Proceedings of the Linux Symposium Volume One · 2006-07-19 · Eric W. Biederman Fully Automated Testing of the Linux Kernel 113 ... Kristen Carlson Accardi Open Source Technology

Proceedings of theLinux Symposium

Volume One

July 19th–22nd, 2006Ottawa, Ontario

Canada

ContentsEnabling Docking Station Support for the Linux Kernel 9

Kristen Carlson Accardi

Open Source Graphic Drivers—They Don’t Kill Kittens 19David M. Airlie

kboot—A Boot Loader Based on Kexec 27Werner Almesberger

Ideas on improving Linux infrastructure for performance on multi-core platforms 39Maxim Alt

A Reliable and Portable Multimedia File System 57J.-Y. Hwang, J.-K. Bae, A. Kirnasov, M.-S. Jang, & H.-Y. Kim

Utilizing IOMMUs for Virtualization in Linux and Xen 71M. Ben-Yehuda, J. Mason, J. Xenidis, O. Krieger, L. Van Doorn, J. Nakajima,A. Mallick, & E. Wahlig

Towards a Highly Adaptable Filesystem Framework for Linux 87S. Bhattacharya & D. Da Silva

Multiple Instances of the Global Linux Namespaces 101Eric W. Biederman

Fully Automated Testing of the Linux Kernel 113M. Bligh & A.P. Whitcroft

Linux Laptop Battery Life 127L. Brown, K.A. Karasyov, V.P. Lebedev, R.P. Stanley, & A.Y. Starikovskiy

The Frysk Execution Analysis Architecture 147Andrew Cagney

Evaluating Linux Kernel Crash Dumping Mechanisms 153Fernando Luis Vázquez Cao

Exploring High Bandwidth Filesystems on Large Systems 177Dave Chinner & Jeremy Higdon

The Effects of Filesystem Fragmentation 193Giel de Nijs, Ard Biesheuvel, Ad Denissen, Niek Lambert

The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux 209M. Desnoyers & M.R. Dagenais

Linux as a Hypervisor 225Jeff Dike

System Firmware Updates Utilizing Sofware Repositories 235Matt Domsch & Michael Brown

The Need for Asynchronous, Zero-Copy Network I/O 247Ulrich Drepper

Problem Solving With Systemtap 261Frank Ch. Eigler

Perfmon2: a flexible performance monitoring interface for Linux 269Stéphane Eranian

OCFS2: The Oracle Clustered File System, Version 2 289Mark Fasheh

tgt: Framework for Storage Target Drivers 303Tomonori Fujita & Mike Christie

More Linux for Less 313Michael Hennerich & Robin Getz

Hrtimers and Beyond: Transforming the Linux Time Subsystems 333Thomas Gleixner and Douglas Niehaus

Making Applications Mobile Under Linux 347C. Le Goater, D. Lezcano, C. Calmels, D. Hansen, S.E. Hallyn, & H. Franke

The What, The Why and the Where To of Anti-Fragmentation 369Mel Gorman and Andy Whitcroft

GIT—A Stupid Content Tracker 385Junio C. Hamano

Reducing fsck time for ext2 file systems 395V. Henson, Z. Brown, T. T’so, & A. van de Ven

Native POSIX Threads Library (NPTL) Support for uClibc. 409Steven J. Hill

Playing BlueZ on the D-Bus 421Marcel Holtmann

FS-Cache: A Network Filesystem Caching Facility 427David Howells

Why Userspace Sucks—Or 101 Really Dumb Things Your App Shouldn’t Do 441Dave Jones

Conference Organizers

Andrew J. Hutton, Steamballoon, Inc.C. Craig Ross, Linux Symposium

Review Committee

Jeff Garzik, Red Hat SoftwareGerrit Huizenga, IBMDave Jones, Red Hat SoftwareBen LaHaise, Intel CorporationMatt Mackall, Selenic ConsultingPatrick Mochel, Intel CorporationC. Craig Ross, Linux SymposiumAndrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart, Red Hat, Inc.David M. Fellows, Fellows and Carr, Inc.Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.

Enabling Docking Station Support for the Linux KernelIs Harder Than You Would Think

Kristen Carlson AccardiOpen Source Technology Center, Intel Corporation

[email protected]

Abstract

Full docking station support has been a featurelong absent from the Linux kernel—for goodreason. From ACPI to PCI, full docking sta-tion support required modifications to multiplesubsystems in the kernel, building on code thatwas designed for server hot-plug features ratherthan laptops with docking stations. This paperwill present an overview of the work we havedone to implement docking station support inthe kernel as well as a summary of the techni-cal challenges faced along the way.

We will first define what it means to dock andundock. Then, we will discuss a few varia-tions of docking station implementations, bothfrom a hardware and firmware perspective. Fi-nally, we will delve into the guts of the soft-ware implementation in Linux—and show howadding docking station support is really harderthan you would think.

1 Introduction and Motivation

It’s no secret that people have not been clam-oring for docking station support. Most peo-ple do not consider docking stations essential,and indeed some feel they are completely un-necessary. However, as laptops become thinner

and lighter, more vendors are seeking to replacefunctionality that used to be built into the lap-top with a docking station. Commonly, dockingstations will provide additional USB ports, PCIslots, and sometimes extra features like built inmedia card readers or Ethernet ports. Most ven-dors seem to be marketing them as space savingdevices, and as an improved user experience formobile users who do not wish to manage a lotof peripheral devices.

We embarked on the docking station project fora few reasons. Firstly, we knew there were afew members of the Linux community out therewho actually did use docking stations. Thesepeople would hopefully post to the hotplug PCImailing lists every once in a while, wonderingif some round of I/O Hotplug patches would en-able hot docking to work. Secondly, there wasa need to be able to implement a couple sce-narios that dock stations provide a convenienttest case for. Many dock stations are actuallyPeer-to-peer bridges with a set of buses and de-vices located behind the bridge. Hot add withdevices with P2P bridges on them had alwaysbeen hard for us to test correctly due to lack ofdevices. Also, the ACPI _EJD method is com-monly used in AML code for docking stations,but this method can also be applied to any hot-pluggable tree of devices. Finally, we felt thatwith the expanding product offerings for dockstations, filling this feature gap would eventu-

10 • Enabling Docking Station Support for the Linux Kernel

ally become important.

2 Docking Basics

There are three types of docking that are de-fined.

• Cold Docking/Undocking Laptop isbooted attached to the dock station.Laptop is powered off prior to removalfrom the dock station. This has alwaysbeen supported by Linux. The devices onthe dock station are enumerated as if theyare part of the laptop.

• Warm Docking/Undocking Laptop isbooted either docked or undocked. Sys-tem is placed into a suspend state, and theneither docked or undocked. This may besupported by Linux, assuming that yourlaptop actually suspends. It depends reallyon whether a driver’s resume routine willrescan for new devices or not.

• Hot Docking/Undocking Laptop isbooted outside the dock station. Laptop isthen inserted into the dock station whilecompletely powered and operating. Thishas recently had limited support, but onlywith a platform specific ACPI driver.Hotplugging new devices on the dockstation has never been supported.

Docking is controlled by ACPI. ACPI definesa dock as an object containing a method called_DCK. An example dock device definition isshown in Figure 1.

_DCK is what ACPI calls a "control method".Not only does it tell the OS that this ACPI ob-ject is a dock, it also is used to control the iso-lation logic on the dock connector.

Device (DOCK1) {Name(_ADR, . . . )Method(_EJ0, 0) {. . . }Method(_DCK, 1) {. . . }

}

Figure 1: Example DSDT that defines a DockDevice

When the user places their system into thedocking station, the OS will be notified with aninterrupt, and the platform will send a DeviceCheck notify. The notify will be sent to a no-tify handler and then that handler is responsiblefor calling the _DCK control method with theproper arguments to engage the dock connec-tor.

_DCK as defined in the ACPI specificationis shown in Figure 2. Assuming that _DCKreturned successfully, the OS must now re-enumerate all enumerable buses (PCI) and alsoall the other devices that may not be on enumer-able buses that are on the dock.

Undocking is just like docking, only in reverse.When the user hits the release button on thedocking station, the OS is notified with an ejectrequest. The notify handler must first execute_DCK(0) to release the docking connector, andthen should execute the _EJ0 method after re-moving all the devices that are on the dockingstation from the OS.

The _DCK method is not only responsible forengaging the dock connector, it seems to alsobe a convenient place for system manufacturersto do device initialization. This is all imple-mentation dependent. I have seen _DCK meth-ods that do things such as programming a USBhost controller to detect the USB hub on thedock station, issuing resets for PCI devices, andeven attempting to modify PCI config space toassign new bus numbers1 to the dock bridge.

1Highly unacceptable behavior

2006 Linux Symposium, Volume One • 11

This control method is located in the device object that represents the docking station(that is, the device object with all the _EJx control methods for the docking station). Thepresence of _DCK indicates to the OS that the device is really a docking station.

_DCK also controls the isolation logic on the docking connector. This allows an OSto prepare for docking before the bus is activated and devices appear on the bus [1].

Arguments:Arg0

1 Dock (that is, remove isolation from connector)0 Undock (isolate from connector)

Return Code:1 if successful, 0 if failed.

Figure 2: _DCK method as defined in the ACPI Specification

The only way to know for sure what the _DCKmethod does is to disassemble the DSDT.

3 Driver Design Considerations

There are platform specific drivers in the ACPItree. The ibm_acpi drier had previouslyimplemented a limited type of docking sta-tion support that would only work on certainibm laptops. Essentially, this driver wouldhard code the device name of the dock to findthe dock, and then would execute the _DCKmethod without rescanning any of the buses orinserting any of the non-enumerable devices. Itsuffers from being platform specific, which isnot ideal. We wanted to make a generic solu-tion that would work for most platforms.

We originally assumed that all dock stationswere the same: a dock bridge would be locatedon the dock station, which was a P2P bridge,and all devices would be located behind the P2Pbridge. The IBM ThinkPad Dock II is an ex-ample of this type of implementation, shownin Figure 4. The same driver (acpiphp) thatcould hotplug any device that had a P2P bridge

Notebook Docking

PCI-PCI

Bridge

Secondary PCI Bus

PCI Adapter

Card Slot

PCI Bus

USB USB 2.0 Hub

Controller

UltraBayPCI

IDE Controller

PC Card

Controller

ICH

PC Card

SlotPC Card

Slot

CRT, DVI, Serial, Parallel, Mouse, Keyboard, etc.

Figure 4: The IBM ThinkPad Dock IIc©2006, Noritoshi Yoshiyama, Lenovo Japan,

Ltd—Used by Permission

on it could be used to hotplug the dock stationdevices, with the minor addition of needing toexecute the _DCK method prior to scanning fornew devices.

These were bad assumptions.

3.1 Variations in dock device definitions

The dock device definition for a few IBMThinkPads that I had available is shown in Fig-ure 5. The physical device is a P2P bridge.


IBM_HANDLE(dock, root, "\\_SB.GDCK", /* X30, X31, X40 */"\\_SB.PCI0.DOCK", /* 600e/x,770e,...,X20-21 */"\\_SB.PCI0.PCI1.DOCK", /* all others */"\\_SB.PCI.ISA.SLCE", /* 570 */

);

Figure 3: Defining a dock station in ibm_acpi.c

It appears to fit the ACPI definition of a stan-dard PCI hotplug slot, in that it exists under thescope of PCI0, it has an _ADR function, and itis ejectable (has an _EJ0). It contains the _DCKmethod, indication that it is a docking station aswell. This was our original view of the docking

T20,T30,T41, T42 look like this:Device (PCI0)

Device (DOCK){

Name (_ADR, 0x00040000)Method (_BDN, 0, NotSerialized)Name (_PRT, Package (0x06)Method (_STA, 0, NotSerialized)Method (_DCK, 1, NotSerialized)Method (_EJ0, . . . )

Figure 5: IBM T20, T30, T41, T42 DSDT

station.

Unfortunately for us, not all dock stations arethe same. Sometimes system manufactures cre-ate a "virtual" device to represent the dock.It simply calls methods under the "real" dockbridge. In this case, the acpiphp driver willnot recognize the GDCK device as an ejectableslot because it has no _ADR. In addition, itwill not recognize the "real" dock device asan ejectable PCI slot because _EJ0 is not de-fined under the scope of the Dock(), but in-stead under the virtual device GDCK. An ex-ample of this type of DSDT is shown in Fig-ure 6. There are also dock stations that do not

Notebook Docking

PCI Express Adapter

Card Slot

PCI Express

USB USB 2.0 Hub

Controller

UltraBay

ICH

CRT, DVI, Ethernet , Modem , Mouse, Keyboard, etc.

LPCSuper I/O

SerialPort

ParallelPort

USB-IDE

Controller

USB Media

Controller

Media

Card SlotMedia

Card Slot

Figure 7: The Lenovo ThinkPad AdvancedDock Station

c©2006, Noritoshi Yoshiyama, Lenovo Japan,Ltd—Used by Permission

utilize a P2P bridge for PCI devices, such asthe Lenovo ThinkPad Advanced Dock Station,shown in Figure 7. In addition, there are dockstations that do not have any PCI devices onthem at all. This made using the ACPI PCI hot-plug driver a bit nonsensical. However, the nor-mal ACPI driver model also didn’t work, be-cause ACPI drivers will only load if a deviceexists. However, we decided to move the im-plementation from the PCI hotplug driver intoACPI, because there really was nowhere else toput it.

In order to decouple the dock functionalityfrom the hotplug functionality, the dock driverneeds to allow other drivers to be notified upona dock event, and also to register individual hot-plug notification routines. This way, the dock


Scope (_SB)Device(GDCK)

Method (_DCK, 1, NotSerialized){

Store (0x00, Local0)If (LEqual (GGID (), 0x03)){

Store (\_SB.PCI0.LPC.EC.SDCK (Arg0), Local0)}If (LEqual (GGID (), 0x00)){

Store (\_SB.PCI0.PCI1.DOCK.DDCK (Arg0), Local0)}Return (Local0)

}Method (_EJ0, 1, NotSerialized)

. . .Device (PCI1)

Device (DOCK){

Name (_ADR, 0x00030000)Name (_S3D, 0x02)Name (_PRT, Package (0x06)

Figure 6: Alternative dock definition

driver can just handle the dock notificationsfrom ACPI, and individual subsystems/driverscan handle how to hotplug new devices. In thecase of PCI, acpiphp can still handle the de-vice insertion, but it will not be used if there areno PCI devices on the dock station.

4 Driver Implementation Details

The driver is located in drivers/acpi/dock.c. It makes a few external functionsavailable to drivers who are interested in dockevents.

4.1 External Functions

int is_dock_device(acpi_

handle handle)

This function will check to see if an ACPIdevice referenced by handle is a dockdevice. This means that the device eitheris a dock station, or a device on the dockstation.

int register_dock_

notifier(struct notifier_

block *nb)

Sign up for dock notifications. If adriver is interested in being notified whena dock event occurs, it can send in anotifier_block and be called rightafter _DCK has been executed, but beforeany devices have been hotplugged.


int unregister_dock_

notifier(struct notifier_

block *nb)

Remove a driver’s notifier_block.

acpi_status register_

hotplug_dock_device (acpi_

handle, acpi_notify_

handler, void *)

Pass an ACPI notify handler to the dockdriver, to be called when a dock eventhas occurred. This allows drivers such asacpiphp which need to re-enumeratebuses after a dock event to register theirown routine to handle this activity.

acpi_status unregister_

hotplug_dock_device(acpi_

handle handle)

Remove a notify handler from the dockstation’s hotplug list.

4.2 Driver Init

At init time, the dock driver walks the ACPInamespace, looking for devices which have de-fined a _DCK method.

/∗ look for a dock station ∗/acpi_walk_namespace(

ACPI_TYPE_DEVICE,

ACPI_ROOT_OBJECT, ACPI_UINT32_MAX,

find_dock, &num, NULL);

If we find a dock station, then we create a pri-vate data structure to hold a list of devices de-pendent on the dock station, and also hotplugnotify blocks.

We can detect devices dependent on the dockby walking the namespace looking for _EJDmethods. _EJD is another method defined byACPI, that is associated with devices that havea dependency on other devices. From the spec:

This object is used to specify thename of a device on which the device,under which this object is declared,is dependent. This object is primar-ily used to support docking stations.Before the device indicated by _EJDis ejected, OSPM will prepare the de-pendent device (in other words, thedevice under which this object is de-clared) for removal [1].

So, to translate, all devices that are behind adock bridge should have an _EJD method de-fined in them that names the dock.

Drivers or subsystems can register for dock no-tifications if they control a device dependent onthe dock station. Drivers use the is_dock_device() function to determine if they area device on a dock station. This allows forre-enumeration of the subsystem after a dockevent if it is necessary. In the case of PCI de-vices, the acpiphp driver is modified to de-tect not only ejectable PCI slots, but also PCIdock bridges or hotpluggable PCI devices. Ifit does find one of these devices, then it willrequest that the dock driver notify acpiphpwhenever a dock event occurs. When a systemdocks, the acpiphp driver will treat the eventlike any other PCI hotplug event, and rescan theappropriate bus to see if new devices have beenadded.

4.3 Dock Events

At driver init time, the dock driver registersan ACPI event handler with the ACPI subsys-tem. When a dock event occurs, the dockdriver event handler will be called. A dockis a ACPI_NOTIFY_BUS_CHECK event type.First, the event handler will make sure that weare not already in the middle of docking. Thischeck is needed, because I found on some dock


stations/laptop combos that false dock eventswere being generated by the system—probablydue to a faulty physical connection. We ignorethese false events. It is also necessary to ensurethat the dock station is actually present beforeperforming the _DCK operation. This is ac-complished by the dock_present() func-tion. dock_present() just executes theACPI _STA method. _STA will report whetheror not the device is present.

if (!dock_in_progress(ds) &&dock_present(ds)) {

begin_dock() just sets some state bits to in-dicate that we are now in the middle of handlinga dock event.

begin_dock(ds);

dock() will execute the _DCK method with theproper arguments.

dock(ds);

We confirm that the device is still present andfunctioning after the _DCK method.

if (!dock_present(ds)) {printk(KERN_ERR PREFIX

"Unable to dock!\n");break;

}

We notify all drivers who have registered withthe register_dock_notifier() func-tion. This allows drivers to do anything thatthey want prior to handling a hotplug notifi-cation. This can be important if _DCK doessomething that needs to be undone. For exam-ple, on the IBM T41, the _DCK method willclear the secondary bus number for the parentof the dock bridge2. This makes it a bit hardfor acpiphp to scan buses looking for new de-vices. acpiphp can register a function that is

2also highly unacceptable

called by the notifier_call_chain thatwill clean up this mistake prior to calling thehotplug notification function.

notifier_call_chain(&dock_notifier_list, event,NULL);

Drivers or subsystems that need to be notifiedso that devices can be hotplugged can registera hotplug notification function with the dockdriver by using the register_hotplug_dock_device() function. hotplug_devices() just walks the list of hotplug no-tification routines and calls each one of them inthe order that it was received.

hotplug_devices(ds, event);

We clear the dock state bits to indicate that weare finished docking.

complete_dock(ds);

Now we alert userspace that a dock event hasoccurred. This event should be sent to the acpidprogram. If a userspace program is ever writtenor modified to care about dock events, they canuse acpid to get those events.

if (acpi_bus_get_device(ds→handle, &device))

acpi_bus_generate_event(device, event, 0);

Undocking is mostly just the reverse of dock-ing. An undock is a ACPI_NOTIFY_EJECT_REQUEST type. Once again, we must not bein the middle of handling a dock event, and thedock device must be present in order to handlethe eject request properly.

if (!dock_in_progress(ds) &&dock_present(ds)) {

Because undocking may remove the acpi_device structure that we need to send dock


events to userspace, we send our undock noti-fication to the acpid prior to actually executing_DCK.

if (acpi_bus_get_device(ds→handle, &device))

acpi_bus_generate_event(device, event, 0);

We also must call all the hotplug routines tonotify them of the eject request. This is im-portant to do prior to executing _DCK, since_DCK will release the physical connection andmay make it impossible for clean removal ofsome devices. Finally, we can call undock(),which simply executes the _DCK method withthe proper arguments.

hotplug_devices(ds, event);undock(ds);

The ACPI spec requires that all dock stations(i.e. objects which define _DCK) also define an_EJ0 routine. This must be called after _DCKin order to properly undock. What this routineactually does is system dependent.

eject_dock(ds);

At this point, a call to _STA should indicate thatthe dock device is not present.

if (dock_present(ds))printk(KERN_ERR PREFIX

"Unable to undock!\n");

The design of the driver was intentionally keptstrictly to handling dock events. For this rea-son, this is the only thing of interest that thisdriver does.

5 Conclusions

Dock stations make excellent test cases for hot-plug related kernel code. Attempting to hot-plug a device which can be a PCI bridge with

a tree of devices under it exposed some inter-esting problems that apply to other devices be-sides dock stations. Right now we require theuse of the pci=assign-buses parameter, mainlybecause the BIOS may not reserve enough busnumbers for us to insert a new dock bridge andother buses behind it. I found a couple prob-lems with how bus numbers are assigned dur-ing my work which required patches to the PCIcore. In many ways the problems that are facedwith implementing hot dock are directly appli-cable to hotplugging on servers. Therefore, it isvaluable work to continue, even if only 3 peo-ple in the world still use a docking station. Wedo believe that docking station usage will riseas system vendors create more compelling usesfor them.

Dock station hardware implementations can re-ally vary. It’s very common to have a P2Pbridge located on the dock station, with a tree ofdevices underneath it, however, it isn’t the onlyimplementation. Because of this, it’s importantto handle docking separately from any hotplugactivity, so that all the intelligence for hotplugcan be handled by the individual subsystems ordrivers rather than in one gigantic dock driver.I have only implemented changes to allow onedriver to hotplug after a dock, but more driversor subsystems may be modified in the future.

I have very limited testing done at this point,and every time a new person tries the dockpatches, the design must be modified to handleyet another hardware implementation. As us-age increases, I expect that the implementationdescribed in this paper will evolve to somethingwhich hopefully allows more and more laptopdocking stations to "just work" with Linux.

References

[1] Advanced Configuration and Power


Interface specification. www.acpi.info,3.0a edition.

[2] PCI Hot Plug specification.www.pcisig.com, 1.1 edition.

[3] PCI Local Bus specification.www.pcisig.com, 3.0 edition.

[4] PCI-to-PCI Bridge specification.www.pcisig.com, 1.2 edition.

[5] Jonathan Corbet, Alessandro Rubini, andGreg Kroah-Hartman. Linux DeviceDrivers. O’Reilly,http://lwn.net/Kernel/LDD3/.


Open Source Graphic Drivers—They Don’t Kill Kittens

David M. AirlieOpen Source [email protected]

Abstract

This paper is a light-hearted look at the stateof current support for Linux / X.org graphicsdrivers and explains why closed source driversare in fact responsible for the death of a lot ofsmall cute animals.

The paper discusses how the current trend ofusing closed-source graphics drivers is affect-ing the Open Source community, e.g. users areclaiming that they are running an open sourceoperating system which then contains a 1 MBbinary kernel module and 10MB user spacemodule. . .

The paper finally look at the current state ofopen source graphics drivers and vendors andhow they are interacting with the kernel andX.org communities at this time (this changes alot). It discusses methods for producing opensource graphics drivers such as the r300 projectand the recently started NVIDIA reverse engi-neering project.

1 Current Official Status

This section examines the current status (asof March 2006) of the support from variousmanufactures for Linux, X.org[4], and DRI[1]projects. The three primary manufacturers, In-tel, ATI, NVIDIA, are looked at in depth alongwith a brief overview of other drivers.

1.1 Intel

Currently Intel contract Tungsten Graphics toimplement drivers for their integrated graph-ics chipsets. TG directly contribute code tothe Linux kernel, X.org, and Mesa projects tosupport Intel cards from the i810 to the i945G.Intel have previously released complete regis-ter and programmer’s reference guides for thei810 chipset; however, from the i830 chipsetonwards, no information was given to externalentities with the exception of Tungsten.

As of March 2006, all known Intel chipsets aresupported by the i810 X.org driver.

1.1.1 2D

The Intel integrated chipsets vary in the num-ber and type of connectable devices. The desk-top chipsets commonly only have native sup-port for CRTs, and external devices are requiredto drive DVI or tv-out displays. These exter-nal devices are connected to the chipset usingeither the DVO (digital video output) on i8xx,or sDVO (serial digital video output) on i9xx.These external devices are controlled over ani2c bus. The mobile chipsets allow for an in-built LVDS controller and sometimes in-builttv-out controller.

Due to the number of external devices avail-able from a number of manufacturers (e.g. Sil-

20 • Open Source Graphic Drivers—They Don’t Kill Kittens

icon Image, Chrontel), writing a driver is alot of hard work without a lot of manufacturerdatasheets and hardware. The Intel BIOS sup-ports a number of these devices.

For this reason the driver uses the Video BIOS(VBE) to do mode setting on these chipsets.This means that unless the BIOS has a modepre-configured in its tables, or a mode is hackedin (using tools like i915resolution), the drivercannot set it. This stops the driver being usedproperly on systems where the BIOS isn’t con-figured properly (widescreen laptops) or theBIOS doesn’t like the mode from the monitor(Dell 2005FPW via DVI).

1.1.2 3D

The Intel chipsets have varying hardware sup-port for accelerating 3D operations. Intel “dis-tinguish” themselves for other manufacturersby not putting support for TNL in hardware,preferring to have optimized drivers do thatstuff in software. The i9xx added support forHW vertex shaders; however, fragment shad-ing is still software-based. The 3D driver forthe Intel chipsets supports all the features of thechipset with the exception of Zone Rendering,a tile-based rendering approach. Implementingzone rendering is a difficult task which changesa lot of how Mesa works, and the returns arenot considered great enough yet.

However, in terms of open source support for3D graphics, Intel provide by far the best sup-port via Tungsten Graphics.

1.2 ATI

ATI, once a fine upstanding citizen of then opensource world (well they got paid for it), nolonger have any interest in our little adventuresand have joined the kitten killers. The ATI

cards can be broken up into 3 broad categories,pre-r300, r3xx-r4xx, and r5xx.

1.2.1 pre-r300

Thanks to the Weather Channel wanting open-source drivers for the Radeon R100 and R200family of cards, and paying the money, ATImade available cut-down register specificationsfor these chipsets to the open-source developercommunity via their Developer Relations web-site. These specifications allowed TungstenGraphics to implement basically complete 3Ddrivers for the r100 and r200 series of cards.However, ATI didn’t provide any informationon programming any company-proprietary fea-tures such as Hyper-Z or TruForm. ATI’s en-gineering also provided some support over anumber of years for 2D on these cards, such asinitial render acceleration, and errata for manychips.

1.2.2 r300–r4xx

The R300 marked the first chipset that ATIweren’t willing to provide any 3D support fortheir cards. A 2D register spec and develop-ment kit was provided to a number of devel-opers, and errata support was provided by ATIengineering. However, no information on the3D sections of this chipset were ever revealedto the open source community. A number ofOEMs have been provided information on thesecards, but no rights to use it for open-sourcework.

ATI’s fglrx closed-source driver appeared withsupport for many of the 3D features on thecards; it, however, has had certain stabilityproblems and later versions do not always runon older cards.


This range of cards also saw the introductionof PCI Express cards. Support for these cardscame quite late, and a number of buggy fglrxreleases were required before it was stabilised.

1.2.3 r5xx

The R5xx is the latest ATI chipset. This chipsethas a completely redesigned mode setting, andmemory controller compared to the r4xx, the3D engine is mostly similiar. Again no infor-mation has been provided to the open-sourcecommunity. As of this writing no support be-yond vesa is available for these chipsets. ATIhave not released an open-source 2D driver or aversion of fglrx that supports these chips, mak-ing them totally useless for any Linux users.

1.2.4 fglrx

The ATI fglrx driver supports r200, r300, andr400 cards, and is built using the DRI frame-work. It installs its own libGL (the DRI oneused to be insufficent for their needs) and aquite large kernel module. FGLRX OpenGLsupport can sometimes be a bit useless, Doom3for example crashes horribly on fglrx when itcame out first.

1.3 NVIDIA

Most people have thought that NVIDIA werealways evil and never provided any specs,which isn’t true. Back in the days of the rivachipsets, the Utah-GLX project implemented3D support for the NVIDIA chipsets usingNVIDIA-provided documentation.

1.3.1 2D

NVIDIA have some belief in having a driverfor their chipsets shipped with X.org even if itonly supports basic 2D acceleration. This atleast allows users to get X up and running sothey can download the latest binary driver fortheir cards, or at least use X on PPC and othernon-x86 architectures. The nv driver in X.org issupported by NVIDIA employees and despite itbeing written in obfuscated C}ˆHhex code, thesource is there to be tweaked. BeOS happensto have a better open source NVIDIA driverwith dual-head support, which may be portedto X.org at some point.

1.3.2 3D

When it comes to 3D, the NVIDIA 3D driversare considered the best “closed-source” drivers.From an engineering point of view, the driversare well supported, and NVIDIA interact wellwith the X.org community when it comes toadding new features. The NVIDIA driver pro-vides support for most modern NVIDIA cards;however, they recently dumped support for alot of older cards into a legacy driver and arediscontinuing support in the primary driver.NVIDIA drivers commonly support all featuresof OpenGL quite well.

1.4 Others

1.4.1 VIA and SiS

Other manufactures of note are Matrox, VIA,and SiS. VIA and SiS both suffer from a se-rious lack of interaction with the open-sourcecommunity, most likely due to some culturaldifferences between Taiwanese manufacturersand open-source developers. Both companies


occasionally code-drop drivers for hardwarewith bad license files, no response to feed-back, but with a nice shiny press release orget in touch with open-source driver writerswith promises of support and NDA’d documen-tation, but nothing ever comes of it. Neithercompany has a consistent approach to opensource drivers. VIA chipsets have good supportfor features thanks to people taking their codedrops and making proper X.org drivers fromthem (unichrome and openchrome projects),and SiS chipsets (via Thomas Winischofer)have probably by far the best 2D driver avail-able in terms of features, but their 3D driversare a bit hit-and-miss, and only certain chipsetsare supported at all.

1.4.2 Matrox

Matrox provide open source drivers for theirchipsets below G550; however, newer chipsetsuse a closed-source driver.

2 Closed Source Drivers—Reasons

So a question the author is asked a lot is why hebelieves closed source drivers are a bad thing.I don’t consider them bad so much as pureevil, in the kitten-killing, seal-clubbing sense.Someone has to hold an extreme view on thesethings, and in the graphics driver case that isthe author’s position. This sections exploressome of the reasons why open-source driversfor graphics card seems to be going the oppo-site direction to open-source drivers for everyother type of hardware.

This is all the author’s opinion and doesn’t tryto reflect truth in any way.

2.1 Reason—Microsoft

The conspiracy theorists among us (I’m not ahuge fan), find a way to blame Microsoft for ev-ery problem in Linux. So to keep them happy,I’ve noticed two things.

• Microsoft decided to use a vendor’s chipin the XBOX series→ no specs anymore.

• Chipset vendors puts DirectX 8.0 supportinto a chip→ no specs anymore.

Hope this keeps that section happy.

2.2 Reason—???

Patents and fear of competitors or patentscumsucking companies bringing infringementagainst the latest chipset release and delayingit, is probably a valid fear amongst chip manu-facturers. They claim releasing chipset docs tothe public may make it easier for these things tobe found; however, most X.org developers haveno problem signing suitable NDAs with manu-facturers to access specs. Open source driversmay show a company’s hand to a greater de-gree. This is probably the most valid fear andit is getting more valid due to the great U.S.patent system.

2.3 Reason—Profit

Graphics card manufacturing is a very compet-itive industry, especially in the high-end gam-ing, 3–6 month development cycle, grind-out-as-many-different-cards-as-you-can world thatATI and NVIDIA inhabit. I can’t see how opensourcing drivers would slow down these cy-cles or get in the way—apart from the fact thatthe dirty tricks to detect and speed up quake 3


might be spotted easier (everyone spots them inthe closed source drivers anyways). It doesn’tquite explain Matrox and those guys who don’treally engage in the gamer market to any greatdegree. It also doesn’t really explain fglrxwhich are some of the most unsuitable driversfor gaming on Linux.

Also things like SLI and Crossfire bring toquestion some of the profit motivation; thenumber of SLI and Crossfire users are certainlyless than the number of Linux users.

3 Closed Source Drivers—KillingKittens

3.1 Fluffy—Open Source OS

Linux is an open source OS. Linux has becomea highly stable OS due to its open source nature.The ability for anyone to be able to fix a bug inany place in the OS, within reason, is very use-ful for implementing Linux in a lot of serverand embedded environments. Things like Win-dows CE work for embedded systems as longas you do what MS wanted you to do; Linuxworks in these systems because you don’t haveto follow the plan of another company: youare free to do your own things. Closed sourcedrivers take this freedom away.

If you load a 1MB binary into your Linux ker-nel or X.org, you are NO LONGER RUN-NING AN OPEN SOURCE OS. Lots of usersdon’t realise this, they tell their friends all aboutopen source, but use NVIDIA drivers.

3.2 Mopsy—Leeching

So on to why the drivers are a bad thing. Linuxdevelopers have developed a highly stable OS

and provide the source to it, X.org is finally get-ting together a window system with some mod-ern features in it and are providing the sourceto it. These developers are also providing theideas and infrastructure for these things openly.Closed source vendors are just not contribut-ing to the pool of knowledge and features inany decent ways. Open source developers arecurrently implementing acceleration architec-tures and memory management systems thatthe closed source drivers have had for a fewyears. These areas aren’t exactly the familyjewels, surely some code might have been con-tributed or some ideas on how things might bedone.

3.3 Kitty—niche systems

There are a lot of niche systems out there, in-stallations in the thousands that normally don’tinterest the likes of NVIDIA or ATI. The au-thor implements embedded graphics systems,and originally use ATI M7s but now uses In-tel chipsets where possible. These sales, whilenot significant to ATI or NVIDIA on an individ-ual basis, add up to a lot more than the SLI orCrossFire sales ever will. However these nichesystems usually require open source drivers inorder to do something different. For example,the author’s systems require a single 3D appli-cation but not an X server. Implementing thisis possible using open source drivers; however,doing so with closed source driver is not pos-sible. Also, for non-x86 systems such as PPCor Sparc, where these chips are also used, get-ting a functional driver under Linux just isn’tpossible.

3.4 Spot—out-dated systems

Once a closed vendor has sold enough of a card,it’s time to move on and force people somehow


to buy later cards. Supporting older cards isno longer a priority or profitable. This allowsthem to stop support for these cards at a cer-tain level and not provide any new features onthose cards even if it possible. Looking at thefeatures added to the open-source radeon driversince its inception shows that continuing devel-opment on these cards is possible in the opensource community. NVIDIA recently relegatedall cards before a certain date to their legacydrivers. Eventually these drivers will probablystop being updated, meaning running a newerversion of Linux on those systems will becomeimpossible.

4 Open Source Drivers—Future

This section discusses the future plans of opensource graphic driver development. This paperwas written in March 2006, and a lot may havehappened between now and the publishing datein July.1 The presentation at the conference willhopefully have all-new information.

4.1 Intel

Going forwards, Intel appear to be in the bestpositions. Recent hirings show their supportfor open source graphics, and Tungsten Graph-ics have added a number of features to theirdrivers and are currently implementing an opensource video memory manager initially on theIntel chipsets. Once the mode-setting issues arecleared up, the drivers will be the best exampleout there.

The author has done some work to implementBIOS-less mode setting on these cards for cer-tain embedded systems, and hopes that work

1[Well, besides an awful lot of formatting, copyedit-ing, and such—OLS Formatting Team.]

can be taken forward to cover all cards and in-tegrated into the open source X.org driver andbecome supported by Intel/TG.

4.2 ATI

4.2.1 R3xx + R4xx 3D support

The R300 project is an effort to provide anopen-source 3D driver for the r300 and greaterby reverse engineering methods. The projecthas used the fglrx and Windows drivers to re-verse engineer the register writes used by ther3xx cards. The method used involved runninga simple OpenGL application and changing onething at a time to see what registers were writ-ten by the driver. There are still a few prob-lems with this approach in terms of stability, ascertain card setup sequences for certain cardsare not yet known (radeon 9800s fall over alot). These sequences are not that easy to dis-cover; however, tracing the fglrx startup usingvalgrind might help a lot.

While this project has been highly successfulin terms of implementing the feature set of thecards, the lack of documentation and/or engi-neering support hamper any attempts to makethis a completely stable system.

4.2.2 R5xx 2D support

A 2D driver for the R5xx series of cards fromthe author may appear; however, the authorwould like to engage ATI so as to avoid gettingsued into the ground, due to a lot of informationbeing available under NDA via an OEM. Mostof the driver has, however, been reverse engi-neered by tracing the outputs from the videobios when asked to set a mode, using a mod-ified x86 emulator.


4.3 NVIDIA

Recently an X.org DRI developer (StephaneMarcheu) announced the renoveau project [3],an attempt to build open-source 3D drivers forNVIDIA cards. This project will use the samemethods as the r300 project to attempt to get atfirst a basic 3D driver for the NVIDIA cards.

5 Reverse Engineering Methodolo-gies

This section just looks at some of the com-monly used reverse engineering methodologiesin developer graphics drivers.

5.1 2D Modesetting

Most cards come with a video BIOS that can setmodes. Using a tool like LRMI[2], the LinuxReal Mode Interface, the BIOS can be run in-side an emulator. When the BIOS uses an inlor outl instruction to write to the card, LRMImust actually do this for it, so these calls canbe trapped and used to figure out the sequenceof register writes necessary to set a particularmode on a particular chipset. Multiple runs canbe used to track exactly where the mode infor-mation is emitted.

This method has been used by the authorin writing mode-setting code for Intel i915chipsets, for intelfb and X.org, and also forlooking at the R520 mode setting.

Another method, if a driver exists for a cardunder Linux already: a number of developershave discussed using an mmap trick, wherebya framework is built which loads the driver,and fakes the mmap for the card registers.

The framework then catches all the segmenta-tion faults and logs them while passing themthrough. This has been used by Ben Herren-schmidt for radeonfb suspend/resume supporton Apple drivers. An enhancement to this bythe author (who was too lazy to write an mmapframework for x86) uses a valgrind plugin totrack the mmap and read/writes to an mmapedarea. This solution isn’t perfect (it only allowsreading back writes after they happen), but ithas been sufficent for work on the i9xx reverseengineering.

5.2 3D

Most 3D cards have some form of two-section drivers, a kernel-space manager and auserspace 3D driver linked into the applica-tion via libGL. The kernel-space code normallyqueues up command buffers from the userspacedriver. The userspace driver normally mmapsthe command queues into its address space. Anapplication linked with libGL can do some sim-ple 3D operations and then watch the commandbuffers as the app fills them. Tweaking whatthe highlevel application does allows differentcommand buffers to be compared and a map ofcard registers vs. features can be built up. Ther300 project has been very successful with thisapproach.

The r300 project also have a tool that runs un-der Windows, that constantly scans the sharedbuffers used by the windows drivers, anddumps them whenever they change in order todo similiar work.

6 Conclusion

This paper has looked at the current situa-tion with graphics driver support for Linux and


X.org from card manufacturers. It looks at whyclosed source drivers are considered evil andlooks at what the open source community is do-ing to try and provide drivers for open sourcecards. Just remember, save those kittens.

References

[1] DRI Project.http://dri.freedesktop.org.

[2] Linux Real Mode Interface.http://lrmi.sf.net.

[3] The Renoveau Project.http://renoveau.sf.net/.

[4] The X.org Project.http://www.x.org/.

kboot—A Boot Loader Based on Kexec

Werner [email protected]

Abstract

Compared to the “consoles” found on tradi-tional Unix workstations and mini-computers,the Linux boot process is feature-poor, and theaddition of new functionality to boot loaders of-ten results in massive code duplication. Withthe availability of kexec, this situation can beimproved.

kboot is a proof-of-concept implementation ofa Linux boot loader based on kexec. kboot usesa boot loader like LILO or GRUB to load a reg-ular Linux kernel as its first stage. Then, the fullcapabilities of the kernel can be used to locateand to access the kernel to be booted, performlimited diagnostics and repair, etc.

1 Oh no, not another boot loader !

There is already no shortage of boot loaders forLinux, so why have another one ? The moti-vation for making kboot is simply that the bootprocess of Linux is still not as good as it couldbe, and that recent technological advances havemade it comparably easy to do better.

Looking at traditional Unix servers and work-stations, one often finds very powerful boot en-vironments, offering a broad choice of possiblesources for the kernel and other system files toload. It is also quite common to find various

tools for hardware diagnosis and system soft-ware repair. On Linux, many boot loaders aremuch more limited than this.

Even boot loaders that provide several of theseadvanced features, like GRUB, suffer from theproblem that they need to replicate functional-ity or at least include code found elsewhere,which creates an ever increasing maintenanceburden. Similarly, any drivers or protocols theboot loader incorporates, will have to be main-tained in the context of that boot loader, in par-allel with the original source.

New boot loader functionality is not only re-quired because administrators demand morepowerful tools, but also because technologi-cal progress leads to more and more complexmechanisms for accessing storage and other de-vices, which a boot loader eventually should beable to support.

It is easy to see that a regular Linux system hap-pens to support a superset of all the functional-ity described above.

With the addition of the kexec system call tothe 2.6.13 mainline Linux kernel, we now havean instrument that allows us to build boot load-ers with a fully featured Linux system, tailoredaccording to the needs of the boot process andthe resources available for it.

Kboot is a proof-of-concept implementation ofsuch a boot loader. It demonstrates that newfunctionality can be merged from the vast code

28 • kboot—A Boot Loader Based on Kexec

base available for Linux with great ease, andwithout incurring any significant maintenanceoverhead. This way, it can also serve as a plat-form for the development of new boot concepts.

The project’s home page is at http://kboot.sourceforge.net/

The remainder of this section gives a high-levelview of the role of a boot loader in general,and what kboot aims to accomplish. Additionaltechnical details about the boot process, includ-ing tasks performed by the Linux kernel whenbringing up user space, can be found in [1].

Section 2 briefly describes Eric Biederman’skexec [2], which plays a key role in the op-eration of kboot. Section 3 introduces kbootproper, explains its structure, and discusses itsapplication. Section 4 gives an outlook on fu-ture work, and we conclude with section 5.

1.1 What a boot loader does

After being loaded by the system’s firmware, aboot loader spends a few moments making it-self comfortable on the system. This includesloading additional parts, moving itself to othermemory regions, and establishing access to de-vices.

After that, it typically tries to interact withthe user. This interaction can range fromchecking whether the user is trying to get theboot loader’s attention by pressing some key,through a command line or a simple full-screenmenu, to a lavish graphical user interface.

Whatever the interface may be, in the end itsmain purpose is to allow the user to select, per-haps along with some other options, which op-erating system or kernel will be booted. Oncethis choice is made, the boot loader proceeds toload the corresponding data into memory, doessome additional setup, e.g., to pass parameters

to the operating system it is booting, and trans-fers control to the entry point of the code it hasloaded.

In the case of Linux, two items deserve specialmention: the boot parameter line and the initialRAM disk.

The boot parameter line was at its inception in-tended primarily as a means for passing a “bootinto single user mode” flag to the kernel, butthis got a little out of hand, and it is nowadaysoften used to pass dozens if not hundreds ofbytes of essential configuration data to the ker-nel, such as the location of the root file system,instructions for how certain drivers should ini-tialize themselves (e.g., whether it is safe forthe IDE driver to try to use DMA or not), andthe selection and tuning of items included in ageneric kernel (e.g., disabling ACPI support).

Since a kernel would often not even boot with-out the correct set of boot parameters, a bootloader must store them in its configuration, andpass them to the kernel without requiring useraction. At the same time, users should of coursebe able to manually set and override such pa-rameters.

The initial RAM disk (initrd), which at the timeof writing is gradually being replaced by theinitial RAM file system (initramfs), provides anearly user space, which is put into memory bythe boot loader, and is thus available even be-fore the kernel is fully capable to interact withits surroundings. This early user space is usedfor extended setup operations, such as the load-ing of driver modules.

Given that the use of initrd is an integral part ofmany Linux distributions, any general-purposeLinux boot loader must support this functional-ity.


Boot

process

Hard− and firmware

New device driversNew protocols

Combination of servicesNew file systems Convenience

Compatible "look and feel"

Administration User experience

Figure 1: The boot process exists in a world full of changes and faces requirements from manydirections. All this leads to the need to continuously grow in functionality.

1.2 What a boot loader should be like

A boot loader has much in common with theoperating system it is loading: it shares thesame hardware, exists in the same administra-tive context, and is seen by the same users.From all these directions originate require-ments on the boot process, as illustrated in fig-ure 1.

The boot loader has to be able to access at leastthe hardware that leads to the locations fromwhich data has to be loaded. This does notonly include physical resources, but also anyprotocols that are used to communicate withdevices. Firmware sometimes provides a setof functions to perform such accesses, but newhardware or protocol extensions often requiresupport that goes beyond this. For example, al-though many PCs have a Firewire port, BIOSsupport for booting from storage attached viaFirewire is not common.

Above basic access mechanisms lies the do-main of services the administrator can combinemore or less freely. This begins with file system

formats, and gets particularly interesting whenusing networks. For example, there is noth-ing inherently wrong in wanting to boot kernelsthat happen to be stored in RPM files on an NFSserver, which is reached through an IPsec link.

The hardware and protocol environment of theboot process extends beyond storage. For ex-ample, keyboard or display devices for userswith disabilities may require special drivers.With kboot, such devices can also be used tointeract with the boot loader.

Last but not least, whenever users have to per-form non-trivial tasks with the boot loader, theywill prefer a context similar to what they areused to from normal interaction with the sys-tem. For instance, path names starting at theroot of a file system hierarchy tend to be easierto remember than device-local names prefixedwith a disk and partition number.

In addition to all this, it is often desirable ifsmall repair work on an unbootable system canbe done from the boot loader, without havingto find or prepare a system recovery medium,or similar.


kboot −f

Kernel memory(before rebooting)

Kernelcode

Kernel memory(while and after rebooting)

Kernelcode

Jump to kernel setup

Order pages

1 3

Copy file(s) through user spaceinto kernel memory

file

4

Run kexec rebootcode

2

Figure 2: Simplified boot sequence of kexec.

The bottom line is that a general-purpose bootloader will always grow in functionality alongthe lines of what the full operating system cansupport.

1.3 The story so far

The two principal boot loaders for Linux on thei386 platform, LILO and GRUB, illustrate thistrend nicely.

LILO was designed with the goal in mind ofbeing able to load kernels from any file systemthe kernel may support. Other functionality hasbeen added over time, but growth has been lim-ited by the author’s choice of implementing theentire boot loader in assembler. 1

GRUB appeared several years later and waswritten in C from the beginning, which helped

1LILO was written in 1992. At that time, 32-bit realmode of the i386 processor was not generally known, andthe author therefore had to choose between programmingin the 16-bit mode in which the i386 starts, or implement-ing a fully-featured 32-bit protected mode environment,complete with real-mode callbacks to invoke BIOS func-tions. After choosing the less intrusive of the two ap-proaches, there was the problem that no suitable and rea-sonably widely deployed free C compiler was available.Hence the decision to write LILO in assembler.

it to absorb additional functionality morequickly. For instance, GRUB can directly reada large number of different file system formats,without having to rely on external help, such asthe map file used by LILO. GRUB also offerslimited networking support.

Unfortunately, GRUB still requires that anynew functionality, be it drivers, file systems, fileformats, network protocols, or anything else, isintegrated into GRUB’s own environment. Thissomewhat slows initial incorporation of newfeatures, and, worse yet, leads to an increas-ing amount of code that has to be maintained inparallel with its counterpart in regular Linux.

In an ideal boot loader, the difference betweenthe environment found on a regular Linux sys-tem and that in the boot loader would be re-duced to a point where integration of new fea-tures, and their subsequent maintenance, is triv-ial. Furthermore, reducing the barrier for work-ing on the boot loader should also encouragecustomization for specific environments, andmore experimental uses.

The author has proposed the use of the Linuxkernel as the main element of a boot loaderin [1]. Since then, several years have passed,some of the technology has first changed, then


matured, and with the integration of the key el-ement required for all this into the mainstreamkernel, work on this new kind of boot loadercould start in earnest.

2 Booting kernels with kexec

One prediction in [1] came true almost immedi-ately, namely that major changes to the bootimgmechanism described there were quite prob-able: when Eric Biederman released kexec,it swiftly replaced bootimg, being technologi-cally superior and also better maintained.

Unfortunately, adoption of kexec into the main-stream kernel took much longer than anyoneexpected, in part also because it underwent de-sign changes to better support the very elegantkdump crash dump mechanism [3], and it wasonly with the 2.6.13 kernel that it was finallyaccepted.

2.1 Operation

This is a brief overview of the fundamental as-pects of how kexec operates. More details canbe found in [4], [5], and also [3].

As shown in figure 2, the user space toolkexec first loads the code of the new kernelplus any additional data, such as an initial RAMdisk, into user space memory, and then invokesthe kexec_load system call to copy it intokernel memory (1). During the loading, theuser space tool can also add or omit data (e.g.,setup code), and perform format conversions(e.g., when reading from an ELF file).

After that, a reboot system call is made toboot the new kernel (2). The reboot code triesto shut down all devices, such that they are in a

defined and inactive state, from which they canbe instantly reactivated after the reboot.

Since data pages containing the new kernelhave been loaded to arbitrary physical locationsand could not occupy the same space as thecode of the old kernel before the reboot any-way, they have to be moved to their final desti-nation (3).

Finally, the reboot code jumps to the entry pointof the setup code of the new kernel. That kernelthen goes through its initialization, brings updrivers, etc.

2.2 Debugging

The weak spot of kexec are the drivers: somedrivers may simply ignore the request to shutdown, others may be overzealous, and deac-tivate the device in question completely, andsome may leave the device in a state fromwhich it cannot be brought back to life, bethis either because the state itself is incorrector irrecoverable, or because the driver simplydoes not know how to resume from this specificstate.

Failure may also be only partial, e.g., VGA of-ten ends up in a state where the text does notscroll properly until the card is reset by loadinga font.

Many of these problems have not become vis-ible yet, because those drivers have not beensubjected to this specific shutdown and rebootsequence so far.

The developers of kexec and kdump have madea great effort to make kexec work with a largeset of hardware, but given the sheer number ofdrivers in the kernel and also in parallel trees,there are doubtlessly many more problems stillawaiting discovery.


udev

dropbear

kexec

etc.

kboot utils

kboot shell

uClibc/glibc

Lean kernel

(sh, cat, mount, ...)BusyBox

Figure 3: The software stack of the kboot envi-ronment.

Since kboot is the first application of kexecthat should attract interest from more than arelatively small group of developers, many ofthe expected driver conflicts will surface in theform of boot failures occurring under kboot, af-ter which they can be corrected.

3 Putting it all together

Kboot bundles the components needed for aboot loader, and provides the “glue” to holdthem together. For this, it needs very littlecode: as of version 10, only roughly 3’500lines, about half of this shell scripts. AlreadyLILO exceeds this by one order of magnitude,and GRUB further doubles LILO’s figure.2

Of course, during its build process, kboot pullsin various large packages, among them theentire GCC tool chain, a C library, Busy-Box, assorted other utilities, and the Linuxkernel itself. In this regard, kboot resemblesmore a distribution like Gentoo, OpenEmbed-ded, or Rock Linux, which consist mainly of

2These numbers were obtained by quite unscientifi-cally running wc-l on a somewhat arbitrary set of thefiles in the respective source trees.

Firmware

kboot

Boot loader

kexec

legacy OSReboot to

Main system("booted environment")

initramfsKernel

Figure 4: The boot sequence when using kboot.

meta-information about packages maintainedby other parties.

3.1 The boot environment

Figure 3 shows the software packages that con-stitute the kboot environment. Its basis is aLinux kernel. This kernel only needs to supportthe devices, file systems, and protocols that willbe used by kboot, and can therefore, if spaceis an issue, be made considerably smaller thana fully-featured production kernel for the samemachine.

In order to save space, kboot can use uClibc [6]instead of the much larger glibc. Unfortunately,properly supporting a library different from theone on the host system requires building a dedi-cated version of GCC. Since uClibc is sensitiveto the compiler version, kboot also builds a lo-cal copy of GCC for the host. To be on the safeside, it also builds binutils.

After this tour de force, kboot builds the appli-cations for its user space, which include Busy-Box [7], udev [8], the kexec tools [2], and


dropbear [9]. BusyBox provides a great manycommon programs, ranging from a Bourneshell, through system tools like “mount,” to acomplete set of networking utilities, including“wget” and a DHCP client. Udev is responsiblefor the creation of device files in /dev. It isa user space replacement for the kernel-baseddevfs. The kexec tools provide the user spaceinterface to kexec.

Last but not least, dropbear, an SSH server andclient package, is included to demonstrate theflexibility afforded by this design. This also of-fers a simple remote access to the boot prompt,without the need to set up a serial console forjust this purpose.

3.2 The boot sequence

The boot sequence, shown in figure 4, is asfollows: first, the firmware loads and startsthe first-stage boot loader. This would typi-cally be a program like GRUB or LILO, butit could also be something more specialized,e.g., a loader for on-board Flash memory. Thisboot loader then immediately proceeds to loadkboot’s Linux kernel and kboot’s initramfs.

The kernel goes through the usual initializationand then starts the kboot shell, which updatesits configuration files (see section 3.5), maybring up networking, and then interacts with theuser.

If the user chooses, either actively or througha timeout, to start a Linux system, kboot thenuses kexec to load the kernel and maybe alsoan initial RAM disk.

Although not yet implemented at the time ofwriting, kboot will also be able to boot legacyoperating systems. The plan is to initially avoidthe quagmire of restoring the firmware envi-ronment to the point that the system can be

booted from it, but to hand the boot requestback to the first stage boot loader (e.g., withlilo-R or grub-set-default), and toreboot through the firmware.

3.3 The boot shell

At the time of writing, the boot shell is fairlysimple. After initializing the boot environment,it offers a command line with editing, com-mand and file name completion, and a historyfunction for the current session.

The following types of items can be entered:

• Names of variables containing a com-mand. These variables are usually definedin the kboot configuration file, but can alsobe set during a kboot session.3 The vari-able is expanded, and the shell then pro-cesses the command. This is a slight gen-eralization of the label in LILO, or thetitle in GRUB.

• The path to a file containing a bootablekernel. Path names are generalized inkboot, and also allow direct access to de-vices and some network resources. Theyare described in more detail in the nextsection. When such a path name is en-tered, kboot tries to boot the file throughkexec.

• The name of a block device containing theboot sector of a legacy operating system,or the path to the corresponding devicefile.

• An internal command of the kboot shell.It currently supports cd and pwd, with theusual semantics.

3In the latter case, they are lost when the session ends.


Syntax Example Descriptionvariable my_kernel Command stored in a variable/path /boot/bzImage-2.6.13.2 Absolute path in booted environment//path cat //etc/fstab Absolute path in kboot environmentpath cd linux-2.6.14 Relative path in current environmentdevice hda7 Device containing a boot sector/dev/device /dev/hda7 Device file of device with boot sectordevice:/path hda1:/bzImage File or directory on a devicedevice:path hda1:bzImage (implicit /dev/)/dev/device:/path /dev/sda6:/foo/bar File or directory on a device/dev/device:path /dev/sda6:foo/bar (explicit /dev/)host:/path server:/home/k/bzImage-a File or directory on an NFS serverhttp://host/path http://server/foo File on an HTTP serverftp://host/path ftp://server/foo/bar File on an FTP server

Table 1: Types of path names recognized by kboot.

• A shell command. The kboot shell per-forms path name substitution, and thenruns the command. If the command usesan executable from the booted environ-ment, it is run with chroot, since the sharedlibraries available in the kboot environ-ment may be incompatible with the expec-tations of the executable.

With the exception of a few helper programs,like the command line editor, the kboot shell iscurrently implemented as a Bourne shell script.

3.4 Generalized path names

Kboot automatically mounts file systems ofthe booted environment, on explicitly specifiedblock devices, and—if networking is enabled—also from NFS servers. Furthermore, it cancopy and then boot files from HTTP and FTPservers.

For all this, it uses a generalized path namesyntax that reflects the most common forms ofspecifying the respective resources. E.g., for

NFS, the host:path syntax is used, for HTTP, itis a URL, and paths on the booted environmentlook just like normal Unix path names. Table 1shows the various forms of path names.

Absolute paths in the kboot environment are anexception: they begin with two slashes insteadof one.

We currently assume that there is one principalbooted system environment, which defines the“normal” file system hierarchy on the machinein question. Support for systems with multiplebooted environments is planned for future ver-sions of kboot.

3.5 Configuration files

When kboot starts, it only has access to the con-figuration files stored in its initramfs. Thesewere gathered at build time, either from the user(who placed them in kboot’s config/ direc-tory), or from the current configuration of thebuild host.

This set of files includes kboot’s own config-uration /etc/kboot.conf, /etc/fstab,


/etc/fstab/etc/hosts

kboot.confBuild environment

kboot

kboot.conffstabhosts

Mount /etcCopy latestversions

Bootedenvironment

Figure 5: Some of the configuration files usedby kboot.

and /etc/hosts. The kboot build processalso adds a file /etc/kboot-featurescontaining settings needed for the initializationof the kboot shell.

Kboot can now either use these files, or it can,at the user’s discretion, try to mount the filesystem containing the /etc directory of thebooted environment, and obtain more recentcopies of them.

The decision of whether kboot will use its owncopies, or attempt an update first, is made atbuild time. It can be superseded at boot time bypassing the kernel parameter kboot=local.

3.6 When not to use kboot

While kboot it designed to be a flexible and ex-tensible solution, there are areas where this typeof boot loader architecture does not fit.

If only very little persistent storage is available,which is a common situation in small embed-ded systems, or if large enough storage devices

would be available, but cannot be made an inte-gral part of the boot process, e.g., removable orunreliable media, only a boot loader optimizedfor tiny size may be suitable.

Similarly, if boot time is critical, the time spentloading and initializing an extra kernel may betoo much. The boot time of regular desktop orserver type machines already greatly exceedsthe minimum boot time of a kernel, which em-bedded system developers aim to bring well be-low one second [10], so loading another kerneldoes not add too much overhead, particularly ifthe streamlining proposed below is applied.

Finally, the large hidden code base of kbootis unsuitable if high demands on system reli-ability, at least until the point when the kernelis loaded, require that the number of softwarecomponents be kept to a minimum.

3.7 Extending kboot

The most important aspect of kboot is not theset of features it already offers, but that it makesit easy to add new ones.

New device drivers, low-level protocols (e.g.,USB), file systems, network protocols, etc., areusually directly supported by the kernel, andneed no or only little additional support fromuser space. So kboot can be brought up to datewith the state of the art by a simple kernel up-grade.

Most of the basic system software runs out ofthe box on virtually all platforms supported byLinux, and particularly distributions for em-bedded systems provide patches that help withthe occasional compatibility glitches. Theyalso maintain compact alternatives to packageswhere size may be an issue.

Similarly, given that kboot basically provides aregular Linux user space, the addition of new


ornaments and improvements to the user inter-face, which is an area with a continuous de-mand for development, should be easy.

When porting kboot to a new platform,the foremost—and also technically mostdemanding—issue is getting kexec to run.Once this is accomplished, interaction with theboot loader has to be adapted, if such inter-action is needed. Finally, any administrativetools that are specific to this platform need tobe added to the kboot environment.

4 Future work

At the time of writing, kboot is still a veryyoung program, and has only been tested by asmall number of people. As more user feed-back arrives, new lines of development willopen. This section gives an overview of cur-rently planned activities and improvements.

4.1 Reducing kernel delays

The Linux kernel spends a fair amount of timelooking for devices. In particular, IDE or SCSIbus scans can try the patience of the user, alsobecause they repeat similar scans already doneby the firmware. The use of kboot now addsanother round of the same.

A straightforward mechanism that should helpto alleviate such delays would be to predicttheir outcome, and to stop the scan as soon asthe list of discovered devices matches the pre-diction. Such a prediction could be made bykboot, based on information obtained from thekernel it is running under, and be passed as aboot parameter to be interpreted by the kernelbeing booted.

Once this is in place, one could also envisionconfiguring such a prediction at the first stage

boot loader, and passing it directly to the firstkernel. This way, slow device scans that areknown to always yield the same result could becompletely avoided.

4.2 From shell to C

At the time of writing, the boot shell is aBourne shell script. While this makes it easyto integrate other executables into the kbootshell, execution speed may become an issue,and also other language properties, such as thedifficulty of separating name spaces, and howeasily subtle quoting bugs may escape discov-ery, are turning into serious problems.

Rewriting the kboot shell in C should yielda program that is still compact, but easier tomaintain.

4.3 Using a real distribution

The extensibility of kboot can be further in-creased by replacing its build process, whichis very similar to that of buildroot [11], withthe use of a modular distribution with a largeset of maintained packages. In particular Open-Embedded [12] and Rock Linux [13] look verypromising.

The reasons for not reusing an existing buildprocess already from the beginning weremainly that kboot needs tight control over theconfiguration process (to reuse kernel configu-ration, and to propagate information from thereto other components) and package versions (inorder to know what users will actually be build-ing), the sometimes large set of prerequisites,and also problems encountered during trials.

4.4 Modular configuration

Adding new functionality to the kboot environ-ment usually requires an extension of the build


process and changes to the kboot shell. Forcommon tasks, such as the addition of a newtype of path names, it would be desirable to beable to just drop a small description file into thebuild system, which would then interface withthe rest of kboot over a well-defined interface.

Regarding modules: at the time of writing,kboot does not support loadable kernel mod-ules.

5 Conclusions

Kboot shows that a versatile boot loader can bebuilt with relative little effort, if using a Linuxkernel supporting kexec and a set of programsdesigned with the space constraints of embed-ded systems in mind.

By making it considerably easier to synchro-nize the boot process with regular Linux devel-opment, this kind of boot loader architectureshould facilitate more timely support for newfunctionality, and encourage developers to ex-plore new ideas whose implementation wouldhave been considered too tedious or too arcanein the past.

References

[1] Almesberger, Werner. Booting Linux:The History and the Future, Proceedingsof the Ottawa Linux Symposium 2000,July 2000.http://www.almesberger.net/cv/papers/ols2k-9.ps

[2] Biederman, Eric W. Kexec tools andpatches. http://www.xmission.com/~ebiederm/files/kexec/

[3] Goyal, Vivek; Biederman, Eric W.;Nellitheertha, Hariprasad. Kdump, AKexec-based Kernel Crash DumpingMechanism, Proceedings of the OttawaLinux Symposium 2005, vol. 1, pp.169–180, July 2005. http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf

[4] Pfiffer, Andy. Reducing System RebootTime with kexec, April 2003. http://www.osdl.org/archive/andyp/kexec/whitepaper/kexec.pdf

[5] Nellitheertha, Hariprasad. Reboot LinuxFaster using kexec, May 2004.http://www-128.ibm.com/developerworks/linux/library/l-kexec.html

[6] Andersen, Erik. uClibc.http://www.uclibc.org/

[7] Andersen, Erik. BUSYBOX.http://busybox.net/

[8] Kroah-Hartman, Greg; et al. udev.http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html

[9] Johnston, Matt. Dropbear SSH serverand client. http://matt.ucc.asn.au/dropbear/dropbear.html

[10] CE Linux Forum. BootupTimeResources,CE Linux Public Wiki.http://tree.celinuxforum.org/pubwiki/moin.cgi/BootupTimeResources

[11] Andersen, Erik. BUILDROOT. http://buildroot.uclibc.org/

[12] OpenEmbedded.http://oe.handhelds.org/

[13] Rock Linux.http://www.rocklinux.org/


Ideas on improving Linux infrastructure forperformance on multi-core platforms

Maxim AltIntel Corporation

[email protected]

Abstract

With maturing compiler technologies, compile-time analysis can be a very powerful tool foroptimizing and monitoring code on any ar-chitecture. In combination with modern run-time analysis tools and existing program inter-faces to monitor hardware counters, we willsurvey modern techniques for analyzing per-formance issues. We propose using perfor-mance counter data and sequences of perfor-mance events to trigger event handlers in ei-ther the application or the operating system. Inthis way, sequence of performance events canbe your debugging breakpoint or a callback.This paper will try to bridge the capabilities ofadvanced performance monitoring with com-mon software development infrastructure (de-buggers, gcc, loader, process scheduler). Oneproposed approach is to extend the run-time en-vironment with an interface layer that will filterperformance profiles, capture sequences of per-formance hazards, and provide summary datato the OS, debuggers, or application.

With the introduction of hyper-threading tech-nology several years ago, there were obviouschallenges to look beyond a single running pro-cess to monitor and schedule compute intensiveprocesses on multi-threaded cores. Multi-levelmemory hierarchy and scaling on SMP systemscomplicated the situation even further, causing

essential changes in kernel scheduler and per-formance tools. In the era of parallel and plat-form computing, we rely less on single exe-cution process performance—with each com-ponent optimized by the compiler—and it be-comes important to evaluate performance of theplatform as a whole. The new concept of per-formance adaptive schedulers is one exampleof intelligently maximizing the performance onplatform level of CMP systems. Performancedata at higher granularity and a concept of pro-cessor efficiency per functionality can be ap-plied to making intelligent decisions on processscheduling in the operating system.

Towards the end, we will suggest particular im-provements in performance and run-time toolsas a reflection of proposed approaches and tran-sition to platform-level optimization goals.

1 Introduction

This paper introduces a series of ideas forbringing together existing disparate technolo-gies to improve tools for the detection and ame-lioration of performance hazards.

About 20 years ago, an exceptionally thin 16-bit real-time iRMX operating system had anextremely simple but important built-in fea-ture: when debugging race conditions in shared

40 • Ideas on improving Linux infrastructure for performance on multi-core platforms

memory at any point of run-time execution, adeveloper could bring the system to a halt, setan internal OS breakpoint at an address and theoperating system would halt whenever a valuewas being written at the specified address. Nohigh level debugger was needed to debug raceconditions, nor were they capable.

With the introduction of hyper-threading tech-nology, many software vendors started to ex-periment with running their multi-threadedsoftware on hyper-threaded processors. Withthe increase the number of processors, thosevendors expected to get good scaling after elim-ination of synchronization and scheduling is-sues. These issues are quite difficult to track,debug, or find with Gdb or VTune analyzer.

The problem of debugging synchronization is-sues in multi-threaded applications is growingmore important and more complex with the ad-vent of parallelizing compilers and languagesupport for multithreading, such as OpenMP.The compiler has knowledge of program se-mantics, but does not generally have run-timeinformation. The OS is in a position to makedecisions based on run-time information, butit doesn’t have the semantic information thatwas available to the compiler, since it seesonly the binary. Further, any run-time analysisand decision-making affects application perfor-mance, either directly by using CPU time, orindirectly by effects such as cache pollution.

Spin locks are an example of the kind ofconstruct for which higher-level informationwould be helpful. The compiler can easily de-tect spin lock constructs using simple pattern-recognition. From the run-time perspective aspin lock is a loop that repeatedly reads a sharedmemory address and compares it to a loop-invariant a value. A spin-lock is a useful syn-chronization mechanism if the stall is not long.For long stalls, it wastes a lot of processortime. Typically, after spinning for some num-ber of cycles, the thread will yield to let other

threads make progress. If the OS could iden-tify which threads were waiting for a lock andwhich threads held the lock, it could adjust pri-orities to maximize throughput automatically.

Another relevant example of how a schedulermight use performance data is not based on de-bugging. Consider two processes running con-currently: two floating point intensive loops,where only one of which has long memorystalls. Should the scheduling of the processesalter? For example, two floating point inten-sive loops with unknown memory latency ortwo loops of unknown execution property ortwo blocks of code of unknown programmingconstruct?

This question was raised by my colleagues in[1] about 3 years ago, where it was discussedhow beneficial it would be if the OS schedulerhad built-in micro-architectural intelligence.

Another issue with debugging the performanceof an application using spin locks is that it typi-cally doesn’t provide much insight to know thatthe spin-lock library code is hot. The appli-cation programmer needs to know which lockis being held. That information can be gath-ered from the address of the lock, but often itis more useful to have a stack trace gatheredat the time the lock is identified as hot. Thisrequires sampling the return stack in the per-formance monitor handler, not just the currentinstruction pointer.

A process or a code block (hot block or block ofinterest) can be characterized by performanceand code profiles, where the code profile is rep-resented by a hierarchy of basic programmingconstructs, and the performance profile is rep-resented by execution path along with registersimage captured at any given point in time. Inthis paper we will describe a set of new profile-guided static and dynamic approaches for effi-cient run-time decisions: debugging, analyzingand scheduling.


Please refer to Appendix A for an overview ofexisting technologies, tools and utilities.

2 Bridging the Technlogies

I would like to explore extending the scope ofsampling-analysis hybrid tools for example byprofiling with helper threads[10.5].

This section will provide a series of exampleson how to combine building block componentsto get useful sample-analyze schemas, whichcould potentially turn into standalone tools:

- A Pintool doing Pronto. Given the non-existent ability of Pin to sample performancecounters in optimize/analyze mode, a hybridtool when Pronto is based on a Pintool wouldallow dynamic implementation of collectingperformance data via pintool instrumentation.If sampling is needed then a sampling drivercan be initiated and called from a pintool usingPAPI. The PAPI interface allows you to start,stop, and read the performance counters (e.g.calls to PAPI_start and PAPI_stop using Pin’sinstrumentation interface). PAPI does require akernel extension. This idea may also be imple-mented through a static HP’s Caliper tool1

- A Pintool to seek for hotspots. Hotspot analy-sis can be done by defining “what it means to bea hotspot” by a pintool, or statically by parsingsampling data with scripts.

- A Pintool which uses performance feedbackdata. Theoretically, a Pintool could be builtwhich uses performance data which has beencollected (perhaps by some other tool) on previ-ous runs. Intel has a file format for storing per-formance monitor unit data (“.hpi” files) which

1http://h21007.www2.hp.com/dspp/tech/tech_TechSoftwareDetailPage_IDX/1,1703,1174,00.html (currently available forItanium microarchitecture only)

are used by the compiler. Pintool reads eventsand performance counters dynamically as it ex-ecutes a binary.

- A Pintool to recognize event patterns. Se-quitur can be used for static or dynamic anal-ysis (with a certain performance overhead) forcomplex grammars or in the context of this pa-per, event sequences.

2.1 Performance Overhead of Sample-analyze Workflow

Pintools instrumentation can be intrusive andthe overhead is dependent upon the particulartool used. The generated code under pintoolis the only code that executes, and the origi-nal code is kept for reference. Because instru-mentation can perturb the performance behav-ior of the application, it is desirable to be ableto collect performance data during an uninstru-mented run, and use that data during a later in-strumentation or execution run.

In addition, Sequitur performance for genericgrammars containing many symbols may beextremely heavy2.

2.2 Developers’ Pain

As one could imagine, the multi-threaded ap-plication developers who are debugging andrunning on multi-core architectures need profil-ing tools (such as VTune analyzer or EMON) tobe aware of the stack frame and the execution

2Incorporation of the Sequitur algorithm into your in-strumentation is an essential part of the techniques de-scribed in this paper. Due to the significant performanceoverhead, Sequitur is not used in generic form, and re-quires tailoring for particular usage cases with simplifiedgrammar. It helps to find performance hazards and se-quences of events of interest, or these, which character-izes the application process.


context. The tools also need to take into ac-count procedure inlining. It would also be use-ful if the simple data derived out of these toolscould be used by the run-time environment toadapt the environment for better throughput.

In this section we would consider code exam-ples known to cause much pain when debug-ging or scheduling. Then, we will suggest waysto adjust the Linux infrastructure to leverageand integrate existing tools mentioned above toaddress these painful situations.

2.3 Profile-guided Debugging

A very common example is when you pro-file a multi-threaded application with frequentinter-process communications and interlock-ing. Many enterprise applications (web servers,data base servers, application servers, telecom-munication applications, distributed contentproviders, etc.) suffer from complex synchro-nization issues when scaled. While optimizingsuch application with standard profiling tools,it is common to observe most of the cyclesbeing spent in synchronization objects them-selves. For example, wait for an object, idleloops, spin loops on shared memory.

Whether the implementation of synchroniza-tion objects is proprietary or via POSIX threads[15], a hot spot is noted as entering/leaving thecritical section or locking/unlocking the sharedobject. Deeper analysis of such hotspots usu-ally shows there is not much to optimize fur-ther on a micro-architectural level unless oneis trying to optimize the performance of glibc.The real question is how to find out what ob-jects actually originated a problematic idle orspin time in a millions-of-code-lines applica-tion with hundreds of locks? To track and in-strument locks is not an easy task; it is similarto tracking memory allocations. Standard de-bugger techniques are not effective in identify-ing the underlying application issue.

Spinlock (mutex_t *m) {Int I;For (i=0; I < spin_count; i++)

if (pthread_mutex_trylock(m) != EBUSY) return;pthread_mutex_lock(m); //or sometimes Sleep(M)

}

Figure 1: Spin lock

spin_start:pauseTest [mem], val ; pre-read to save the

; atomic op penaltyJ Skip_xchgLock cmpxchg [mem], val ; shows bus serial-

; ization stall

Skip_xchg:jnz spin_start

Figure 2: Spin Lock Loop

A standard adaptive spin lock implementationlooks similar to Figure 1 where inner spin lockloop translates to instructions shown in Fig-ure 2.

Let’s analyze what characterizes this code. Oneobvious implication of using atomic operations(for entering/leaving critical section it is atomicadd/dec) is that such operations serialize thememory bus, which yields significant stalls dueto pipeline flush and cache line invalidation for[mem].

The code in Figure 2 would generate similarperformance event patterns on most architec-tures. Following are the properties which char-acterize the code block profile:

• Very short loop (2–5 instructions)

• Very short loop containing nop orrep nop (pause)

• Contains instruction with Lock prefixyielding bus serialization

• Contains either xchg or dec or add in-struction


The performance event profile for this blockhas the following properties:

• Likely branch misprediction at the ‘loop’statement

• Very high CPI (cycles per instruction), asthere is no parallelism possible

• Data bus utilization ratio (> 50%)

• Bus Not Ready ratio (> 0.01)

• Burst read contribution to data bus utiliza-tion (> 50%)

• Processor/Bus Writeback contribution todata bus utilization (> 50%)

• Parallelization ratio (< 1)

• Microops per instruction retired (> 2)

• Memory requests per instruction (> 0.1)

• Modified data sharing ratio (> 0.1)

• Context switches is high

We can define the grammar which consists of:loop length, loops with nops, locks, adds, dec,xchg; misprediction branches, high CPI, con-text switches, high data bus utilization.

It is necessary to quantify each of the perfor-mance counters’ values (also called knobs) sowe could establish a trigger for potential per-formance hazard. From the properties of spinlock block we can define a rule for a blockto become the hot block or the block of inter-est. Then, similar to hot stream data prefetchexample in Appendix A, we will use Sequiturto detect hot blocks containing event sequenceswithin the defined grammar as you can see be-low in Figure 5

In order to simplify the problem for the spinlock detection, we may limit ourselves to the

analysis of only code profile, as the entire per-formance profile is a direct result of having aninstruction with a ‘lock’ prefix (e.g. a ‘lock’prefix on any instruction results in data bus se-rialization, yielding known set of performancestalls). The ‘spin lock’ code properties canbe dynamically obtained at run-time by instru-mentation. We can use the Pin command lineknob facility to define a block’s heat. For ex-ample these knobs may be:

• Matched number of samples to trigger thehazard

• Number of consecutive samples

• Minimum and maximum length for hotblock

• Minimum spins to consider it hot

• Maximum number of instructions to con-sider loop short

This technique3 would allow us to insert abreakpoint on an event of performance hazard -the hot spin lock according to user’s definitionof a performance concern. The debugger wouldbe able to stop and display stack frame of thecontext when the given sequence of events hadoccurred.4

The described dynamic mechanism (one of thesuggested “sample-analyze” workflows) de-tects a block of interest and breaks the execu-tion on a performance issue.

3For the spin lock profile, running the sampling alongpintool instrumented executable is safe and correct, sincemain characteristics of spin lock could not be disruptedby instrumented code shown above

4Ideas for a standalone profiling tool - Assume wehave an ability to improve an open source (PAPI-based)or proprietary (VTune analyzer) profiling tool. Insteadof Int 3 insertion we could insert a macro operation todump a stack frame and register contents by the samplingdriver. The existing symbol table would allow track-ing source-level performance hazards defined by the pin-tool’s knobs.


// Run in optimize mode only - no need for sampling mode run

for (each basic block)for (each instruction in block) {if (Instruction is branch) {target = TargetAddress(ins);if (trace_start < target && target < address(ins) &&

(target - address < short loop knob)) {Insert IfCall (trace_count--);Insert ThenCall (spin_count++);New grammar (knobs);

}}

if ((instruction in block has lock prefix) &&(instruction is either xchg, cmpxhg, add, dec)) {if (grammar->AddEvent(address(ins)) &&

(block_heat++ > hot block knob)) {Insert Interrupt 3; // for debugging the application

}}

}

Figure 3: Pintool’s trace instrumentation pseudocode

For the static profiling schema we will modifythis workflow as follows:

On run-time side, as follows:

- Pintool instrumentation inserting software-generated interrupts would stay the same as indynamic case. Pintool would read the informa-tion about hot spin locks from static profilingresults (PGO) or pronto repository

-Allow the debugger to read pronto repositorydirectly. This data would contain pairs, such as(ip address, number of times the IP is reached -signifying the heat of the block )

On compile-time side:

- A newly developed pintool that would be sim-ilar in functionality to PGO and profrun util-ity without sampling. However this pintoolwould contain same detection algorithm by the

Sequitur as described in Figure 3, which de-tects hot spin locks according to user definedknob values marking the heat. In this man-ner, pronto_tool is virtually replaced with theSequitur. Upon detection of a hot block, thepintool spills the pair (ip, frequency) into thepronto repository.

- Due to unique code properties, the currentimplementation of PGO and profrun utility al-ready contain the needed information aboutspin lock block’s code profile. We still needto build a script which would replace analysistool pronto_tool and is based on parsing profiledata with the Sequitur. This mechanism wouldextract event patterns matching our definitionof the spin lock block’s heat. The detected pair(ip address, number of times this IP has beenreached until block became hot) is inserted intoprofiling info, and subsequently passed to thedebugger


This would summarize another suggestion ona new standalone run-time tool that reads in theprofile data and use it to find hot locks, and thentells the debugger the IP of those blocks so itcan stop there.

With our attempt to characterize code by itsperformance profile, one may ask how adequatethe mapping between an actual code block andits performance profile is. Would a sequenceof events spanned by performance propertiessymbolize a spin lock code, or in other words,how uniquely do code block properties definethe code itself? For performance debugging oradaptation the functionality of a hot block itselfis not important. Rather, what important is it’salgorithm mapping on the micro-architectureand the stalls caused by this mapping. There-fore, it is sufficient to accurately describe a per-formance hazard and signal when its propertieshave occurred.

2.4 Performance Adaptive Run-timeScheduler

Consider running a high performance multi-threaded application. Many computation andmemory intensive applications (rendering, en-coding, signal processing, etc.) suffer fromcomplex scheduling issues when scaling onmodern multi-threaded and multi-processorarchitectures. Often, the developers opti-mize these applications by parallelizing singlethreaded computation to run multiple threads.In order to make the application run well in par-allel, the developers perform functional decom-position.5

Then, the OS scheduler takes over the de-cision on how to schedule these functionally

5Functional decomposition is the analysis of the ac-tivity of a system as the product of a set of subordinatefunctions performed by independent subsystems, eachwith its own characteristic domain of application.

decomposed threads onto the available hard-ware. Since the OS scheduler is not aware ofmicro-architecture, functional decomposition,or OpenMP, parallelization often leads to per-formance degradation. In order to analyze thisphenomenon there were many researches on in-formed multi-threaded schedulers [12], symbi-otic job scheduling for SMP [13], [14], andMASA [1]. In this paper we will take an ap-proach of bridging existing profiling tools andadvanced compiler technologies to take a stepfurther in solving this problem.

As an example, consider open source LAMEmp3 encoder6. It is clear that the applica-tion is both computation and memory intensive,where computation is mostly floating point.Functional decomposition of hotspot functionlame_ecnode_mp3_frame() is equivalent to afunctional decomposition of L3psycho_anal()function. All decomposed threads at any pointin time could unfold into a situation when run-ning processes utilize similar resources on thesame physical core (e.g. threads are: floatingpoint intensive, floating point intensive, heavyinteger computations, heavy integer computa-tions, long memory latency operations, longmemory latency operations).

As in the previous section, running a thread’sprofile consists of performance and code prop-erties. Below, we will analyze such propertiesand the knobs defining the heat:

Following is the structure of properties forfloating point operations intensive code block:

- Estimated functional imbalance originated bycompiler’s scheduler

- Estimated CPI by the compiler’s scheduler

- Outer loop iteration count

6http://lame.sourceforge.net/download/download.html


You can see similar characteristics in integerintensive and memory intensive code blocks.These code block properties ignore possiblecoding style inhibitors and are agnostic to someoptimization techniques (such as code motion).Nested loops and non-inlined calls within aloop are being merged into single region at therun-time, since the block of interest in this casewould be an outer block encapsulating multipleiterations to the same instruction pointer.

Event profile to determine performance proper-ties for floating point intensive block:

- Balanced execution and parallelism – actualcycles per instruction ratio

- Microops per instruction retired for very longlatency instructions (FP)

- FP assist and saturation event per retiredFLOPs

- Retired FLOPs per relative number to instruc-tion retired

- Conversion operations RTDC per relativenumber to instruction retired

- SSE instruction retired per instruction retired

Event profile for memory intensive block:

- Data bus utilization ratio (> 30%)

- Bus Not Ready ratio (> 0.001)

- Burst read contribution to data bus utilization(> 30%)

- Processor/Bus Writeback contribution to databus utilization (> 30%)

- Microops per instruction retired (> 2) for re-peat instructions

- Memory requests per instruction (> 0.1)

- Modified data sharing ratio (> 0.1)

In order to write a pintool for this topic, it isnecessary to be able to deliver some compile-time derived data to the run-time. In par-ticular, some code block characteristics canbe easily determined by the compiler’s sched-uler: For example, CPI and other parallelismmetrics, scheduled memory operations, sched-uled floating point operations, etc. Com-pilers can output such information via ob-ject code annotations, optimizer reports, orpost-compilation scripts that can strip requiredstatistic on the generated assembly code. Therecent changes in GCC’s vectorizer and op-timizer include Tree SSA7. It is possible toget the compiler scheduler’s reports using--ftree-vectorizer-verbose com-piler option. The code block properties derivedfrom the compiler scheduler’s data only neededon hot blocks. However, the compiler does notknow which block is hot unless PGO or Prontowas used.

On the run-time side, Pin has a disassembly en-gine built-in. A pintool would be able to eas-ily determine functional unit imbalance in a hotblock if it isn’t available from the compiler. As-suming that Pin has the ability to retrieve somecompiler scheduler data, there are a few ways tocreate a pintool to determine whether runningcode has properties of floating point or mem-ory intensive hot blocks:

For the compile time:

- A “2-model” compilation can usually do both:determine and process hot block properties.Generated assembly, PGO and Pronto reposi-tory can be concurrently processed with a scriptto extract the instruction level parallelism (ILP)information per hot block

7Static Single Assignment for Trees [18]: newGCC 4.x optimizer: http://gcc.gnu.org/projects/tree-ssa/\#intro


- Extend the code’s debug-info into informationcontaining ILP of basic blocks during compila-tion. It is an estimated value, not based on run-time performance. The scheduler’s compile-time data could be passed through an exe-cutable itself as a triple (start block address, endblock address, parallelism data). Pintool has anextensive set of APIs that accesses debug info.

For the run-time:

- Instrument the binary with a pintool thattraces loops with a large number of counts (apotential knob). Then, count the number offloating point, memory and integer operationsin a loop.

- PGO and Pronto may also contain ILP relatedratios, which are derived from basic samplingduring profiling run (with PAPI interface). Ex-tending the Pronto repository to carry paral-lelism info can improve the ability of pintool’sinstrumentation analysis.

Additional run-time instrumentation can bebased on the performance profile of the hotcomputational intensive blocks by runningsampling along with instrumentation.8

This schema shows the feasibility of obtain-ing a process property. However, possibleperformance overhead of sampling and pro-cessing (even if it is incorporated in oneinstrumentation-sampling step) may be tooheavy to make run-time decisions for the OSscheduler.

Now we will analyze the data collection pro-cess for the performance profile-aware OSscheduler. First, let’s exclude 2-step models asinappropriate schemas for OS scheduling. As-sume the OS cannot contain low overhead con-tinuous sampling, then, a pintool instrumenta-

8The unique performance profile reflecting computeintensive block properties would not be disrupted by in-strumentation performance overhead

tion embedded into a running process cannot beextensive but can be discrete.

We will also assume that estimated ILP infor-mation and compiler scheduler’s data can beretrieved via debug information for each basicblock9. Instrumentation can count frequencyand count of each basic block determining theestimated heat of the block.

On the run-time side, we would require imple-mentation of one of the following: limited sam-pling, processing of Pronto repository, or de-composing compiler scheduler decision for thelength of one basic block. We propose that con-text switch time might be an appropriate placeto insert this lightweight process.

Pin instrumentation can be done on a basicblock granularity with Pin itself setting up in-strumentation calls, which is greatly improvesperformance.

Thus, we are considering three approaches fordynamic performance adaptive scheduler:

1. Annotation, no instrumentation. A lim-ited lightweight instrumentation is done onlyon the level of basic blocks. This instrumenta-tion would not be based on performance coun-ters, clock cycles or actual ILP info. This isassuming some basic compiler scheduling datacan be incorporated in to an executable usingmechanisms similar to debug symbols. Thiswould provide an estimated code block profile.As soon as a code block gets to a specified heat(user pre-defined knob on loop iteration count),the pintool triggers an internal OS schedulingevent carrying the code profile signature.

2. Limited sampling. As noted earlier, limitedsampling may be possible at the OS scheduler’s

9The Pronto data is mapped using debug info inDWARF2 format. Some compiler-based info suchas predicted ratios IPC or FPU/ALU/SIMD utilizationcould be added to pronto repository data derived frompre-characterized hot blocks


checkpoint, such as context switch. This couldrefine information obtained from item 1 aboveand give more accurate data on actual ILP. Asingle sampling iteration over basic block coulddetect performance profile hazards based oncounters which are specific to compute inten-sive blocks shown above. A trial sampling runwould last only during the length of a single it-eration of the loop, assuming a context switchhad occurred several times during execution ofa large loop count.

3. Instrumentation under “2-compile” model.Assuming ILP information can be incorporatedinto the binary, we would use PGO or Prontomechanisms to generate actual sampling ratioswithin the profile feedback repository. After atraining run we collect the profile informationwhich includes each basic block’s frequencyand count, along with its ILP info. Assumingthis information is available in the binary, thiswould indicate to the OS scheduler the perfor-mance properties of the running process. Thisworkflow would be enabled by a simple pin-tool instrumentation that is analyzing each ba-sic block’s information.

From the workflow above, there is an ob-vious conclusion that the loader/linker hasto have certain abilities to map and main-tain new information passed within the gener-ated binary. Investigating the glibc code onpotential changes for loader/linker in elf/dl-open.c, dl-sym.c and dl-load.c, we noted apossibility of creating a number of loadingthreads that could load libraries in paral-lel. With _dl_map_object_from_fd() each ofthe threads would retrieve various informa-tion carried in the executable by link-timeprocedure of locating symbols. In this wayhashing mechanism for objects with largeamount of symbols can be parallelized indl_lookup_symbol_x(), calling the expensivehashing algorithm do_lookup_x(). However,conducting this experiment any further is out

of scope for this paper.

It is appropriate to comment on describingpossible workflow combinations of “sampling-analysis” of the static algorithm for the OS per-formance adaptive scheduler. Due to the natureof the usage model, the static algorithm may besuitable for feasibility study or prototyping anapproach, but least likely used in real life andtherefore is not mentioned in this paper in de-tail.

Each of the workflows discussed require certaincapabilities to be developed:

1. Sampling drivers are closed source but canbe distributed. The open source interface forTB5 format analysis should be implemented byPAPI or VTune Analyzer/SEP.

Most of the performance events required fordetermining the code’s performance profile arepublic.

2. Compiler (GCC). The performance anal-ysis tool with basic profile feedback and vec-torizer reports mechanisms already exist. Thefollowing enhancements would be needed:

- The vectorizer reports must include compilerscheduling information on parallelism.

- Mechanisms to incorporate compiler reportsper basic block in to a binary need to be devel-oped.

- A utility which collects sampling data for pro-file feedback needs to be developed based onthe PAPI interface.

3. Pronto repository and profrun utility.These utilities currently exist as a part of IntelCompiler, but are closed source because theyuse the VTune TB5 file format. The followingenhancements would be needed:

- PAPI interface for profrun utility and Prontorepository


- Pronto using pintool instrumentation

4. Pin. The following enhancements would beneeded:

- API extensions to retrieve compiler schedulerinfo that are embedded into a binary

- Compiler scheduler decomposition API (anAPIs that retrieve compiler’s scheduler infor-mation, especially related to ILP)

- API ability to read Pronto repository frommemory or a file

- A ‘timer’ pintool to help development activi-ties to track performance (via gettime())

- New pintool instrumentation libraries to pro-vide description of hazardous performanceevent sequences based on common code andperformance profiles

- New Pin APIs that can perform independentsampling via PAPI interface to hide architecturedependences, “A Pintool doing Pronto”

- For each pintool instrumentation to specifya set of performance counters that may be af-fected by instrumentation itself.10

5. Loader/linker. The loader can be easilyinstrumented with the Pin interface relying onIMG_ API set. Following enhancements wouldbe needed:

- Properly dispatch additional compiler’sscheduler information embedded into binary,similarly to the debug info

- For faster linking, improve the OS loader’sspeed by creating loading threads

10These performance counters or ratios may not bepresent on your architecture with specified name but onmodern architectures assumed to have similar ones

6. VTune analyzer. Following enhancementswould be filed to Intel VTune developmentteam:

- Ability to recognize and sample pintool in-strumented code.

- Capability to receive a signal from instru-mented code in order to display and translateprocess context and stack frame.

7. Debugger (GDB):

- Compile-time feedback: Enable reading basicblock ILP information along with debug infor-mation incorporated by the compiler’s sched-uler

- Run-time feedback: Enable reading Prontorepository with (frequency, count) informationof the BBL. It may eliminate the need for pin-tool instrumentation for the debugger.

- Consider scripting language to describe eventsequence and pintool instrumentation algo-rithms.

8. OS Scheduler:

- Need to have the ability to retrieve processprofile signature which characterizes perfor-mance and code constructs derived from run-ning executable.

In order to test the feasibility of the sug-gested tools without changes in the kernel, wecan write emulation application with the usermode simplified scheduler’s algorithm by set-ting affinity with process’s profile data.

- Have the option of signaling to the pintool in-strumentation process that a context switch isabout to occur and start a lightweight instru-mentation mechanism with non-intrusive sam-pling for one iteration of a basic block. Pintoolmay communicate with the OS scheduler viaioctl, considering the events are coming from adriver.


3 Conclusion

In this paper we surveyed ideas spanning sev-eral technologies and tools developed by avast community during past 10 years. Wetried to bridge recent accomplishments inmainstream compiler technology, performancecounters, pattern recognition algorithms, ad-vanced binary instrumentation tools, debuggingapproaches and advanced dynamic optimiza-tion techniques. Many of these technologieswere also inherited from previous researcheson databases optimizations and compression al-gorithms. Demonstrated complex workflowsincorporating “sample-analyze” technologiesinto enhanced run-time Linux infrastructuremake another step towards advanced dynamicoptimizations, debugging and process schedul-ing. Quantifying the significance and useful-ness of the proposed approaches is a subject ofa separate research and experiments.

Please refer to Appendix B for potential appli-cations of the suggested ideas.

4 Acknowledgements

I would like to thank my colleagues and high-light their involvement in supplying importantsupporting materials, thoughts and comments:

Siddha Suresh, Venkatesh Pallipadi: Intel,Open Source Technology Center

John Piper, Anton Chernoff, Shalom Gold-enberg: Intel, Dynamic Optimization Labs;

Robert Cohn, Harish Patil: Intel

Raymond Paik: Intel, Channel Product Group

5 Appendix A. Technology Back-ground and Current Situation

Most modern micro-processors have a PMU—virtual or physical performance monitoring unitthat contains hundreds or even thousands ofperformance counters and events. Modernmicro-architectural profiling technology is di-vided into two distinct steps: sampling andanalysis. The sampling mechanism records in-struction pointers (IP) with performance coun-ters as in (IP, frequency, count) or (IP, value).The analysis processes sampling data on themaximum time spent in repeated blocks (hotblocks), possibly including: disassembly, map-ping to the source code, affiliating to a function,a process or a thread;

As the building blocks of the proposed work-flow, it is important to overview existing toolsand technology. Some of these tools are OpenSource, some are proprietary or have a closedsource engine with a BSD-style license for freedistribution.

Sampling tools:

Well-known profiling tools and programminginterfaces (such as VTune analyzer, EMON,PAPI, Compiler’s profile guided optimization(PGO) with sampling) are usually system-wideand process agnostic. Sampling tools can be at-tached to any running process, but do not haveaccess to full run-time environment and func-tionality context: thread storage, register val-ues, loop count, frequency and stack.

Another type of sampling tool does not havethe concept of time and is built upon executedinstructions along the execution path. Suchtools are not aware of stalls and clock cycles,but can sample executed instruction propertiessuch as instruction count, instruction operands,branches, addresses, functions, images, etc.


Pin [8], [17] can be considered as a “JIT” (just-in-time) compiler, with the originating binaryexecution intercepted at a specified granularity.This execution is almost identical to the origi-nal. Pin contains examples of instrumentationtools like basic block profilers, cache simula-tors, instruction trace generators, etc. It is easyto derive new Pin-based tools using the exam-ples as a template.

Analysis tools:

Well known analysis tools are compilers anddebuggers. Beyond actual code generation andinstruction scheduling, the compiler has theability to report on optimizations, schedulerheuristics and decisions, and predicted perfor-mance counter ratios (please refer to [5]). Thecompiler can determine code profile and esti-mate performance profile of any code block,specifying execution balance across CPU units.

Some sampling tools such as VTune andEMON incorporate data analysis within them.

There are advanced parts of a compiler’s op-timizer which can be standalone tools that areable to parse sampling files and extract profiledata.

Before moving onto another class of tools,there are also more sophisticated data analy-sis tools which work in formal language envi-ronments. As Sequitur originated from com-pression algorithms, it is capable of defininga sequence of events as a grammar and traceevent samples on correct expression composi-tion from given sequences of samples. The Se-quitur algorithm description can be found in[9]. The implementation of sequitur has pub-lic versions.

Hybrid tools:

The series of following tools are a hybrid be-tween sampling and analysis. Most require a

complex build model with up to 3-compilationmodel process, (see Figure 5).

Profile-guided optimization (PGO for IntelCompilers or Profile feedback for GCC) isa part of many modern compilers [3], [4].Currently, the PGO mostly samples executedbranches, calls, frequency and loop counts withfollowing output of the data in an intermedi-ate format to the disk. The PGO analysis isalso a part of compiler’s optimizer, which is in-voked with second compilation. It assists thecompiler’s scheduler to make better decisionsby using actual run-time data instead of heuris-tics targeting probable application behavior.

As an extension of the PGO mechanism, atool incorporating a trace of run-time specificevents and samples (for instance, actual mem-ory latency, cache misses) has been developed.This mechanism can be considered as a bun-dled sampling tool of profrun utility with anal-ysis tool pronto_tool [6], [16], which we willrefer to as Pronto. Rather then just being a natu-ral extension of PGO capabilities these tools arestandalone and not incorporated into the com-piler. Architecture-wise, profrun is built uponproprietary sampling drivers spilling the dataon the disk, which is called Pronto repository.Profrun is currently incorporated in the IntelCompiler package using Intel sampling drivers,but conceptually can be based on open sourcePAPI interface, see [7] for PAPI documenta-tion. The pronto_tool reads and analyzes thePronto repository for various data representa-tion. A typical output is shown in Figure 4.

A hybrid tool Pintools, based on Pin, is a crit-ical component of this paper’s focus. Pintoolsincorporate into a single executable targeted in-strumentation and analysis. This is an imple-mentation powered by Pin API callbacks pro-viding instrumentation for any running imageat any granularity. Pintools mechanism canbe considered a generic binary instrumentationtemplate to create your own hybrid of sampling


$ profrun -dcache mark$ pronto_tool -d 10 pgopti.hpi

PRONTO: Profiling module "mark":PRONTO: Reading samples from TB5 file ’pgopti.tb5’PRONTO: Reading samples for module at path: ’mark’

Dumping PRONTO Repository

Sample source 0: pgopti.tb5 UID: TYPE = TB5SAMP (54423553 414d500080ac9d3c 8f39c501 0043363d 8f39c501 00000000 00000000)

Module: "mark"Event: "DCache miss": 35 samples#0 : 1 samples: [0x00001c70] mark.c:main(20:14)

total latency=17 maximum latency= 17[0:7]=0 [8:15]=0 [16:31]=1 [32:99]=0 [100:inf]=0

#1 : 5757 samples: [0x00001701] mark.c:main(23:21)total latency= 43132 maximum latency= 366[0:7]=4070 [8:15]=1668 [16:31]=18 [32:99]=0 [100:inf]=1

#29 : 5786 samples: [0x00001700] mark.c:main(23:42)total latency= 40047 maximum latency= 439[0:7]=5294 [8:15]=433 [16:31]=55 [32:99]=0 [100:inf]=4

Figure 4: pronto_tool output

and analysis implementations. Current Pin in-strumentation capabilities can extract only pro-files that are not related to actual clock cycles.For example, taken branches, loop iterationscounts, calls, memory references, etc.

Other examples of more sophisticated pintools-based technologies are helper threads [10.2]and hot stream data prefetch [2].

Helper thread technology, or software-basedspeculative pre-computation (SSP) was orig-inated from complex database architecturesand based on compiler-based pre-execution[10.1] to generate a thread that would prefetchlong latency memory accesses in runtime.This is the 3-compilation model static tech-nique. Its implementation is currently donein Intel Compilers [16] with Pronto mecha-nisms (option used for the first compilation

sourcesEarly

loweringApply profile data SSP transform transform

SSP-optimizedobjectcode

instrumentedexecutable Instrumented execution .dpi files

Third compilation

instrumented

object sourcesEarly

loweringtransform transform transform

First compilationAdd

countersto each

optimizedobject codesources

Early lowering transform transform transform

Second compilation

Apply profile data

optimizedexecutable

Monitored executionUsing “profrun”

.hpi filesDcache miss events

Figure 5: Helper Threads (SSP) build diagram

--prof-gen-sampling, for the second--prof-use --ssp, and for the third with--ssp). The workflow diagram is shown inFigure 5.

As a dynamic equivalent to this technique, the


Program image

Instrumented executionWith profiling

Analyze instrumentation mode: Run-1, Sampling on

Hot data stream

Access, stride

Prefetch insertion

Sequitur Grammar Analyze Data reference

sequence

New optimized image

Run-2, Sampling Off

Optimize instrumentation mode: Run-3

Figure 6: Hot data stream prefetch injection al-gorithm

mechanism described in [7] incorporates Pin-tool for dynamic instrumentation of memoryread bursts in the sampling mode; the Sequiturfor fast dynamic analysis of memory accesssequences with long latencies; and matchingmechanism for sequences that would benefit byprefetching (called hot data streams detectionphase). Based on hot data streams, the Pin caninject prefetching instructions as shown in Fig-ure 6.

6 Appendix B. Usage Model Exam-ples On Proposed Workflows

Here are some potential applications of pro-file guided debugging and performance adap-tive scheduling:

1. OS scheduling decisions may be basedon occurring patterns of hardware perfor-mance events, event hazards detection orplatform resource utilization hazards:

Some event sequences can determine ahazard, upon which the OS scheduler mayredefine priorities in the run queue andaffinity to a logical/physical CPU.

2. Hyper-threading and Dual Core. Immedi-ate performance gains. If a recurring pat-tern of utilization similar CPU resourceswas detected, the thread affinity assignedshould distribute to run these threads on

different physical cores. This approachexpected to show immediate performancegains on HT-enabled system on a series ofdedicated applications.

3. Independence of usage model whileadopting Dual-Core/Multi-Core. In orderto adopt DC/MC for maximizing the sys-tem performance, a user should be awareof system usage model. With perfor-mance adaptive scheduling infrastructure,the usage model alternation will becomeless relevant for performance. In turn, itmay stimulate efficient adoption of multi-core technology by application develop-ers, since user awareness of usage modelwill not affect extracting optimal perfor-mance from the software.

4. Simplify software development schemes.Background/foreground and process pri-ority management based on performance.

5. Hybrid of OpenMP & MPI for high per-formance programming will be simpli-fied. A performance-adaptive OS Sched-uler will handle optimal scheduling depen-dent on processor’s resource utilization foreach OpenMP thread.

6. Power utilization optimization and energycontrol. Modern micro-architectures havean extensive set of energy control relatedperformance counters. When power re-strictions are enforced for a process execu-tion, the number of stall cycles due to plat-form resource saturation should be mini-mized. The optimized scheduling for pro-cesses on preventing such platform perfor-mance hazards to occur should be handledby OS scheduler.

7. Dynamic Capacity planning analysis. An-alyzing profiling data logs per thread and


detection of certain event sequence haz-ards may assist in identifying capacity re-quirements for the application.

8. Out-of-order execution layer for stati-cally scheduled codes, better utilization of“free” CPU cycles and compensation forpossible compiler’s scheduler inefficien-cies.

Having performance feedback based OSscheduler will provide information withadditional granularity (on top of the com-piler scheduler) for filling the empty cy-cles generated by the compiler (or ifpresent, even during OOO execution onx86).

9. Virtualization Technology. When a codeis running on virtual processing units, andutilizing a virtual pool of resources, it isimportant to provide optimal performance,a dynamic code migration suggestion. Theassignment between virtual and physicalprocessing unit should be done based onactual performance execution statistics. IfLinux is a “guest” OS, the presence of per-formance adaptive scheduling mechanismwill allow the OS scheduler to be awareof resource utilization across all the virtualprocesses.

10. Profile-guided debugging proposal targetsmost difficult areas of debugging - perfor-mance debugging and scalability issues.See [10].6 on examples on how to utilizeHelper Threads technology for memorydebugging. By combining principles ofProfile-Guided optimization and conven-tional debugging mechanisms we showedit is possible to architect a debugger’s ex-tension to set a breakpoint at a perfor-mance or power pattern occurrence. Asa result, variety of metrics for the perfor-mance may be reflected in the debugging,such as: ratio mips/watt, instruction level

parallelism. State-of-the-art debuggers al-low users to manually define a breakpointon expressions which involve values ob-tained from the memory during the appli-cation execution. This approach assiststo extend the mechanisms to combine theexpression values received from CPU andchipset performance counters in run-time.

Examples of possible breakpoints whichwould be set by a user who debugs multi-threaded applications are:

• Hazardous spin locks

• Shared memory race conditions

• Too long or too short object waits

• Heavy ITLB, Instruction or TraceCache misses

• Power consuming blocks; Floatingpoint intensive procedures

• Loops with extremely low CPIs; lowpower blocks

• Long latency memory bus operations

• Irregular data structure accesses;alignment issues during run-time

• Queue and pipeline flushes, unex-pected long latency execution

• Opcode or series of opcodes beingexecuted

• Hyper-threading contentions or raceconditions

• OpenMP issues

There are already working applications withPin-based instrumentation for simulation andperformance prediction purposes. Extendingthese simulation technologies [15], similar tothe PGD technique generating Interrupt 3, wewould be able to emit other interrupts, signalsand eventually generate an alternate sequenceof events.


References

[1] “Enhancements for Hyper-ThreadingTechnology in the Operating Systems –Seeking the Optimal Scheduling”, byJun Nakajima and Venkatesh Pallipadi,Intel Corporation

[2] “Dynamic Hot Data Stream Prefetchingfor General-Purpose Programs”, byTrishul M.Chilimbi and Martin Hirzel,Microsoft Research and University ofColoroado

[3] Compiler for PGO (profile feedback)and its repository data coverage:http://gcc.gnu.org/onlinedocs/gccint/Profile-information.htm

[4] Compiler optimizer, gcc4.1:http://gcc.gnu.org/

onlinedocs/gcc-4.1.0/gcc/

Optimize-Options.html

[5] Compiler for vectorization and reports:http://www.gnu.org/software/

gcc/projects/tree-ssa/

vectorization.html

[6] Pronto repository content, Profrun andpronto_tool – Intel Compiler tools,based on 2-compile model:http://www.intel.com/

software/products/compilers/

clin/docs/main\_cls/index.htm

[7] PAPI: http://icl.cs.utk.edu/papi/overview/index.html

[8] Pin & pintool: http://rogue.colorado.edu/pin;Pin manual for x86:http://rogue.colorado.edu/

pin/documentation.php;Pin related papers: http://rogue.colorado.edu/pin/papers.html

[9] Sequitur: For Sequitur Algorithmdescription see:

[9.1] “Compression and explanation inhierarchical grammars”, by Craig G.Nevill-Manning and Ian H. Witten,University of Waikato, New Zealand

[9.2] “Identifying Hierarchical Structure inSequences: A linear time algorithm”,Craig G.Nevill-Manning and IanH.Witten, University of Waikato, NewZealand, 1997

[9.3] “Efficient Representation andAbstractions for Quantifying andExploiting Data Reference Locality”,by Trishul M.Chilimbi, MicrosoftResearch, 2001

[10] Helper threads and compiler basedpre-execution:

[10.1] For concept overview see “CompilerBased Pre-execution”, Dongkeun Kimdissertation, University of Maryland,2004,

[10.2] Threads: Basic Theory and Libraries:http://www.cs.cf.ac.uk/Dave/

C/node29.html

[10.3] Usage model for helper threads in“Helper threads via Multi-threading”,IEEE Micro, 11/2004

[10.4] “Helper Threads via VirtualMultithreading on an experimentalItanium 2 Processor-based platform”,by Perry Wang et al, Intel, 2002

Helper threads and pre-executiontechnology for:

[10.5] Profiling, see “Profiling with Helperthreads”, T.Tokunaga and T. Sato(Japan), 2006


[10.6] Debugging, see “HeapMon: A helperthread approach to programmable,automatic, and low overhead memorybug detection”, IBM Journal ofResearch and Development, byR.Shetty et al., 2005

[11] “Dynamic run-time architecturetechnique for enabling continuousoptimizations” by Tipp Moseley,Daniel A. Connors, etc., University ofColorado

[12] “Chip Multithreading Systems Need aNew Operating System Scheduler” byAlexandra Fedorova, ChristopherSmall, et all, Harward University &Sun Micro.

[13] “Methods for Modeling ResourceContention on SimultaneousMultithreading Processors” by TippMoseley, Daniel A. Connors,University of Colorado

[14] “Pthreads Primer, A guide tomultithreaded programming”, BillLewis and Daniel J. Berg, SunSoftPress, 1996

[15] SimPoint toolkit by UCSD:http://www-cse.ucsd.edu/

~calder/simpoint/simpoint_

overview.htm

[16] Intel Compiler Documentation –keywords: Software-based SpeculativePrecomputation (SSP); Profrun utility,prof-gen-sampling:http://www.intel.com/

software/products/compilers/

clin/docs/main_cls/index.htm

[17] “Pin: Building Customized ProgramAnalysis Tools with DynamicInstrumentation,” by Chi-Keung Luk,Robert Cohn, Robert Muth, Harish

Patil, Artur Klauser, Geoff Lowney,Steven Wallace, Vijay Janapa Reddi,Kim Hazelwood. ProgrammingLanguage Design and Implementation(PLDI), Chicago, IL, June 2005

[18] Tree SSA: A new optimizationinfrastructure for GCC, by DiegoNovillo, Red Hat Canada, 2003:http://people.redhat.com/

dnovillo/pub/tree-ssa/

papers/tree-ssa-gccs03.pdf;http://gcc.gnu.org/projects/

tree-ssa/#intro

A Reliable and Portable Multimedia File System

Joo-Young Hwang, Jae-Kyoung Bae, Alexander Kirnasov,Min-Sung Jang, Ha-Yeong Kim

Samsung Electronics, Suwon, Korea{jooyoung.hwang, jaekyoung.bae, a78.kirnasov}@samsung.com

{minsung.jang, hayeong.kim}@samsung.com

Abstract

In this paper we describe design and imple-mentation of a database-assisted multimediafile system, named as XPRESS (eXtendiblePortable Reliable Embedded Storage System).In XPRESS, conventional file system metadatalike inodes, directories, and free space infor-mation are handled by transactional database,which guarantees metadata consistency againstvarious kinds of system failures. File sys-tem implementation and upgrade are made easybecause metadata scheme can be changed bymodifying database schema. Moreover, us-ing well-defined database transaction program-ming interface, complex transactions like non-linear editing operations are developed easily.Since XPRESS runs in user level, it is portableto various OSes. XPRESS shows streamingperformance competitive to Linux XFS real-time extension on Linux 2.6.12, which indi-cates the file system architecture can provideperformance, maintainability, and reliability al-together.

1 Introduction

Previously consumer electronics (CE) devicesdidn’t use disk drives, but these days disks are

being used for various CE devices from per-sonal video recorder (PVR) to hand held cam-corders, portable media players, and mobilephones. File systems for such devices have re-quirements for multimedia extensions, reliabil-ity, portability and maintainability.

Multimedia Extension Multimedia extensionsrequired for CE devices are non-linear editingand advanced file indexing. As CE devicesare being capable of capturing and storing A/Vdata, consumers want to personalize media dataaccording to their preference. They make theirown titles and shares with their friends via in-ternet. PVR users want to edit recorded streamsto remove advertisement and uninterested por-tions. Non-linear editing system had been onlynecessary in the studio to produce broadcastcontents but it will be necessary also for con-sumers. Multimedia file system for CE devicesshould support non-linear editing operations ef-ficiently. File system should support file index-ing by content-aware attributes, for example thedirector and actor/actress of a movie clip.

Reliability On occurrence of system fail-ures(e.g. power failures, reset, and bad blocks),file system for CE devices should be recoveredto a consistent state. Implementation of a reli-able file system from scratch or API extensionto existing file system are difficult and requireslong stabilization effort. In case of appending

58 • A Reliable and Portable Multimedia File System

multimedia API extension (e.g. non-linear editoperations), reliability can be a main concern.

Portability Conventional file systems havedependency on underlying operating system.Since CE devices manufacturers usually usevarious operating systems, file system shouldbe easily ported to various OSes.

Maintainability Consumer devices are diverseand the requirements for file system are slightlydifferent from device to device. Moreover, filesystem requirements are time-varying. Main-tainability is desired to reduce cost of file sys-tem upgrade and customization.

In this paper, we propose a multi-media file sys-tem architecture to provide all the above re-quirements. A research prototype named as“XPRESS”(eXtensible Portable Reliable Em-bedded Storage System) is implemented onLinux. XPRESS is a user-level database-assisted file system. It uses a transactionalBerkeley database as XPRESS metadata store.XPRESS supports zero-copy non-linear edit(cut & paste) operations. XPRESS shows per-formance in terms of bandwidth and IO la-tencies which is competitive to Linux XFS.This paper is organized as follows. Archi-tecture of XPRESS is described in section 2.XPRESS’s database schema is presented in sec-tion 3. Space allocation is detailed in section 4.Section 5 gives experimental results and discus-sions. Related works are described in section 6.We conclude and indicate future works in sec-tion 7.

2 Architecture

File system consistency is one of important filesystem design issue because it affects overalldesign complexity. File system consistencycan be classified into metadata consistency and

data consistency. Metadata consistency is tosupport transactional metadata operations withACID (atomicity, consistency, isolation, dura-bility) semantics or a subset of ACID. Typicalfile system metadata consist of inodes, directo-ries, disk’s free space information, and free in-odes information. Data consistency means sup-porting ACID semantics for data transactionsas well. If a data transaction to update a portionof file is aborted due to some reasons, the dataupdate is complete or data update is discardedat all.

There have been several approaches of imple-menting file system consistency; log structuredfile system [11] or journaling file system [2].In log structured file system, all operations arelogged to disk drive. File system is structuredas logs of consequent file system update op-erations. Journaling is to store operations onseparate journal space before updating the filesystem. Journaling file system writes twice (tojournal space and file system) while log struc-tured file system does write once. Howeverjournaling approach is popular because it canupgrade existing non-journaling file system tojournaling file system without losing or rewrit-ing the existing contents.

Implementation of log structured file system orjournaling is a time consuming task. Therehave been a lot of researches and implemen-tation of ACID transactions mechanism indatabase technology. There are many stableopen source databases or commercial databaseswhich provide transaction mechanism. So wedecide to exploit databases’ transaction mech-anism in building our file system. Since weaimed to design a file system with performanceand reliability altogether, we decided not tosave file contents in database. Streaming per-formance can be bottlenecked by database iffile contents are stored in db. So, only meta-data is stored in database while file contents arestored to a partition. Placement of file contents


on the data partition is guided by multimediaoptimized allocation strategy.

Storing file system metadata in database makesfile system upgrades and customization mucheasier than conventional file systems. XPRESShandles metadata thru database API and is notresponsible for disk layout of the metadata. Filesystem developer has only to design high leveldatabase schema and use well defined databaseAPI to upgrade existing database. To upgradeconventional file systems, developer shouldhandle details of physical layout of metadataand modify custom data structures.

XPRESS is implemented in user level becauseuser level implementation gives high portabil-ity and maintainability. File system sourcecodes are not dependent on OS. Kernel levelfile systems should comply with OS specificinfrastructures. Linux kernel level file systemcompliant to VFS layer cannot be ported eas-ily to different RTOSes. There can be a over-head which is due to user level implementa-tion. XPRESS should make system calls toread/write block device file to access file con-tents. If file data is sparsely distributed, contextswitching happens for each extent. There wasan approach to port existing user level databaseto kernel level[9] and develop a kernel leveldatabase file system. It can be improved if us-ing Linux AIO efficiently. Current XPRESSdoes not use Linux AIO but has no significantperformance overhead.

Figure 1 gives a block diagram of XPRESS filesystem. XPRESS is designed to be independentof database. DB Abstraction Layer (DBAL) islocated between metadata manager and Berke-ley DB. DBAL defines basic set of interfaceswhich modern databases usually have. SQL orcomplex data types are not used. XPRESS hasnot much OS dependencies. OSAL of XPRESShas only wrappers for interfacing block devicefile which may be different across operatingsystems.

Figure 1: Block Diagram of XPRESS File Sys-tem

There are four modules handling file systemmetadata; superblock, directory, inode, and al-loc modules. Directory module implements thename space using a lookup database (dir.db)and path_lookup function. The functionlooks up the path of a file or a directory throughrecursive DB query for each token of the pathdivided separated by ’/’ character. The flow-ing is the simple example for the process of thisfunction. Superblock module maintains free in-odes. Inode module maintains the logical-to-physical space mappings for files. Alloc mod-ule maintains free space.

In terms of file IO, file module calls inode mod-ule to updates or queries extents.db andreads or writes file contents. The block de-vice file corresponding to the data partition (say“/dev/sda1”) is opened for read/write modeat mount time and its file descriptor is savedin environment descriptor, which is referredto by every application accessing the partition.XPRESS does not implement caching but relies


on Linux buffer cache. XPRESS supports bothdirect IO and cached IO. For cached IO, theblock device file is opened without O_DIRECTflag while opened with O_DIRECT in case ofdirect IO.

Support for using multiple partitions concur-rently, XPRESS manages mounted partitionsinformation using environment descriptors. Aenvironment descriptor has information about apartition and its mount point, which is used forname space resolution. Since all DB resources(transaction, locking, logging, etc) correspond-ing to a partition belongs to a DB environmentand separate logging daemon is ncessary forseparate environment, a new logging daemonis launched on mounting a new partition.

3 Transactional Metadata Manage-ment

3.1 Choosing Database

As the infrastructure of XPRESS, databaseshould conform to the following requirements.First, it should have transactional operationand recovery support. It is important for im-plementing metadata consistency of XPRESS.Second, it should be highly concurrent. Sincefile system is used by many threads or pro-cesses, database should support and have highconcurrency performance as well. Third, itshould be light-weight. Database does not haveto support many features which are unneces-sary for file system development. It only has toprovide API necessary for XPRESS efficiently.Finally, it should be highly efficient. Databaseimplemented as a separate process has cleanerinterface and maintainability; however libraryarchitecture is more efficient.

Berkeley DB is an open source embed-ded database that provides scalable, high-

performance, transaction-protected data man-agement services to applications. Berkeley DBprovides a simple function-call API for dataaccess and management. Berkeley DB is em-bedded in its application. It is linked directlyinto the application and runs in the same ad-dress space as the application. As a result, nointer-process communication, either over thenetwork or between processes on the same ma-chine, is required for database operations. Alldatabase operations happen inside the library.Multiple processes, or multiple threads in asingle process, can all use the database at thesame time as each uses the Berkeley DB li-brary. Low-level services like locking, transac-tion logging, shared buffer management, mem-ory management, and so on are all handledtransparently by the library.

Berkeley DB offers important data manage-ment services, including concurrency, transac-tions, and recovery. All of these services workon all of the storage structures. The library pro-vides strict ACID transaction semantics, by de-fault. However, applications can relax the isola-tion or the durability. XPRESS uses relaxed se-mantics for performance. Multiple operationscan be grouped into a single transaction, andcan be committed or rolled back atomically.Berkeley DB uses a technique called two-phaselocking to support highly concurrent transac-tions, and a technique called write-ahead log-ging to guarantee that committed changes sur-vive application, system, or hardware failures.

If a BDB environment is opened with a recov-ery option, Berkeley DB runs recovery for alldatabases belonging to the environment. Re-covery restores the database to a clean state,with all committed changes present, even af-ter a crash. The database is guaranteed to beconsistent and all committed changes are guar-anteed to be present when recovery completes.


DB unit key data structure secondary indexsuper partition partition number superblock data RECNO -

dir partition parent INO/file name INO B-tree INO (B tree)inode partition INO inode data RECNO -

free space partition PBN length B-tree length(B tree)extents file file offset PBN/length B-tree -

Table 1: Database Schema, INO: inode number, RECNO: record number, PBN: physical blocknumber

3.2 Database Schema Design

XPRESS defines five databases to store file sys-tem metadata and their schema is shown in Ta-ble 1. Each B-tree database has a specific key-comparison function that determines the orderin which keys are stored and retrieved. Sec-ondary index is used for performance accelera-tion.

Superblock DB The superblock database,super.db, stores file system status and in-ode bitmap information. As overall file sys-tem status can be stored in just one record,and inode bitmap also needs just a few records,this database has RECNO structure, and doesnot need any secondary index. Super databasekeeps a candidate inode number, and when anew file is created, XPRESS uses this inodenumber and then replaces the candidate inodenumber with another one selected after scan-ning inode bitmap records.

Directory DB The dir database, dir.db,maps directory and file name information to in-ode numbers. The key used is a structure withtwo values: the parent inode number and thechild file name. This is similar to a standardUNIX directory that maps names to inodes. Adirectory in XPRESS is a simple file with a spe-cial mode bit. As XPRESS is user-level file sys-tem, it does not depend on Linux VFS layer. Asa result it cannot use directory related caches

of Linux kernel (i.e., dentry cache), instead,database cache will be used for that purpose.

Inode DB The inode database, inode.db,maps inode numbers to the file information(e.g., file size, last modification time, etc.).When a file is created, a new inode record isinserted into this database, and when a file isdeleted, the inode record is removed from thisdatabase. XPRESS assigns inode numbers inan increasing order and upper limit on the num-ber of inodes is determined when creating thefile system. It will be possible to make sec-ondary indices for inode.db for efficiency(e.g., searching files whose size is larger than1M). But currently no secondary index is used.

Free Space DB The free-space database,freespace.db, manages free extents of the par-tition. Initially free-space database has onerecord whose data is just one big extent whichmeans a whole partition. When a change infile size happens this database will update itsrecords.

Extents DB A file in XPRESS consists of sev-eral extents of which size is not fixed. The ex-tents database, extents.db, maps file offsetto physical blocks address of the extent includ-ing the file data. As this database correspondsto each file, its life-time is also same with thatof a file. The exact database name is identifiedwith an inode number; extents.db is justdatabase file name. This database is only dy-namically removable while all other databases


are going on with the file system.

3.3 Transactional System Calls

A transaction is atomic if it ensures that all theupdates in a transaction are done altogether ornone of them is done at all. After a trans-action is ether committed or aborted, all thedatabases for file system metadata are in con-sistent. In multi-process or multi-thread envi-ronment, concurrent operation should be pos-sible without any interference of other opera-tions. We can say this property isolated. AnOperation will have already been committed tothe databases or is in the transaction log safelyif the transaction is committed. So the file sys-tem operations are durable which means filesystem metadata is never lost.

Each file system call in XPRESS is protectedby a transaction. Since XPRESS system callsare using the transactional mechanism providedby Berkeley DB, ACID properties are enabledfor each file system call of XPRESS. A file sys-tem call usually involves multiple reads and up-dates to metadata. An error in the middle ofsystem call can cause problems to the file sys-tem. By storing file system metadata in thedatabases and enabling transactional operationsfor accessing those databases, file system iskept stable state in spite of many unexpectederrors.

An error in updating any of the databases dur-ing a system call will cause the system call,which is protected by a transaction, to beaborted. There can be no partially done systemcalls. XPRESS ensures that any system call iscomplete or not started at all. In this sense, aXPRESS system call is atomic.

Satisfying strict ACID properties can cause per-formance degradation. Especially in file sys-tem, since durability may not be strict condi-

tion, we can relax this property for better per-formance. XPRESS compromises durabilityby not syncing the log on transaction commit.Note that the policy of durability applies to alldatabases in an environment. Flushing takesplace periodically and the cycle of flushing canbe configurable. Default cycle is 10 seconds.

Every XPRESS system call handles three typesof failures; power-off, deadlock, and unex-pected operation failure. Those errors are han-dled according to cases. In case of power-offwe are not able to handle immediately. Afterthe system is rebooted, recovery procedure willbe automatically started. Deadlock may occurduring transaction when the collision betweenconcurrent processes or threads happened. Inthis case, one winning process access databasesuccessfully and all other processes or threadsare reported deadlock. They have to abortand retry their transactions. In case of unex-pected operation failure, on-going transactionis aborted and system calls return error to itscaller.

3.4 Non-linear Editing Support

Non-linear editing on A/V data means cuttinga segment of continuous stream and inserting asegment inside a stream. This is required whenusers want to remove unnecessary portion of astream; for example, after recording two hoursof drama including advertisement, user wantsto remove the advertisement segments from thestream. Inserting is useful when a user wantsto merge several streams into one title. Theseneeds are growing because consumers capturemany short clips during traveling somewhereand want to make a title consisting of selectedsegments of all the clips.

Conventional file systems does not considerthis requirement, so to support those opera-tions, a portion of file should be migrated. If


front end of a file is to be cut, the remain-ing contents of the file should be copied to anew file because the remaining contents are notblock aligned. File systems does assume blockaligned starting of a file. In other words, theydo not assume that a file does not start in themiddle of a block. Moreover, a block is notassumed to be shared by different files. Thishas been general assumptions about had diskfile system because disk is a block based devicewhich means space is allocated and accessed inblocks. The problem is more complicated forinserting. A file should have a hole to accom-modate new contents inside it and the hole maynot be block-aligned. So there can be breachesin the boundaries of the hole.

We solve those problems by using byte-precision extents allocation. XPRESS allowsphysical extent not aligned with disk block size.After cutting a logical extent, the physical ex-tents corresponding to the cut logical extent canbe inserted into other file. Implementation ofthose operations involves updating databasesmanaging the extents information. Data copyis not required for the editing operations, onlyextents info is updated accordingly.

4 Extent Allocation and InodeManagement

In this section, the term “block” refers to the al-location unit in XPRESS. The block size can beconfigurable at format time. For byte-precisionextents, block size is configured to one byte.

4.1 Extent Allocation

There were several approaches for improvingcontiguity of file allocation. Traditionally (FAT,ext2, ext3) disc free space was handled by

means of bitmaps. Bit 0 at position n of thebitmap designates, that n-th disc block is free(can be allocated for the file). This approachhas several drawbacks; searching for the freespace is not effective and bitmaps do not explic-itly provide information on contiguous chunksof the free space.

The concept of extent was introduced in orderto overcome mentioned drawbacks. An extentis a couple, consisting from block number andlength in units of blocks. An extent representsa contiguous chunk of blocks on the disk. Thefree space can be represented by set of corre-sponding extents. Such approach is used forexample in XFS ( with exception of real-timevolume). In case of XFS the set of extents isorganized in two B+ trees, first sorted by start-ing block number and second sorted by size ofthe extent. Due to such organization the searchfor the free space becomes essentially more ef-ficient compared to the case of using bitmaps.

One way to increase the contiguity comes fromincreasing the block size. Such approach is es-pecially useful for real time file systems whichdeal with large video streams in combinationwith idea of using special volume specificallyfor these large files. Similar approach is uti-lized in XFS for real-time volume. Anotherway to improve file locality on the disc is preal-location. This technique can be described as anallocation of space for the file in advance beforethe file is being written to. Preallocation can beaccomplished whether on application or on thefile system level. Delayed allocation can alsobe used for contiguous allocation. This tech-nique postpones an actual disc space allocationfor the file, accumulating the file contents tobe written in the memory buffer, thus providingbetter information for the allocation module.

XPRESS manages free space by using ex-tents. Allocation algorithm uses two databases:freespace1.db and freespace2.db,


which collect the information on free ex-tents which are not occupied by any fileson the file system. The freespace1.dborders all free extents by left end and thefreespace2.db, which is secondary in-dexed db of freespace1.db, orders all freeextents by length. Algorithm tries to allocateentire length of req_len as one extent in theneighborhood of the last_block. If it fails,then it tries to allocate maximum extent withinneighborhood. If no free extent is found withinneighborhood, it tries to allocate maximum freeextent in the file system. After allocating afree extent, neighborhood is updated and ittries the same process to allocate the remain-ing, which iterate until entire requested lengthis allocated. Search in neighborhood is accom-plished in left and right directions using twocursors on freespace1.db. The neighbor-hood size is determined heuristically propor-tional to the req_len.

Pre-allocation is important for multi-threadedIO applications. When multiple IO threadstry to write files, the file module tries to pre-allocate extents enough to guarantee disk IO ef-ficiency. Otherwise, disk blocks are fragmentedbecause contiguous blocks are allocated to dif-ferent files in case that multiple threads are al-locating simultaneously. Pre-allocation size ofXPRESS is by default 32Mbytes which maybe varying according to disk IO bandwidth.On file close, unused spaces among the pre-allocated extents are returned to free spacedatabase, which is handled by FILE module ofXPRESS.

4.2 Inode Management

File extents are stored in extents.db. Eachfile extent is represented by logical file offsetand corresponding physical extent. Both offsetand physical extent are specified with byte pre-cision in order to provide facilities for partial

Figure 2: Logical to physical mapping

truncation that is truncation of some portion ofthe file from a specified position of the file withbyte precision.

Let us designate a logical extent starting fromthe logical file offset a mapped to a physicalextent [b,c] as [a,[b,c]] for explanations in thefollowing. A logical file interval may containsome regions which do not have physical im-age on the disc; such regions are referred asholes. The main interface function of the IN-ODE module is xpress_make_segment_op(). The main parameters of the functionare type of the operation (READ, WRITE andDELETE) and logical file segment, specifiedby logical offset and the length.

extents.db is read on file access to con-vert a logical segment to a set of physical ex-tents. Figure 2 shows an example of seg-ment mapping. Logical segment [start, end]corresponds to following set of physical ex-tents; {[start,[e,f]], [a,[g,h]], [b,[]], [c,[f,g]],[d,[i,k]]}, where the logical extent [b,c] is a filehole. In case of read operation to extents.db, the list of physical extents is retrieved andreturned to the file module.

When specified operation is WRITE and speci-fied segment contains yet unmapped area - thatis writing to the hole or beyond end of file isaccomplished, then allocation request may begenerated after aligning the requested size tothe block size since as was mentioned alloca-tion module uses blocks as units of allocation.In Figure 3, blocks are allocated when writingthe segment [start,end] of the file.


Figure 3: Block allocation details

Figure 4: Throughput with 1 thread

Xpress allows to perform partial file truncateoperation as well, as cut and paste operation.First operation removes some fragment of thefile, while the latter also inserts the removedfile portion into specified position of anotherfile. On partial truncate operation, a truncatedlogical segment is left as a hole. On cut op-eration, logical segment mapping is updated toavoid hole. On paste operation, mapping is up-dated to avoid overwriting as well.

5 Experimental Results

Test platform is configured as following. TargetH/W : Pentium4 CPU 1.70GHz with 128MBRAM and 30GB SAMSUNG SV3002H IDEdiskTarget O/S : Linux-2.6.12Test tools : tiotest(tiobench[1]), rwrt

XPRESS consistency semantic is metadatajournaling and ordered data writing which is

Figure 5: Throughput with 4 threads

similar to XFS file system and ext3 with or-dered data mode. Hence we chose XFS andext3 as performance comparison targets. AsXFS file system provides real-time sub-volumeoption, we also used it as one of compari-son targets. By calling it XFS-RT, we willdistinguish it from normal XFS. tiotest isthreaded I/O benchmark tool which can testdisk I/O throughput and latency. rwrt is a ba-sic file I/O application to test ABISS (AdaptiveBlock IO Scheduling System)[3]. It performsisochronous read operations on a file and gath-ers information on the timeliness of system re-sponses. It gives us I/O latencies or jitters ofstreaming files, which is useful for analyzingstreaming quality. We did not use XPRESS’sI/O scheduling and buffer cache module be-cause we can get better performance with theLinux’s native I/O scheduling and buffer cache.

IO Bandwidth ComparisonFigure 4 and Figure 5 show the results of I/Othroughput comparison for each file system.These two tests are conducted with tiotesttool whose block size option is 64KB whichmeans the unit of read and write is 64KB. Inboth cases, there is no big difference betweenall file systems.

IO Latency ComparisonWe used both tiotest and rwrt tool to mea-sure I/O latencies in case of running multiple


Figure 6: MAX latency with 1 thread

Figure 7: MAX latency with 4 threads

concurrent I/O threads. The rwrt is dedi-cated for profiling sequential read latencies andtiobench is used for profiling latencies of write,random write, read, and random read opera-tions.

Figure 6 and Figure 7 show the maximum I/Olatencies for each file system obtained fromtiotest. The maximum I/O latency is alittle low on XPRESS file system. In termsof write latency, XPRESS outperforms otherswhile maximum read latencies are similar.

To investigate read case more, we conductedthe rwrt with the number of threads from 1to 8 and measure the read latencies for eachread request. Each process tries its best toread blocks of a file. The results of thesetests include the latencies of each read requestwhose size is 128Kbyte. Table 2 shows the

File System Average Std Dev Max2 threads

Ext3 8.47 26.03 512XFS 8.62 25.15 239

XFS-RT 8.96 25.66 260XPRESS 8.62 24.93 262

4 threadsExt3 17.9594 72.70 877XFS 17.8389 72.70 528

XFS-RT 18.7506 74.33 524XPRESS 17.8467 71.79 413

8 threadsExt3 36.38 166.47 1144XFS 36.35 166.86 1049

XFS-RT 38.34 171.39 1023XPRESS 36.33 166.38 923

Table 2: Read Latencies Statistics. All are mea-sured in milliseconds.

statistics of the measured read latencies for 2,4, and 8 threads. The average latencies arenearly the same for all experimented file sys-tems. XPRESS show slightly smaller standarddeviation than others and improvement regard-ing the maximum latencies. Please note thatXPRESS maximum latency is 413 millisecondswhile the max latency of XFS-RT is 524 mil-liseconds.

Jitters during Streaming at a Constant DataRateThe jitters performance is important fromuser’s point of view because it leads to a lowquality video streaming. We performed rwrttool with ABISS I/O scheduling turned off. Therwrt is configured to read a stream with aconstant specified data rate, say 3MB/sec or5MB/sec. Table 3 shows jitter statistics foreach file system when running single threadwith 5MB/sec rate, four concurrent threads at5MB/sec rates, and six concurrent threads at3MB/sec rates, respectively. The results ofthese tests include the jitters of each read re-


File System Average Std Dev Max1 thread (5MB/s)

Ext3 3.97 2.44 43XFS 3.91 2.03 46

XFS-RT 3.94 1.79 25XPRESS 3.89 1.92 40

4 threads (each 5MB/sec)Ext3 18.00 41.72 694XFS 15.67 30.25 249

XFS-RT 19.22 34.63 267XPRESS 17.93 38.83 257

6 threads (each 3MB/sec)Ext3 24.80 48.43 791XFS 23.81 42.20 297

XFS-RT 26.57 43.48 388XPRESS 23.66 45.31 337

Table 3: Jitter Statistics. All are measured inmilliseconds.

quest whose size is 128Kbyte. Mean values ofexperimented file systems are nearly the same.XPRESS, XFS, and XFS-RT show the similarstandard deviation of jitters which is much lessthan that of ext3.

Metadata and Data SeparationMetadata access patterns are usually small re-quests while streaming accesses are continuousbulk data transfer. By separating metadata anddata partitions on different media, performancecan be optimized according to their workloadtypes. XPRESS allows metadata to be placedin separate partition which can be placed onanother disk or NAND flash memory. Table 4summarizes the effect of changing the db par-tition. This is test for extreme case since weused ramdisk, which is virtual in-memory disk,as a separate disk. However, we can identifythe theoretical possibility of performance en-hancement from this test. According to the re-sult, the enhancement happens mainly on writetest by 11%. We expect using separate parti-tion will reduce latencies significantly, which

configuration write readsame disk 25.20 27.82

separate disk 28.20 28.53

Table 4: Placing metadata on ramdisk and dataon a disk. All are measured in MB/s.

are not shown here.

Non-linear EditingTo test cut-and-paste operation, we preparedtwo files each of which is of 131072000 bytes,then cut the latter half ( = 65,536,000 bytes)of one file and append it to the other file. InXPRESS, this operation takes 0.274 seconds.For comparison, the operation is implementedon ext3 by copying iteratively 65536 bytes,which took 3.9 seconds. The performance gapis due to not copying file data but updating filesystem metadata (extents.db and inode.db). In XPRESS, the operation is atomic trans-action and can be undone if it fails during theoperation. However our implementation of theoperation on ext3 does not guarantee atomicity.

6 Related Works

Traditionally file system and database has beendeveloped separately without depending oneach other. They are aimed for different pur-poses; file system is for byte streaming anddatabase is for advanced query support. Re-cently there is a growing needs to index fileswith attributes other than file name. For exam-ple, file system containing photos and emailsneed to be indexed by date, sender, or sub-jects. XML document provides jit(just-in-time) schema to support contents based index-ing. Conventional file system is lack of in-dexing that kind of new types of files. Therehas been a few works to implement databasefile systems to enhance file indexing; BFS[5],


GnomeStorage[10], DBFS[6], kbdbfs[9], andWinFS[7]. Those are not addressing streamingperformance issues. For video streaming, dataplacement on disk is important. While con-ventional database file systems resorts to DBor other file systems for file content storage,XPRESS controls placement of files content byitself.

Compared to custom multimedia file systems(e.g. [12], [8], [4]), XPRESS has a well-definedfile system metadata design and implementa-tion framework and file system metadata is pro-tected by transactional databases. Appendingmultimedia extensions like indexing and non-linear editing is easier. Moreover since it is im-plemented in user level, it is highly portable tovarious OSes.

7 Conclusions and Future Works

In this paper we described a novel multime-dia file system architecture satisfying stream-ing performance, multimedia extensions (non-linear editing), reliability, portability, andmaintainability. We described detailed designand issues of XPRESS which is a research pro-totype implementation. In XPRESS, file sys-tem metadata are managed by a transactionaldatabase, so metadata consistency is ensuredby transactional DB. Upgrade and customiza-tion of file system is easy task in XPRESS be-cause developers don’t have to deal with meta-data layout in disk drive. We also implementatomic non-linear editing operations using DBtransactions. Cutting and pasting A/V datasegments is implemented by extents databaseupdate without copying segment data. Com-pared to ext3, XFS, and XFS real-time sub-volume extension, XPRESS showed competi-tive streaming performance and more determin-istic response times.

This work indicates feasibility of database-assisted multimedia file system. Based onthe database and user level implementation, itmakes future design change and porting easywhile streaming performance is not compro-mised at the same time. Future works are appli-cation binary compatibility support using sys-tem call hijacking, appending contents-basedextended attributes, and encryption support.Code migration to kernel level will also behelpful for embedded devices having low com-puting power.

References

[1] Threaded i/o tester.http://sourceforge.net/projects/tiobench/.

[2] M. Cao, T. Tso, B. Pulavarty,S. Bhattacharya, A. Dilger, andA. Tomas. State of the art: Where we arewith the ext3 filesystem. In Proceedingof Linux Symposium, July 2005.

[3] Giel de Nijs, Benno van den Brink, andWerner Almesberger. Active block ioscheduling system. In Proceeding ofLinux Symposium, pages 109–126, July2005.

[4] Pallavi Galgali and Ashish Chaurasia.San file system as an infrastructure formultimedia servers. http://www.redbooks.ibm.com/redpapers/pdfs/redp4098.pdf.

[5] Dominic Giampaolo. Practical FileSystem Design with the Be File System.Morgan Kaufmann Publishers, Inc.,1999. ISBN 1-55860-497-9.

[6] O. Gorter. Database file system.Technical report, University of Twente,aug 2004. http://ozy.student.


utwente.nl/projects/dbfs/dbfs-paper.pdf.

[7] Richard Grimes. Revolutionary filestorage system lets users search andmanage files based on content, 2004.http://msdn.microsoft.com/msdnmag/issues/04/01/WinFS/default.aspx.

[8] R. L. Haskin. Tiger Shark — A scalablefile system for multimedia. IBM Journalof Research and Development,42(2):185–197, 1998.

[9] Aditya Kashyap. File SystemExtensibility and Reliability Using anin-Kernel Database. PhD thesis, StonyBrook University, 2004. TechnicalReport FSL-04-06.

[10] Seth Nickell. A cognitive defense ofassociative interfaces for objectreference, Oct 2004. http://www.gnome.org/~seth/storage/associative-interfaces.pdf.

[11] Mendel Rosenblum and John K.Ousterhout. The design andimplementation of a log-structured filesystem. ACM Transactions on ComputerSystems, 10(1):26–52, 1992.

[12] Philip Trautman and Jim Mostek.Scalability and performance in modernfile systems.http://linux-xfs.sgi.com/projects/xfs/papers/xfs_white/xfs_white_paper.%html.


Utilizing IOMMUs for Virtualization in Linux and Xen

Muli [email protected]

Jon [email protected]

Orran [email protected]

Jimi [email protected]

Leendert Van [email protected]

Asit [email protected]

Jun [email protected]

Elsie [email protected]

Abstract

IOMMUs are hardware devices that trans-late device DMA addresses to proper ma-chine physical addresses. IOMMUs havelong been used for RAS (prohibiting de-vices from DMA’ing into the wrong memory)and for performance optimization (avoidingbounce buffers and simplifying scatter/gather).With the increasing emphasis on virtualization,IOMMUs from IBM, Intel, and AMD are be-ing used and re-designed in new ways, e.g.,to enforce isolation between multiple operatingsystems with direct device access. These newIOMMUs and their usage scenarios have a pro-found impact on some of the OS and hypervisorabstractions and implementation.

We describe the issues and design alterna-tives of kernel and hypervisor support for newIOMMU designs. We present the design andimplementation of the changes made to Linux(some of which have already been merged intothe mainline kernel) and Xen, as well as ourproposed roadmap. We discuss how the inter-faces and implementation can adapt to upcom-

ing IOMMU designs and to tune performancefor different workload/reliability/security sce-narios. We conclude with a description of someof the key research and development challengesnew IOMMUs present.

1 Introduction to IOMMUs

An I/O Memory Management Unit (IOMMU)creates one or more unique address spaceswhich can be used to control how a DMA op-eration from a device accesses memory. Thisfunctionality is not limited to translation, butcan also provide a mechanism by which deviceaccesses are isolated.

IOMMUs have long been used to address thedisparity between the addressing capability ofsome devices and the addressing capability ofthe host processor. As the addressing capabilityof those devices was smaller than the address-ing capability of the host processor, the devicescould not access all of physical memory. Theintroduction of 64-bit processors and the Phys-ical Address Extension (PAE) for x86, which

72 • Utilizing IOMMUs for Virtualization in Linux and Xen

allowed processors to address well beyond the32-bit limits, merely exacerbated the problem.

Legacy PCI32 bridges have a 32-bit interfacewhich limits the DMA address range to 4GB.The PCI SIG [11] came up with a non-IOMMUfix for the 4GB limitation, Dual Address Cy-cle (DAC). DAC-enabled systems/adapters by-pass this limitation by having two 32-bit ad-dress phases on the PCI bus (thus allowing 64bits of total addressable memory). This mod-ification is backward compatible to allow 32-bit, Single Address Cycle (SAC) adapters tofunction on DAC-enabled buses. However, thisdoes not solve the case where the addressablerange of a specific adapter is limited.

An IOMMU can create a unique translated ad-dress space, that is independent of any addressspace instantiated by the MMU of the proces-sor, that can map the addressable range of a de-vice to all of system memory.

When the addressable range of a device is lim-ited and no IOMMU exists, the device mightnot be able to reach all of physical memory. Inthis case a region of system memory that thedevice can address is reserved, and the deviceis programmed to DMA to this reserved area.The processor then copies the result to the tar-get memory that was beyond the “reach” of thedevice. This method is known as bounce buffer-ing.

In IOMMU isolation solves a very differentproblem than IOMMU translation. Isolation re-stricts the access of an adapter to the specificarea of memory that the IOMMU allows. With-out isolation, an adapter controlled by an un-trusted entity (such as a virtual machine whenrunning with a hypervisor, or a non-root user-level driver) could compromise the security oravailability of the system by corrupting mem-ory.

The IOMMU mechanism can be located on the

device, the bus, the processor module, or evenin the processor core. Typically it is located onthe bus that bridges the processor/memory ar-eas and the PCI bus. In this case, the IOMMUintercepts all PCI bus traffic over the bridge andtranslates the in- and out-bound addresses. De-pending on implementation, these in- and out-bound addresses, or translation window, can beas small as a few megabytes to as large as theentire addressable memory space by the adapter(4GB for 32-bit PCI adapters). If isolation isnot an issue, it may be beneficial to have ad-dresses beyond this window pass through un-modified.

AMD IOMMUs: GART, Device ExclusionVector, and I/O Virtualization Technology

AMD’s Graphical Aperture Remapping Table(GART) is a simple translation-only hardwareIOMMU [4]. GART is the integrated transla-tion table designed for use by AGP. The sin-gle translation table is located in the proces-sor’s memory controller and acts as an IOMMUfor PCI. GART works by specifying a physicalmemory window and list of pages to be trans-lated inside that window. Addresses outside thewindow are not translated. GART exists in theAMD Opteron, AMD Athlon 64, and AMD Tu-rion 64 processors.

AMD’s Virtualization (AMD-V(TM) / SVM)enabled processors have a Device ExclusionVector (DEV) table define the bounds of aset of protection domains providing isolation.DEV is a bit-vectored protection table that as-signs per-page access rights to devices in thatdomain. DEV forces a permission check ofall device DMAs indicating whether devicesin that domain are allowed to access the cor-responding physical page. DEV uses one bitper physical 4K page to represent each page inthe machine’s physical memory. A table of size128K represents up to 4GB.


AMD’s I/O Virtualization Technology [1] de-fines an IOMMU which will translate and pro-tect memory from any DMA transfers by pe-ripheral devices. Devices are assigned into aprotection domain with a set of I/O page tablesdefining the allowed memory addresses. Be-fore a DMA transfer begins, the IOMMU inter-cepts the access and checks its cache (IOTLB)and (if necessary) the I/O page tables for thatdevice. The translation and isolation functionsof the IOMMU may be used independently ofhardware or software virtualization; however,these facilities are a natural extension to virtu-alization.

The AMD IOMMU is configured as a capabil-ity of a bridge or device which may be Hyper-Transport or PCI based. A device downstreamof the AMD IOMMU in the machine topol-ogy may optionally maintain a cache (IOTLB)of its own address translations. An IOMMUmay also be incorporated into a bridge down-stream of another IOMMU capable bridge.Both topologies form scalable networks of dis-tributed translations. The page structures usedby the IOMMU are maintained in system mem-ory by hypervisors or privileged OS’s.

The AMD IOMMU can be used in conjunctionwith or in place of of the GART or DEV. WhileGART is limited to a 2GB translation window,the AMD IOMMU can translate accesses to allphysical memory.

Intel Virtualization Technology for DirectedI/O (VT-d)

Intel Virtualization Technology for DirectedI/O Architecture provides DMA remappinghardware that adds support for isolation of de-vice accesses to memory as well as translationfunctionality [2]. The DMA remapping hard-ware intercepts device attempts to access sys-tem memory. Then it uses I/O page tables to

determine whether the access is allowed andits intended location. The translation structureis unique to an I/O device function (PCI bus,device, and function) and is based on a multi-level page table. Each I/O device is given theDMA virtual address space same as the phys-ical address space, or a purely virtual addressspace defined by software. The DMA remap-ping hardware uses a context-entry table that isindexed by PCI bus, device and function to findthe root of the address translation table. Thehardware may cache context-entries as well asthe effective translations (IOTLB) to minimizethe overhead incurred for fetching them frommemory. DMA remapping faults detected bythe hardware are processed by logging the faultinformation and reporting the faults to softwarethrough a fault event (interrupt).

IBM IOMMUs: Calgary, DART, and Cell

IBM’s Calgary PCI-X bridge chips providehardware IOMMU functionality to both trans-late and isolate. Translations are defined by aset of Translation Control Entries (TCEs) in atable in system memory. The table can be con-sidered an array where the index is the pagenumber in the bus address space and the TCEat that index describes the physical page num-ber in system memory. The TCE may also con-tain additional information such as DMA di-rection access rights and specific devices (ordevice groupings) that each translation can beconsidered valid. Calgary provides a uniquebus address space to all devices behind eachPCI Host Bridge (PHB). The TCE table can belarge enough to cover 4GB of device accessi-ble memory. Calgary will fetch translations asappropriate and cache them locally in a man-ner similar to a TLB, or IOTLB. The IOTLB,much like the TLB on an MMU, provides asoftware accessible mechanism that can invali-date cache entries as the entries in system mem-ory are modified. Addresses above the 4GB


boundary are accessible using PCI DAC com-mands. If these commands originate from thedevice and are permitted, they will bypass theTCE translation. Calgary ships in some IBMSystem P and System X systems.

IBM’s CPC925 (U3) northbridge, which canbe found on JS20/21 Blades and Apple G5machines, provides IOMMU mechanisms us-ing a DMA Address Relocation Table (DART).DART is similar to the Calgary TCE table, butdiffers in that the entries only track validityrather than access rights. As with the Cal-gary, the U3 maintains an IOTLB and providesa software-accessible mechanism for invalidat-ing entries.

The Cell Processor has a translation- andisolation-capable IOMMU implemented onchip. Its bus address space uses a segmentedtranslation model that is compatible with theMMU in the PowerPC core (PPE). This two-level approach not only allows for efficientuser level device drivers, but also allows ap-plications running on the Synergistic Process-ing Engine (SPE) to interact with devices di-rectly. The Cell IOMMU maintains two lo-cal caches—one for caching segment entriesand another for caching page table entries, theIOSLB and IOTLB, respectively. Each has aseparate software accessible mechanism to in-validate entries. However, all entries in bothcaches are software accessible, so it is possi-ble to program all translations directly in thecaches, increasing the determinism of the trans-lation stage.

2 Linux IOMMU support and theDMA mapping API

Linux runs on many different platforms. Thoseplatforms may have a hardware IOMMU em-ulation such as SWIOTLB, or direct hard-ware access. The software that enables these

IOMMUs must abstract its internal, device-specific DMA mapping functions behind thegeneric DMA API. In order to write generic,platform-independent drivers, Linux abstractsthe IOMMU details inside a common API,known as the “DMA” or “DMA mapping”API [7] [8]. As long as the implementation con-forms to the semantics of the DMA API, a wellwritten driver that is using the DMA API prop-erly should “just work” with any IOMMU.

Prior to this work, Linux’s x86-64 architec-ture included three DMA API implementa-tions: NOMMU, SWIOTLB, and GART. NOMMU

is a simple, architecture-independent imple-mentation of the DMA API. It is used whenthe system has neither a hardware IOMMU norsoftware emulation. All it does is return thephysical memory address for the memory re-gion it is handed as the DMA address for theadapter to use.

Linux includes a software implementationof an IOMMU’s translation function, calledSWIOTLB. SWIOTLB was first introduced inarch/ia64 [3] and is used today by bothIA64 and x86-64. It provides translationthrough a technique called bounce buffering.At boot time, SWIOTLB sets aside a large phys-ically contiguous memory region (the aper-ture), which is off-limits to the OS. The sizeof the aperture is configurable and ranges fromseveral to hundreds of megabytes. SWIOTLB

uses this aperture as a location for DMAs thatneed to be remapped to system memory higherthan the 4GB boundary. When a driver wishesto DMA to a memory region, the SWIOTLB

code checks the system memory address ofthat region. If it is directly addressable bythe adapter, the DMA address of the region isreturned to the driver and the adapter DMAsthere directly. If it is not, SWIOTLB allocates a“bounce buffer” inside the aperture, and returnsthe bounce buffer’s DMA address to the driver.If the requested DMA operation is a DMA read


(read from memory), the data is copied fromthe original buffer to the bounce buffer, and theadapter reads it from the bounce buffer’s mem-ory location. If the requested DMA operation isa write, the data is written by the adapter to thebounce buffer, and then copied to the originalbuffer.

SWIOTLB treats the aperture as an array, andduring a DMA allocation it traverses the arraysearching for enough contiguous empty slots inthe array to satisfy the request using a next-fitallocation strategy. If it finds enough space tosatisfy the request, it passes the location withinthe aperture for the driver to perform DMA op-erations. On a DMA write, SWIOTLB performsthe copy (“bounce”) once the driver unmapsthe IO address. On a DMA read or bidirec-tional DMA, the copy occurs during the map-ping of the memory region. Synchronizationof the bounce buffer and the memory regioncan be forced at any time through the variousdma_sync_xxx function calls.

SWIOTLB is wasteful in CPU operations andmemory, but is the only way some adapterscan access all memory on systems without anIOMMU. Linux always uses SWIOTLB on IA64machines, which have no hardware IOMMU.On x86-64, Linux will only use SWIOTLB whenthe machine has greater than 4GB memory andno hardware IOMMU (or when forced throughthe iommu=force boot command line argu-ment).

The only IOMMU that is specific to x86-64hardware is AMD’s GART. GART’s implemen-tation works in the following way: the BIOS(or kernel) sets aside a chunk of contiguouslow memory (the aperture), which is off-limitsto the OS. There is a single aperture for allof the devices in the system, although not ev-ery device needs to make use of it. GART

uses addresses in this aperture as the IO ad-dresses of DMAs that need to be remapped to

system memory higher than the 4GB bound-ary. The GART Linux code keeps a list ofthe used buffers in the aperture via a bitmap.When a driver wishes to DMA to a buffer, thecode verifies that the system memory addressof the buffer’s memory falls within the device’sDMA mask. If it does not, then the GART

code will search the aperture bitmap for anopening large enough to satisfy the number ofpages spanned by the DMA mapping request.If it finds the required number of contiguouspages, it programs the appropriate remapping(from the aperture to the original buffer) in theIOMMU and returns the DMA address withinthe aperture to the driver.

We briefly described seven different IOMMUdesigns. AMD’s GART and IBM’s DART pro-vide translation but not isolation. Conversely,AMD’s Device Exclusion Vector provides iso-lation but not translation. IBM’s Calgary andCell are the only two architectures available to-day which provide both translation and isola-tion. However, AMD’s and Intel’s forthcomingIOMMUs will soon be providing these capabil-ities on most x86 machines.

3 Xen IOMMU support

Xen [5] [6] is a virtual machine monitor forx86, x86-64, IA64, and PowerPC that sup-ports execution of multiple guest operating sys-tems on the same physical machine with highperformance and resource isolation. Operat-ing systems running under Xen are either para-virtualized (their source code is modified in or-der to run under a hypervisor) or fully virtu-alized (source code was designed and writtento run on bare metal and has not been mod-ified to run under a hypervisor). Xen makesa distinction between “physical” (interchange-ably referred to as “pseudo-physical”) frames


and machine frames. An operating system run-ning under Xen runs in a contiguous “physi-cal” address space, spanning from physical ad-dress zero to end of guest “physical” memory.Each guest “physical” frame is mapped to ahost “machine” frame. Naturally, the physicalframe number and the machine frame numberwill be different most of the time.

Xen has different uses for IOMMU than tradi-tional Linux. Xen virtual machines may strad-dle or completely reside in system memoryover the 4GB boundary. Additionally, Xenvirtual machines run with a physical addressspace that is not identity mapped to the ma-chine address space. Therefore, Xen would liketo utilize the IOMMU so that a virtual machinewith direct device access need not be aware ofthe physical to machine translation, by present-ing an IO address space that is equivalent tothe physical address space. Additionally, Xenwould like virtual machines with hardware ac-cess to be isolated from other virtual machines.

In theory, any IOMMU driver used by Linux onbare metal could also be used by Linux underXen after being suitably adapted. The changesrequired depend on the specific IOMMU, butin general the modified IOMMU driver wouldneed to map from PFNs to MFNs and allo-cate a machine contiguous aperture rather thana pseudo-physically contiguous aperture. Inpractice, as of Xen’s 3.0.0 release, only a mod-ified version of SWIOTLB is supported.

Xen’s controlling domain (dom0) always uses amodified version of SWIOTLB. Xen’s SWIOTLB

serves two purposes. First, since Xen domainsmay reside in system memory completelyabove the 4GB mark, SWIOTLB provides amachine-contiguous aperture below 4GB. Sec-ond, since a domain’s pseudo-physical memorymay not be machine contiguous, the apertureprovides a large machine contiguous area forbounce buffers. When a stock Linux driver run-ning under Xen makes a DMA API call, the call

always goes through dom0’s SWIOTLB, whichmakes sure that the returned DMA address isbelow 4GB if necessary and is machine con-tiguous. Naturally, going through SWIOTLB onevery DMA API call is wasteful in CPU cyclesand memory and has a non-negligible perfor-mance cost. GART or Calgary (or any othersuitably capable hardware IOMMU) could beused to do in hardware what SWIOTLB does insoftware, once the necessary support is put inplace.

One of the main selling points of virtualizationis machine consolidation. However, some sys-tems would like to access hardware directly inorder to achieve maximal performance. For ex-ample, one might want to put a database virtualmachine and a web server virtual machine onthe same physical machine. The database needsfast disk access and the web server needs fastnetwork access. If a device error or system se-curity compromise occurs in one of the virtualmachines, the other is immediately vulnerable.Because of this need for security, there is a needfor software or hardware device isolation.

Xen supports the ability to allocate differentphysical devices to different virtual machines(multiple “driver domains” [10]). However,due to the architectural limitations of most PChardware, notably the lack of an isolation ca-pable IOMMU, this cannot be done securely.In effect, any domain that has direct hardwareaccess is considered “trusted.” For some sce-narios, this can be tolerated. For others (e.g., ahosting service that wishes to run multiple cus-tomers virtual machines on the same physicalmachine), this is completely unacceptable.

Xen’s grant tables are a software solution to thelack of suitable hardware for isolation. Granttables provide a method to share and transferpages of data between domains. They give (or“grant”) other domains access to pages in thesystem memory allocated to the local domain.These pages can be read, written, or exchanged


(with the proper permission) for the purpose ofproviding a fast and secure method for domainsto receive indirect access to hardware.

How does data get from the hardware to the lo-cal domain that wishes to make use it, whenonly the driver domain can access the hard-ware directly? One alternative would be forthe driver domain to always DMA into its ownmemory, and then pass the data to the localdomain. Grant tables provide a more efficientalternative by letting driver domains DMA di-rectly into pages in the local domain’s memory.However, it is only possible to DMA into pagesspecified within the grant table. Of course, thisis only significant for non-privileged domains(as privileged domains could always access thememory of non-privileged domains). Grant ta-bles have two methods for allowing access toremote pages in system memory: shared pagesand page flipping.

For shared pages, a driver in the local do-main’s kernel will advertise a page to be sharedvia a hypervisor function call (“hypercall” or“hcall”). The hcall notifies the hypervisor thatother domains are allowed to access this page.The local domain then passes a grant table ref-erence ID to the remote domain it is “granting”access to. Once the remote domain is finished,the local domain removes the grant. Sharedpages are used by block devices and any otherdevice that receives data synchronously.

Network devices, as well as any other devicethat receives data asynchronously, use a methodknown as page flipping. When page flipping, adriver in the local domain’s kernel will adver-tise a page to be transferred. This call notifiesthe hypervisor that other domains can receivethis page. The local domain then transfers thepage to the remote domain and takes a free page(via producer/consumer ring).

Incoming network packets need to be inspectedbefore they can be transferred, so that the in-

tended destination can be deduced. Since blockdevices already know which domain requesteddata to be read, there is no need to inspectthe data prior to sending it to its intended do-main. Newer networking technologies (suchas RDMA NICs and Infiniband) know when apacket is received from the wire for which do-main is it destined and will be able to DMA itthere directly.

Grant tables, like SWIOTLB, are a software im-plementation of certain IOMMU functionality.Much like how SWIOTLB provides the transla-tion functionality of an IOMMU, grant tablesprovide the isolation and protection functional-ity. Together they provide (in software) a fullyfunctional IOMMU (i.e., one that provides bothtranslation and isolation). Hardware accelera-tion of grant tables and SWIOTLB is possible,provided a suitable hardware IOMMU exists onthe platform, and is likely to be implemented inthe future.

4 Virtualization: IOMMU designrequirements and open issues

Adding IOMMU support for virtualizationraises interesting design requirements and is-sues. Regardless of the actual functionality ofan IOMMU, there are a few basic design re-quirements that it must support to be useful in avirtualized environment. Those basic design re-quirements are: memory isolation, fault isola-tion, and virtualized operating system support.

To achieve memory isolation, an operating sys-tem or hypervisor should not allow one vir-tual machine with direct hardware access tocause a device to DMA into an area of physi-cal memory that the virtual machine does notown. Without this capability, it would be possi-ble for any virtual machine to have access to the


memory of another virtual machine, thus pre-cluding running an untrusted OS on any virtualmachine and thwarting basic virtualization se-curity requirements.

To achieve fault isolation, an operating systemor hypervisor should not allow a virtual ma-chine that causes a bad DMA (which leads to atranslation error in the IOMMU) to affect othervirtual machines. It is acceptable to kill the er-rant virtual machine or take its devices off-line,but it is not acceptable to kill other virtual ma-chines (or the entire physical machine) or takedevices that the errant virtual machines doesnot own offline.

To achieve virtualized operating system sup-port, an operating system or hypervisor needsto support para-virtualized operating systems,fully virtualized operating systems that are notIOMMU-aware, and fully virtualized IOMMU-aware operating systems. For para-virtualizedOS’s, the IOMMU support should mesh inseamlessly and take advantage of the existingOS IOMMU support (e.g., Linux’s DMA API).For fully virtualized but not IOMMU-awareOS’s, it should be possible for control tools toconstruct IOMMU translation tables that mir-ror the OS’s pseudo-physical to machine map-pings. For fully virtualized IOMMU aware op-erating systems, it should be possible to trap,validate, and establish IOMMU mappings suchthat the semantics the operating system expectswith regards to the IOMMU are maintained.

There are several outstanding issues and openquestions that need to be answered for IOMMUsupport. The first and most critical question is:“who owns the IOMMU?” Satisfying the isola-tion requirement requires that the IOMMU beowned by a trusted entity that will validate ev-ery map and unmap operation. In Xen, the onlytrusted entities are the hypervisor and privi-leged domains (i.e., the hypervisor and dom0 instandard configurations), so the IOMMU mustbe owned by either the hypervisor or a trusted

domain. Mapping and unmapping entries intothe IOMMU is a frequent, fast-path operation.In order to impose as little overhead as possi-ble, it will need to be done in the hypervisor.At the same time, there are compelling rea-sons to move all hardware-related operationsoutside of the hypervisor. The main reason isto keep the hypervisor itself small and igno-rant of any hardware details except those ab-solutely essential, to keep it maintainable andverifiable. Since dom0 already has all of the re-quired IOMMU code for running on bare metal,there is little point in duplicating that code inthe hypervisor.

Even if mapping and unmapping of IOMMUentries is done in the hypervisor, should dom0or the hypervisor initialize the IOMMU andperform other control operations? There arearguments both ways. The argument in favorof the hypervisor is that the hypervisor alreadydoes some IOMMU operations, and it might aswell do the rest of them, especially if no clear-cut separation is possible. The arguments in fa-vor of dom0 are that it can utilize all of the baremetal code that it already contains.

Let us examine the simple case where a physi-cal machine has two devices and two domainswith direct hardware access. Each device willbe dedicated to a separate domain. From thepoint of view of the IOMMU, each device hasa different IO address space, referred to sim-ply as an “IO space.” An IO space is a vir-tual address space that has a distinct translationtable. When dedicating a device to a domain,we either establish the IO space a priori or letthe domain establish mappings in the IO spacethat will point to its machine pages as it needsthem. IO spaces are created when a device isgranted to a domain, and are destroyed whenthe device is brought offline (or when the do-main is destroyed). A trusted entity grants ac-cess to devices, and therefore necessarily cre-ates and grants access to their IO spaces. The


same trusted entity can revoke access to de-vices, and therefore revoke access and destroytheir IO spaces.

There are multiple considerations that needto be taken into account when designing anIOMMU interface. First, we should differen-tiate between the administrative interfaces thatwill be used by control and management tools,and “data path” interfaces which will be usedby unprivileged domains. Creating and de-stroying an IO space is an administrative inter-face; mapping a machine page is a data pathoperation.

Different hardware IOMMUs have differentcharacteristics, such as different degrees of de-vice isolation. They might support no isola-tion (single global IO address space for all de-vices in the system), isolation between differentbusses (IO address space per PCI bus), or iso-lation on the PCI Bus/Device/Function (BDF)level (i.e., a separate IO address space for eachlogical PCI device function). The IO space cre-ation interface should expose the level of iso-lation that the underlying hardware is capableof, and should support any of the above isola-tion schemes. Exposing a finer-grained isola-tion than the hardware is capable of could leadsoftware to a false sense of security, and ex-posing a coarser grained isolation would not beable to fully utilize the capabilities of the hard-ware.

Another related question is whether several de-vices should be able to share the same IO ad-dress space, even if the hardware is capableof isolating between them. Let us consider afully virtualized operating system that is notIOMMU aware and has several devices dedi-cated to it. Since the OS is not capable of uti-lizing isolation between these devices and eachIO space consumes a small, yet non-negligibleamount of memory for its translation tables,there is no point in giving each device a sep-arate IO address space. For cases like this, it

would be beneficial to share the same IO ad-dress space among all devices dedicated to agiven operating system.

We have established that it may be beneficialfor multiple devices to share the same IO ad-dress space. Is it likewise beneficial for multi-ple consumers (domains) to share the same IOaddress space? To answer this question, let usconsider a smart IO adapter such as an Infini-band NIC. An IB NIC handles its own trans-lation needs and supports many more concur-rent consumers than PCI allows. PCI dedicates3 bits for different “functions” on the samedevice (8 functions in total) whereas IB sup-ports 24 bits of different consumers (millionsof consumers). To support such “virtualizationfriendly” adapters, one could run with transla-tion disabled in the IOMMU, or create a singleIO space and let multiple consumers (domains)access it.

Since some hardware is only capable of hav-ing a shared IO space between multiple non-cooperating devices, it is beneficial to be able tocreate several logical IO spaces, each of whichis a window into a single large “physical IOspace.” Each device gets its own window intothe shared address space. This model only pro-vides “statistical isolation.” A driver program-ming a device may guess another device’s win-dow and where it has entries mapped, and if itguesses correctly, it could DMA there. How-ever, the probability of its guessing correctlycan be made fairly small. This mode of oper-ation is not recommended, but if it’s the onlymode the hardware supports. . .

Compared to creation of an IO space, mappingand unmapping entries in it is straightforward.Establishing a mapping requires the followingparameters:

• A consumer needs to specify which IOspace it wants to establish a mapping


in. Alternatives for for identifying IOspaces are either an opaque, per-domain“IO space handle” or the BDF that this IOspace translates for.

• The IO address in the IO address spaceto establish a mapping at. The main ad-vantage of letting the domain pick the IOaddress it that it has control over howIOMMU mappings are allocated, enablingit to optimize their allocation based on itsspecific usage scenarios. However, in thecase of shared IO spaces, the IO addressthe device requests may not be availableor may need to be modified. A reasonablecompromise is to make the IO address a“hint” which the hypervisor is free to ac-cept or reject.

• The access permissions for the given map-ping in the IO address space. At a mini-mum, any of none, read only, writeonly, or read write should be sup-ported.

• The size of the mapping. It may bespecified in bytes for convenience and toeasily support different page sizes in theIOMMU, but ultimately the exact size ofthe mapping will depend on the specificpage sizes the IOMMU supports.

To reduce the number of required hypercalls,the interface should support multiple mappingsin a single hypervisor call (i.e., a “scatter gatherlist” of mappings).

Tearing down a mapping requires the followingparameters:

• The IO space this mapping is in.

• The mapping, as specified by an IO ad-dress in the IO space.

• The size of the mapping.

Naturally, the hypervisor needs to validate thatthe passed parameters are correct. For example,it needs to validate that the mapping actuallybelongs to the domain requesting to unmap it,if the IO space is shared.

Last but not least, there are a number of miscel-laneous issues that should be taken into accountwhen designing and implementing IOMMUsupport. Since our implementation is targetingthe open source Xen hypervisor, some consid-erations may be specific to a Xen or Linux im-plementation.

First and foremost, Linux and Xen already in-clude a number of mechanisms that either em-ulate or complement hardware IOMMU func-tionality. These include SWIOTLB, grant tables,and the PCI frontend / backend drivers. AnyIOMMU implementation should “play nicely”and integrate with these existing mechanisms,both on the design level (i.e., provide hardwareacceleration for grant tables) and on the imple-mentation level (i.e., do not duplicate commoncode).

One specific issue that must be addressed stemsfrom Xen’s use of page flipping. Pages thathave been mapped into the IOMMU must bepinned as long as they are resident in theIOMMU’s table. Additionally, any pages thatare involved in IO may not be relinquished bya domain (e.g., by use of the balloon driver).

Devices and domains may be added or removedat arbitrary points in time. The IOMMU sup-port should handle “garbage collection” of IOspaces and pages mapped in IO when the do-main or domains that own them die or thedevice they map is removed. Likewise, hot-plugging of new devices should also be han-dled.


5 Calgary IOMMU Design and Im-plementation

We have designed and implemented IOMMUsupport for the Calgary IOMMU found in high-end IBM System X servers. We developed itfirst on bare metal Linux, and then used thebare metal implementation as a stepping-stoneto a “virtualization enabled” proof-of-conceptimplementation in Xen. This section describesboth implementations. It should be noted thatCalgary is an isolation-capable IOMMU, andthus provides isolation between devices resid-ing on different PCI Host Bridges. This capa-bility is directly beneficial in Linux even with-out a hypervisor for its RAS capabilities. Forexample, it could be used to isolate a devicein its own IO space while developing a driverfor it, thus preventing DMA related errors fromrandomly corrupting memory or taking downthe machine.

5.1 x86-64 Linux Calgary support

The Linux implementation is included at thetime of this writing in 2.6.16-mm1. It is com-posed of several parts: initialization and detec-tion code, IOMMU specific code to map andunmap entries, and a DMA API implementa-tion.

The bring-up code is done in two stages: detec-tion and initialization. This is due to the waythe x86-64 arch-specific code detects and ini-tializes IOMMUs. In the first stage, we detectwhether the machine has the Calgary chipset.If it does, we mark that we found a CalgaryIOMMU, and allocate large contiguous areasof memory for each PCI Host Bridge’s trans-lation table. Each translation table consists ofa number of entries that total the addressablerange given to the device (in page size incre-ments). This stage uses the bootmem allocator

and happens before the PCI subsystem is ini-tialized. In the second stage, we map Calgary’sinternal control registers and enable translationon each PHB.

The IOMMU requires hardware-specific codeto map and unmap DMA entries. This part ofthe code implements a simple TCE allocatorto “carve up” each translation table to differ-ent callers, and includes code to create TCEs(Translation Control Entries) in the format thatthe IOMMU understands and writes them intothe translation table.

Linux has a DMA API interface to abstract thedetails of exactly how a driver gets a DMA’ableaddress. We implemented the DMA API forCalgary, which allows generic DMA mappingcalls to be translated to Calgary specific DMAcalls. This existing infrastructure enabled theCalgary Linux code to be more easily hookedinto Linux without many non-Calgary specificchanges.

The Calgary code keeps a list of the used pagesin the translation table via a bitmap. Whena driver make a DMA API call to allocate aDMA address, the code searches the bitmap foran opening large enough to satisfy the DMAallocation request. If it finds enough spaceto satisfy the request, it updates the TCEs inthe translation table in main memory to let theDMA through. The offset of those TCEs withinthe translation table is then returned to the de-vice driver as the DMA address to use.

5.2 Xen Calgary support

Prior to this work, Xen did not have any supportfor isolation-capable IOMMUs. As explainedin previous sections, Xen does have softwaremechanisms (such as SWIOTLB and grant ta-bles) that emulate IOMMU-related functional-ity, but does not have any hardware IOMMU


support, and specifically does not have anyisolation-capable hardware IOMMU support.

We added proof-of-concept IOMMU supportto Xen. The IOMMU support is composedof a thin “general IOMMU” layer, and hard-ware IOMMU specific implementations. Atthe moment, the only implementation is forthe Calgary chipset, based on the bare-metalLinux Calgary support. As upcoming IOM-MUs become available, we expect more hard-ware IOMMU implementations to show up.

It should be noted that the current implemen-tation is proof-of-concept and is subject tochange as IOMMU support evolves. In the-ory it targets numerous IOMMUs, each withdistinct capabilities, but in practice it has onlybeen implemented for the single isolation-capable IOMMU that is currently available. Weanticipate that by the time you read this, the in-terface will have changed to better accommo-date other IOMMUs.

The IOMMU layer receives the IOMMU re-lated hypercalls (both the “management” hcallsfrom dom0 and the IOMMU map/unmap hcallsfrom other domains) and forwards them to theIOMMU specific layer. The following hcallsare defined:

• iommu_create_io_space – this callis used by the management domain(dom0) to create a new IO space that is at-tached to specific PCI BDF values. If theIOMMU supports only bus level isolation,the device and function values are ignored.

• iommu_destroy_io_space – thiscall is used to destroy an IO space, as iden-tified by a BDF value.

Once an IO space exists, a domain can askto map and unmap translation entries in itsIOMMU using the following calls:

• u64 do_iommu_map(u64 ioaddr,u64 mfn, u32 access, u32bdf, u32 size);

• int do_iommu_unmap(u64ioaddr, u32 bdf, u32 size);

When mapping an entry, the domain passes thefollowing parameters:

• ioaddr – The address in the IO spacethat the domain would like to establish amapping at. This is a hint; the hypervisoris free to use it or ignore it and return adifferent IO address.

• mfn – The machine frame number thatthis entry should map. In the current Xencode base, a domain running on x86-64and doing DMA is aware of the physi-cal/machine translation, and thus there isno problem with passing the MFN. In fu-ture implementations this API will proba-bly change to pass the domain’s PFN in-stead.

• access – This specifies the Read/Writepermission of the entry (read here refersto what the device can do—whether it canonly read from memory, or can write to itas well).

• bdf – The PCI Bus/Device/Function ofthe IO space that we want to map this in.This parameter might be changed in laterrevisions to an opaque IO-space identifier.

• size – How large is this entry? The cur-rent implementation only supports a sin-gle IOMMU page size of 4KB, but we an-ticipate that future IOMMUs will supportlarge page sizes.

The return value of this function is the IO ad-dress where the entry has been mapped.


When unmapping an entry, the domain passesthe BDF, the IO address that was returned andthe size of the entry to be unmapped. The hy-pervisor validates the parameters, and if theyvalidate correctly, unmaps the entry.

An isolation-capable IOMMU is likely to ei-ther have a separate translation table for dif-ferent devices, or have a single, shared transla-tion table where each entry in the table is validfor specific BDF values. Our scheme supportsboth usage models. The generic IOMMU layerfinds the right translation table to use based onthe BDF, and then calls the hardware IOMMU-specific layer to map or unmap an entry in it.In the case of one domain owning an IO space,the domain can use its own allocator and thehypervisor will always use the IO addresses thedomain wishes to use. In the case of a sharedIO space, the hypervisor will be the one con-trolling IO address allocation. In this case IOaddress allocation could be done in cooperationwith the domains, for example by adding a perdomain offset to the IO addresses the domainsask for—in effect giving each domain its ownwindow into the IO space.

6 Roadmap and Future Work

Our current implementation utilizes theIOMMU to run dom0 with isolation enabled.Since dom0 is privileged and may access allof memory anyway, this is useful mainly as aproof of concept for running a domain withIOMMU isolation enabled. Our next immedi-ate step is to run a different, non privileged andnon trusted “direct hardware access domain”with direct access to a device and with isolationenabled in the IOMMU.

Once we’ve done that, we plan to continuein several directions simultaneously. We in-tend to integrate the Calgary IOMMU support

with the existing software mechanisms such asSWIOTLB and grant tables, both on the inter-face level and the implementation (e.g., sharingfor code related to pinning of pages involvedin ongoing DMAs). For configuration, we arelooking to integrate with the PCI frontend andbackend drivers, and their configuration mech-anisms.

We are planning to add support for more IOM-MUs as hardware becomes available. In partic-ular, we look forward to supporting Intel andAMD’s upcoming isolation-capable IOMMUs.

Longer term, we see many exciting possibili-ties. For example, we would like to investigatesupport for other types of translation schemesused by some devices (e.g. those used by In-finiband adapters).

We have started looking at tuning the IOMMUfor different performance/reliability/securityscenarios, but do not have any results yet.Most current-day machines and operating sys-tems run without any isolation, which in theoryshould give the best performance (least over-head on the DMA path). However, IOMMUsmake it possible to perform scatter-gather co-alescing and bounce buffer avoidance, whichcould lead to increased overall throughput.

When enabling isolation in the IOMMU, onecould enable it selectively for “untrusted” de-vices, or for all devices in the system. Thereare many trade-offs that can be made when en-abling isolation: one example is static versusdynamic mappings, that is, mapping the entireOS’s memory into the IOMMU up front when itis created (no need to make map and unmap hy-percalls) versus only mapping those pages thatare involved in DMA activity. When using dy-namic mappings, what is the right mapping al-location strategy? Since every IOMMU imple-ments a cache of IO mappings (an IOTLB), weanticipate that the IO mapping allocation strat-egy will have a direct impact on overall system


performance.

7 Conclusion: Key Research andDevelopment Challenges

We implemented IOMMU support on x86-64for Linux and have proof-of-concept IOMMUsupport running under Xen. We have shownthat it is possible to run virtualized and non-virtualized operating systems on x86-64 withIOMMU isolation. Other than the usual woesassociated with bringing up a piece of hard-ware for the first time, there are also interest-ing research and development challenges forIOMMU support.

One question is simply how can we build bet-ter, more efficient IOMMUs that are easier touse in a virtualized environment? The upcom-ing IOMMUs from IBM, Intel, and AMD haveunique capabilities that have not been exploredso far. How can we best utilize them and whatadditional capabilities should future IOMMUshave?

Another open question is whether we can usethe indirection IOMMUs provide for DMA ac-tivity to migrate devices that are being accesseddirectly by a domain, without going throughan indirect software layer such as the backenddriver. Live virtual machine migration (“live”refers to migrating a domain while it continuesto run) is one of Xen’s strong points [9], but atthe moment it is mutually incompatible with di-rect device access. Can IOMMUs mitigate thislimitation?

Another set of open question relate to the ongo-ing convergence between IOMMUs and CPUMMUs. What is the right allocation strategyfor IO mappings? How to efficiently supportlarge pages in the IOMMU? Does the fact that

some IOMMUs share the CPU’s page table for-mat (e.g., AMD’s upcoming IOMMU) changeany fundamental assumptions?

What is the right way to support fully virtu-alized operating systems, both those that areIOMMU-aware, and those that are not?

We continue to develop Linux and Xen’sIOMMU support and investigate these ques-tions. Hopefully, the answers will be forthcom-ing by the time you read this.

8 Legal

Any statements about support or other commitmentsmay be changed or canceled at any time without no-tice. All statements regarding future direction andintent are subject to change or withdrawal withoutnotice, and represent goals and objectives only. In-formation is provided “AS IS” without warranty ofany kind. The information could include technicalinaccuracies or typographical errors. Improvementsand/or changes in the product(s) and/or the pro-gram(s) described in this publication may be madeat any time without notice.

References

[1] AMD I/O Virtualization Technology(IOMMU) Specification, 2006, http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/34434.pdf.

[2] Intel Virtualization Technology forDirected I/O Architecture Specification,2006, ftp://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf.


[3] IA-64 Linux Kernel: Design andImplementation, by David Mosbergerand Stephane Eranian, Prentice HallPTR, 2002, ISBN 0130610143.

[4] Software Optimization Guide for theAMD64 Processors, 2005, http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF.

[5] Xen and the Art of Virtualization, byB. Dragovic, K. Fraser, S. Hand,T. Harris, A. Ho, I. Pratt, A. Warfield,P. Barham, and R. Neugebauer, inProceedings of the 19th ASMSymposium on Operating SystemsPrinciples (SOSP), 2003.

[6] Xen 3.0 and the Art of Virtualization, byI. Pratt, K. Fraser, S. Hand, C. Limpach,A. Warfield, D. Magenheimer,J. Nakajima, and A. Mallick, inProceedings of the 2005 Ottawa LinuxSymposium (OLS), 2005.

[7] Documentation/DMA-API.txt.

[8] Documentation/DMA-mapping.txt.

[9] Live Migration of Virtual Machines, byC. Clark, K. Fraser, S. Hand,J. G. Hanseny, E. July, C. Limpach,I. Pratt, A. Warfield, in Proceedings ofthe 2nd Symposium on NetworkedSystems Design and Implementation,2005.

[10] Safe Hardware Access with the XenVirtual Machine Monitor, by K. Fraser,S. Hand, R. Neugebauer, I. Pratt,A. Warfield, M. Williamson, inProceedings of the OASIS ASPLOS2004 workshop, 2004.

[11] PCI Special Interest Grouphttp://www.pcisig.com/home.


Towards a Highly Adaptable Filesystem Framework forLinux

Suparna BhattacharyaLinux Technology CenterIBM Software Lab, [email protected]

Dilma Da SilvaIBM T.J. Watson Research Center, USA

[email protected]

Abstract

Linux R© is growing richer in independent gen-eral purpose file systems with their own uniqueadvantages, however, fragmentation and diver-gence can be confusing for users. Individualfile systems are also adding an expanding num-ber of options (e.g. ext3) and variations (e.g.reiser4 plugins) to satisfy new requirements.Both of these trends indicate a need for im-proved flexibility in file system design to ben-efit from the best of all worlds. We exploreways to address this need, using as our basis,KFS (K42 file system), a research file systemdesigned for fine-grained flexibility.

KFS aims to support a wide variety of file struc-tures and policies, allowing the representationof a file or directory to change on the fly toadapt to characteristics not well known a priori,e.g. list-based to tree-based, or small to largedirectory layouts. It is not intended as yet an-other file system for Linux, but as a platformto study trade-offs associated with adaptabilityand evaluate new algorithms before incorpora-tion on an established file system. We hopethat ideas and lessons learnt from the experi-ence with KFS will be beneficial for Linux filesystems to evolve to be more adaptable and forthe VFS to enable better building-block-basedcode sharing across file systems.

1 Introduction

The Linux 2.6 kernel includes over 40 filesys-tems, about 0.6 million lines of code in total.The large number of Linux filesystems as wellas their sheer diversity is a testament to thepower and flexibility of the Linux VFS. Thishas enabled Linux to support a wide range ofexisting file system formats and protocols. In-terestingly, this has also resulted in a growingnumber of new file systems that have been de-veloped for Linux. The 2.5 development seriessaw the inclusion of four(ext3, reiserFS, JFS,and XFS) general purpose journalling filesys-tems, while in recent times multiple clusterfilesystems have been submitted to the mainlinekernel, providing users a slew of alternatives tochoose from. Allowing multiple independentgeneral purpose filesystems to co-exist has alsohad the positive effect of enabling each to in-novate in parallel within its own space makingdifferent trade-offs and evolving across multi-ple production releases. New file systems havea chance to prove themselves out in the realworld, letting time pick the best one rather thanstandardize on one single default filesystem[6].

At the same time, the multiplicity of filesystemsthat essentially address very similar needs alsosometimes leads to unwarranted fragmentationand divergence from the perspective of users

88 • Towards a Highly Adaptable Filesystem Framework for Linux

who may find themselves faced with complexadministrative choices with associated lock-into a file system format chosen at a certain pointin time. This is probably one reason whymost users tend to simply adopt the defaultfile system provided by their Linux distribution(i.e. ext3 or reiserfs), despite the availability ofpossibly more advanced filesystems like XFSand JFS, which might have been more suitablefor their primary workloads. Each individualfilesystem has been evolving to expand its ca-pabilities by adding support for an increasingnumber of options and variations, to satisfy newrequirements, and make continuous improve-ments while also maintaining compatibility.

We believe that these trends point to a need fora framework for a different kind of flexibilityin file system design for the Linux kernel, onethat allows continuous adaptability to evolv-ing requirements, but reduces duplicate effortwhile providing users the most appropriate lay-out and capabilities for the workload and fileaccess patterns they are running.

This work has two main goals: (1) to inves-tigate how far a design centered on support-ing dynamic customization of services can helpto address Linux’s needs for flexible file sys-tems and (2) to explore performance benefitsand trade-offs involved in a design for flexibil-ity.

The basis of our exploration is the HFS/KFSresearch effort started a decade ago. In theHurricane File System (HFS), the support fordynamic alternatives for file layout were mo-tivated by the scalability requirements of theworkloads targetted by the Hurricane operatingsystem project. The K42 File System (KFS)built on the flexibility basis of HFS, expand-ing it by encorporating the architectural prin-ciples in the K42 Research Operating Systemproject. While KFS was designed as a sepa-rate filesystem of its own, in this work we ex-plore what it would take to apply similar tech-

niques to existing filesystems, where such fine-grained flexibility was not an original designconsideration. We also update KFS to workwith Linux 2.6, aiming at carrying out exper-imental work that can provide concrete infor-mation about the impact of KFS’s approach onaddressing workload-specific performance re-quirements.

The rest of the paper is organized as fol-lows: Section 2 illustrates how adaptabilityis currently held in Linux filesystems; Sec-tion 3 presents the basic ideas from KFS’s de-sign; Section 4 discusses KFS’s potential as anadaptable filesystem framework for Linux andSection 5 describes required future work to en-able this vision. Section 6 concludes.

2 Adaptability in Linux filesystems

2.1 Flexibility provided by the VFS layer

The Linux Virtual File System (VFS) [24] isquite powerful in terms of the flexibility it pro-vides for implementing filesystems, whetherdisk-based filesystems, network filesystems orspecial purpose pseudo filesystems. This flex-ibility is achieved by abstracting key filesys-tem objects, i.e. the super block, inode and file(for both files and directories) including the ad-dress space mapping, the directory entry cache,and methods associated with each of these ob-jects. It is possible to allow different inodes inthe same filesystem to have different operationvectors. This is used, for example, by ext3 tosupport different journalling modes elegantly,by some types of stackable/filter filesystemsto provide additional functionality to an exist-ing filesystem, and even for specialized accessmodes determined by open mode, e.g. execute-in-place support. Additionally, the inclusion ofextended attributes support in the VFS methods


allows customizable persistent per-inode stateto be maintained, which can be used to specifyvariations in functional behavior at an individ-ual file level.

The second aspect of flexibility ensues fromcommon code provided for implementationof various file operations, i.e. generic rou-tines which can be invoked by filesystemsor wrapped with additional filesystem-specificcode. The bulk of interfacing betweenthe VFS and the Virtual Memory Manager(VMM), including read-ahead logic, page-caching and access to disk blocks for block-device-based filesystems, happens in this man-ner. Some of these helper routines (e.g.mpage_writepages()) accept function pointersas arguments, e.g. for specifying a filesystemspecific getblock() routine, which also allowsfor potential variation of block mapping andconsistency and allocation policies even withinthe same filesystem.

Another category of helper routines calledlibfs [4] is intended for simplifying the taskof writing new virtual filesystems, though cur-rently targeted mainly at stand-alone virtualfilesystems developed for special-purpose in-terfacing between kernel and user space, as analternative to using ioctls or new system calls.

2.2 Flexibility within individual filesystems

Over time, the need to satisfy new requirementswhile maintaining on-disk compatibility to theextent possible has led individual filesystemsto incorporate a certain degree of variabilitywithin each filesystem specific implementation,using mount options, fcntls/ioctls, file attributesand internal layering. In this sub-section wecover a few such examples. While we limit thisdiscussion to a few disk-based general purposefilesystems, similar approaches apply to otherfilesystem types as well.

We do not presently consider file systems thatare not intended to be part of the mainlineLinux kernel tree.

2.2.1 Ext3 options and backward compati-bility

One of the often cited strengths of the ext3filesystem is its high emphasis on backwardsand forwards compatibility and dependability,even as it continues evolving to incorporatenew features and enhancements. This has beenachieved through a carefully designed compati-bility bitmap scheme [22], the use of mount op-tions and tune2fs to turn on individual featuresper filesystem, conversion utilities to ease mi-gration to new incompatible features (e.g. us-ing resize2fs-type capability), and per-file flagsand attributes that can be controlled through thechattr command.

Three compatibility bitmaps in the super blockdetermine if and how an old kernel wouldmount a filesystem with an unknown featurebit marked in each of these bitmaps: read-write(COMPAT), read-only (RO_COMPAT), and in-compatible (INCOMPAT)). For safety reasons,though, the filesystem checker e2fsck takes thestringent approach of not touching a filesystemwith an unknown feature bit even if it is in theCOMPAT set, recommending the usage of anewer version of the utility instead. Backwardcompatibility with an older kernel is usefulduring a safe revert of an installation/upgradeor in enabling the disk to be mounted fromother Linux systems for emergency recoverypurposes. For these reasons interesting tech-niques have been used in the development offeatures like directory indexing [16] making in-terior index nodes look like deleted directoryentries and clearing directory-indexing flagswhen updating directories in older kernels toensure compatibility as far as possible. Simi-lar considerations are being debated during the


design of file pre-allocation support to avoidexposing uninitialized pre-allocated blocks ifmounted by an older kernel. With the inclusionof per-inode compatibility flags, the granularityof backward compatibility support can be nar-rowed down to a per-file level. This may beuseful during integration of extent maps.

The use of mount options and tune2fs makes itpossible for new, relatively less established orincompatible features to be turned on option-ally with explicit administrator knowledge fora few production releases before being madethe default. Also, in the future, advanced fea-ture sets may be bundled into a higher levelgroup that signifies a generational advance ofthe filesystem [23]. Mount options are alsoused for setting consistency policies (i.e. jour-nalling mode) on a per filesystem basis. Ad-ditionally, ext3 makes use of persistent file at-tributes (through the introduction of the chattrcommand), in combination with the ability touse different operation vectors for different in-odes, to allow certain features/policies (e.g. fulldata journalling support for certain log files,preserving reservation beyond file close forslow-growing files) to be specified for individ-ual files.

2.2.2 JFS and XFS

Although JFS [20] does not implement compat-ibility bitmaps as ext3 does, its on-disk layoutis scalable, and backward compatibility has notbeen much of an issue. The on-disk directorystructure was changed shortly after JFS wasported from OS/2 R© to Linux, causing a ver-sion bump in the super block. Since then, therehas been no need to change the on-disk layoutof JFS. The kernel will still support the older,OS/2-compatible, format.

JFS uses extent-based data structures and uses40 bits to store block offsets and addresses. The

on-disk layout supports various block sizes, al-though the kernel currently only supports a 4KB block size. Without increasing the blocksize, JFS can support files and partitions up to 4petabytes in length. B+ trees are used to imple-ment both the extent maps and directories. In-odes are dynamically allocated as needed, andextents of free inodes are reclaimed. JFS canstore file names in 16-bit unicode, translatingfrom the code page specified by the iocharsetmount option. The default is to do no transla-tion.

JFS supports most mount options and chattrflags that ext3 does. Journaling can be tem-porarily disabled with the nointegrity mountflag. This is primarily intended to speed upthe population of a partition, for instance, frombackup media, where recovery would involvereformatting and restarting the process, with norisk of data loss.

XFS [21] also has a scalable ondisk layout, usesextent based structures, variable block sizes,dynamic inode allocation and separate alloca-tion groups. It supports an optional real-timeallocator suitable for media streaming applica-tions.

2.2.3 Reiser4 plugins

Much like ext3 and JFS, the reiserfs v3 filesys-tem included in the current Linux kernel usesmount options and persistent inode attributes(set via chattr) to provide new features and vari-ability of policies. For example, the hash func-tion for directories can be selected by a mountoption, as can some tuning of the block alloca-tor, while tail-merging can be disabled on botha per mount point or per inode basis. It supportsthe same journalling modes as ext3.

The next generation of reiserfs, the reiser4filesystem [15] (not yet in the mainline Linux


kernel), includes an internal layering architec-ture that is intended to allow for the develop-ment for different kinds of plugins to make ex-tensions to the filesystem without having to up-grade/format to a new version. This approachenables the addition of new features like com-pression, support for alternate semantics, di-rectory ordering, security schemes, block allo-cation and consistency policies, and node bal-ancing policies for different items. The statedgoal of this architecture is to enable and encour-age developers to build plugins to customizethe filesystem with features desired by applica-tions. From the available material at this stage,it is not clear to us yet the extent to which plug-ins are intended to address fine-grain flexibilityor dynamic changes of on-disk data representa-tion beyond tail formatting and item merging indancing trees. Also, at a first glance our (possi-bly mistaken) perception is that reiser4 flexibil-ity support comes with the price of complexity:the code base is large, and the interfaces appearto be tied to reiser4 internals.

2.3 Limitations of current approaches

While there is a considerable amount of flexi-bility within the existing framework, both at theVFS level and within individual file systems,there are some observations and issues emerg-ing with the evolution of multiple new filesys-tems and new generations of filesystems thathave been in existence for a while.

• Code commonality is supported at higherlevels but not for lower level data manip-ulation, e.g. with the inclusion of extentssupport for ext3 there would be over 5 B+separate tree implementations across indi-vidual filesystems.

• While the combination of per-inode oper-ation vectors and persistent attribute flags

allows for flexibility at per inode level, andalternate allocation schemes can poten-tially be encapsulated in the get_blocks()function pointer used for a given opera-tion, there is no general framework to sup-port different layouts for different files ina cohesive manner, to move from an oldscheme to a new one in a modular fashion,or supply different meta-data or data allo-cation policies for a group of files, becausethe inode representation is typically fixedupfront.

• There is no framework for filesystemsto provide their building blocks for useby other filesystems, for example eventhough OCFS2 [5] chose to start with a lotof code from ext3 as its base, this involvedcopying source code and then editing it toadd all the function it needed for clusteringsupport and scalability. As a result, exten-sions like 64-bit and extents support fromOCFS2 can not be applied back as exten-sions to ext3 as an option. Because of itslong history of simplicity and dependabil-ity, ext2/3 is often the preferred choice forbasing experimentation for advanced ca-pabilities [9], so the ability to specializebehaviour starting from a common codebase is likely to be useful.

• Difficulty with making changes to the on-disk format for an existing file system re-sults in implementers getting locked intosupporting a change once made, and hencerequires very careful consideration andmake take years into real deployment es-pecially for incompatible features. Evencompatible features on ext2/3 are incom-patible with older filesystem checkers.


3 Overview of KFS

KFS builds on the research done by the Hurri-cane File System (HFS) [14, 13, 12]. HFS wasdesigned for (potentially large-scale) shared-memory multiprocessors, based on the prin-ciple that, in order to maximize performancefor applications with diverse requirements, afile system must support a wide variety of filestructures, file system policies, and I/O inter-faces. As an extreme example, HFS allowsa file’s structure to be optimized for concur-rent random-access write-only operations by 10threads, something no other file system can do.HFS explored its flexibility to achieve betterperformance and scalability. It proved that itsflexibility came with little processing or I/Ooverhead. KFS took HFS’s principles furtherby eliminating global data structures or poli-cies. KFS runs as a file system for the K42 [10]and Linux operating systems.

The basic aspect of KFS’s enablement of fine-grained customization is that each virtual orphysical resource instance (e.g., a particularfile, open file instance, block allocation map)is implemented by a different set of (C++) ob-jects. The goal of this object-oriented design isto allow each file system element to have thelogical and physical representation that bettermatches its size and access pattern characteris-tics. Each element in the system can be ser-viced by the object that best fits its require-ments; if the requirements change, the compo-nent representing the element in KFS can be re-placed accordingly. Applications can achievebetter performance by using the services thatmatch their access patterns, scalability, andsynchronization requirements.

When a KFS file system is mounted, the blockson disk corresponding to the superblock areread, and a SuperBlock object is instantiated torepresent it. A BlockMap object is also instanti-ated to represent block allocation information.

Another important object instantiated at filesystem creation time is the RecordMap, whichkeeps the association between file system ele-ments, their object type, and their disk location.In many traditional Unix file systems, this as-sociation is fixed and implicit: every file or di-rectory corresponds to an inode number; inodelocation and inode data representation is fixeda priori. Some file systems support dynamicblock allocation for inodes and a set of alter-native inode representations. In KFS, insteadof trying to accommodate new possibilities forrepresentation and dynamic policies incremen-tally, we take the riskier route of starting witha design intended to support change and diver-sity of representations. KFS explores the im-pact of an architecture centered on supportingthe design and deployment of evolving alter-native representations for file system resources.The goal is to learn how far this architecturecan go in supporting flexibility, and what arethe trade-offs involved in this approach.

An element (file or directory) in KFS is rep-resented in memory by two objects: one pro-viding a logical view of the object (called Log-ical Server Object, or LSO), and one encap-sulating its persistent characteristics (PhysicalServer Object, or PSO).

Figure 1 portrays the scenario where three filesare instantiated: a small file, a very large file,and a file where extended allocation is beingused. These files are represented by a commonlogical object (LSO) and by PSO objects tai-lored for their specific characteristics: PSOs-mall, PSOextent, PSOlarge. If the small filegrows, the PSOsmall is replaced by the appro-priate object (e.g., PSOlarge). The RecordMapobject is updated in order to reflect the new ob-ject type and the (potential) new file location ondisk.

KFS file systems may spawn multiple disks.Figure 2 pictures a scenario where file X isbeing replicated on the two available disks,


PSOsmall PSOlarge

SuperBlock

BlockMap

LSO

RecordMap

LSO LSO

PSOextent

Figure 1: Objects representing files with different size and access pattern characteristics in KFS.

while file Y is being striped on the two disksin a round-robin fashion, and file Z is also be-ing replicated, but with its content being com-pressed before going to disk.

In the current KFS implementation, when a fileis created the choice of object implementationto be used is explicitly made by the file systemcode based on simplistic heuristics. Also, theuser could specify intended behavior by chang-ing values on the appropriate objects residingon /proc or use of extended attributes to providehints about the data elements they are creatingor manipulating.

As the file system evolves, compatibility with“older” formats is kept as long as the file sys-tem knows how to instantiate the object type tohandle the old representation.

The performance experiments with KFS forLinux 2.4 indicate that KFS’s support for flexi-bility doesn’t result in unreasonable overheads.KFS on Linux 2.4 (with an implementation ofinode and directory structures matching ext2)was found to run with a performance similarto ext2 on many workloads and 15% sloweron some. These results are described in [18].New performance data is being made availableat [11] as we tune KFS’s integration with Linux2.6.

There are two ongoing efforts on using KFS’s

flexible design to quickly prototype and eval-uate opportunities for alternative policies anddata representation:

• Co-location of meta-data and data: a pro-totype of directory structure for embed-ding file attributes in directory entrieswhere we extend the ideas proposed in [8]by doing meta-data and block allocationon a per-directory basis;

• Local vs global RecordMap structure: al-though KFS’s design tried to avoid the useof global data structures, its initial proto-type implementation has a single object(one instance of the RecordMap class) re-sponsible for mapping elements onto typeand disk location. As our experimenta-tion progressed, it became clear that thisdata structure was hindering scalabilityand flexibility, and imposing performanceoverhead due to lack of locality betweenthe RecordMap entry for a file and itsdata. Our current design explores asso-ciating different RecordMap objects withdifferent parts of the name space.

More information about KFS can be found at[18, 7].


X1

X1

X2

X2

X3

X3

Y1

Y2

Y3

Y4

Z1 Z2 Z3

Z2 Z3

File X File Y File Z

LSO LSO LSO

PSOreplicated PSOstripedPSOcompr

PSOreplicated

Disk 1

Disk 2 Z1

Figure 2: KFS objects and block allocation for files X (replicated; blocks X1, X2, X3), Y (striped;blocks Y1, Y2, Y3, Y4), and Z (compressed and replicated; blocks Z1, Z2, Z3).

4 Learnings from KFS towards anadaptable filesystem frameworkfor Linux

What possibilities does the experience withKFS suggest towards addressing some of theconcerns discussed in section 2.3? Intuitively,it appears that the ideas originating from HFS,and developed further in KFS, of a bottom-up building-block-based, fine-grained approachto flexibility for experimenting with special-ized behaviour and data organization on a per-element basis, could elegantly complement thecurrent VFS approach of abstracting higher lev-els of commonality that allows an abundanceof filesystems. While KFS was developed asa separate filesystem, we do not believe thatadding yet another filesystem to Linux is theright answer. Instead developing simple meth-ods of applying KFS-like flexibility to existingfilesystems incrementally, while possibly non-trivial, may be a promising approach to pursue.

4.1 Switching formats

While many of the file systems mentioned inSection 2 (ext2/3, reiserfs, JFS, XFS, OCFS2)are intended to be general-purpose file systems,each appears to have its own sweet-spot usagescenarios or access patterns that sets it apartfrom the rest. Various comparison charts [1, 17,19] exist that highlight the strengths and weak-nesses of these alternatives to enable adminis-trators to make the right choice for their sys-tems when they format their disks at the timeof first installation. However, predicting thenature of workloads ahead of time, especiallywhen mixed access patterns may apply to dif-ferent files in the same filesystem at differenttimes, is difficult. Unlike configuration optionsor run-time tunables, switching a choice of on-disk format on a system is highly disruptive andresoure consuming. With an adaptable frame-work that can alter representations on a per-element basis in response to significant changesin access patterns, living with a less than opti-mal solution as a result of being locked into a


given on-disk filesystem representation wouldno longer be necessary.

We have described earlier (section 3) that inKFS it is possible to create new formats andwork with them, with other formats being si-multaneously active as well. This is possi-ble due to (1) the association between a Log-ical Storage Object (LSO) and its associatedPhysical Storage Object (PSO) is not fixed,and (2) the implementation of local rather thanglobal control structures for different types ofresource allocation. We plan to experimentwith abstracting this mechanism so that it canbe used by existing filesystems, for examplefor upgrading to new file representations like64-bit and extents support in ext3 including thenew multi-block allocator implementation fromAlex Tomas [2]. We would like to comparethe results with the current approach in ext3for achieving the format switch, which relieson per-inode flags and alternate inode operationvectors.

4.2 Evaluation of alternate layouts andpolicies

The ability to switch formats and layout poli-cies enables KFS to provide a platform for de-termining optimal layout and allocation poli-cies for a filesystem through ongoing experi-mentation and comparative evaluation. Typi-cally, layout decisions are very closely linkedto the particular context in which the file systemis intended to be used, e.g. considering under-lying media, nature of application workloads,data access patterns, available space, filesys-tem aging/fragmentation, etc. With the mech-anisms described in the previous sub-section inplace, we intend to demonstrate performanceadvantages over the long run from being ableto choose and switch from alternative repre-sentations, for example from a very small fileoriented representation with data embedded

within the inode, to direct, indirect block map-ping, to extents maps based on file size, distri-bution and access patterns.

4.3 Assessment of overheads imposed byflexibility

It is said that any problem in computer sciencecan be solved by adding an extra level of in-direction. The only problem is that indirec-tions do not come for free, especially if it in-volves extra disk seeks to perform an opera-tion. It is for this reason that ext2/3, for ex-ample, attempts to allocate indirect blocks con-tiguously with the data blocks they map to, andwhy support for extended attributes storage inlarge inodes has been demonstrated to deliversignificant performance gains for Samba work-loads [3] compared to storage of attributes in aseparate block. This issue has been a primarydesign consideration for KFS. The originalHFS/KFS effort has been specifically concep-tualized with a view towards enabling high per-formance and scalability through fine-grainedflexibility, rather than with an intent of addinglayers of high level semantic features.

Initial experiments with KFS seem to indicatethat the benefits of flexibility outweigh over-heads, given a well-designed meta-data cachingmechanism and right choice of building blockswhere indirections go straight to the most ap-propriate on-disk objects when a file elementis first looked up. The ability to have entirelydifferent and variable-sized “inode” represen-tations for different types of files amortizes thecost across all subsequent operations on a file.It remains to be verified whether this is provedto be valid on a broad set of workloads andwhether the same argument would apply in thecontext of adding flexibility to existing filesys-tems without requiring invasive changes.

Would the reliance on C++ in KFS be concernfor application to existing Linux filesystems?


Inheritance has proven to be very useful dur-ing the development of per-element specializa-tion in KFS, as it simplified coding to a greatextent and enabled workload-specific optimiza-tions. However, being able to move to C witha minimalist scheme may be a desirable goalwhen working with existing filesystems in thelinux kernel.

4.4 Ability to drop experimental layoutchanges easily

As described in section 2.3, the problem ofbeing stuck with on-disk format changes oncemade necessitates a stringent approach towardsadoption of layout improvements which mayresult in years of lead time into actual deploy-ment of state-of-the-art filesystem enhance-ments.

In KFS, as we add new formats, we can stillwork with old ones for compatibility, but theold ones can be kept out of the mainstream im-plementation. The code is activated when read-ing old representations, and we can on the fly“migrate” to the new format as we access it, al-beit at a certain run-time overhead.

This does not however address the issue of han-dling backward compatibility with older ker-nels. Perhaps it would make sense to includea compatibility information scheme at a per-PSO level, similar to the ext2/3 filesystem’s su-perblock level compatibility bitmaps. For ex-ample, in the situation where we are able to ex-tend a given PSO type to a new scheme in away that is backward compatible with the ear-lier PSO type (e.g. as in case of the directory in-dexing feature), we would like to indicate thatso that an older kernel does not reject it.

5 Future work

Section 4 has discussed ongoing work on KFSthat, in the short term, may result in usefulinsights for achieving an adaptable filesystemframework for Linux. In this section we dis-cuss new work to be done that is essential torealizing this adaptable framework.

5.1 Adaptability in the filesystem checkerand tools

With adaptable building blocks and support formultiple alternate formats as part of a per ele-ment specialization, it follows that file-systemchecker changes would be required to be ableto recognize, load, and handle new PSOs aswell as perform some level of general consis-tency checks based on the indication of locationand sizes common to all elements, and pars-ing the record map(s). As with existing filesys-tem checkers, backward compatibility remainsa tricky issue. The PSO compatibility schemediscussed earlier could be used to flag situ-ations where newer versions of tools are re-quired. Likewise, other related utilities likelow-level backup utilities (e.g dump), migra-tion tools, defragmenter, debugfs, etc wouldneed to be dynamically extendable to handlenew formats.

5.2 Address the problem of sharing granu-lar building blocks across filesystems

KFS was not originally designed with the in-tent of enabling building-block sharing acrossexisting file systems. However, given poten-tial benefits in factoring core ext2/3 code, data-consistency schemes and additional capabili-ties (e.g. clustering), as well as extensions tolibfs beyond its current limited scope of ap-plication, it is natural to ask whether KFS-type constructs could be useful in this context.


Could the same principles that allow per ele-ment flexibility through building block compo-sition be taken further to enable abstraction ofthese building blocks in a way that not tightlytied to the containing filesystem? Design issuesthat may need to be explored in order to evalu-ate the viability of this possibility include figur-ing out how a combination of PSOs, e.g. alter-nate layouts and alternate consistency, schemescould be built efficiently in the context of anexisting filesystem in a manner that is reusablein the context of another filesystem. The ef-fort involved in trying to refactor existing codeinto PSOs may not be small; a better approachmay be to start this with a few simple buildingblocks and use the framework for new buildingblocks created from here on.

5.3 Additional Issues

Another aspect that needs further exploration iswhether the inclusion of building blocks and as-sociated specialization adds to overall testingcomplexity for distributions, or if the frame-work can be enhanced to enable a systematicapproach to be devised to simplify such verifi-cation.

6 Conclusions

We presented KFS and discussed that the expe-rience so far indicates that KFS can be a pow-erful approach to support flexibility of servicesin a file system down to a per-element granu-larity. We also argued that the approach doesnot come with unacceptable performance over-heads, and that it allows for workload-specificoptimizations. We believe that the flexibilityin KFS makes it a good prototype environ-ment for experimenting with new file systemresource management policies or file represen-tations. As new emerging workloads appear,

KFS can be useful to the Linux communityby providing evaluation results for alternativerepresentations and by advancing the applica-tion of different data layouts for different fileswithin the same filesystem, determined eitherstatically or dynamically in response to chang-ing access patterns.

In this paper we propose a new exploration ofKFS: to investigate how its building-block ap-proach could be abstracted from KFS’s imple-mentation to allow code sharing among file sys-tems, providing a library-like collection of al-ternative implementations to be experimentedwith across file systems. We do not have yetevidence that this approach is feasible, but aswe improve the integration of KFS (originallydesigned for the K42 operating system) withLinux 2.6 we hope to have a better understand-ing of KFS’s general applicability.

Acknowledgements

We would like to thank Dave Kleikamp forcontributing the section on JFS flexibility, andChristoph Hellwig for his inputs on XFS.

Orran Krieger started KFS based on his expe-rience with designing and implementing HFS.He has been a steady source of guidance andmotivation for KFS developers, always avail-able for in-depth design reviews that improvedKFS and made our work more fun. We thankhim for his insight and support.

We would like to thank Paul McKenney, ValHenson, Mingming Cao, and Chris Mason fortheir valuable review feedback on early draftsof this paper, and H. Peter Anvin and Mel Gor-man for helping us improve our proposal. Thispaper might never have been written but forinsightful discussions with Stephen Tweedie,Theodore T’so, and Andreas Dilger over the


course of the last year. We would like to thankthem and other ext3 filesystem developers fortheir steady focus on continuously improvingthe ext3 filesystem, devising interesting tech-niques to address conflicting goals of depend-ability vs state-of-the-art advancement. Finallywe would like to thank the linux-fsdevel com-munity for numerous discussions over yearswhich motivated this paper.

This work was partially supported by DARPAunder contract NBCH30390004.

Availability

KFS is released as open source as part ofthe K42 system, available from a public CVSrepository; for details refer to the K42 website: http://www.research.ibm.com/K42/.

Legal Statement

Copyright c© 2006 IBM.

This work represents the view of the authors anddoes not necessarily represent the view of IBM.

IBM and OS/2 are trademarks or registered trade-marks of International Business Machines Corpora-tion in the United States and/or other countries.

Linux is a registered trademark of Linus Torvalds inthe United States, other countries, or both.

Other company, product, and service names may betrademarks or service marks of others.

References in this publication to IBM products orservices do not imply that IBM intends to makethem available in all countries in which IBM oper-ates.

This document is provied “AS IS,” with no expressor implied warranties. Use the information in thisdocument at your own risk.

References

[1] Ray Bryant, Ruth Forester, and JohnHawkes. Filesystem performance andscalability in linux 2.4.17. In USENIXAnnual Technical Conference, 2002.

[2] M. Cao, T.Y. Tso, B. Pulavarty,S. Bhattacharya, A. Dilger, andA. Tomas. State of the art: Where we arewith the ext3 filesystem. In Proceedingsof the Ottawa Linux Symposium(OLS),pages 69–96, 2005.

[3] Jonathan Corbet. Which filesystem forsamba4? http:

//lwn.net/Articles/112566/.

[4] Jonathan Corbet. Creating linux virtualfile systems. http://lwn.net/Articles/57369/,November 2003.

[5] Jonathan Corbet. The OCFS2 filesystem.http:

//lwn.net/Articles/137278/, May2005.

[6] Alan Cox. Posting on linux-fsdevel.http://marc.theaimsgroup.com/

?l=linux-fsdevel&m=

112558745427067&w=2, September2005.

[7] Dilma da Silva, Livio Soares, and OrranKrieger. KFS: Exploring flexilibity in filesystem design. In Proc. of the BrazilianWorkshop in Operating Systems,Salvador, Brazil, August 2004.


[8] Greg Ganger and Frans Kaashoek.Embedded inodes and explicit gruopings:Exploiting disk bandwith for small files.In Proceedings of the 1997 UsenixAnnual Technical Conference, pages1–17, January 1997.

[9] Val Henson, Zach Brown, TheodoreTs’o, and Arjan van de Ven. Reducingfsck time for ext2 filesystems. InProceedings of the Ottawa LinuxSymposium(OLS), 2006.

[10] The K42 operating system, http://www.research.ibm.com/K42/.

[11] Kfs performance experiments.http://k42.ozlabs.org/Wiki/

KfsExperiments, 2006.

[12] O. Krieger and M. Stumm. HFS: Aperformance-oriented flexible filesystembased on build-block compositions. ACMTransactions on Computer Systems,15(3):286–321, 1997.

[13] Orran Krieger. HFS: A Flexible FileSystem for Shared-MemoryMultiprocessors. PhD thesis, Departmentof Electrical and Computer Engineering,University of Toronto, 1994.

[14] Orran Krieger and Michael Stumm. HFS:A flexible file system for large-scalemultiprocessors. In Proceedings of theDAGS/PC Symposium (The SecondAnnual Dartmouth Institute on AdvancedGraduate Studies in ParallelComputation), 1993.

[15] Namesys. Reiser4. http://www.namesys.com/v4/v4.html,August 2004.

[16] Daniel Phillips. A directory index forext2. In 5th Annual Linux Showcase andConference, pages 173–182, 2001.

[17] Justin Piszcz. Benchmarking FileSystems Part II.

[18] Livio Soares, Orran Krieger, andDilma Da Silva. Meta-data snapshotting:A simple mechanism for file systemconsistency. In SNAPI’03 (InternationalWorkshop on Storage NetworkArchitecture and Parallel I/O), pages41–52, 2003.

[19] John Troy Stepan. Linux File SystemsComparative Performance. LinuxGazette, January 2006.http://linuxgazette.net/122/

TWDT.html.

[20] IBM JFS Core Team.

[21] SGI XFS Team. XFS: ahigh-performance journaling filesystem.

[22] Stephen Tweedie and Theodore Y Ts’o.Planned extensions to the linux ext2/3filesystem. In USENIX Annual TechnicalConference, pages 235–244, 2002.

[23] Stephen C Tweedie. Re: Rfc: mke2fswith dir_index, resize_inode by default,March, 2006.

[24] Linux kernel sourcecode vfsdocumentation. file Documentaion/filesystems/vfs.txt.


Multiple Instances of the Global Linux Namespaces

Eric W. BiedermanLinux Networx

[email protected]

Abstract

Currently Linux has the filesystem namespacefor mounts which is beginning to prove use-ful. By adding additional namespaces for pro-cess ids, SYS V IPC, the network stack, userids, and probably others we can, at a trivialcost, extend the UNIX concept and make noveluses of Linux possible. Multiple instances of anamespace simply means that you can have twothings with the same name.

For servers the power of computers is growing,and it has become possible for a single serverto easily fulfill the tasks of what previously re-quired multiple servers. Hypervisor solutionslike Xen are nice but they impose a perfor-mance penalty and they do not easily allow re-sources to be shared between multiple servers.

For clusters application migration and preemp-tion are interesting cases but almost impossiblyhard because you cannot restart the applicationonce you have moved it to a new machine, asusually there are resource name conflicts.

For users certain desktop applications interfacewith the outside world and are large and hardto secure. It would be nice if those applicationscould be run on their own little world to limitwhat a security breach could compromise.

Several implementations of this basic idea havebeen done succsessfully. Now the work is

to create a clean implementation that can bemerged into the Linux kernel. The discussionhas begun on the linux-kernel list and things areslowly progressing.

1 Introduction

1.1 High Performance Computing

I have been working with high performanceclusters for several years and the situation ispainful. Each Linux box in the cluster is ref-ered to as a node, and applications running orqueued to run on a cluster are jobs.

Jobs are run by a batch scheduler and, oncelaunched, each job runs to completion typicallyconsuming 99% of the resources on the nodesit is running on.

In practice a job cannot be suspended,swapped, or even moved to a different set ofnodes once it is launched. This is the oldestand most primitive way of running a computer.Given the long runs and high computation over-head of HPC jobs it isn’t a bad fit for HPC en-vironments, but it isn’t a really good fit either.

Linux has much more modern facilities. Whatprevents us from just using them?

102 • Multiple Instances of the Global Linux Namespaces

HPC jobs currently can be suspended, but thatjust takes them off the cpu. If you have suffi-cient swap there is a chance the jobs can evenbe pushed to swap but frequently these applica-tion lock pages in memory so they can be en-sured of low latency communication.

The key problem is simply having multiple ma-chines and multiple kernels. In general, howto take an application running under one Linuxkernel and move it completely to another kernelis an unsolved problem.

The problem is unsolved not because it is fun-damentally hard, but simply because it hasnot be a problem in the UNIX environment.Most applications don’t need multiple ma-chines or even big machines to run on (espe-cially with Moore’s law exponentially increas-ing the power of small machines). For many ofthe rest the large multiprocessor systems havebeen large enough.

What has changed is the economic observationthat a cluster of small commodity machines ismuch cheaper and equally as fast as a large su-percomputer.

The other reason this problem has not beensolved (besides the fact that most people work-ing on it are researchers) is that it is not im-mediately obvious what a general solution is.Nothing quite like it has been done before soyou can’t look into a text book or into thearchives of history and know a solution. Whichin the broad strokes of operating system theoryis a rarity.

The hard part of the problem also does not liein the obvious place people first look— how tosave all of the state of an application. Instead,the problem is how do you restore a saved ap-plication so it runs successfully on another ma-chine.

The problem with restoring a saved applica-tion is all of the global resources an application

uses. Process ids, SYS V IPC identifiers, file-names, and the like. When you restore an appli-cation on another machine there is no guaranteethat it can reuse the same global identifiers asanother process on that machine may be usingthose identifiers.

There are two general approaches to solvingthis problem. Modifying things so these globalmachine identifiers are unique across all ma-chines in a cluster, or modifying things so thesemachine global identifiers can be repeated ona single machine. Many attempts have beenmade to scale global identifiers cluster-wide—Mosix, OpenSSI, bproc, to name a few—andall of them have had to work hard to scale. SoI choose to go with an implementation that willeasily scale to the largest of clusters, with nocommunication needed to do so.

This has the added advantage that in a clusterit doesn’t change the programming model pre-sented to the user. Just some machines willnow appear as multiple machines. As the riseof the internet has shown building applicationsthat utilize multiple machines is not foreign tothe rest of the computing world either.

To make this happen I need to solve thechallenging problem of how to refactor theUNIX/Linux API so that we can have multi-ple instances of the global Linux namespaces.The Plan 9 inspired mount/filesystem name-space has already proved how this can be doneand is slowly proving useful.

1.2 Jails

Outside the realm of high perfomance comput-ing people have been restricting their server ap-plication to chroot jails for years. The problemswith chroot jails have become well understoodand people have begun fixing them. First BSDjails, and then Solaris containers are some ofthe better known examples.


Under Linux the open source community hasnot been idle. There is the linux-jail project,Vserver, Openvz, and related projects likeSELinux, UML, and Xen.

Jails are a powerful general purpose tool use-ful for a great variety of things. In resourceutilization jails are cheap, dynamically load-ing glibc is likely to consume more memorythan the additional kernel state needed to tracka jail. Jails allow applications to be run in sim-ple stripped down environments, increasing se-curity, and decreasing maintenance costs, whileleaving system administrators with the familiarUNIX environment.

The only problem with the current general pur-pose implementation of jails under Linux isnothing has been merged into the mainline ker-nel, and the code from the various projects isnot really mergable as it stands. The closest Ihave seen is the code from the linux-jail project,and that is simply because it is less general pur-pose and implemented completely as a Linuxsecurity module.

Allowing multiple instances of global nameswhich is needed to restore a migrated applica-tion is a more constrained problem than that ofsimply implementing a jail. But a jail that en-sures you can have multiple instances of all ofthe global names is a powerful general purposejail that you can run just about anything in. Sothe two problems can share a common kernelsolution.

1.3 Future Directions

A cheap and always available jail mecha-nism is also potentially quite useful outsidethe realm of high performance computingand server applications. A general puruposecheckpoint/restart mechanism can allow desk-top users to preserve all of their running appli-

cations when they log out. Vulnerable or un-trusted applications like a web browser or an ircclient could be contained so that if they are at-tacked all that is gained is the ability to browsethe web and draw pictures in an X window.

There would finally be answer to the age oldquestion: How do I preserve my user spacewhile upgrading my kernel?

All this takes is an eye to designing the inter-faces so they are general purpose and nestable.It should be possible to have a jail inside a jailinside a jail forever, or at least until there don’texist the resources to support it.

2 Namespaces

How much work is this? Looking at the ex-isting patches it appears that 10,000 to 20,000lines of code will ultimately need to be touched.The core of Linux is about 130,000 lines ofcode, so we will need to touch between 7% and15% of the core kernel code. Which clearlyindicates that one giant patch to do everythingeven if it was perfect would be rejected simplybecause it is too large to be reviewed.

In Linux there are multiple classes of globalidentifiers (i.e. process id, SYS V IPC keys,user ids). Each class of identifier can be thoughtof living in its own namespace.

This gives us a natural decomposition to theproblem, allowing each namespace to be mod-ified separately so we can support multiple in-stances of that namespace. Unfortunately thisalso increases the difficulty of the problem, aswe need to modify the kernel’s reporting andconfiguration interfaces to support multiple in-stances of a namespace instead of having themtightly coupled.


The plan then is simple. Maintain backwardscompatibility. Concentrate on one namespaceat a time. Focus on implementing the abil-ity to have multiple objects of a given type,with the same name. Configure a namespacefrom the inside using the existing interfaces.Think of these things ultimately not as serversor virtual machines but as processes with pe-culiar attributes. As far as possible implementthe namespaces so that an application can begiven the capability bit for that allows full con-trol over a namespace and still not be able toescape. Think in terms of a recursive imple-mentation so we always keep in mind what ittakes to recurse indefinitely.

What the system call interface will be to cre-ate a new instance of a namespace is still upfor debate. The current contenders are a newCLONE_ flag or individual system calls. I per-sonally think a flag to clone and unshare isall that is needed but if arguments are actuallyneeded a new system call makes sense.

Currently I have identified ten separate names-paces in the kernel. The filesystem mountnamespace, uts namespace, the SYS V IPCnamespace, network namespace, the pid name-space, the uid namespace, the security name-space, the security keys namespace, the devicenamespace, and the time namespace.

2.1 The Filesystem Mount Namespace

Multiple instances of the filesystem mountnamespace are already implemented in the sta-ble Linux kernels so there are few real issueswith implementing it. There are still outstand-ing question on how to make this namespaceusable and useful to unpriviledged processes, aswell as some ongoing work to allow the filesys-tem mount namespace to allow bind mounts tohave bind flags. For example, so the the bindmount can be restricted read only when othermounts of the filesystem are still read/write.

int uname(struct utsname *buf);struct utsname {

char sysname[];char nodename[];char release[];char version[];char machine[];char domainname[];

};

Figure 1: uname

CAP_SYS_ADMIN is currently required tomodify the mount namespace, although therehas been some discussion on how to relax therestrictions for bind mounts.

2.2 The UTS Namespace

The UTS namespace characterizes and identi-fies the system that applications are running on.It is characterizeed by the uname system call.uname returns six strings describing the cur-rent system. See Figure 1.

The returned utsname structure has only twomembers that vary at runtime nodename anddomainname. nodename is the classic host-name of a system. domainname is the NISdomainname. CAP_SYS_ADMIN is requiredto change these values, and when the systemboots they start out as "(none)".

The pieces of the kernel that report and mod-ify the utsname structure are not connectedto any of the big kernel subsystems, build evenwhen CONFIG_NET is disabled, and use CAP_SYS_ADMIN instead of one of the more spe-cific capabilities. This clearly shows that thecode has no affiliation with one of the largernamespaces.

Allowing for multiple instances of the UTSnamespace is a simple matter of allocating a


new copy of struct utsname in the kernelfor each different instance of this namespace,and modifying the system calls that modifythis value to lookup the appropriate instance ofstruct utsname by looking at current.

2.3 The IPC Namespace

The SYS V interprocess communication name-space controls access to a flavor of shared mem-ory, semaphores, message queues introduced inSYS V UNIX. Each object has associated withit a key and an id, and all objects are glob-ally visible and exist until they are explicitlydestroyed.

The id values are unique for every object ofthat type and assigned by the system when theobject is created.

The key is assigned by the application usuallyat compile time and is unique unless it is speci-fied as IPC_PRIVATE. In which case the keyis simply ignored.

The ipc namespace is currently limited by thefollowing universally readable and uid 0 setablesysctl values:

• kernel.shmmax The maximum sharedmemory segment size.

• kernel.shmall The maximum com-bined size of all shared memory segments.

• kernel.shmni The maximum numberof shared memory segments.

• kernel.msgmax The maximum mes-sage size.

• kernel.msgmni The maximum num-ber of message queues.

• kernel.msgmnb The maximum num-ber of bytes in a message queue.

• kernel.sem An array of 4 control inte-gers

– sc_semmsl The maximum numberof semaphores in a semaphore set.

– sc_semmns The maximum numberof semaphores in the system.

– sc_semopm The maximum numberof semaphore operations in a singlesystem call.

– sc_semmni The maximum numberof semaphore sets in the system.

Operations in the ipc namespace are limited bythe following capabilities:

• CAP_IPC_OWNER Allows overriding thestandard ipc ownership checks, for the fol-lowing operations: shm_attach, shm_stat, shm_get, msgrcv, msgsnd,msg_stat, msg_get, semtimedop,sem_getall, sem_setall, sem_stat, sem_getval, sem_getpid,sem_getncnt, sem_getzcnt, sem_setval, sem_get.

For filesystems namei.c uses CAP_

DAC_OVERRIDE, and CAP_DAC_READ_

SEARCH to provide the same level of con-trol.

• CAP_IPC_LOCK Required to controllocking of shared memory segments inmemory.

• CAP_SYS_RESOURCE Allows settingthe maximum number of bytes in a mes-sage queue to exceed kernel.msgmnb

• CAP_SYS_ADMIN Allows changing theownership and removing any ipc object.

Allowing for multiple instances of the ipcnamespace is a straightforward process of du-plicating the tables used for lookup by key and


id, and modifying the code to use currentto select the appropriate table. In additionthe sysctls need to be modified to look atcurrent and act on the corresponding copyof the namespace.

The ugly side of working with this namespaceis the capability situation. CAP_IPC_OWNER

trivially becomes restricted to the current ipcnamespace. CAP_IPC_LOCK still remains dan-gerous. CAP_DAC_OVERRIDE and CAP_DAC_

READ_SEARCH might be ok, it really dependson the state of the filesystem mount name-space. CAP_SYS_RESOURCE and CAP_SYS_

ADMIN are still unsafe to give to untrusted ap-plications.

2.4 The Network Namespace

By volume the code implementing the net-work stack is the largest of the subsystems thatneeds its own namespace. Once you look atthe network subsystem from the proper slant itis straightforward to allow user space to havewhat appears to be multiple instances of thenetwork stack, and thus you have a networknamespace.

The core abstractions used by the network stackare processes, sockets, network devices, andpackets. The rest of the network stack is de-fined in terms of these.

In adding the network namespace I add a fewsimple rules.

• A network device belongs to exactly onenetwork namespace.

• A socket belongs to exactly one networknamespace.

A packet comes into the network stack from theoutside world through a network device. We

can the look at that device to find the networknamespace and the rules to process that packetby.

We generate a packet and feed it to the ker-nel through a socket. The kernel looks at thesocket, finds the network namespace, and fromthere the rules to process the packet by.

What this means is that most of the networkstack global variables need to be moved intothe network namespace data structure. Hope-fully we can write this so the extra level of in-direction will not reduce the performance of thenetwork stack.

Looking up the network stack global variablesthrough the network namespace is not quiteenough. Each instance of the network name-space needs its own copy of the loopback de-vice, an interface needs to be added to movenetwork devices between network namespaces,and a two headed tunnel device (a cousin of theloopback device) needs to be added so we cansend packets between different network names-paces.

With these in place it is safe to give processesa separate network namespace: CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST,CAP_NET_ADMIN, and CAP_NET_RAW. Allof the functionality will work and workingthrough the existing network devices therewon’t be an ability to escape the network name-space. Care should be given to giving an un-trusted process access to real network devices,though, as hardware or software bugs in the im-plementation of that network device could bereduce the security.

The future direction of the network stack is to-wards Van Jackobson network channels, wheremore of the work is pushed towards processcontext, and happening in sockets. That workappears to be a win for network namespaces


in two ways. More work happening in pro-cess context and in well defined sockets meansit is easier to lookup the network namespace,and thus cheaper. Having a lightweight packetclassifier in the network drivers should allow asingle network device to appear to user spaceas multiple network devices each with a dif-ferent hardware address. Then based upon thedestination hardware address the packet canbe placed in one of several different networknamespaces. Today to get the same semanticsI need to configure the primary network deviceas a router, or configure ethernet bridging be-tween the real network and the network inter-face that sends packets to the secondary net-work namespace.

2.5 The Process Id Namespace

The venerable process id is usually limited to16bits so that the bitmap allocator is efficientand so that people can read and remember theids. The identifiers are allocated by the kernel,and identifiers that are no longer in use are pe-riodically reused for new processes. A singleidentifier value can refer to a process, a threadgroup, a process group and to a session, but ev-ery use starts life as a process identifier.

The only capability associated with process idsis CAP_KILL which allows sending signals toany process even if the normal security checkswould not allow it.

Process identifiers are used to identify the cur-rent process, to identify the process that died,to specify a process or set of processes to senda signal to, to specify a process or set of pro-cesses to modify, to specify a process to de-bug, and in system monitoring tools to spec-ify which process the information pertains to.Or in other words process identifiers are deeplyentrenched in the user/kernel interface and areused for just about everything.

In a lot of ways implementing a process idnamespace is straightforward as it is clear howeverything should look from the inside. Thereshould be a group of all of the processes inthe namespace that kill -1 sends signals to.Either a new pid hash table needs to be allo-cated or the key in the pid hash table needsto be modified to include the pid namespace.A new pid allocation bitmap needs to be al-located. /proc/sys/pid_max needs to bemodified to refer to the current pid allocationbitmap. When the pid namespace exits all ofthe processes in the pid namespace need to bekilled. Kernel threads need to be modified tonever start up in anything except the default pidnamespace. A process that has pid 1 must existthat will not receive any signals except for theones it installs a signal handler for.

How a pid namespace should look from the out-side is a much more delicate question. Howshould processes in a non default pid name-space be displayed in /proc? Should any ofthe process in a pid namespace show up in anyother pid namespace? How much of the ex-isting infrastructure that takes pids should con-tinue to work?

This is one of the few areas where the discus-sion on the kernel list has come to a completestandstill, as an inexpensive technical solutionto everyones requirements was not visible at thetime of the conversation.

A big piece of what makes the process id name-space different is that processes are organizedinto a hierarchical tree. Maintaining the par-ent/child relationship between the process thatinitiates the pid namespace and the first processin the new pid namespace requires first processin the new pid namespace have two pids. Apid in the namespace of the parent and pid 1in its own namespace. This results in names-paces that are hierarchical unlike most names-paces that are completely disjoint.


Having one process with two pids looks like aserious problem. It gets worse if we want thatprocess to show up in other pid namespaces.

After looking deeply at the underlying mecha-nisms in the kernel I have started moving thingsaway from pid_t to pointers to struct pid.The immediate gain is that the kernel becomesprotected from pid wraparound issues.

Once all of the references that matter arestruct pid pointers inside the kernel a dif-ferent implementation becomes possible. Wecan hang multiple <pid namespace, pid_t>tuples off struct pid allowing us to have adifferent name for the same pid in several pidnamespaces.

With processes in subordinate pid namespacesat least potentially showing up when we needthem we can preserve the existing UNIX apifor all functions that take pids and not need toreinvent pid namespace specific solutions.

The question yet to be answered in my mindis do we always map a process’s struct pidinto all of its ancestor’s pid namespaces, or dowe provide a mechanism that performs thosemappings on demand?

2.6 The User and Group ID Namespace

In the kernel user ids are used for both account-ing and for for performing security checks.The per user accounting is connected to theuser_struct. Security checks are doneagainst uid, euid, suid, fsuid, gid,egid, sgid, fsgid, processes capabilities,and variables maintained by a Linux securitymodule. The kernel allows any of the uid/gidvalues to rotate between slots, or, if a processhas CAP_SETUID, arbitrary values to be setinto the filesystem uids.

With a uid namespace the security checks forequality of uids become checks to ensure the

entire tuple <uid namespace, uid> is equal.Which means if two uids are in different names-paces the check will always fail. So the onlypermitted cases across uid namespaces will bewhen everyone is allowed to perform the actionthe process is trying to perform or when theprocess has the appropriate capability to per-form the action on any process.

An alternative in some cases to modifying allof the checks to be against <namespace, uid>tuples is to modify some of the checks to beagainst user_struct pointers.

Since uid namespaces are not persistent, map-ping of a uid namespace to filesystems requiressome new mechanisms. The primary mech-anism is to associate with the each super_block the uid namespace of the filesystem;probably moving that information into eachstruct inode in the kernel for speed andflexibility.

To allow sharing of filesystem mounts betweendifferent uid namespaces requires either usingacls to tag inodes with non-default filesystemnamespace information or using the key infras-tructure to provide a mapping between differentuid namespaces.

Virtual filesystems require special care as fre-quently they allow access to all kinds of spe-cial kernel functionality without any capabilitychecks if the uid of a process equals 0. So vir-tual filesystems like proc and sysfs must spec-ify the default kernel uid namespace in theirsuperblock or it will be trivial to violate thekernel security checks.

There is a question of whether the change inrules and mechanisms should take place in thecore kernel code, making it uid namespaceaware, or in a Linux security module. Akey of that decision is the uid hash table anduser_struct. From my reading of the ker-nel code it appears that current Linux security


modules can only further restrict the defaultkernel permissions checks and there is not ahook that makes it possible to allocate a dif-ferent user_struct depending on securitymodule policies.

Which means at least the allocation of user_struct, and quite possibly making all of theuid checks fail if the uid namespaces are notequal, should happen in the core of the ker-nel with security modules standing in the back-ground providing really advanced facilities.

With a uid namespace it becomes safe to giveuntrusted users CAP_SETUID without reduc-ing security.

2.7 Security Modules and Namespaces

There are two basic approaches that can be pur-sued to implement multiple instances of userspace. Objects in the kernel can be isolated bypolicy and security checks with security mod-ules, or they can be isolated by making visibleonly the objects you are allowed to access byusing namespaces.

The Linux Jail module (http://sf.net/projects/linuxjail) implemented by"Serge E. Hallyn" <[email protected]> is agood example of what can be done with justa security module and isolating a group of pro-cesses with permission checks and policy ratherthan simply making the inaccessible parts ofthe system disappear.

Following that general principle Linux secu-rity modules have two different roles they canplay when implementing multiple instances ofuser space. They can make up for any unim-plemented kernel namespace by isolating ob-jects with additional permission checks, whichis good as a short term solution. Linux secu-rity modules modified to be container aware

can also provide for enhanced security enforce-ment mechanisms in containers. In essence thissecond modification is the implementation of anamespace for security mechanisms and policy.

2.8 The Security Keys Namespace

Not long ago someone added to the kernel whatis the frustration of anyone attempting to imple-menting namespaces to allow for the migrationof user space. Another obscure and little knownglobal namespace.

In this case each key on a key ring is assigneda global key_serial_t value. CAP_SYS_ADMIN is used to guard ownership and permis-sion changes.

I have yet to look in detail but at first glancethis looks like one of the easy cases, where wecan just simply implement another copy of thelookup table. It appears the key_serial_t values are just used for manipulation of thesecurity keys, from user space.

2.9 The Device Namespace

Not giving children in containers CAP_SYS_MKNOD and not mounting sysfs is sufficientto prevent them from accessing any devicenodes that have not been audited for use bythat container. Getting a new instance of theuid/gid namespace is enough to remove accessfrom magic sysfs entries controlling devices al-though there is some question on how to bringthem back.

For purposes of migration, unless all devices aset of processes has access to are purely virtual,pretending the devices haven’t changed is non-sense. Instead it makes much more sense to ex-plicitly acknowledge the devices have changed


and send hotplug remove and add events to theset of processes.

With the use of hotplug events the assumptionthat the global major and minor numbers that adevice uses are constant is removed.

Equally as sensitive as CAP_SYS_MKNOD, andprobably more important if mounting sysfs isallowed, is CAP_CHOWN. It allows changingthe owner of a file. Since it would be requiredto change the sysfs owner before a sensitive filecould be accessed.

So in practice managing the device namespaceappears to be a user space problem with restric-tions on CAP_SYS_MKNOD and CAP_CHOWNbeing used to implement the filter policy ofwhich devices a process has access to.

2.10 The Time Namespace

The kernel provides access to several clocksCLOCK_REALTIME, CLOCK_MONOTONIC,CLOCK_PROCESS_CPUTIME_ID, andCLOCK_THREAD_CPUTIME_ID being theprimaries.

CLOCK_REALTIME reports the current wallclock time.

CLOCK_MONOTONIC is similar to CLOCK_REALTIME except it cannot be set and thusnever backs up.

CLOCK_PROCESS_CPUTIME_ID reportshow much aggregate cpu time the threads in aprocess have consumed.

CLOCK_THREAD_CPUTIME_ID reports howmuch cpu time an individual thread has con-sumed.

If process migration is not a concern noneof these clocks except possibly CLOCK_

REALTIME is interesting. In the context ofprocess migration all of these clocks becomeinteresting.

The thread and process clocks simply need anoffset field so the amount of time spent on theprevious machine can be added in. So that wecan prevent the clocks from going backwards.

The monotonic timer needs an offset field sothat we can guarantee that it never goes back-wards in the presence of process migration.

The realtime clock matters the least but hav-ing an additional offset field for clock adds ad-ditional flexibility to the system and comes atpractically no cost.

All of the clocks except for CLOCK_MONOTONIC support setting the clockwith clock_settime so the existing controlinterfaces are sufficient. For the monotonicclock things are different, clock_settimeis not allowed to set the clock, ensuring thatthe time will never run backwards, and there isa huge amount of sense in that logic.

The only reasonable course I can see is settingthe monotonic clock when the time namespaceis created. Which probably means we will needa syscall (and not a clone flag) to create theclone flag and we should provide it with min-imum acceptable value of the monotonic clock.

3 Modifications of Kernel Inter-faces

One of the biggest challenges with implement-ing multiple namespaces is how do we modifythe existing kernel interfaces in a way that re-tains backwards compatibility for existing ap-plications while still allowing the reality of thenew situation to be seen and worked with.


Modifying existing system calls is proba-bly the easiest case. Instead of referenceglobal variables we reference variables throughcurrent.

Modifying /proc is trickier. Ideally we wouldintroduce a new subdirectory for each class ofnamespace, and in that folder list each instanceof that namespace, adding symbolic links fromthe existing names where appropriate. Unfortu-nately it is not possible to obtain a list of names-paces and the extra maintenance cost does notyet seem to just the extra complexity and cost ofa linked list. So for proc what we are likely tosee is the information for each namespace listedin /proc/pid with a symbolic link from theexisting location into the new location under/proc/self/.

Modifying sysctl is fairly straightforwardbut a little tricky to implement. The problemis that sysctl assumes it is always dealingwith global variables. As we put those vari-ables into namespaces we can no longer storethe pointer into a global variable. So we needto modify the implementation of sysctl tocall a function which takes a task_structargument to find where the variable is located.Once that is done we can move /proc/sysinto /proc/pid/sys.

Modifying sysfs doesn’t come up for mostof the namespaces but it is a serious issue forthe network namespace. I haven’t a clue whatthe final outcome will be but assuming we wantglobal visibility for monitor applications some-thing like /proc/pid and /proc/selfneeds to be added so we can list multiple in-stances and add symbolic links from their oldlocation in sysfs.

The netlink interface is the most difficultkernel interface to work with. Because controland query packets are queued and not neces-sarily processed by the application that sends

the query, getting the context information nec-essary to lookup up the appropriate global vari-ables is a challenge. It can even be difficult tofigure out which port namespace to reply to. Aslong as the only users of netlink are part of thenetworking stack I have an implementation thatsolves the problems. However, as netlink hasbeen suggested for other things, I can’t counton just the network stack processing packets.

Resource counters are one of the more chal-lenging interfaces to specify. Deciding if aninterface should be per namespace or global isa challenge, and answering the question howdoes this work when we have recursive in-stances of a namespace. All of these con-cerns are exemplified when there are mul-tiple untrusted users on the system. Forstarters we should be able to punt and im-plement something simple and require CAP_SYS_RESOURCE if there are any per name-space resource limits. Which should leaveus with a simple and correct implementation.Then the additional concerns can be addressedfrom there.


Fully Automated Testing of the Linux Kernel

Martin BlighGoogle Inc.

[email protected]

Andy P. WhitcroftIBM Corp.

[email protected]

Abstract

Some changes in the 2.6 development processhave made fully automated testing vital to theongoing stability of Linux R©. The pace of de-velopment is constantly increasing, with a rateof change that dwarfs most projects. The lackof a separate 2.7 development kernel means thatwe are feeding change more quickly and di-rectly into the main stable tree. Moreover, thebreadth of hardware types that people are run-ning Linux on is staggering. Therefore it is vi-tal that we catch at least a subset of introducedbugs earlier on in the development cycle, andkeep up the quality of the 2.6 kernel tree.

Given a fully automated test system, we canrun a broad spectrum of tests with high fre-quency, and find problems soon after they areintroduced; this means that the issue is stillfresh in the developers mind, and the offendingpatch is much more easily removed (not buriedunder thousands of dependant changes). Thispaper will present an overview of the currentearly testing publication system used on thehttp://test.kernel.org website. Wethen use our experiences with that system to de-fine requirements for a second generation fullyautomated testing system.

Such a system will allow us to compile hun-dreds of different configuration files on ev-ery release, cross-compiling for multiple dif-ferent architectures. We can also identify per-

formance regressions and trends, adding sta-tistical analysis. A broad spectrum of testsare necessary—boot testing, regression, func-tion, performance, and stress testing; from diskintensive to compute intensive to network in-tensive loads. A fully automated test harnessalso empowers other other techniques that areimpractical when testing manually, in orderto make debugging and problem identificationeasier. These include automated binary chopsearch amongst thousands of patches to weedout dysfunctional changes.

In order to run all of these tests, and collate theresults from multiple contributors, we need anopen-source client test harness to enable shar-ing of tests. We also need a consistent outputformat in order to allow the results to be col-lated, analysed and fed back to the communityeffectively, and we need the ability to “pass”the reproduction of issues from test harness tothe developer. This paper will describe the re-quirements for such a test client, and the newopen-source test harness, Autotest, that we be-lieve will address these requirements.

1 Introduction

It is critical for any project to maintain a highlevel of software quality, and consistent inter-faces to other software that it uses or uses it.

114 • Fully Automated Testing of the Linux Kernel

There are several methods for increasing qual-ity, but none of these works in isolation, weneed a combination of:

• skilled developers carefully developinghigh quality code,

• static code analysis,

• regular and rigorous code review,

• functional tests for new features,

• regression testing,

• performance testing, and

• stress testing.

Whilst testing will never catch all bugs, it willimprove the overall quality of the finished prod-uct. Improved code quality results in a betterexperience not only for users, but also for de-velopers, allowing them to focus on their owncode. Even simple compile errors hinder devel-opers.

In this paper we will look at the problem of au-tomated testing, the current state of it, and ourviews for its future. Then we will take a casestudy of the test.kernel.org automated test sys-tem. We will examine a key test component,the client harness, in more detail, and describethe Autotest test harness project. Finally wewill conclude with our vision of the future anda summary.

2 Automated Testing

It is obvious that testing is critical, what is per-haps not so obvious is the utility of regular test-ing at all stages of development. It is importantto catch bugs as soon as possible after they arecreated as:

• it prevents replication of the bad code intoother code bases,

• fewer users are exposed to the bug,

• the code is still fresh in the authors mind,

• the change is less likely to interact withsubsequent changes, and

• the code is easy to remove should that berequired.

In a perfect world all contributions would bewidely tested before being applied; however, asmost developers do not have access to a largerange of hardware this is impractical. More rea-sonably we want to ensure that any code changeis tested before being introduced into the main-line tree, and fixed or removed before most peo-ple will ever see it. In the case of Linux, An-drew Morton’s -mm tree (the de facto develop-ment tree) and other subsystem specific treesare good testing grounds for this purpose.

Test early, test often!

The open source development model and Linuxin particular introduces some particular chal-lenges. Open-source projects generally sufferfrom the lack of a mandate to test submissionsand the fact that there is no easy funding modelfor regular testing. Linux is particularly hardhit as it has a constantly high rate of change,compounded with the staggering diversity ofthe hardware on which it runs. It is completelyinfeasible to do this kind of testing without ex-tensive automation.

There is hope; machine-power is significantlycheaper than man-power in the general case.Given a large quantity of testers with diversehardware it should be possible to cover a use-ful subset of the possible combinations. Linuxas a project has plenty of people and hardware;what is needed is a framework to coordinatethis effort.


Test tree

Patches Patches Patches Patches

Test tree Test tree

Distro Distro Distro Distro

Mainline

MainlinePrerelease

Users Users Users Users

Figure 1: Linux Kernel Change Flow

2.1 The Testing Problem

As we can see from the diagram in figure 1Linux’s development model forms an hour-glass starting highly distributed, with contri-butions being concentrated in maintainer treesbefore merging into the development releases(the -mm tree) and then into mainline itself.It is vital to catch problems here in the neckof the hourglass, before they spread out to thedistros—even once a contribution hits mainlineit is has not yet reached the general user popu-lation, most of whom are running distro kernelswhich often lag mainline by many months.

In the Linux development model, each actualchange is usually small and attribution for eachchange is known making it easy to track the au-thor once a problem is identified. It is clear thatthe earlier in the process we can identify thereis a problem, the less the impact the change willhave, and the more targeted we can be in report-ing and fixing the problem.

Whilst contributing untested code is discour-aged we cannot expect lone developers to beable to do much more than basic functional test-ing, they are unlikely to have access to a wide

range of systems. As a result, there is an op-portunity for others to run a variety of tests onincoming changes before they are widely dis-tributed. Where problems are identified andflagged, the community has been effective atgetting the change rejected or corrected.

By making it easier to test code, we can en-courage developers to run the tests before eversubmitting the patch; currently such early test-ing is often not extensive or rigorous, where itis performed at all. Much developer effort isbeing wasted on bugs that are found later in thecycle when it is significantly less efficient to fixthem.

2.2 The State of the Union

It is clear that a significant amount of testingresource is being applied by a variety of par-ties, however most of the current testing effortgoes on after the code has forked from main-line. The distribution vendors test the code thatthey integrate into their releases, hardware ven-dors are testing alpha or beta releases of thosedistros with their hardware. Independent Soft-ware Vendors (ISVs) are often even later in thecycle, first testing beta or even after distro re-lease. Whilst integration testing is always valu-able, this is far too late to be doing primary test-ing, and makes it extremely difficult and ineffi-cient to fix problems that are found. Moreover,neither the tests that are run, nor the results ofthis testing are easily shared and communicatedto the wider community.

There is currently a large delay between amainline kernel releasing and that kernel beingaccepted and released by the distros, embed-ded product companies and other derivativesof Linux. If we can improve the code qual-ity of the mainline tree by putting more effortinto testing mainline earlier, it seems reason-able to assume that those “customers” of Linux


would update from the mainline tree more of-ten. This will result in less time being wastedporting changes backwards and forwards be-tween releases, and a more efficient and tightlyintegrated Linux community.

2.3 What Should we be Doing?

Linux’s constant evolutionary approach to soft-ware development fits well with a wide-ranging, high-frequency regression testingregime. The “release early, release often” de-velopment philosophy provides us with a con-stant stream of test candidates; for example the-git snapshots which are produced twice daily,and Andrew Morton’s collecting of the spe-cialised maintainer trees into a bleeding-edge-mm development tree.

In an ideal world we would be regression test-ing at least daily snapshots of all developmenttrees, the -mm tree and mainline on all possiblecombinations of hardware; feeding the resultsback to the owners of the trees and the authorsof the changes. This would enable problemsto be identified as early as possible in the con-centration process and get the offending changeupdated or rejected. The test.kernel.org testingproject provides a preview of what is possible,providing some limited testing of the mainlineand development trees, and is discussed morefully later.

Just running the tests is not sufficient, all thisdoes is produce large swaths of data for humansto wade through; we need to analyse the resultsto engender meaning, and isolate any problemsidentified.

Regression tests are relatively easy to analyse,they generate a clean pass or fail; however, eventhese can fail intermittently. Performance testsare harder to analyse, a result of 10 has no par-ticular meaning without a baseline to compare

it against. Moreover, performance tests are not100% consistent, so taking a single sample isnot sufficient, we need to capture a number ofruns and do simple statistical analysis on theresults in order to determine if any differencesare statistically significant or not. It is also crit-ically important to try to distinguish failuresof the machine or harness from failures of thecode under test.

3 Case Study: test.kernel.org

We have tried to take the first steps towardsthe automated testing goals we have outlinedabove with the testing system that generatesthe test.kernel.org website. Whilst it is still farfrom what we would like to achieve, it is a goodexample of what can be produced utilising timeon an existing in house system sharing and test-ing harness and a shared results repository.

New kernel releases are picked up automati-cally within a few minutes of release, and apredefined set of tests are run across them by aproprietary IBM R© system called ABAT, whichincludes a client harness called autobench. Theresults of these tests are then collated, andpushed to the TKO server, where they are anal-ysed and the results published on the TKO web-site.

Whilst all of the test code is not currently open,the results of the testing are, which provides avaluable service to the community, indicating(at least at a gross level) a feel for the viabilityof that release across a range of existing ma-chines, and the identification of some specificproblems. Feedback is in the order of hoursfrom release to results publication.


Client Harness

New Release

Mirror / Trigger

Server Job Queues

Results Collation

Results Analysis

Results Publication

Manual Job

Patch

ObserveProblem

Figure 2: test.kernel.org Architecture

3.1 How it Works

The TKO system is architected as show in fig-ure 2. Its is made up of a number of distinctparts, each described below:

The mirror / trigger engine: test execution iskeyed from kernel releases; by any -mm tree re-lease (2.6.16-rc1-mm1), git release (2.6.17-rc1-git10), release candidate (2.6.17-rc1), stable re-lease (2.6.16) or stable patch release (2.6.16.1).A simple rsync local mirror is leveraged to ob-tain these images as soon as they are avail-able. At the completion of the mirroring pro-cess any newly downloaded image is identifiedand those which represent new kernels triggertesting of that image.

Server Job Queues: for each new kernel, apredefined set of test jobs are created in theserver job queues. These are interspersed withother user jobs, and are run when time is avail-able on the test machines. IBM’s ABAT serversoftware currently fulfils this function, but a

simple queueing system could serve for theneeds of this project.

Client Harness: when the test system is avail-able, the control file for that test is passed to theclient harness. This is responsible for settingup the machine with appropriate kernel ver-sion, running the tests, and pushing the resultsto a local repository. Currently this function isserved by autobench. It is here that our effortsare currently focused with the Autotest clientreplacement project which we will discuss indetail in section 4.4.

Results Collation: results from relevant jobsare gathered asynchronously as the tests com-plete and they are pushed out to test.kernel.org.A reasonably sized subset of the result data ispushed, mostly this involves stripping the ker-nel binaries and system information dumps.

Results Analysis: once uploaded the resultsanalysis engine runs over all existing jobs andextracts the relevant status; this is then sum-marised on a per release basis to produce bothoverall red, amber and green status for each re-lease/machine combination. Performance datais also analysed, in order to produce historicalperformance graphs for a selection of bench-marks.

Results Publication: results are made avail-able automatically on the TKO web site. How-ever, this is currently a “polled” model; no au-tomatic action is taken in the face of either testfailures or if performance regressions are de-tected, it relies on developers to monitor thesite. These failures should be actively pushedback to the community via an appropriate pub-lication mechanism (such as email, with linksback to more detailed data).

Observed problems: When a problem (func-tional or performance) is observed by a de-veloper monitoring the analysed and publishedresults, this is manually communicated back


to the development community (normally viaemail). This often results in additional patchesto test, which can be manually injected into thejob queues via a simple script, but currentlyonly by an IBM engineer. These then automat-ically flow through with the regular releases,right through to publication on the matrix andperformance graphs allowing comparison withthose releases.

3.2 TKO in Action

The regular compile and boot testing frequentlyshakes out bugs as the patch that carried thementers the -mm tree. By testing multiple ar-chitectures, physical configurations, and kernelconfigurations we often catch untested combi-nations and are able to report them to the patchauthor. Most often these are compile failures,or boot failures, but several performance re-gressions have also been identified.

As a direct example, recently the performanceof highly parallel workloads dropped off sig-nificantly on some types of systems, specifi-cally with the -mm tree. This was clearly in-dicated by a drop off in the kernbench perfor-mance figures. In the graph in figure 3 we cansee the sudden increase in elapsed time to a newplateau with 2.6.14-rc2-mm1. Note the verticalerror bars for each data point—doing multipletest runs inside the same job allows us to calcu-late error margins, and clearly display them.

Once the problem was identified some furtheranalysis narrowed the bug to a small number ofscheduler patches which were then also tested;these appear as the blue line (“other” releases)in the graph. Once the regression was identifiedthe patch owner was then contacted, several it-erations of updated fixes were then producedand tested before a corrected patch was applied.This can be seen in the figures for 2.6.16-rc1-mm4.

The key thing to note here is that the regressionnever made it to the mainline kernel let aloneinto a released distro kernel; user exposure wasprevented. Early testing ensured that the devel-oper was still available and retained context onthe change.

3.3 Summary

The current system is providing regular anduseful testing feedback on new releases andproviding ongoing trend analysis against his-torical releases. It is providing the results ofthis testing in a public framework availableto all developers with a reasonable turn roundtime from release. It is also helping develop-ers by testing on rarer hardware combinationsto which they have no access and cannot test.

However, the system is not without its prob-lems. The underlying tests are run on a in-house testing framework (ABAT) which is cur-rently not in the public domain; this preventseasy transport of these tests to other testers. Asa result there is only one contributor to the re-sult set at this time, IBM. Whilst the wholestack needs to be made open, we explain in thenext section why we have chosen to start firstwith the client test harness.

The tests themselves are very limited, coveringa subset of the kernel. They are run on a smallnumber of machines, each with a few, fixedconfigurations. There are more tests whichshould be run but lack of developer input andlack of hardware resources on which to test pre-vent significant expansion.

The results analysis also does not communi-cate data back as effectively as it could tothe community—problems (especially perfor-mance regressions) are not as clearly isolated asthey could be, and notification is not as promptand clear as it could be. More data “fold-ing” needs to be done as we analyse across a


98

100

102

104

106

108

110

112

114

2.6.

16-r

c32.

6.16

-rc2

-mm

12.

6.16

-rc2

2.6.

16-r

c1-m

m5+

p221

692.

6.16

-rc1

-mm

52.

6.16

-rc1

-mm

4+p2

1704

2.6.

16-r

c1-m

m4

2.6.

16-r

c1-m

m3+

p214

672.

6.16

-rc1

-mm

2+p2

1319

2.6.

16-r

c1-m

m2+

p211

632.

6.16

-rc1

-mm

1+p2

0948

2.6.

16-r

c12.

6.15

.62.

6.15

.42.

6.15

.32.

6.15

.22.

6.15

.12.

6.15

-mm

4+p2

0933

2.6.

15-m

m4

2.6.

15-m

m3+

p209

352.

6.15

-mm

3+p2

0440

2.6.

15-m

m3+

p206

922.

6.15

-mm

3+p2

0545

2.6.

15-m

m3

2.6.

15-m

m1+

p198

872.

6.15

2.6.

15-r

c72.

6.15

-rc6

2.6.

15-r

c5-m

m3+

p190

872.

6.15

-rc5

-mm

12.

6.15

-rc5

2.6.

15-r

c42.

6.15

-rc3

-mm

12.

6.15

-rc3

2.6.

15-r

c2-m

m1

2.6.

15-r

c22.

6.15

-rc1

-mm

22.

6.15

-rc1

-mm

12.

6.15

-rc1

2.6.

14.1

2.6.

14-m

m1

2.6.

142.

6.14

-rc5

-mm

12.

6.14

-rc5

2.6.

14-r

c4-m

m1

2.6.

14-r

c42.

6.14

-rc2

-mm

22.

6.14

-rc2

-mm

12.

6.14

-rc2

2.6.

14-r

c1-m

m1

2.6.

14-r

c12.

6.13

-mm

32.

6.13

-mm

12.

6.13

Ela

psed

tim

e (s

econ

ds)

Kernel

mainline -mm other

Figure 3: Kernbench Scheduler Regression

multi-dimensional space of kernel version, ker-nel configuration, machine type, toolchain, andtests.

4 Client Harnesses

As we have seen, any system which will pro-vide the required level of testing needs to forma highly distributed system, and be able to runacross a large test system base. This will ne-cessitate a highly flexible client test harness; akey component of such a system. We have usedour experiences with the IBM autobench client,and the TKO analysis system to define require-ments for such a client. This section will dis-cuss client harnesses in general and lead on toa discussion of the Autotest project’s new testharness.

We chose to attack the problem of the clientharness first as it seems to be the most pressing

issue. With this solved, we can share not onlyresults, but the tests themselves more easily,and empower a wide range of individuals andcorporations to run tests easily, and share theresults. By defining a consistent results format,we can enable automated collation and analysisof huge amounts of data.

4.1 Requirements / Design Goals

A viable client harness must be operable stand-alone or under an external scheduler infrastruc-ture. Corporations already have significant re-sources invested in bespoke testing harnesseswhich they are not going to be willing to waste;the client needs to be able to plug into those,and timeshare resources with them. On theother hand, some testers and developers willhave a single machine and want something sim-ple they can install and use. This bimodal flex-ibility is particularly relevant where we want to


be able to pass a failing test back to a patch au-thor, and have them reproduce the problem.

The client harness must be modular, with aclean internal infrastructure with simple, welldefined APIs. It is critical that there is clearseparation between tests, and between tests andthe core, such that adding a new test cannotbreak existing tests.

The client must be simple to use for newcom-ers, and yet provide a powerful syntax for com-plex testing if necessary. Tests across multiplemachines, rebooting, loops, and parallelism allneed to be supported.

We want distributed scalable maintainership,the core being maintained by a core team andthe tests by the contributors. It must be able toreuse the effort that has gone into developingexisting tests, by providing a simple way to en-capsulate them. Whilst open tests are obviouslysuperior, we also need to allow the running ofproprietary tests which cannot be contributed tothe central repository.

There must be a low knowledge barrier to entryfor development, in order to encourage a widevariety of new developers to start contributing.In particular, we desire it to be easy to writenew tests and profilers, abstracting the com-plexity into the core as much as possible.

We require a high level of maintainability. Wewant a consistent language throughout, onewhich is powerful and yet easy to understandwhen returning to the code later, not only bythe author, but also by other developers.

The client must be robust, and produce consis-tent results. Error handling is critical—teststhat do not produce reliable results are use-less. Developers will never add sufficient errorchecking into scripts, we must have a systemwhich fails on any error unless you take affir-mative action. Where possible it should isolate

hardware or harness failures from failures ofthe code under test; if something goes wrong ininitialisation or during a test we need to knowand reject that test result.

Finally, we want a consistent resultsarchitecture—it is no use to run thousandsof tests if we cannot understand or parse theresults. On such a scale such analysis mustbe fully automatable. Any results structureneeds to be consistent across tests and acrossmachines, even if the tests are being run by awide diversity of testers.

4.2 What Tests are Needed?

As we mentioned previously, the current pub-lished automated testing is very limited in itsscope. We need very broad testing coverage ifwe are going to catch a high proportion of prob-lems before they reach the user population, andneed those tests to be freely sharable to max-imise test coverage.

Most of the current testing is performed in or-der to verify that the machine and OS stack isfit for a particular workload. The real workloadis often difficult to set up, may require propri-etary software, and is overly complex and doesnot give sufficiently consistent reproducible re-sults, so use is made of a simplified simula-tion of that workload encapsulated within a test.This has the advantage of allowing these simu-lated workloads to be shared. We need tests inall of the areas below:

Build tests simply check that the kernel willbuild. Given the massive diversity of differ-ent architectures to build for, different con-figuration options to build for, and differenttoolchains to build with, this is an extensiveproblem. We need to check for warnings, aswell as errors.


Static verification tests run static analysisacross the code with tools like sparse, lint, andthe Stanford checker, in the hope of findingbugs in the code without having to actually ex-ecute it.

Inbuilt debugging options (e.g. CONFIG_

DEBUG_PAGEALLOC, CONFIG_DEBUG_SLAB)and fault insertion routines (e.g. fail every100th memory allocation, fake a disk error oc-casionally) offer the opportunity to allow thekernel to test itself. These need to be a sepa-rated set of test runs from the normal functionaland performance tests, though they may reusethe same tests.

Functional or unit tests are designed to exer-cise one specific piece of functionality. Theyare used to test that piece in isolation to ensureit meets some specification for its expected op-eration. Examples of this kind of test includeLTP and Crashme.

Performance tests verify the relative perfor-mance of a particular workload on a specificsystem. They are used to produce comparisonsbetween tests to either identify performancechanges, or confirm none is present. Examplesof these include: CPU performance with Kern-bench and AIM7/reaim; disk performance withbonnie, tbench and iobench; and network per-formance with netperf.

Stress tests are used to identify system be-haviour when pushed to the very limits of itscapabilities. For example a kernel compile exe-cuted completely in parallel creates a compileprocess for each file. Examples of this kindof test include kernbench (configured appro-priately), and deliberately running under heavymemory pressure such as running with a smallphysical memory.

Profiling and debugging is another key area.If we can identify a performance regression, or

some types of functional regression, it is im-portant for us to be able to gather data aboutwhat the system was doing at the time in orderto diagnose it. Profilers range from statisticaltools like readprofile and lockmeter to monitor-ing tools like vmstat and sar. Debug tools mightrange from dumping out small pieces of infor-mation to full blown crashdumps.

4.3 Existing Client Harnesses

There are a number of pre-existing test har-nesses in use by testers in the community. Eachhas its features and problems, we touch on afew of them below.

IBM autobench is a fairly fully featured clientharness, it is completely written in a combi-nation of shell and perl. It has support fortests containing kernel builds and system boots.However, error handling is very complex andmust be explicitly added in all cases, but doesencapsulate the success or failure state of thetest. The use of multiple different languagesmay have been very efficient for the originalauthor, but greatly increases the maintenanceoverheads. Whilst it does support running mul-tiple tests in parallel, loops within the job con-trol file are not supported nor is any complex“programming.”

OSDL STP The Open Systems DevelopmentLab (OSDL) has the Scalable Test Platform(STP). This is a fully integrated testing envi-ronment with both a server harness and clientwrapper. The client wrapper here is very simpleconsisting of a number of shell support func-tions. Support for reboot is minimal and kernelinstallation is not part of the client. There is noinbuilt handling of the meaning of results. Er-ror checking is down to the test writer; as thisis shell it needs to be explicit else no checkingis performed. It can operate in isolation andresults are emailable, reboot is currently beingadded.


LTP1 The Linux Test Project is a functional /regression test suite. It contains approximately2900 small regression tests which are appliedto the system running LTP. There is no supportfor building kernels or booting them, perfor-mance testing or profiling. Whilst it containsa lot of useful tests, it is not a general heavyweight testing client.

A number of other testing environments cur-rently exist, most appear to suffer from thesame basic issues, they evolved from the sim-plest possible interface (a script) into a testsuite; they were not designed to meet the levelof requirements we have identified and speci-fied.

All of those we have reviewed seem to have anumber of key failings. Firstly, most lack mostlack bottom up error handling. Where sup-port exists it must be handled explicitly, testersnever will think of everything. Secondly, mostlack consistent machine parsable results. Thereis often no consistent way to tell if a test passes,let alone get any details from it. Lastly, due totheir evolved nature they are not easy to under-stand nor to maintain. Fortunately it should bereasonably easy to wrap tests such as LTP, or toport tests from STP and autobench.

4.4 Autotest a Powerful Open Client

The Autotest open client is an attempt to ad-dress the issues we have identified. The aim isto produce a client which is open source, im-plicitly handles errors, produces consistent re-sults, is easily installable, simple to maintainand runs either standalone or within any serverharness.

Autotest is an all new client harness implemen-tation. It is completely written in Python; cho-sen for a number of reasons, it has a simple,

1http://ltp.sourceforge.net/

clean and consistent syntax, it is object orientedfrom inception, and it has very powerful errorand exception handling. Whist no language isperfect, it meets the key design goals well, andit is open source and widely supported.

As we have already indicated, there are a num-ber of existing client harnesses; some are evenopen-source and therefore a possible basis for anew client. Starting from scratch is a bold step,but we believe that the benefits from a designedapproach outweigh the effort required initiallyto get to a workable position. Moreover, muchof the existing collection of tests can easily beimported or wrapped.

Another key goal is the portability of the testsand the results; we want to be able to run testsanywhere and to contribute those test resultsback. The use of a common programming lan-guage, one with a strict syntax and semanticsshould make the harness and its contained testsvery portable. Good design of the harness andresults specifications should help to maintainportable results.

4.5 The autotest Test Harness

Autotest utilises an executable control file torepresent and drives the users job. This con-trol file is an executable fragment of Pythonand may contain any valid Python constructs,allowing the simple representation of loops andconditionals. Surrounding this control file isthe Autotest harness, which is a set of supportfunctions and classes to simplify execution oftests and allow control over the job.

The key component is the job object which rep-resents the executing job, provides access to thetest environment, and provides the frameworkto the job. It is responsible for the creation ofthe results directory, for ensuring the job out-put is recorded, and for any interactions with


any server harness. Below is a trivial exampleof a control file:

job.runtest(’test1’, ’kernbench’, 2, 5)

One key benefit of the use of a real program-ming language is the ability to use the full rangeof its control structures in the example belowwe use an iterator:

for i in range(0, 5):job.runtest(’test%d’ % i, ’kernbench’,

2, 5)

Obviously as we are interested in testing Linux,support for building, installing and booting ker-nels is key. When using this feature, we need alittle added complexity to cope with the inter-ruption to control flow caused by the systemreboot. This is handled using a phase step-per which maintains flow across execution in-terruptions, below is an example of such a job,combining booting with iteration:

def step_init():step_test(1)

def step_test(iteration):if (iteration < 5):

job.next_step([step_test,iteration + 1])

print "boot: %d" % iteration

kernel = job.distro_kernel()kernel.boot()

Tests are represented by the test object; eachtest added to Autotest will be a subclass of this.This allows all tests to share behaviour, such ascreating a consistent location and layout for theresults, and recording the result of the test in acomputer readable form. In figure 4 is the classdefinition for the kernbench benchmark. As wecan see it is a subclass of test, and as such ben-efits from its management of the results direc-tory hierarchy.

4.6 Summary

We feel that Autotest is much more power-ful and robust design than the other client har-nesses available, and will produce more consis-tent results. Adding tests and profilers is sim-ple, with a low barrier to entry, and they areeasy to understand and maintain.

Much of the power and flexibility of Autoteststems from the decision to have a user-definedcontrol file, and for that file to be written in apowerful scripting language. Whilst this wasmore difficult to implement, the interface theuser sees is still simple. If the user wishes torepeat tests, run tests in parallel for stress, oreven write a bisection search for a problem in-side the control file, that is easy to do.

The Autotest client can be used either as stan-dalone, or easily linked into any schedulingbackend, from a simple queueing system to ahuge corporate scheduling and allocation en-gine. This allows us to leverage the resources oflarger players, and yet easily allow individualdevelopers to reproduce and debug problemsthat were found in the lab of a large corpora-tion.

Each test is a self-contained modular pack-age. Users are strongly encouraged to createopen-source tests (or wrap existing tests) andcontribute those to the main test repository ontest.kernel.org.2 However, private testsand repositories are also allowed, for maximumflexibility. The modularity of the tests meansthat different maintainers can own and main-tain each test, separate from the core harness.We feel this is critical to the flexibility and scal-ability of the project.

We currently plan to support the Autotest clientacross the range of architectures and across the

2See the autotest wiki http://test.kernel.org/autotest.


import testfrom autotest_utils import *

class kernbench(test):

def setup(self,iterations = 1,threads = 2 * count_cpus(),kernelver = ’/usr/local/src/linux-2.6.14.tar.bz2’,config = os.environ[’AUTODIRBIN’] + "/tests/kernbench/config"):

print "kernbench -j %d -i %d -c %s -k %s" % (threads, iterations, config, kernelver)

self.iterations = iterationsself.threads = threadsself.kernelver = kernelverself.config = config

top_dir = job.tmpdir+’/kernbench’kernel = job.kernel(top_dir, kernelver)kernel.config([config])

def execute(self):testkernel.build_timed(threads) # warmup runfor i in range(1, iterations+1):

testkernel.build_timed(threads, ’../log/time.%d’ % i)

os.chdir(top_dir + ’/log’)system("grep elapsed time.* > time")

Figure 4: Example test: kernbench

main distros. There is no plans to support otheroperating systems, as it would add unnecessarycomplexity to the project. The Autotest projectis released under the GNU Public License.

5 Future

We need a broader spectrum of tests added tothe Autotest project. Whilst the initial goalis to replace autobench for the published dataon test.kernel.org, this is only a firststep—there are a much wider range of tests thatcould and should be run. There is a wide bodyof tests already available that could be wrappedand corralled under the Autotest client.

We need to encourage multiple different en-tities to contribute and share testing data formaximum effect. This has been stalled wait-

ing on the Autotest project, which is now near-ing release, so that we can have a consistentdata format to share and analyse. There willbe problems to tackle with quality and consis-tency of data that comes from a wide range ofsources.

Better analysis of the test results is needed.Whilst the simple red/yellow/green grid ontest.kernel.org and simple gnuplotgraphs are surprisingly effective for so little ef-fort, much more could be done. As we run moretests, it will become increasingly important tosummarise and fold the data in different waysin order to make it digestible and useful.

Testing cannot be an island unto itself—notonly must we identify problems, we must com-municate those problems effectively and effi-ciently back to the development community,provide them with more information upon re-quest, and be able to help test attempted fixes.


We must also track issues identified to closure.

There is great potential to automate beyond justidentifying a problem. An intelligent automa-tion system should be able to further narrowdown the problem to an individual patch (bybisection search, for example, which is O(log2)number of patches). It could drill down into aproblem by running more detailed sets of per-formance tests, or repeating a failed test severaltimes to see if a failure was intermittent or con-sistent. Tests could be selected automaticallybased on the area of code the patch touches,correlated with known code coverage data forparticular tests.

6 Summary

We are both kernel developers, who startedthe both test.kernel.org and Autotestprojects out of a frustration with the cur-rent tools available for testing, and for fullyautomated testing in particular. We arenow seeing a wider range of individualsand corporations showing interest in both thetest.kernel.org and Autotest projects,and have high hopes for their future.

In short we need:

• more automated testing, run at frequent in-tervals,

• those results need to be published consis-tently and cohesively,

• to analyse the results carefully,

• better tests, and to share them, and

• a powerful, open source, test harness thatis easy to add tests to.

There are several important areas where inter-ested people can help contribute to the project:

• run a diversity of tests across a broad rangeof hardware,

• contribute those results back totest.kernel.org,

• write new tests and profilers, contributethose back, and

• for the kernel developers ... fix the bugs!!!

An intelligent system can not only improvecode quality, but also free developers to domore creative work.

Acknowledgements

We would like to thank OSU for the donationof the server and disk space which supports thetest.kernel.org site.

We would like to thank Mel Gorman for his in-put to and review of drafts of this paper.

Legal Statement

This work represents the view of the authors anddoes not necessarily represent the views of eitherGoogle or IBM.

Linux is a trademark of Linus Torvalds in the UnitedStates, other countries, or both.

Other company, product, and service names may bethe trademarks or service marks of others.


Linux Laptop Battery LifeMeasurement Tools, Techniques, and Results

Len Brown Konstantin A. Karasyov

Vladimir P. Lebedev Alexey Y. Starikovskiy

Intel Open Source Technology Center

{len.brown,konstantin.a.karasyov}@intel.com

{vladimir.lebedev,alexey.y.starikovskiy}@intel.com

Randy P. StanleyIntel Mobile Platforms [email protected]

Abstract

Battery life is a valuable metric for improvingLinux laptop power management.

Battery life measurements require repeatableworkloads. While BAPCo R© MobileMark R©

2005 is widely used in the industry, it runs onlyon Windows R© XP. Intel’s Open Source Tech-nology Center has developed a battery life mea-surement tool-kit for making reliable batterylife measurements on Linux R© with no speciallab equipment necessary.

This paper describes this Linux battery lifemeasurement tool-kit, and some of the tech-niques for measuring Linux laptop power con-sumption and battery life.

This paper also includes example measurementresults showing how selected system configura-tions differ.

1 Introduction

First we examine common industry practice formeasuring and reporting laptop battery life.

Next we examine the methods available onACPI-enabled Linux systems to measure thebattery capacity and battery life.

Then we describe the implementation of a bat-tery life measurement toolkit for Linux.

Finally, we present example measurement re-sults applying this toolkit to high-volume hard-ware, and suggest some areas for further work.

1.1 State of the Industry

Laptop vendors routinely quote MobileMark R©

battery measurement results when introducingnew systems. The authors believe that this

128 • Linux Laptop Battery Life

is not only the most widely employed indus-try measurement, but that MobileMark also re-flects best known industry practice. So wewill focus this section on MobileMark and ig-nore what we consider lesser measurement pro-grams.

1.2 Evolution of MobileMark R©

In 1995, BAPCo R©, the Business ApplicationsPerformance Corporation, introduced a battery-life (BL) workload to support application basedpower evaluation. The first incarnation, SYS-mark BL, was a Windows R© 3.1 based workloadwhich utilized office applications to implementa repeatable workload to produce a “battery rundown” time. Contrary to performance bench-marks which executed a stream of commands,this workload included delays which were in-tended to represents real user interaction, muchlike a player piano represent real tempos. Be-cause the system is required to deplete its ownbattery, a master system and physical interfacewas required as well as the slave system undertest. In late 1996 the workload was re-writtento support Windows95 adapted to 32-bit appli-cations.

When Windows R© 98 introduced ACPI support,BAPCo overhauled the workload to shed thecumbersome and expensive hardware interface.SYSmark98 BL became the first software onlyBL workload. (No small feat as the systemwas now required to resurrect itself and reportBL without adding additional overhead.) Addi-tionally a more advanced user delay model wasintroduced and an attempt was made to under-stand the power performance trade-off withinmobile systems by citing the number of loopscompleted during the life of the battery. Al-though well intended, this qualification pro-vided only gross level insight into the powerperformance balance of mobile systems.

In 2002, BAPCo released MobileMark 2002[MM02] which modernized the workload andadopted a response based performance quali-fier which provided considerably more insightinto the power performance balance attained bymodern power management schemes. Addi-tionally they attempted to better define a morelevel playing field by providing a more rigor-ous set of system setting requirements and rec-ommendations, and strongly recommending alight meter to calibrate the LCD panel bright-ness setting to a common value. Additionally,they introduced a “Reader” module to comple-ment the Office productivity module. Readerprovided a time metric for an optimal BL usagemodel to define a realistic upper bound whileexecuting a real and practical use.

In 2005 BAPCo’s MobileMark 2005 [MM05]added to the MobileMark 2002 BL “suite” byintroducing new DVD and Wireless browsingmodules as well as making slight changes to in-crease robustness and hold the work/time con-stant for all machines. Today these moduleshelp us to better understand the system balanceof power and performance. Multiple resultsalso form contour of solutions reflective of therespective user and usage models.

1.3 Learning from MobileMark R© 2005

While MobileMark is not available for Linux,it illustrates some of the best industry practicesfor real use power analysis that Linux measure-ments should also employ.

1.3.1 Multiple Workloads

Mobile systems are subject to different user1

and usage models,2 each with its own battery1Different users type, think and operate the system

differently.2Usage models refers to application choices and con-

tent.


life considerations. To independently measuredifferent usage models, MobileMark 2005 pro-vides 4 workloads:

1. Office productivity 2002SE

This workload is the second edition ofMobileMark 2002 Office productivity.Various office productivity tools are usedto open and modify office documents.

Think time is injected between various op-erations to reflect that real users need tolook at the screen and react before issuingadditional input.

The response time of selected operations isrecorded (not including delays) to be ableto qualify the battery life results and dif-ferentiate the performance level availablewhile attaining that battery life.

2. Reader 2002SE

This workload is a second edition of Mo-bileMark 2002 Reader. Here, a webbrowser reads a book from local files,opening a new page every 2 minutes. Thisworkload is almost completely idle time,and can be considered an upper bound,which no “realistic” activity can possiblyexceed.

3. DVD Playback 2005

InterVideo R© WinDVD R© plays a referenceDVD movie repeatedly until the batterydies. WinDVD monitors that the standardframe rate, so that the harness can abortthe test if the work level is not sustained.In practice, modern machines have am-ple capacity to play DVDs, and frames arerarely dropped.

4. Wireless browsing 2005

Here the system under test loads a webpage every 15 seconds until the batterydies. The web pages are an average of

150 KB. This workload is not specific towireless networks, however, and in theorycould be run over wired connections.

1.3.2 Condition the Battery

In line with manufacturer’s recommendations,BAPCo documentation recommends condition-ing the battery before measurement. This en-tails simply running the battery from full chargeuntil full discharge at least once.

For popular laptop batteries today, condition-ing tends to minimize memory effects, extendthe battery life, and increase the consistency ofmeasurements.

MobileMark recommends conditioning the bat-tery before taking measurements.

1.3.3 Run the Battery until fully Dis-charged

Although conditioning tends to improve the ac-curacy of the internal battery capacity instru-mentation, this information is not universallyaccurate or reliable before or after condition-ing.

MobileMark does not trust the battery in-strumentation, and disables the battery low-capacity warnings. It measures battery life byrunning on battery power until the battery isfully discharged and the system crashes.

1.3.4 Qualify Battery Life with Perfor-mance

In addition to the battery life (in minutes) Mo-bileMark Office productivity results always re-port response time.


This makes it easy to tell the difference betweena battery life result for a low performance sys-tem and a similar result for a high performancesystem that employs superior power manage-ment.

There is no performance component reportedfor the other workloads, however, as the userexperience for those workloads is relatively in-sensitive to performance.

1.3.5 Constant Work/Time

The MobileMark Office productivity workloadwas calibrated to a minimal machine that com-pleted one workload iteration in about 90 min-utes. If a faster machine completes the work-load iteration in less time, the system idles untilthe next activity cycle starts at 90-minutes.

2 Measurement Methods

Here we take a closer look at the methods avail-able to observe and measure battery life in aLinux context.

2.1 Using an AC Watt Meter

Consumer-grade Watt Meters with a resolu-tion of 0.1Watt and 1-second sampling rate areavailable for about 100 U.S. Dollars.3 Whileintended to tell you the cost of operating yourold refrigerator, they can just as easily tell youthe A/C draw for a computer.

It is important to avoid the load of batterycharging from this scenario by measuring. Thiscan be done by measuring only when the bat-tery is fully charged, or for laptops that allow

3Watt’s Up Pro: https://www.doubleed.com

it, running on A/C with the battery physicallyremoved.

You’ll be able to see the difference betweensuch steady-state operations as LCD on vs. off,LCD brightness, C-states and P-states. How-ever, it will be very difficult to observe transientbehavior with the low sampling rate.

Unfortunately, the A/C power will include theloss in the AC-to-DC power supply “brick.”While an “External Power Adapter” sporting anEnergy Star logo4 rated at 20 Watts or greaterwill be more than 75% efficient, others will notmeet that criteria and that can significantly dis-tort your measurement results.

So while this method is useful for some types ofcomparisons, it isn’t ideal for predicting batterylife. This is because most laptops behave differ-ently when running on DC battery vs. runningon AC. For example, it is extremely commonfor laptops to enable deep C-states only on DCpower and to disable them on AC power.

2.2 Using a DC Watt Meter on the DC con-verter

It is possible to modify the power adapter byinserting a precise high-wattage low-ohm resis-tor in series on the DC rail and measuring thevoltage drop over this resistor to calculate thecurrent, and thus Watts.

This measurement is on the DC side of the con-verter, and thus avoids the inaccuracy from AC-DC conversion above. But this method suffersthe same basic flaw as the AC meter methodabove, the laptop is running in AC mode, andthat is simply different from DC mode.

4http://www.energystar.gov


2.3 Replacing the Battery with a DC powersupply

The next most interesting method to measurebattery consumption on a laptop is to pull apartthe battery and connect it to a lab-bench DCpower supply.

This addresses the issue of the laptop runningin DC mode. However, few reading this paperwill have the means to set up this supply, or thewillingness to destroy their laptop battery.

However, for those with access this type of testsetup, including a high-speed data logger; DCconsumption rates can be had in real-time, withnever a wait for battery charging.

Further, it is possible that system designersmay choose to make the system run differ-ently depending on battery capacity. For exam-ple, high-power P-states may be disabled whenon low battery power—but these enhancementswould be disabled when running on a DCpower supply that emulates a fully charged bat-tery.

2.4 Using a DC Watt Meter on an instru-mented battery

Finally, it is possible to instrument the outputof the battery itself. Like the DC power supplymethod above, this avoids the issues with theAC wattmeter and the instrumented power con-verter method in that the system is really run-ning on DC. Further, this allows the system toadapt as the battery drains, just as it would inreal use. But again, most people who want tomeasure power have neither a data logger, nora soldering iron available.

2.5 Using Built-in Battery Instrumentation

Almost all laptops come with built in battery in-strumentation where the OS read capacity, cal-culate drain and charge rates, and receive ca-pacity alarms.

On Linux, /proc/acpi/battery/*/info and state will tell you about yourbattery and its current state, including drainrate.

Sometimes the battery drain data will give agood idea of average power consumption, butoften times this data is mis-leading.

One way to find out if your drain rate is accu-rate is to plot the battery capacity from fullycharged until depleted. If the system is runninga constant workload, such as idle, then the in-strumentation should report full capacity equalto the design capacity of the battery at the start,and it should report 0 capacity just as the lightsgo out—and it should report a straight line inbetween. In practice, only new properly condi-tioned batteries do this. Old batteries and bat-teries that have not been conditioned tend tosupply very poor capacity data.

Figure 1 shows a system with an old (4.4AH ∗10.4V ) = 47.520 Wh battery. After fully charg-ing the battery, the instrumentation at the startof the 1st run indicates that the battery ca-pacity of under 27.000 Wh. If the batterythreshold warning was enabled for that run,the system would have shut down well before5,000 seconds—even though the battery actu-ally lasted past 7,000 seconds.

The 1st run was effectively conditioning thebattery. The 2nd run reported a fully chargedcapacity of nearly 33.000 Wh. The actual bat-tery life was only slightly longer than the ini-tial conditioning run, but in this case the re-ported capacity was closer to the truth. The


0

5000

10000

15000

20000

25000

30000

35000

40000

0 1000 2000 3000 4000 5000 6000 7000 8000

cap

time

t30/results.office0/stat.log capt30/results.office1/stat.log capt30/results.office2/stat.log capt30/results.office3/stat.log cap

Figure 1: System A under-reports capacity until conditioned

3rd run started above 38.000 Wh and was lin-ear from there until the battery died after 7,000seconds. The 4th run showed only marginallymore truthful results. Note that a 10% batterywarning at 4,752 would actually be useful to areal user after the battery has been conditioned.

Note also that the slope of all 4 lines is thesame. In this case, the rate of discharge shownby the instrumentation appears accurate, evenfor the initial run.

The battery life may not be longer than theslope suggests, it may be shorter. Figure2shows system B suddenly losing power near theend of its conditioning run. However, the 2nd(and subsequent) runs were quite well behaved.

Figure 3 shows system C with a drop-off thatis sure to fool the user’s low battery trip points.In this case the initial reported capacity doesnot change, staying at about 6800 of 71.000 Wh(95%). However, the first run drops off a cliffat about 11,000 seconds. The second and third

runs drop at about 13,500. But subsequent runsall drop at about 12,000 seconds. So condition-ing the battery didn’t make this one behave anybetter.

Finally, Figure 4 shows system D reporting ini-tial capacity equal to 100% of its 47.950 Whdesign capacity. But upon use, this capacitydrops almost immediately to about 37.500 Wh.Even after being conditioned 5 times, the bat-tery followed the same pattern. So either theinitial capacity was correct and the drain rate iswrong, or initial capacity is incorrect and thedrain rate is correct. Note that this behaviorwent away when a new battery was used. Anew battery reported 100% initial capacity, and0% final capacity, connected by a straight line.

In summary, the only reliable battery life mea-surement is a wall clock measurement from fullcharge until the battery is depleted. Depletedhere means ignoring any capacity warnings andrunning until the lights go out.


0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

cap

time

i5150/results.office0/stat.log capi5150/results.office1/stat.log cap

Figure 2: System B over-reports capacity until conditioned

0

10000

20000

30000

40000

50000

60000

70000

0 2000 4000 6000 8000 10000 12000 14000 16000

cap

time

satellite/results.office0/stat.log capsatellite/results.office1/stat.log capsatellite/results.office2/stat.log capsatellite/results.office3/stat.log capsatellite/results.office4/stat.log capsatellite/results.office5/stat.log cap

Figure 3: System C over-reports final capacity, conditioning does not help


0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

cap

time

d600/results.office0/stat.log capd600/results.office4/stat.log cap

Figure 4: System D over-reports initial capacity, conditioning does not help

3 Linux Battery Life Toolkit

The Linux Battery Life Toolkit (bltk) consistsof a test framework and six example workloads.There are common test techniques that shouldbe followed to assure repeatable results no mat-ter what the workload.

3.1 Toolkit Framework

The toolkit framework is responsible forlaunching the workload, collecting statisticsduring the run, and summarizing the results af-ter a test completes.

The framework can launch any arbitrary work-load, but currently has knowledge of 6 exampleworkloads: Idle, Reader, Office, DVD Player,SW Developer, and 3D-Gamer.

3.2 Idle Workload

The idle workload simply executes the frame-work without invoking any programs. Statis-tics are collected the same way as for the otherworkloads.

3.3 Web Reader Workload

The web reader workload opens an HTML-formatted version of War and Peace by LeoTolstoy5 in Firefox R© and then sends “nextpage” keyboard events to browser every twominutes, simulating interaction with the humanreader.

5We followed the lead of BAPCo’s MobileMark hereon selection of reading material.


3.4 Open Office Workload

Open Office rev 1.1.4 was chosen for thistoolkit because it is stable and freely available.It is intended to be automatically installed bythe toolkit to avoid results corruption due to lo-cal settings and version differences.

3.4.1 Open Office Activities

Currently 3 applications from OpenOffice suiteare used for the Office workload, oowriter,oocalc and oodraw. The set of commonoperations is applied to these applications tosimulate activities, typical for office applicationusers.

Using oowriter, the following operationsare performed:

• text typing

• text pattern replacement

• file saving

Using oocalc, the following operations areperformed:

• creating spreadsheet;

• editing cells values;

• assigning math expression to the cell;

• expanding math expression over a set ofcells;

• assigning set of cells to the math expres-sion;

• file saving;

Using oodraw, the following operations areperformed:

• duplicating image;

• moving image over the document;

• typing text over the image;

• inserting spreadsheet;

• file saving.

3.4.2 Open Office User Input

User input consists of actions and delays. Ac-tions are represented by the key strokes sentto the application window though the X server.This approach makes the application performthe same routines as it does during interactionwith the real user.6 Delays are inserted betweenactions to represent a real user.

The Office workload scenario is not hard-coded, but is scripted using the capabilitiesshown in Appendix A.

3.4.3 Open Office Performance Scores

A single iteration of the office workload sce-nario completes in 720 seconds. When a fastermachine completes the workload in less than720 seconds, it is idle until the next iterationstarts.

720seconds = Workload_time+ Idle

Workload time consists of Active_time—thetime it takes for the system to start applica-tions and respond to user commands—plus thedelays that the workload inserts to model usertype-time and think-time.

6The physical input device, such as keyboard andmouse are not used here.


Workload_time = Active_time+Delay_time

So the performance metric for each workloadscenario iteration is Active_time, which is cal-culated by measuring Workload_time and sim-ply and subtracting the known Delay_time.

The reported performance score is the averageActive_time over all loop iterations, normal-ized to a reference Active_time so that biggernumbers represent better performance:

Per f ormance_score =100∗Active_re f erence/Average_Active_measured

3.5 DVD Movie Playback Workload

mplayer is invoked to play a DVD movie un-til the battery dies. Note that mplayer doesnot report frame rate to the toolkit framework.For battery life comparisons, equal work/timemust be maintained, so that it is assumed, butnot verified in these tools that modern systemscan play DVD movies at equal frame rates.

3.6 Software Developer Workload

The software developer workload mimics aLinux ACPI kernel developer: it invokes vi toinsert a comment string into one of the LinuxACPI header files and then invokes make -jN on a Linux kernel source tree, where N is cho-sen to be three times the number of processorsin a system. This cycle is extended out to 12minutes with idle time to more closely modelconstant work/time on different systems.

The Active_time for the developer workload isthe time required for the make command tocomplete, and it is normalized into a perfor-mance score the same was as for the Officeworkload.

Per f ormance_score = 100∗Active_re f erence/Average_Active_measured.

3.7 3D Gamer Workload

A 3D gamer workload puts an entirely differ-ent workload on a laptop, one that is gener-ally more power-hungry than all the workloadsabove.

However, we found that 3D video support is notuniversally deployed or enabled in Linux, norare there a lot of selections of games that aresimultaneously freely available, run on a broadrange of platforms, and include a demo-modethat outputs performance.

glxgears satisfies the criteria for beingfreely available, universally supported, andit reports performance; however, that perfor-mance is not likely to closely correlate to whata real 3D game would see. So we are not satis-fied that we have a satisfactory 3D-Gamer met-ric yet.

In the case of a 3D game workload, a reason-able performance metric to qualify battery lifewould be based on frames/second.

3D_Per f ormance_Score =FPS_measured/FPS_re f erence

4 Example Measurement Results

This section includes example battery life mea-surements to show what a typical user can doon their system without the aid of any specialinstrumentation.

Unless otherwise specified, the Dell InspironTM

6400 shown in Table 1 was used as the examplesystem.

Note that this system is a somewhat arbitraryreference. It has a larger and brighter screen


System Dell Inspiron 6400Battery 53 Wh

Processor Intel Core Duo T2500, 2GHz2MB cache, 667MHz bus

LCD 15.4" WXGA, min brightMemory 1GB DDR2, 2DIMM, 533MHz

Distribution Novell SuSE 10.1 BETAGUI KDE

Linux 2.6.16 or laterHZ 250

cpufreq ondemand governorBattery Alerts ignoredScreen Saver disabled

DPMS disabledWired net disabledWireless disabled

Table 1: Nominal System Under Test

than many available on the market. It arrivedwith a 53 Wh 6-cell battery, but is also avail-able with an 85 Wh 9-cell battery, which wouldincrease the absolute battery life results by over50%. But the comparisons here are generallyof this system to itself, so these system-specificparameters are equal on both sides.

4.1 Idle Workload

4.1.1 Idle: Linux vs. Windows

Comparing Linux7 with Windows8 on the samehardware tells us how Linux measures up tohigh-volume expectations.

This baseline comparison is done with pure-idle workload. While trivial, this “workload” isalso crucial, because a difference in idle powerconsumption will have an effect on virtually allother workloads.

7Linux-2.6.16+ as delivered with Novell SuSE 10.1BETA

8Windows R© XP SP2

0

60

120

180

240

300

WindowsLinux

Figure 5: Idle: Linux vs. Windows

Here the i6400 lasts 288 minutes on Windows,but only 238 minutes on Linux, a 50 minutedeficit. One can view this as a percentage, eg.Linux has 238/288 = 83% of the idle batterylife as compared to Windows.

One can also estimate the average power us-ing the fixed 53 Wh battery capacity. (53Wh ∗60min/Hr)/288min = 11.0W for Windows.(53Wh ∗ 60min/Hr)/238min = 13.4W forLinux. So here Linux is at a 2.4W deficit com-pared to Windows in idle.

4.1.2 Idle: The real cost of the LCD

While the i6400 has a larger than average dis-play, the importance of LCD power can not beover-stated—even for systems with smaller dis-plays.

The traditional screen saver that draws prettypictures on an idle screen is exactly the op-posite of what you want for long battery life.The reason isn’t because the display takes morepower, the reason is because it takes the proces-


0

60

120

180

240

300

offbrightdim

Figure 6: Idle: Effect of LCD

sor out of the deepest available C-state whenthere is nothing “useful” to do.

The CPU can be removed from that equationby switching to a screen saver that does not runany programs. Many select the “blank” screensaver on the assumption that it saves power—but it does not. A Black LCD actually savesno power vs. a white LCD. This is because theblack LCD has the backlight on just as bright asthe white LCD, but it is actually using fraction-ally more energy to block that light with everypixel.

So the way to save LCD power is to dim theback-light so it is no brighter than necessary toread the screen; and or turn it off completelywhen you are not reading at the screen. Notethat an LCD that is off should appear black ina darkened room. If it is glowing, then the pix-els are simply fighting to obscure a backlightthat is still on. A screen saver that runs no pro-grams and has DPMS (Display Power Manage-ment Signaling) enabled to turn off the displayis hugely important to battery life.

On the example system, the 238-minute “dim”

0

60

120

180

240

300

USBNo USB

Figure 7: Idle: Effect of USB

idle time drops to 171 for maximum LCDbrightness, and increases to 280 minutes forLCD off. Expressed as Watts, dim is 13.4W,bright is 18.6W, and off is 11.4W. So this par-ticular LCD consumes between 2.0 and 7.2W.Your mileage will vary.

Note that because of its large demands systempower, analysis of the power consumption ofthe other system components is generally mostpractical when the LCD is off.

4.1.3 Idle: The real cost of USB

The i6400 has no integrated USB devices. Soif you execute lsusb, you’ll see nothing untilyou plug in an external device.

If an (unused) USB 1.0 mouse is connected tothe system, battery life drops 12 minutes to 226from 238. This corresponds to (14.1−13.4) =0.7W .


0

60

120

180

240

300

1000250100

Figure 8: Idle: Effect of HZ

4.1.4 Idle: Selecting HZ

In Linux-2.4, the periodic system timer tickran at 100 HZ. Linux-2.6 started life runningat 1000 HZ. Linux-2.6.13 added CONFIG_HZ,with selections of 100, 1000, and a compromisedefault of 250 HZ.

Figure 8 shows that the selection of HZ has avery small effect on the i6400, though othershave reported larger differences on other sys-tems. Note that since this difference was small,this comparison was made in single-user mode.

4.1.5 Idle: init1 vs. init5

The Linux vs. Windows measurement abovewas in multi-user GUI mode—though the net-work was disabled. One question often asked ifthe GUI (KDE, in this example) and other stan-dard daemons have a significant effect on Linuxbattery life.

In this example, the answer is yes, but notmuch. Multi-user battery life is 238 min-

0

60

120

180

240

300

init5init1

Figure 9: Idle: init1 vs. init5

utes, and Single-user battery life is 10 min-utes longer at 248—only a 4% difference. Ex-pressed as Watts, 13.4−12.8 = 0.6W to run inmulti-user GUI mode.

However, init5 battery consumption may de-pend greatly on how the administrator config-ures the system.

4.1.6 Idle: 7200 vs. 5400 RPM Disk Drives

The i6400 arrived with a 5400 RPM 40GBFujitsu MHT2040BH SATA drive. Upgrad-ing that drive to a 7200 RPM 60GB HitachiHTS721060G9SA00 SATA drive reduced single-user9 idle battery life by 16 minutes, to 232from 248 (6%). This corresponds to an av-erage power difference of 0.89W. The speci-fications for the drives show the Hitachi con-suming about 0.1W more in idle and standby,and the same for read/write. So it is not im-

9init1 idle is used as the baseline here because the dif-ference being measured is small, and to minimize the riskthat the two different drives are configured differently.


mediately clear why Linux loses an additional0.79W here.

4.1.7 Idle: Single Core vs. Idle Dual Core

Disabling one of the cores by booting withmaxcpus=1 has no measurable effect on idlebattery life. This is because the BIOS leaves thecores in the deepest available C-state. WhenLinux neglects to start the second core, it be-haves almost exactly as if Linux had started thatcore and entered the deepest available C-stateon it.

Note that taking a processor off-line at run-timein Linux does not currently put that proces-sor into the deepest available C-state. There isa bug10 where offline processors instead enterC1. So taking a processor offline at run-timecan actually result in worse battery life than ifyou leave it alone and let Linux go idle auto-matically.

4.1.8 The case against Processor Throt-tling (T-States)

Processor Throttling States (T-states) areavailable to the administrator under /proc/acpi/processor/*/throttling tomodulate the clock supplied to the processors.

Most systems support 8 throttling states todecrease the processor frequency in steps of12.5%. Throttling the processor frequency isindependent of P-state frequency changes, sothe two are combined. For the example, Table 2shows the potential effect of throttling when theexample system is in P0 or P3.

10http://bugzilla.kernel.org/show_bug.cgi?id=5471

State P0 MHz P3 MHzT0 2000 1000T1 1750 875T2 1500 750T3 1250 625T4 1000 500T5 750 375T6 500 250T7 250 125

Table 2: Throttling States for the Intel R© CoreTM

Duo T2500

Throttling has an effect on processor frequencyonly when the system is in the C0 state execut-ing instructions. In the idle loop, Linux is inthe Cx state (x: x != 0) where no instructionsare executed and throttling has no effect, as theclock is already stopped.

Indeed, T-states have been shown to have a netnegative impact on battery life on some sys-tems, as they can interfere with the mechanismsto efficiently enter deep C-states.

On the example system, throttling the idle sys-tem down to the T7, the slowest speed, had a netnegative impact on idle battery life of 4 min-utes.

Throttling is used by Linux for passive coolingmode and for thermal emergencies. It is notintended for the administrator to use throttlingto maximize performance/power or extend bat-tery life. That is what cpufreq processor perfor-mance states are for. So the next time you areexploring the configuration menus of the pow-ersavd GUI, do NOT check the box that enablesprocessor clock throttling. It is a bug that theadministrator is given the opportunity to makethat mistake.


0

60

120

180

240

300

WindowsLinux

Figure 10: DVD: Linux vs. Windows

4.2 Reader Workload equals Init 5 Idle

Adding the Reader workload to init5 idle re-sults in exactly the same battery life—238 min-utes. The bar chart is left as an exercise for thereader.

4.3 DVD Movie Workload

4.3.1 DVD Movie on Linux vs. Windows

The DVD movie playback workload is alsoattractive for comparing Linux and Windows.This constant work/time workload leaves littleroom for disagreement about what the operat-ing environment is supplying to the user. DVDmovie playback is also a realistic workload,people really do sit down and watch DVDs onbattery power.

However, different DVD player software isused in each operating environment. The Win-dows solution uses WinDVD R©, and the Linuxmeasurement uses mplayer.

Here the i6400 plays a DVD on Linux for 184minutes (3h4m). The i6400 plays the sameDVD on Windows for 218 minutes (3h38m).This 34 minute deficit puts Linux at about 84%of Windows. In terms of Watts, Linux is at a(17.3−14.6) = 2.7W deficit compared to Win-dows on DVD movie playback.

4.3.2 DVD Movie Single vs. Dual Core

DVD playback was measured with 1 CPUavailable vs. 2 CPUS, and there was zero im-pact on battery life.

4.3.3 DVD Movie: Throttling is not helpful

DVD playback was measured at T4 (50% throt-tling) and there was a net negative impact of9 minutes on battery life. Again, throttlingshould be reserved for thermal management,and is almost never an appropriate tool whereefficient performance/power is the goal.

4.4 Office Workload Battery Life and Per-formance

The Office workload battery life and perfor-mance are shown in Figure 11 and Figure 12,respectively. The example system lasted 232minutes with maxcpus=1 and a 5400 RPMdrive, achieving a performance rating of 94(UP5K in Figures 11 and 12). Enabling the sec-ond core cost 6 minutes (–3%) of battery life,but increased performance by 89% to to 178,(MP5K in Figures 11 and 12).

Upgrading the 5400 RPM disk drive to the 7200RPM model had an 18 minute (8%) impact onthe UP battery life, and an 12 minute (5%) im-pact on MP battery life. But the 7200 RPMdrive had negligible performance benefit on this


0

60

120

180

240

300

MP7KUP7KMP5KUP5K

Figure 11: Office Battery Life

0

50

100

150

200

250

300

MP7KUP7KMP5KUP5K

Figure 12: Office Performance

0

60

120

180

240

300

MP7KUP7KMP5KUP5K

Figure 13: Developer Battery Life

workload. (UP7K and MP7K in Figures 11 and12).

Note that the size of memory compared to theworking set of the Office workload impact howmuch the disk is accessed. Were memory to besmaller or the workload modified to access thedisk more, the faster drive would undoubtedlyhave a measurable benefit.

In summary, the second core has a significantperformance benefit, with minimal battery coston this workload. However, upgrading from a5400RPM to 72000 RPM drive does not showa significant performance benefit on this work-load as it is currently implemented.

4.5 Developer Workload Battery Life andPerformance

The Developer workload battery life and per-formance are shown in Figure 13 and Figure 14,respectively.

Here the maxcpus=1 5400 RPM baselinescores 220 minutes with performance of 96.


0

50

100

150

200

250

300

MP7KUP7KMP5KUP5K

Figure 14: Developer Performance

Enabling the second core had a net positiveimpact on battery life of 2 minutes, and in-creased performance to 172 (+79%). Startingfrom the same baseline, upgrading to the 7200RPM drive from the 5400 RPM drive droppedbattery life 26 minutes to 194 from 220 (–12%),but increased performance to 175 (+82%). Si-multaneously enabling the second core and up-grading the drive reduced battery life 34 min-utes to 186 from 220 (15%), but increased per-formance to 287 (+198%).

Clearly developers using this class of machineshould always have both cores enabled andshould be using 7200 RPM drives.

5 Future Work

5.1 Enhancing the Tools

The current version of the tools, 1.0.4, coulduse some improvements.

• Concurrent Office applications. The cur-rent scripts start an application, use it, andthen close it. Real users tend to have mul-tiple applications open at once. It is un-clear if this will have any significant effecton battery life, but it would be better eyecandy.

• Add sanity checking that the system isproperly configured before starting a mea-surement.

5.2 More Comparisons to make

The example measurements in this paper sug-gest even more measurements.

• Effect of run-time device power states.

• Comparison of default policies of differentLinux distributors.

• Benefits of the laptop patch?

• USB 2.0 memory stick cost

• Gbit LAN linked vs unplugged

• WLAN seeking

• WLAN associated

• Bluetooth

• KDE vs. Gnome GUI

• LCD: brightness vs power consumption isthere an optimal brightness/power setting?

• Power consumption while suspended toRAM vs. power consumption to reboot.What is break-even for length of time sus-pended vs halt, off, boot?

• Suspend to Disk and wakeup vs. stayingidle

• Suspend to RAM and wakeup vs stayingidle


6 Conclusion

The authors hope that the tools and techniquesshown here will help the Linux communityeffectively analyze system power, understandlaptop battery life, and improve Linux powermanagement.

Appendix A: Scripting Commands

Keystrokes, keystroke conditions (like <Alt>,<Ctrl>, <Shift>, etc.) and delays are scriptedin a scenario file along with other actions(run command, wait command, selectwindow, send signal, etc.). The scenariofile is passed to the workload script executionprogram, strings are parsed and appropriate ac-tions are executed.

The scenario script is linear, no proceduredefining is (currently) supported. Each stringconsists of 5 white space separated fields andbegins with command name followed by 4 ar-guments (State, Count, Delay, String). For eachparticular command arguments could have dif-ferent meanings or be meaningless, though all 4arguments should present. The following com-mands are implemented:

Commands to generate user input

DELAY 0 0 Delay 0Suspends execution for ’Delay’ msecs;

PRESSKEY State Count Delay String

Send Count State + String key-strokes with Delay msec intervalsbetween them to the window in focus,i.e. command PRESSKEY S 2 500 Down

would generate two <Shift> + <Down>

keystrokes with 1/2 second intervals. Thestate values are:

S for Shift,

A for Alt,

C for Ctrl.

Some keys should be presented as their re-spective names: Up, Down, Left, Right,Return, Tab, ESC.

RELEASEKEY 0 0 0 StringSimilar to PRESSKEY command, exceptthe Release event being sent. It could beuseful since some menu buttons react onkey release, i.e. the pair of PRESSKEY 0

0 <Delay> Return and RELEASEKEY

0 0 <Delay> Return should be usedin this case.

TYPETEXT State 0 Delay StringTypes text from String with Delay msecsinterval between keystrokes. If State is Fthen the text from String file is typed in-stead of String itself.

ENDSCEN 0 0 0 0End of scenario. No strings beyond thisone will be executed.

Commands to operate applications

RUNCMD 0 0 0 StringExecute command String, exits on com-pletion.

WAITSTARTCMD 0 Count Delay String

Checks Count times with Delay msecs in-tervals if String command is started (totalwait time is Count * Delay msecs).

WAITFINISHCMD 0 Count Delay String

Checks Count times with Delay msecs in-tervals if String command is finished (totalwait time is Count * Delay msecs).


Commands to interact with X windows

SETWINDOWID State 0 0 StringMakes window with X window ID locatedin ’String’ object active. If State is F, thenString is treated as a file; if E or 0, as anenvironment variable;

SETWINDOW 0 0 0 StringWaits for window with String title to ap-pear and makes it active.

FOCUSIN 0 0 0 0Sets focus to current active window.

FOCUSOUT 0 0 0 0Gets focus out of current active window.

ENDWINDOW 0 0 0 StringWaits for window with String title to dis-appear.

SYNCWINDOW 0 0 0 0Tries to synchronize current active win-dow.

To reach one particular window, SETWINDOWand FOCUSIN commands should be per-formed.

Commands to generate statistics

SENDWORKMSG 0 0 0 StringGenerate ‘WORK’ statistics string in logfile with the String comment.

SENDIDLEMSG 0 0 0 StringGenerate ‘IDLE’ statistics string in log filewith the String comment.

Note that the harness generates statistics reg-ularly, so the above commands are intended togenerate strings to mark the beginning and end-ing of the set of operations (e.g. ‘hot-spot’), forwhich special measurements are required.

Debugging Commands

TRACEON 0 0 0 0Enable debug prints;

TRACEOFF 0 0 0 0Disable debug prints;

References

[ACPI] Hewlett-Packard, Intel, Microsoft,Phoenix, Toshiba AdvancedConfiguration & Power Specification,Revision 3.0a, December 30, 2005.http://www.acpi.info.

[Linux/ACPI] Linux/ACPI Project Homepage,http://acpi.sourceforge.net.

[MM02] MobileMark R© 2002, BusinessApplications Performance Corporation,http://bapco.com, June 4, 2002,Revision 1.0.

[MM05] MobileMark R© 2005, BusinessApplications Performance Corporation,http://bapco.com, May 26, 2005,Revision 1.0.

BAPCo is a U.S. Registered Trademark of the Busi-ness Applications Performance Corporation. Mo-bilMark is a U.S. Registered Trademark of the Busi-ness Applications Performance Corporation. Linuxis a registered trademark of Linus Torvalds. Allother trademarks mentioned herein are the propertyof their respective owners.


The Frysk Execution Analysis Architecture

Andrew CagneyRed Hat Canada [email protected]

Abstract

The goal of the Frysk project is to createan intelligent, distributed, always-on system-monitoring and debugging tool. Frysk willallow GNU/Linux developers and system ad-ministrators: to monitor running processesand threads (including creation and destructionevents); to monitor the use of locking prim-itives; to expose deadlocks, to gather data.Users debug any given process by either choos-ing it from a list or by accepting Frysk’s offer toopen a source code or other window on a pro-cess that is in the process of crashing or thathas been misbehaving in certain user-definableways.

1 Introduction

This paper will first present a typical Frysk use-case. The use-case will then be used to illus-trate how Frysk differs from a more traditionaldebugger, and how those differences benefit theuser. This paper will then go on to provide amore detailed overview of Frysk’s internal ar-chitecture and show how that architecture facil-itates Frysk’s objectives. Finally, Frysk’s futuredevelopment will be reviewed, drawing atten-tion areas of the Linux Kernel that can be en-hanced to better facilitate advanced debuggingtools.

2 Example — K. the Compiler En-gineer

K., a compiler engineer, spends a lot of timerunning a large, complex test-suite involvinglots of processes and scripts, constantly mon-itoring each run for compiler crashes. When acrash occurs, K. must first attempt to reproducethe problem in isolation, then reproduce it un-der a command-line debugging tool, and thenfinally attempt to diagnose the problem.

Using Frysk, K. creates a monitored terminal:

Terminal

From within that terminal, K. can directly runthe test framework:

$ make −j5 check

$ lsMakefile

148 • The Frysk Execution Analysis Architecture

When a crash occurs, K. is alerted by the blink-ing Frysk icon in the toolbar. K. can then clickon the Frysk icon and bring up the source win-dow displaying the crashing program at the lo-cation at which the crash occurred:

3 Frysk Compared To TraditionalDebugger Technology

In the 1980s, at the time when debuggers suchas GDB, SDB, and DBX were first developed,UNIX application complexity was typicallylimited to single-threaded, monolithic applica-tions running on a single machine and writtenin C. Since that period, applications have grownboth in complexity and sophistication utiliz-ing: multiple threads and multiple processes;shared libraries; shared-memory; a distributedstructure, typically as a client-server architec-ture; and implemented using C++, Java, C#,and scripting languages.

Unfortunately, the debugger tools developed atthat time have failed to keep pace of these ad-vances. Frysk, in contrast, has the goal of sup-porting these features from the outset.

3.1 Frysk Supports Multiple Threads, Pro-cesses, and Hosts

Given that even a simple application, such asfirefox, involves both multiple processes andthreads, Frysk was designed from the outset

to follow Threads, Processes, and Hosts. Thatway the user, such as K., is provided with a sin-gle consistent tool that monitors the entire ap-plication.

3.2 Frysk is Non-stop

Historically, since an application had only asingle thread, and since any sign of a prob-lem (e.g., a signal) was assumed to herald dis-aster, the emphasis on debugging tools was tostop an application at the earliest sign of trou-ble. With modern languages, and their man-aged run-times, neither of those these assump-tions apply. For instance, where previously aSIGSEGV was indicative of a fatal crash, it isnow a normal part of an application’s executionbeing used by the system’s managed run-timeas part of memory management.

With Frysk, the assumption is that the user re-quires the debugging tool to be as unobtrusiveas possible, permitting the application to runfreely. Only when the user explicitly requestscontrol over one or more threads, or when a fa-tal situation such as that K. encountered is de-tected, will Frysk halt a thread or process.

3.3 Frysk is Event Driven

Unlike simpler command-line debugging tools,which are largely restricted to exclusively mon-itoring just the user’s input or just the runningapplication, Frysk is event-driven and able toco-ordinate both user and application events si-multaneously. When implementing a graphicalinterface, this becomes especially important asthe user expects Frysk to always be responsive.


3.4 Frysk has Transparent Attach and De-tach

With a traditional debugging tool, a debuggingsession for an existing process takes the form:

• attach to process

• examine values, continue, or stop

• detach from process

That is, the user is firstly very much aware ofthe state of the process (attached or detached),and secondly, is restricted to just manipulatingattached processes. With Frysk, the user caninitiate an operation at any time, the need to at-tach being handled transparently.

For instance, when a user requests a stack back-trace from a running process, Frysk automati-cally attaches to, and then stops, the process.

3.5 Frysk is Graphical, Visual

While a command-line based tool is useful forexamining a simple single-threaded program,it is not so effective when examining an ap-plication that involves tens if not hundreds ofthreads. In contrast, Frysk strongly emphasizesits graphical interface providing visual mech-anisms for examining an application. For in-stance, to examine the history of processes andevents, Frysk provides an event line:

3.6 Frysk Handles Optimized and In-lineCode

Rather than limiting debugging to applicationsthat are written in C and compiled unoptimized,Frysk is focused on supporting application thathave been compiled with optimized and in-lined code. Frysk exploits its graphical inter-face by permitting the user to examine the in-lined code in-place. For instance, an in-linedfunction b() with a further in-line call to f()can be displayed as:

3.7 Frysk Loads Debug Information On-demand

Given that a modern application often has gi-gabytes of debug information, the traditionalapproach of reading all debug information intomemory is not practical. Instead Frysk, usinglibelf and libdw, reads the debug informa-tion on demand, and hence ensures that Frysk’ssize is minimized.

3.8 Frysk Itself is Multi-Threaded and Ob-ject Oriented

It is often suggested that a debugging tool isbest suited at debugging itself. This view be-ing based on the assumption that since devel-opers spend most of their time using their own


tools for debugging their own tools, they will bestrongly motivated to at least make debuggingtheir tool easy. Consequently, a single-threadedprocedural debugging tool written in C wouldbe best suited for debugging C, while devel-opers working on a multi-threaded, object-oriented, event-driven debugging tool are goingto have a stronger motivation to make the toolwork with that class of application.

3.9 Frysk is Programmable

In addition to a graphical interface, the Fryskarchitecture facilitates the rapid developmentof useful standalone command-line utilities im-plemented using Frysk’s core. For instancethe command line utility ftrace, similar tostrace, was implemented by adding a systemcall observer that prints call information to thethreads being traced, and the program fstackwas implemented by adding a stop observer toall threads of a process so that as each threadstopped its stack back-trace could be printed.

4 The Frysk Architecture

4.1 Overview

At a high level, Frysk’s architecture can beviewed as a collection of clients that interactwith Frysk’s core. The core provides clientswith alternate models or views of the system.

gui

publicinterfaces

eclipse

utilities cli

lang model

proc model

kernel

FRYSK’score

...

Frysk’s core then uses the target system’s ker-nel interfaces to maintain the internal models ofthe running system.

4.2 The Core, A Layered Architecture

Aspects of a Linux system can be viewed, ormodeled, at different levels of abstraction. Forinstance:

• a process model: as a set of processes,each containing threads and each threadhaving registers and memory

• a language model: a process executing ahigh-level program, written in C++, hav-ing a stack, variables, and code

Conceptually, the models form sequence of lay-ers, and each layer is implemented using theone below:


Kernel

ProcessModel

LanguageModel

host, process,thread

stack, variable,source code

For instance, the language model, which ab-stracts a stack, would construct that stack’sframes using register information obtainedfrom the process model.

The core then makes each of those modelsavailable to client applications.

4.2.1 Frysk’s Process Model

Frysk’s process model implements a process-level view of the Linux system. The model con-sists of host, process, and task (or thread) ob-jects corresponding to the Linux system equiv-alents:

Host

Proc

ProcProc Proc

Proc

Frysk then makes this model available to theuser as part of the process window:

When a user requests that Frysk monitor for aprocess model event, such as a process exiting,that request is implemented by adding an ob-server (or monitor) to the objects to which therequest applies. When the corresponding eventoccurs, the observers are notified.

4.2.2 Frysk’s Language Model

Corresponding to the run-time state of a high-level program, Frysk provides a run-time lan-guage model. This model provides an abstrac-tion of a running-program’s stack (consisting offrames), variables and objects.

foo

bar

calls

baz

inlines

The model is then made available to the userthrough the source window’s stack and sourcecode browsers:


5 Future Direction

Going forward, Frysk’s development is ex-pected to be increasingly focused on largecomplex and distributed applications. Conse-quently Frysk is expected to continue pushingits available technology.

Internally, Frysk has already identified limita-tions of the current Linux Kernel debugging in-terfaces (ptrace and proc). For instance:that only the thread that did the attach be per-mitted to manipulate the debug target, or thatwaiting on kernel events still requires the jug-gling of SIGCHLD and waitpid. Addressingthese issues will be critical to ensuring Frysk’sscalability.

At the user level, Frysk will continue its ex-ploration of interfaces that allow the user toanalyze and debug increasingly large and dis-tributed applications. For instance, Frysk’s in-terface needs to be extended so that it is capa-ble of visualizing and managing distributed ap-plications involving hundreds or thousands ofnodes.

6 Conclusion

Through the choice of a modern programminglanguage, and the application of modern soft-ware design techniques, Frysk is well advancedin its goal of creating an intelligent, distributed,always-on monitoring and debugging tool.

7 Acknowledgments

Thanks goes to Michael Behm, Stan Cox,Adam Jocksch, Rick Moseley, Chris Moller,Phil Muldoon, Sami Wagiaalla, Elena Zannoni,and Wu Zhou, who provided feedback, code,and screenshots.

Evaluating Linux Kernel Crash Dumping Mechanisms

Fernando Luis Vázquez CaoNTT Data Intellilink

[email protected]

Abstract

There have been several kernel crash dump cap-turing solutions available for Linux for sometime now and one of them, kdump, has evenmade it into the mainline kernel.

But the mere fact of having such a feature doesnot necessary imply that we can obtain a dumpreliably under any conditions. The LKDTT(Linux Kernel Dump Test Tool) project wascreated to evaluate crash dumping mechanismsin terms of success rate, accuracy and com-pleteness.

A major goal of LKDTT is maximizing thecoverage of the tests. For this purpose, LKDTTforces the system to crash by artificially recre-ating crash scenarios (panic, hang, exception,stack overflow, hang, etc.), taking into ac-count the hardware conditions (such as ongoingDMA or interrupt state) and the load of the sys-tem. The latter being key for the significanceand reproducibility of the tests.

Using LKDTT the author could constate the su-perior reliability of the kexec-based approachto crash dumping, although several deficienciesin kdump were revealed too. Since the finalgoal is having the best crash dumping mech-anism possible, this paper also addresses howthe aforementioned problems were identifiedand solved. Finally, possible applications ofkdump beyond crash dumping will be intro-duced.

1 Introduction

Mainstream Linux lacked a kernel crash dump-ing mechanism for a long time despite thefact that there were several solutions (such asDiskdump [1], Netdump [2], and LKCD [3])available out of tree . Concerns about their in-trusiveness and reliability prevented them frommaking it into the vanilla kernel.

Eventually, a handful of crash dumping so-lutions based on kexec [4, 5] appeared:Kdump [6, 7], Mini Kernel Dump [8], andTough Dump [9]. On paper, the kexec-basedapproach seemed very reliable and the impactin the kernel code was certainly small. Thus,kdump was eventually proposed as Linux ker-nel’s crash dumping mechanism and subse-quently accepted.

However, having a crash dumping mechanismdoes not necessarily imply that we can get adump under any crash scenario. It is necessaryto do proper testing, so that the success rate andaccuracy of the dumps can be estimated and thedifferent solutions compared fairly. Besides,having a standardised test suite would also helpestablishing a quality standard and, collaterally,detecting regressions would be much easier.

Unless otherwise indicated, henceforth all theexplanations will refer to i386 and x86_64 ar-chitectures, and Linux 2.6.16 kernel.

154 • Evaluating Linux Kernel Crash Dumping Mechanisms

1.1 Shortcomings of current testing meth-ods

Typically to test crash dumping mechanisms akernel module is created that artificially causesthe system to die. Common methods to bringthe system down from this module consist ofdirectly invoking panic, making a null pointerdereference and other similar techniques.

Sometimes, to ease testing a user space tool isprovided that sends commands to the kernel-space part of the testing tool (via the /procfile system or a new device file), so that thingslike the crash type to be generated can be con-figured at run-time.

Beyond the crash type, there are no provisionsto further define the crash scenario to be recre-ated. In other words, parameters like the loadof the machine and the state of the hardwareare undefined at the time of testing.

Judging from the results obtained with this ap-proach to testing all crash dumping solutionsseem to be very close in terms of reliability,regardless of whether they are kexec-based ornot, which seems to contradict theory. The rea-son is that the coverage of the tests is too lim-ited as a consequence of leaving important fac-tors out of the picture. Just to give some exam-ples, the hardware conditions (such as ongoingDMA or interrupt state), the system load, andthe execution context are not taken into consid-eration. This greatly diminishes the relevanceof the results.

1.2 LKDTT motivation

The critical role crash dumping solutions playin enterprise systems calls for proper testing,so that we can have an estimate of their suc-cess rate under realistic crash scenarios. This issomething the current testing methods cannot

achieve and, as an attempt to fill this gap, theLKDTT project [10] was created.

Using LKDTT many deficiencies in kdump,LKCD, mkdump and other similar projectswere found. Over the time, some regressionswere observed too. This type of informationis of great importance to both Linux distribu-tions and end-users, and making sure it doesnot pass unnoticed is one of the commitmentsof this project.

To create meaningful tests it is necessary to un-derstand the basics of the different crash dump-ing mechanisms. A brief introduction followsin the next section.

2 Crash dump

A variety of crash dumping solutions havebeen developed for Linux and other UNIX R©-like operating systems over the time. Eventhough implementations and design principlesmay differ greatly, all crash dumping mecha-nisms share a multistage nature:

1. Crash detection.

2. Minimal machine shutdown.

3. Crash dump capture.

2.1 Crash detection

For the crash dump capturing process to starta trigger is needed. And this trigger is, mostinterestingly, a system crash.

The problem is that this peculiar trigger some-times passes unnoticed or, in the words, the ker-nel is unable to detect that itself has crashed.


The culprits of system crashes are software er-rors and hardware errors. Often a hardware er-ror leads to a software errors, and vice versa,so it is not always easy to identify the originalproblem. For example, behind a panic in theVFS code a damaged memory module mightbe lurking.

There is one principle that applies to both soft-ware and hardware errors: if the intention isto capture a dump, as soon as an error is de-tected control of the system should be handedto the crash dumping functionality. Deferringthe crash dumping process by delegating in-vocation of the dump mechanism to functionssuch as panic is potentially fatal, because thecrashing kernel might well lose control of thesystem completely before getting there (due toa stack overflow for example).

As one might expect, the detection stage of thecrash dumping process does not show markedimplementation specific differences. As a con-sequence, a single implementation could beeasily shared by the different crash dumping so-lutions.

2.1.1 Software errors

A list of the most common crash scenarios thekernel has to deal with is provided below:

• Oops: Occurs when a programming mis-take or an unexpected event causes a situa-tion that the kernel deems grave. Since thekernel is the supervisor of the entire sys-tem it cannot simply kill itself as it woulddo with a user-space application that goesnuts. Instead, the kernel issues and oops(which results in a stack trace and errormessage to the console) and strives to getout of the situation. But often, after theoops, the system is left in an inconsistent

state the kernel cannot recover from and,to avoid further damage, the system panics(see panic below). For example, a drivermight have been in the middle of talkingto hardware or holding a lock at the timeof the crash and it would not be safe to re-sume execution. Hence, a panic is issuedinstead.

• Panic: Panics are issued by the kernelupon detecting a critical error from whichit cannot recover. After printing and errormessage the system is halted.

• Faults: Faults are triggered by instructionsthat cannot or should not be executed bythe CPU. Even though some of them areperfectly valid, and in fact play an essen-tial role in important parts of the kernel(for example, pages faults in virtual mem-ory management); there are certain faultscaused by programming errors, such asdivide-error, invalid TSS, or double fault(see below), which the kernel cannot re-cover from.

• Double and triple faults: A double faultindicates that the processor detected a sec-ond exception while calling the handler fora previous exception. This might seem arare event but it is possible. For exam-ple, if the invocation of an exception han-dler causes a stack overflow a page faultis likely to happen, which, in turn, wouldcause a double fault. In i386 architectures,if the CPU faults again during the incep-tion of the double fault, then it triple faults,entering a shutdown cycle that is followedby a system RESET.

• Hangs: Bugs that cause the kernel to loopin kernel mode, without giving other tasksthe chance to run. Hangs can be classifiedin two big groups:

– Soft lockups: These are transitorylockups that delay execution and


scheduling of other tasks. Soft lock-ups can be detected using a softwarewatchdog.

– Hard lockups: These are lockups thatleave the system completely unre-sponsive. They occur, for example,when a CPU disables interrupts andgets stuck trying to get spinlock thatis not freed due to a locking error.In such a state timer interrupts arenot served, so scheduler-based soft-ware watchdogs cannot be used fordetection. The same happens to key-board interrupts, and that is why theSys Rq key cannot be used to trig-ger the crash dump. The solutionhere is using the NMI handler.

• Stack overflows: In Linux the size of thestacks is limited (at the time of writingi386’s default size is 8KB) and, for thisreason, the kernel has to make a sensitiveuse of the stack to avoid bloating. It isa common mistake by novice kernel pro-grammers to declare large automatic vari-ables or to use deeply nested recursive al-gorithms; both of these practises tend tocause stack overflows. Stacks are alsojeopardised by other factors that are not soevident. For example, in i386 interruptsand exceptions use the stack of the currenttask, which puts extra pressure on it. Con-sequently, interruption nesting should alsobe taken into account when programminginterrupt handlers.

2.1.2 Hardware errors

Not only software has errors, sometimes ma-chines fail too. Some hardware errors are re-coverable, but when a fatal error occurs the sys-tem should come to a halt as fast as possibleto avoid further damage. It is not even clear

whether trying to capture a crash dump in theevent of a serious hardware error is a sensi-tive thing to do. When the underlying hard-ware cannot be trusted one would rather bringthe system down to avoid greater havoc.

The Linux kernel can make use of some er-ror detection facilities of computer hardware.Currently the kernel is furnished with severalinfrastructures which deal with hardware er-rors, although the tendency seems to be to con-verge around EDAC (Error Detection and Cor-rection) [11]. Common hardware errors theLinux kernel knows about include:

• Machine checks: Machine checks occur inresponse to CPU-internal malfunctions oras a consequence of hardware resets. Theiroccurrence is unpredictable and can leavememory and/or registers in a partially up-dated state. In particular, the state of theregisters at the time of the event is com-pletely undefined.

• System RAM errors: In systems equippedwith ECC memory the memory chip hasextra circuitry that can detect errors in theingoing and outgoing data flows.

• PCI bus transfer errors: The data travel-ling to/from a PCI device may experiencecorruption whilst on the PCI bus. Eventhough a majority of PCI bridges and pe-ripherals support such error detection mostsystem do not check for them. It is worthnoting that despite the fact that some ofthis errors might trigger an NMI it is notpossible figure out what caused it, becausethere is no more information.

2.2 Minimal machine shutdown

When the kernel finds itself in a critical situa-tion it cannot recover from, it should hand con-


trol of the machine to the crash dumping func-tionality. In contrast to the previous stage (de-tection), the things that need to be done at thispoint are quite implementation dependent. Thatsaid, all the crash dumping solutions, regardlessof their design principles, follow the basic exe-cution flow indicated below:

1. Right after entering the dump route thecrashing CPU disables interrupts andsaves its context in a memory area spe-cially reserved for that purpose.

2. In SMP environments, the crashing CPUsends NMI IPIs to other CPUs to haltthem.

3. In SMP environments, each IPI receivingprocessor disables interrupts and saves itscontext in a special-purpose area. Afterthis, the processor busy loops until thedump process ends.Note: Some kexec-based crash dump cap-turing mechanisms relocate to boot CPUafter a crash, so this step becomes differ-ent in those cases (see Section 7.4 for de-tails).

4. The crashing CPU waits a certain amountof time for IPIs to be processed by theother CPUs, if any, and resumes execution.

5. Device shutdown/reinitialization, if doneat all, is kept to a minimum, for it is notsafe after a crash.

6. Jump into the crash dump capturing code.

2.3 Crash dump capture

Once the minimal machine shutdown is com-pleted the system jumps into the crash dumpcapturing code, which takes control of the sys-tem to do the dirty work of capturing the dumpand saving the dump image in a safe place.

Before continuing, it is probably worth definingwhat is understood by kernel crash dump. Akernel crash dump is an image of the resourcesin use by the kernel at the time of the crash, andwhose analysis is an essential element to clarifywhat went wrong. This usually comprises animage of the memory available to the crashedkernel and the register states of all the proces-sors in the system. Essentially, any informationdeemed useful to figure out the source of theproblem is susceptible of being included in thedump image.

This final and decisive stage is probably the onethat varies more between implementations. At-tending to the design principles two big groupscan be identified though:

• In-kernel solutions: LKCD, Diskdump,Netdump.

• kexec-based solutions: Kdump, MK-Dump, Tough Dump.

2.3.1 In-kernel solutions

The main characteristic of the in-kernel ap-proach is, as the name suggests, that the crashdumping code uses the resources of the crash-ing kernel. Hence, these mechanisms, amongother things, make use of the drivers and thememory of the crashing kernel to capture thecrash dump. As might be expected this ap-proach has many reliability issues (see Sec-tion 5.1 for an explanation and Table 2 for testresults).

2.3.2 kexec-based solutions

The core design principle behind the kexec-based approach is that the dump is capturedfrom an independent kernel (the crash dump


kernel or second kernel) that is soft-booted af-ter the crash. Here onwards, for discussion pur-poses, crashing kernel is referred to as first ker-nel and the kernel which captures the dump aseither capture kernel or second kernel.

As a general rule, the crash dumping mecha-nism should avoid fiddling with the resources(such as memory and CPU registers) of thecrashing kernel. And, when this is inevitable,the state of these resources should be saved be-fore they are used. This means that if the cap-ture kernel wants to use a memory region thefirst kernel was using, it should first save theoriginal contents so that this information is notmissing in the dump image. These considera-tions along with the fact that we are soft boot-ing into a new kernel determines the possibleimplementations of the capture kernel:

• Booting the capture kernel from the stan-dard memory locationThe capture kernel is loaded in a reservedmemory region and in the event of a crashit is copied to the memory area from whereit will boot. Since this kernel is linkedagainst the architecture’s default start ad-dress, it needs to reside in the same placein memory as the crashing kernel. There-fore, the memory area necessary to accom-modate the capture kernel is preserved bycopying it to a backup region just beforedoing the copy. This was the approachtaken by Tough Dump.

• Booting the second kernel from a reservedmemory regionAfter a crash the system is unstable and thedata structures and functions of the crash-ing kernel are not reliable. For this rea-son there is no attempt to perform any kindof device shutdown and, as a consequence,any ongoing DMAs at the time of the crashare not stopped. If the approach discussedbefore is used the capture kernel is prone

to be stomped by DMA transactions initi-ated in the first kernel. As long as IOMMUentries are not reassigned, this problemcan be solved by booting the second ker-nel directly from the reserved memory re-gion it was loaded into. To be able to bootfrom the reserved memory region the ker-nel has to be relocated there. The reloca-tion can be accomplished at compile time(with the linker) or at run-time (see dis-cussion in Section 7.2.1). Kdump and mk-dump take the first and second approach,respectively.

For the reliability reasons mentioned above thesecond approach is considered the right solu-tion.

3 LKDTT

3.1 Outline

LKDTT is a test suite that forces the kernelto crash by artificially recreating realistic crashscenarios. LKDTT accomplishes this by takinginto account both the state of the hardware (forexample, execution context and DMA state)and the load conditions of the system for thetests.

LKDTT has kernel-space and user-space com-ponents. It consists of a kernel patch that im-plements the core functionality, a small utilityto control the testing process (ttutils), anda set of auxiliary tools that help recreating thenecessary conditions for the tests.

Usually tests proceed as follows:

• If it was not built into the kernel, load theDTT (Dump Test Tool) module (see Sec-tion 3.2.2).


• Indicate the point in the kernel where thecrash is to be generated using ttutils(see Section 3.3). This point is calledCrash Point (CP).

• Reproduce the necessary conditions forthe test using the auxiliary tools (see Sec-tion 3.4).

• Configure the CP using ttutils. Themost important configuration item is thecrash type.

• If the CP is located in a piece of coderarely executed by the kernel, it may be-come necessary to use some of the aux-iliary tools again to direct the kernel to-wards the CP.

A typical LKDTT session is depicted in Ta-ble 1.

3.2 Implementation

LKDTT is pretty simple and specializedon testing kernel crash dumping solutions.LKDTT’s implementation is sketched in Fig-ure 1, which will be used throughout this sec-tion for explanation purposes.

The sequence of events that lead to an artificialcrash is summarized below:

• Kernel execution flow reaches a crashpoint or CP (see 3.2.1). In the picture theCP is called HD_CP and is located in thehard disk device driver.

• If the CP is enabled the kernel jumps intothe DTT module. Otherwise execution re-sumes from the instruction immediatelyafter the CP.

crash0

>=1

DTT

HD driver

Tasklet

HD_CP

HD_CP_CNT--

Figure 1: Crash points implementation

• The DTT module checks the counter as-sociated with the CP. This counter (HD_CP_CNT in the example) keeps track ofthe number of times the CP in question hasbeen crossed.

• This counter is a reverse counter in reality,and when it reaches 0 the DTT (Dump TestTool) module induces the system crash as-sociated with the CP. If the counter is stillgreater than zero execution returns fromthe module and continues from the instruc-tion right after the CP.

Both the initial value of the counter and thecrash type to be generated are run-time con-figurable from user space using ttutils(see 3.3). Crash points, however, are insertedin the kernel source code as explained in thefollowing section.


# modprobe dtt# ./ttutils lsid crash type crash point name count location1 none INT_HARDWARE_ENTRY 0 kern3 none FS_DEVRW 0 kern4 panic MEM_SWAPOUT 7 kern5 none TASKLET 0 kern# ./ttutils add -p IDE_CORE_CP -n 50# ./ttutils lsid crash type crash point name count location1 none INT_HARDWARE_ENTRY 0 kern3 none FS_DEVRW 0 kern4 panic MEM_SWAPOUT 7 kern5 none TASKLET 0 kern50 none IDE_CORE_CP 0 dyn# ./helper/memdrain# ./ttutils set -p IDE_CORE_CP -t panic -c 10

Table 1: LKDTT usage example

3.2.1 Crash Points

Each of the crash scenarios covered by the testsuite is generated at a Crash Point (CP), whichis a mere hook in the kernel. There are twodifferent approaches to inserting hooks at arbi-trary points in the kernel: patching the kernelsource and dynamic probing. At first glance,the latter may seem the clear choice, becauseit is more flexible and it would not be neces-sary to recompile the kernel to insert a new CP.Besides, there is already an implementation ofdynamic probing in the kernel (Kprobes [12]),with which the need of a kernel recompilationdisappears completely.

Despite all these advantages dynamic probingwas discarded because it changes the executionmode of the processor (a breakpoint interruptis used) in a way that can modify the result ofa test. Using /dev/mem to crash the kernelis another option, but in this case there is noobvious way of carrying out the tests in a con-trolled manner. These are the main motives be-hind LKDTT’s election of the kernel patch ap-

proach. Specifically, the current CP implemen-tation is based on IBM’s Kernel Hooks.

In any case, a recent extension to Kprobescalled Djprobe [13] that uses the breakpointtrap just once to insert a jump instruction atthe desired probe point and uses this thereafterlooks promising (the jump instruction does notalter the CPU’s execution mode and, conse-quently, should not alter the test results).

As pointed out before, each CP has two at-tributes: number of times the CP is crossed be-fore causing the system crash and the crash typeto be generated.

At the time of writing, 5 crash types are sup-ported:

• Oops: generates a kernel oops.

• Panic: generates a kernel panic.

• Exception: dereferences a null pointer.

• Hang: simulates a locking error by busylooping.


• Overflow: bloats the stack.

As mentioned before, crash points are insertedin the kernel source code. This is done usingthe macro CPOINT provided by LKDTT’s ker-nel patch (see 3.5). An example can be seen incode listing 1.

3.2.2 DTT module

The core of LKDTT is the DTT (Dump TestTool) kernel module. Its main duties are: man-aging the state of the CPs, interfacing withthe user-space half of LKDTT (i.e. ttutils)through the /proc file system, and generat-ing the crashes configured by the user using theaforementioned interface.

Figure 2 shows the pseudo state diagram of aCP (the square boxes identify states). FromLKDTT’s point of view, CPs come to existencewhen they are registered. This is done auto-matically in the case of CPs compiled into thekernel image. CPs residing in kernel modules,on the other hand, have to be registered eitherby calling a CP registration function from themodule’s init method, or from user space usinga special ttutils’ option (see add in ttutils,Section 3.3, for a brief explanation and Table 1for a usage example). The later mechanism isaimed at reducing the intrusiveness of LKDTT.

Once a CP has been successfully registeredthe user can proceed to configure it usingttutils (see set in ttutils, Section 3.3, andTable 1 for an example). When the CP is en-abled, every time the CP is crossed the DTTmodule decreases the counter associated with itby one and, when it reaches 0, LKDTT simu-lates the corresponding failure.

If it is an unrecoverable failure, the crash dump-ing mechanism should assume control of the

CP enabled

Countdown

CP registered

.

fatal?

0?

CP disabled

CP deleted

Yes

Yes

No

No

Dump capturecrash

CP configuration

CP reached

CP deletion

CP configuration

Init

CP registration

CP deactivation

Crash generation

Figure 2: State diagram of crash points

system and capture a crash dump. However, ifthe kernel manages to recover from this even-tuality the CP is marked as disabled, and re-mains in this state until it is configured againor deleted. There is a caveat here though: in-kernel CPs cannot be deleted.

LKDTT can be enabled either as a kernel mod-ule or compiled into the kernel. In the modularcase it has to be modprobed:

# modprobe dtt [rec_num={>0}]

rec_num sets the recursion level for the stackoverflow test (default is 10). The stack growthis approximately rec_num*1KB.

3.3 ttutils

ttutils is the user-space bit of LKDTT. Itis a simple C program that interfaces with the


DTT module through /proc/dtt/ctrl and/proc/dtt/cpoints. The first file is usedto send commands to the DTT module, com-monly to modify the state of a CP. The secondfile, on the other hand, can be used to retrievethe state of crash points. It should be fairly easyto integrate the command in scripts to automatethe testing process.

The ttutils command has the following format:

ttutils command [options]

The possible commands being:

• help: Display usage information.

• ver(sion): Display version number ofLKDTT.

• list|ls: Show registered crash points.

• set: Configure a Crash Point (CP). setcan take two different options:-p cpoint_name: CP’s name.-t cpoint_type: CP’s crash type.Currently the available crash types are:none (do nothing), panic, bug (oops), ex-ception (generates an invalid exception),loop (simulates a hang), and overflow.-c pass_num: Number of times thecrash point has to be crossed before thefailure associated with its type is induced.The default value for pass_num is 10.

• reset: Disable a CP. Besides, the asso-ciated counter that keeps track of the num-ber of times the crash point has been tra-versed is also reset. The options availableare:-p cpoint_name: CP’s name.-f: Reset not only the CP’s counter butalso revert its type to none.

• add: Register a CP from a kernel mod-ule so that it can be actually used. Thisis aimed at modules that do not registerthe CPs inserted in their own code. Pleasenote that registering does not imply acti-vation. Activation is accomplished usingset. add has two options:-p cpoint_name: CP’s name.-n id: ID to be associated with the CP.

• rmv: Remove a CP registered using add.rmv has one single option:-p cpoint_name: CP’s name.

3.4 Auxiliary tools

One of the main problems that arises when test-ing crash dumping solutions is that artfully in-serting crash points in the kernel does not al-ways suffice to recreate certain crash scenar-ios. Some execution paths are rarely troddenand the kernel has to be lured to take the rightwrong way.

Besides, the tester may want the system to bein a particular state (ongoing DMA or certainmemory and CPU usage levels, for example).

This is when the set of auxiliary tools includedin LKDTT comes into play to reproduce the de-sired additional conditions.

A trivial example is memdrain (see Table 1for a usage example), a small tool included inthe LKDTT bundle. memdrain is a simple Cprogram that reserves huge amounts of mem-ory so that swapping is initiated. By doing sothe kernel is forced to traverse the CPs insertedin the paging code, so that we can see howthe crash dumping functionality behaves whena crash occurs in the middle of paging anony-mous memory.


3.5 Installation

First, the kernel patch has to be applied:

# cd <PATH_TO_KERNEL_X.Y.Z>

# zcat <PATH_TO_PATCH>/dtt-full-X.

Y.Z.patch.gz | patch -p1

Once the patch has been applied we can pro-ceed to configure the kernel as usual, but mak-ing sure that we select the options indicated be-low:

# make menuconfigKernel hacking --->Kernel debugging [*]Kernel Hook support [*] or [M]Crash points [*] or [M]

The final steps consist of compiling the kerneland rebooting the system:

# make# make modules_install# make install# shutdown -r now

The user space tools (ttutils and the auxil-iary tools) can be installed as follows:

# tar xzf dtt_tools.tar.gz# cd dtt_tools# make

4 Test results

The results of some tests carried out withLKDTT against LKCD and two different ver-sions of the vanilla kernel with kdump enabledcan be seen in Table 2.

For each crash point all the crash types sup-ported by LKDTT were tried: oops, panic, ex-ception, hang, and overflow. The meaning ofthe crash points used during the tests is ex-plained in Section 3.2.1.

The specifications of the test machine are asfollows:

• CPU type: Intel Pentium 4 Xeon Hyper-threading.

• Number of CPUs: 2.

• Memory: 1GB.

• Disk controller: ICH5 Serial ATA.

The kernel was compiled with the options be-low turned on (when available): CONFIG_

PREEMPT, CONFIG_PREEMPT_BKL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_4KSTACKS,CONFIG_SMP. And the kernel command linefor the kdump tests was:

root=/dev/sda1 ro crashkernel=32M@

16M nmi_watchdog=1 console=ttyS0,

38400 console=tty0

The test results in Table 2 are indicated usingthe convention below:

• O: Success.

• O(nrbt): The system recovered fromthe induced failure. However, a subse-quent reboot attempt failed, leaving themachine hanged.

• O(nrbt, nmiw): The dump imagewas captured successfully but the systemhanged when trying to reboot. The NMIwatchdog detected this hang and, afterdetermining that the crash dump had al-ready been taken, tried to reboot the sys-tem again.


Crash point Crash type LKCD 6.1.0 kdump 2.6.13-rc7 kdump 2.6.16INT_HARDWARE_ENTRY panic X 0 0

oops X (nmiw, nrbt) X (nmiw) 0exception X (nmiw, nrbt) X (nmiw) 0

hang X 0 0overflow X X X

INT_HW_IRQ_EN panic X 0 0oops X 0 X(2c)

exception X 0 0hang X X X

overflow X X XINT_TASKLET_ENTRY panic 0(nrbt, nmiw) 0 0

oops 0(nrbt, nmiw) 0 0exception 0(nrbt, nmiw) 0 0

hang X X (SysRq) X(det,SysRq)overflow X X X

TASKLET panic 0(nrbt, nmiw) 0 0oops 0(nrbt, nmiw) 0 0

exception 0(nrbt, nmiw) 0 0hang 0(nrbt, nmiw) 0 0

overflow X X XFS_DEVRW panic 0(nrbt, nmiw) 0 X(2c)

oops 0(nrbt, nmiw) 0 X (log,SysRq)exception 0(nrbt, nmiw) 0 X (log,SysRq)

hang X X (SysRq) X (SysRq)overflow X X X

MEM_SWAPOUT panic 0(nrbt, nmiw) 0 0oops 0(nrbt, nmiw) 0 0 (nrbt)

exception 0(nrbt, nmiw) 0 0 (nrbt)hang X X X (unk,SysRq,2c)

overflow X X XTIMERADD panic 0(nrbt, nmiw) 0 0

oops 0(nrbt, nmiw) 0 0exception 0(nrbt, nmiw) 0 0

hang X 0 0overflow X X X

SCSI_DISPATCH_CMD panic X 0 0oops X 0 0

exception X 0 0hang X X (SysRq) X (det,SysRq)

overflow X X X

Table 2: LKDTT results


• X: Failed to capture dump.

• X(2c): After the crash control of thesystem was handed to the capture kernel,but it crashed due to a device initializationproblem.

• X(SysRq): The crash not detected by thekernel, but the dump process was success-fully started using the Sys Rq key.Note: Often, when plugged after the crashthe keyboard does not work and the SysRq is not effective as a trigger for thedump.

• X(SysRq, 2c): Like the previous case,but the capture kernel crashed trying to ini-tialize a device.

• X(det, SysRq): The hang was de-tected by the soft lockup watchdog(CONFIG_DETECT_SOFTLOCKUP). Sincethis watchdog only notifies about thelockup without taking any active measuresthe dump process had to be started usingthe Sys Rq key.Note: Even though the dump was success-fully captured the result was marked withan X because it required user intervention.The action to take upon lockup should beconfigurable.

• X(log, SysRq): The system becameincreasingly unstable, eventually becom-ing impossible to login into the systemanymore (the prompt did not return af-ter introducing login name and password).After the system locked up like this, thedump process had to be initiated using theSys Rq key, because neither the NMIwatchdog nor the soft lockup watchdogcould detect any anomaly.

• X(nmiw): The error was detected but thecrashing kernel failed to hand control of

the system to the crash dumping mecha-nism and hanged. This hang was subse-quently detected by the NMI watchdog,who succeed in invoking the crash dump-ing functionality. Finally, the dump wassuccessfully captured.Note: The result was marked with an X be-cause the NMI watchdog sometimes failsto start the crash dumping process.

• X(nmiw, nrbt): Like the previouscase, but after capturing the dump the sys-tem hanged trying to reboot.

• X(unk,SysRq,2c): The auxiliary toolused for the test (see Section 3.4) becameunkillable. After triggering the dump pro-cess using the Sys Rq key, the capturekernel crashed attempting to reinitialize adevice.

4.1 Crash points

Even though testers are free to add new CPs,LKDTT is furnished with a set of essential CPs,that is, crash scenarios considered basic andthat should always be tested. The list follows:

IRQ handling with IRQs disabled (INT_HARDWARE_ENTRY) This CP is crossed when-ever an interrupt that is to be handled with IRQsdisabled occurs (see code listing 1).

IRQ handling with IRQs enabled (INT_HW_IRQ_EN) This is the equivalent to the pre-vious CP with interrupts enabled (see code list-ing 2).

Tasklet with IRQs disabled (TASKLET) Ifthis CP is active crashes during the service ofLinux tasklets with interrupts disabled can berecreated.


fastcall unsigned int __do_IRQ(unsigned int irq, structpt_regs *regs)

{.....

CPOINT(INT_HARDWARE_ENTRY);for (;;) {

irqreturn_t action_ret;

spin_unlock(&desc->lock);

action_ret = handle_IRQ_event(irq, regs, action);

.....}

Listing 1: INT_HARDWARE_ENTRY crashpoint (kernel/irq/handle.c)

Tasklet with IRQs enabled (INT_TASKLET_ENTRY) Using this CP it is possibleto cause a crash when the kernel is in themiddle of processing a tasklet with interruptsenabled.

Block I/O (FS_DEVRW) This CP is used tobring down the system while the file systemis performing low-level access to block devices(see code listing 3).

Swap-out (MEM_SWAPOUT) This CP is lo-cated in the code that tries to allocate space foranonymous process memory.

Timer processing (TIMERADD) This is a CPsituated in the code that starts and re-starts highresolution timers.

SCSI command (SCSI_DISPATCH_CMD)This CP is situated in the SCSI commanddispatching code.

fastcall int handle_IRQ_event(.....

if (!(action->flags &SA_INTERRUPT)) {

local_irq_enable();CPOINT(INT_HW_IRQ_EN);

}

do {ret = action->handler(irq,

action->dev_id, regs);if (ret == IRQ_HANDLED)

status |= action->flags;.....

}

Listing 2: INT_HW_IRQ_EN crash point(kernel/irq/handle.c)

IDE command (IDE_CORE_CP) This CP issituated in the code that handles I/O operationson IDE block devices.

5 Interpretation of the results andpossible improvements

5.1 In-kernel crash dumping mechanisms(LKCD)

The primary cause of the bad results obtainedby LKCD, and in-kernel crash dumping mech-anism in general, is the flawed assumption thatthe kernel can be trusted and will in fact be op-erating in a normal fashion. This creates twomajor problems.

First, there is a problem with resources, notablywith resources locking up, because it is not pos-sible to know the locking status at the time ofthe crash. LKCD uses drivers and services ofthe crashing kernel to capture the dump. As a


void ll_rw_block(int rw, int nr,struct buffer_head *bhs[])

{.....

get_bh(bh);submit_bh(rw, bh);continue;

}}unlock_buffer(bh);CPOINT(FS_DEVRW);

}}

Listing 3: FS_DEVRW crash point(fs/buffer.c)

consequence, if the operation that has causedthe crash was locking resources necessary tocapture the dump, the dump operation will endup deadlocking. For example, the driver for thedump device may try to obtain a lock that washeld before the crash occurred and, as it willnever be released, the dump operation will hangup. Similarly, on SMP systems as operationsbeing run on other CPUs are forced to stop inthe event of a crash, there is the possibility thatresources needed during the dumping processmay be locked, because they were in use by anyof the other CPUs and were not released beforethey halted. This may put the dump operationinto a lockup too. Even if this doesn’t result in alock-up, insufficient system resources may alsocause the dump operation to fail.

The source of the second problem is the reli-ability of the control tables, kernel text, anddrivers. A kernel crash means that some kindof inconsistency has occurred within the ker-nel and that there is a strong possibility a con-trol structure has been damaged. As in-kernelcrash dump mechanisms employ functions ofthe crashed system for outputting the dump,there is the very real possibility that the dam-aged control structures will be referenced. Be-

sides, page tables and CPU registers such asthe stack pointer may be corrupted too, whichcan potentially lead to faults during the crashdumping process. In these circumstances, evenif a crash dump is finally obtained, the result-ing dump image is likely to be invalid, so that itcannot be properly analyzed.

For in-kernel crash dumping mechanisms thereis no obvious solution to the memory corrup-tion problems. However, the locking issuescan be alleviated by using polling mode (as op-posed to interrupt mode) to communicate withthe dump devices.

Setting up a controllable dump route within thekernel is very difficult, and this is increasinglytrue as the size and complexity of the kernelaugments. This is what sparked the apparitionof methods capable of capturing a dump inde-pendent from the existing kernel.

5.2 Kdump

Even though kdump proved to be much morereliable than in-kernel crash dumping mecha-nisms there are still issues in the three stagesthat constitute the dump process (see Sec-tion 2):

• Crash detection: hang detection, stackoverflows, faults in the dump route.

• Minimal machine shutdown: stack over-flows, faults in the dump route.

• Crash dump capture: device reinitializa-tion, APIC reinitialization.

5.2.1 Stack overflows

In the event of a stack overflow critical datathat usually resides at the bottom of the stack


is likely to be stomped and, consequently, itsuse should be avoided.

In particular, in the i386 and IA64 architec-tures the macro smp_processor_id() ul-timately makes use of the cpu member ofstruct thread_info, which resides at thebottom of the stack. x86_64, on the other hand,is not affected by this problem because it bene-fits from the use of the PDA infrastructure.

Kdump makes heavy use of smp_processor_id() in the reboot path tothe second kernel, which can lead to unpre-dictable behaviour. This issue is particularlyserious in SMP systems because not only thecrashing CPU but also the rest of CPUs arehighly dependent on likely-to-be-corruptedstacks. The reason it that during the minimalmachine shutdown stage (see Section 2.2 fordetails) NMIs are used to stop the CPUs, butthe NMI handler was designed on the premisethat stacks can be trusted. This obviously doesnot hold good in the event of a crash overflow.

The NMI handler (see code listing 4) usesthe stack indirectly through nmi_enter(),smp_processor_id(), default_do_nmi,nmi_exit(), and also through the crash-time NMI callback function (crash_nmi_callback()).

Even though the NMI callback function can beeasily made stack overflow-safe the same doesnot apply to the rest of the code.

To circumvent some of these problems at thevery least the following measures should beadopted:

• Create a stack overflow-safe replacementfor smp_processor_id, which could becalled safe_smp_processor_id (thereis already an implementation for x86_64).

fastcall void do_nmi(structpt_regs * regs, long error_code)

{int cpu;

nmi_enter();cpu = smp_processor_id();++nmi_count(cpu);

if (!rcu_dereference(nmi_callback)(regs, cpu))

default_do_nmi(regs);

nmi_exit();}

Listing 4: do_nmi (i386)

• Substitute smp_processor_id withsafe_smp_processor_id, which isstack overflow-safe, in the reboot path tothe second kernel.

• Add a new NMI low-level handling rou-tine (crash_nmi) in arch/*/kernel/entry.S that invokes a stack overflowsafe NMI handler (do_crash_nmi) in-stead of do_nmi.

• In the event of a system crash replace thedefault NMI trap vector so that the newcrash_nmi is used.

If we want to be paranoid (and being paranoidis what crash dumping is all about after all), allthe CPUs in the system should switch to newstacks as soon as a crash is detected. This in-troduces the following requirements:

• Per-CPU crash stacks: Reserve one stackper CPU for use in the event of a systemcrash. A CPU that has entered the dumproute should switch to its respective per-CPU stack as soon as possible because the


cause of the crash might have a stack over-flow, and continuing to use the stack insuch circumstances can lead to the gener-ation of invalid faults (such as double faultor invalid TSS). If this happens the systemis bound to either hang or reboot sponta-neously. In SMP systems, the rest of theCPUs should follow suit, switching stacksat the NMI gate (crash_nmi).

• Private stacks for NMIs: The NMI watch-dog can be used to detect hard lockups andinvoke kdump. However, this dump routeconsumes a considerable amount of stackspace, which could cause a stack overflow,or contribute to further bloating the stackif it has already overflowed. As a conse-quence of this, the processor could end upfaulting inside the NMI handler which issomething that should be avoided at anycost. Using private NMI stacks wouldminimize these risks.

To limit the havoc caused by bloated stacks, thefact that a stack is about to overflow should bedetected before it spills out into whatever is ad-jacent to it. This can be achieved in two differ-ent ways:

• Stack inspection: Check the amount offree space in the stack every time agiven event, such as an interrupt, oc-curs. This could be easily implementedusing the kernel’s stack overflow de-bugging infrastructure (CONFIG_DEBUG_STACKOVERFLOW).

• Stack guarding: The second approach isadding an unmapped page at the bottomof the stack so that stack overflows are de-tected at the very moment they occur. If asmall impact in performance is consideredacceptable this is the best solution.

5.2.2 Faults in the dump route

Critical parts of the kernel such as fault han-dlers should not make assumptions about thestate of the stack. An example where properchecking is neglected can be observed in thecode listing 5. The mm member of thestruct tsk is dereferenced without makingany checks on the validity of current. Ifcurrent happens to be invalid, the seeminglyinoffensive dereference can lead to recursivepage faults, or, if things go really bad, to a triplefault and subsequent system reboot.

fastcall void __kprobesdo_page_fault(struct pt_regs *regs, unsigned long error_code)

{struct task_struct *tsk;struct mm_struct *mm;

tsk = current;.....[ no checks are made on

tsk ]mm = tsk->mm;.....

}

Listing 5: do_page_fault (i386)

Finally, to avoid risks, control should be handedto kdump as soon as a crash is detected. Theinvocation of the dump mechanism should notbe deferred to panic or BUG, because manythings can go bad before we get there. Forexample, it is not guaranteed that the possiblecode paths never use any of the things that makeassumptions about the current stack.

5.2.3 Hang detection

The current kernel has the necessary infrastruc-ture to detect hard lockups and soft lockups, butthey both have some issues:


• Hard lockups: This type of lockups are de-tected using the NMI watchdog, which pe-riodically checks whether tasks are beingscheduled (i.e. the scheduler is alive). TheAchilles’ heel of this detection method isthat it is strongly vulnerable to stack over-flows (see Section 5.2.1 for details). Be-sides, there is one inevitable flaw: hangsin the NMI handler cannot be detected.

• Soft lockups: There is a soft lockup detec-tion mechanism implemented in the kernelthat, when enabled (CONFIG_DETECT_SOFTLOCKUP=y), starts per-CPU watch-dog threads which try to run once per sec-ond. A callback function in the timer inter-rupt handler checks the elapsed time sincethe watchdog thread was last scheduled,and if it exceeds 10 seconds it is consid-ered a soft lockup.Currently, the soft lockup detection mech-anism limits itself to just printing the cur-rent stack trace and a simple error mes-sage. But, the possibility of triggering thecrash dumping process instead should beavailable.

Using LKDTT a case in which the existingmechanisms are not effective was discovered:hang with interrupts enabled (see IRQ handlingwith IRQs enabled in Section 4.1). In suchscenario timer interrupts continue to be deliv-ered and processed normally so both the NMIwatchdog and the soft lockup detector end upjudging that the system is running normally.

5.2.4 Device reinitialization

There are cases in which after the crash the cap-ture kernel itself crashes attempting to initializea hardware device.

In the event of a crash kdump does not do anykind of device shutdown and, what is more, the

firmware stages of the standard boot processare also skipped. This may leave the devicesin a state the second kernel cannot get themout of. The underlying problem is that the softboot case is not handled by most drivers, whichassume that only traditional boot methods areused (after all, many of the drivers were writtenbefore kexec even existed) and that all devicesare in a reset state.

Sometimes even after a regular hardware rebootthe devices are not reset properly. The culprit insuch cases is a BIOS not doing its job properly.

To solve this issues the device driver modelshould be improved so that it contemplates thesoft boot case, and kdump in particular. Insome occasions it might be impossible to reini-tialize a certain device without knowing its pre-vious state. So it seems clear that, at least insome cases, some type information about thestate of devices should be passed to the sec-ond kernel. This brings the power managementsubsystem to mind, and in fact studying how itworks could be a good starting point to solvethe device reinitialization problem.

In the meantime, to minimize risks each ma-chine could have a dump device (a HD or NIC)set aside for crash dumping, so that the crashkernel would use that device and have no otherdevices configured.

5.2.5 APICs reinitialization

Kdump defers much of the job of actually sav-ing the dump image to user-space. This meansthat kdump relies on the scheduler and, conse-quently, the timer interrupt to be able to capturea dump.

This dependency on the timer represents aproblem, specially in i386 and x86_64 SMPsystems. Currently, on these architectures, dur-ing the initialization of the kernel the legacy


i8259 must exist and be setup correctly, even ifit will not be used past this stage. This impliesthat, in APIC systems, before booting into thesecond kernel the interrupt mode must return tolegacy. However, doing this is not as easy asit might seem because the location of the i8259varies between chipsets and the ACPI MADT(Multiple APIC Description Table) does notprovide this information. The return to legacymode can accomplished in two different ways:

• Save/restore BIOS APIC states: All theAPIC states are saved early in the bootprocess of the first kernel before the ker-nel attempts to initialize them, so that theAPIC configuration as performed by theBIOS can be obtained. In the event of acrash, before booting into the capture ker-nel the BIOS APIC settings are restoredback. Treating the APICs as black boxeslike this has the benefit that the origi-nal states of the APICs can be restoredeven in systems with a broken BIOS. Be-sides, this method is theoretically immuneto changes in the default configuration ofAPICs in new systems.There is one issue with this methodthough. It makes sure that the BIOS-designated boot CPU will always see timerinterrupts in legacy mode, but this doesnot hold good if the second kernel bootson some other CPU as is possible withkdump. Therefore, for this method towork CPU relocation is necessary. Itshould also be noted that under certainrather unlikely circumstances relocationmight fail (see Section 7.4 for details).

• Partial save/restore: Only the informa-tion that cannot be obtained any other way(i.e. i8259’s location) is saved off at boottime. Upon a crash, taking into accountthis piece of information the APICs are re-configured in such a way that all interrupts

get redirected to the CPU in which the sec-ond kernel is going to be booted, which inkdump’s case is the CPU where the crashoccurred. This is the approach adopted bykdump.

6 LKDTT status and TODOS

Even though using LKDTT it is possible totest rather thoroughly the first two stages of thecrash dumping process, that is crash detectionand minimal machine crash shutdown (see Sec-tion 2), the capture kernel is not being suffi-ciently tested yet. The author is currently work-ing on the following test cases:

• Pending IRQs: Leave the system withpending interrupts before booting into thecapture kernel, so that the robustness ofdevice drivers against interrupts coming atan unexpected time can be tested.

• Device reinitialization: For each devicetest whether it is correctly initialized aftera soft boot.

Another area that is under development at themoment is test automation. However, due to thespecial nature of the functionality being testedthere is a big roadblock for automation: the sys-tem does not always recover gracefully fromcrashes so that tests can resume. That is, insome occasions the crash dumping mechanismthat is being tested will fail, or the system willhang while trying to reboot after capturing thedump. In such cases human intervention willalways be needed.

7 Other kdump issues

The kernel community has been expecting thatthe various groups which are interested in crash


dumping would converge around kdump once itwas merged. And the same was expected fromend-users and distributors. However, so far, thishas not been the case and work has continuedon other strategies.

The causes of this situation are diverse and, to agreat extent, unrelated to reliability aspects. In-stead, the main issues have to do with availabil-ity, applicability and usability. In some cases itis just a matter of time before they get naturallysolved, but, in others, improvements need to bedone to kdump.

7.1 Availability and applicability

Most of the people use distribution-providedkernels that are not shipped with kdump yet.Certainly, distributions will eventually catch upwith the mainstream kernel and this problemwill disappear.

But, in the mean time, there are users whowould like to have a reliable crash dumpingmechanism for their systems. This is especiallythe case of enterprise users, but they usuallyhave the problem that updating or patching thekernel is not an option, because that would im-ply the loss of official support for their enter-prise software (this includes DBMSs such asOracle or DB2 and the kernel itself). It is anextreme case but some enterprise systems can-not even afford the luxury of a system reboot.

This situation along with the discontent withthe crash dumping solutions provided by dis-tributors sparked the apparition of other kexec-based projects (such as mkdump and ToughDump), which were targeting not only main-stream adoption but also existing Linux distri-butions. This is why these solutions sometimescome in two flavors: a kernel patch for vanillakernels and a fully modularized version for dis-tribution kernels.

7.2 Usability

There are some limitations in kdump that havean strong impact in its usability, which affectsboth end-users and distributors as discussed be-low.

7.2.1 Hard-coding of reserved area’s startaddress

To use kdump it is necessary to reserve a mem-ory region big enough to accommodate thedump kernel. The start address and size of thisregion is indicated at boot time with the com-mand line parameter crashkernel=Y@X, Ydenoting how much memory to reserve, and Xindicating at what physical address the reservedmemory region starts. The value of X has to beused when configuring the capture kernel, sothat it is linked to run from that start address.This means a displacement of the reserved areamay render the dump kernel unusable. Besidesit is not guaranteed that the memory region in-dicated at the command line is available to thekernel. For example, it could happen that thememory region does not exist, or that it over-laps system tables, such as ACPI’s. All theseissues make distribution of pre-compiled cap-ture kernels cumbersome.

This undesirable dependency between the sec-ond and first kernel can be broken using a run-time relocatable kernel. The reason is that, bydefinition, a run-time relocatable kernel can runfrom any dedicated memory area the first kernelmight reserve for it. To achieve run-time relo-cation a relocation table has to be added to thekernel binary, so that the actual relocation canbe performed by either a loader (such as kexec)or even by the kernel itself. The first callsfor making the kernel an ELF shared object.The second can be accomplished by resolvingall the symbols in arch/*/kernel/head.S

(this is what mkdump does).


7.2.2 Memory requirements

Leaving the task of writing out the crash dumpto user space introduces great flexibility at thecost of increasing the size of the memory areathat has to be reserved for the capture kernel.But for systems with memory restrictions (suchas embedded devices) a really small kernel withjust the necessary drivers and no user spacemay be more appropriate. This connects withthe following point.

7.3 Kernel-space based crash dumping

After a crash the dump capture kernel might notbe able to restore interrupts to a usable state, beit because the system has a broken BIOS, or beit because the interrupt controller is buggy. Insuch circumstances, processors may end up notreceiving timer interrupts. Besides, the possi-bility of a timer failure should not be discardedeither.

In any case, being deprived of timer inter-rupts is an insurmountable problem for user-space based crash dumping mechanisms suchas kdump, because they depend on a workingscheduler and hence the timer.

To tackle this problem a kernel-space drivencrash dumping mechanism could be used, andeven cohabit with the current user-space cen-tered implementation. Which one to employcould be made configurable, or else, the kernel-space solution could be used as a fallbackmechanism in case of failure to bring up user-space.

7.4 SMP dump capture kernel

In some architectures, such as i386 and x86_64,it is not possible to boot a SMP kernel from a

CPU that is not the BIOS-designated boot CPU.Consequently, to do SMP in the capture kernelit is necessary to relocate to the boot CPU be-forehand. Kexec achieves CPU relocation us-ing scheduler facilities, but kdump cannot usethe same approach because after a crash thescheduler cannot be trusted.

As a consequence, to make kdump SMP-capable a different solution is needed. Infact, there is a very simple method to relo-cate to the boot CPU that takes advantage ofinter-processor NMIs. As discussed in Sec-tion 2.2 (Minimal machine shutdown), this typeof NMIs are issued by the crashing CPU inSMP systems to stop the other CPUs beforebooting into the capture kernel. But this be-havior can be modified so that relocation to theboot CPU is performed too. Obviously, if thecrashing CPU is the boot CPU nothing needs tobe done. Otherwise, upon receiving NMI theboot CPU should assume the task of capturingthe kernel, so that the NMI-issuing CPU (i.e.the crashing the CPU) is relieved from that bur-den a can halt instead. This is the CPU reloca-tion mechanism used by mkdump.

Even though being able to do SMP would boostthe performance of the capture kernel, it wassuggested that in some extreme cases of crashthe boot CPU might not even respond to NMIsand, therefore, relocation to the boot CPU willnot be possible. However, after digging throughthe manuals the author could only find (andreproduce using LKDTT) one such scenario,which occurs when the two conditions beloware met:

• The boot CPU is already attending a dif-ferent NMI (from the NMI watchdog forexample) at the time the inter-processorNMI arrives.

• The boot CPU hangs inside the handler ofthis previous NMI, so it does not return.


The explanation is that during the time a CPUis servicing an NMI other NMIs are blocked, soa lockup in the NMI handler guarantees a sys-tem hang if relocation is attempted as describedbefore. The possibility of such a hang seemsremote and easy to evaluate. But it could alsobe seen as a trade-off between performance andreliability.

8 Conclusion

Existing testing methods for kernel crash dumpcapturing mechanisms are not adequate be-cause they do not take into account the state ofthe hardware and the load conditions of the sys-tem. This makes it impossible to recreate manycommon crash scenarios, depriving test resultsof much of their validity. Solving these issuesand providing controllable testing environmentwere the major motivations behind the creationof the LKDTT (Linux Kernel Dump Test Tool)testing project.

Even though LKDTT showed that kdump ismore reliable than traditional in-kernel crashdumping solutions, the test results revealedsome deficiencies in kdump too. Among these,minor hang detection deficiencies, great vulner-ability to stack overflows, and problems reini-tializing devices in the capture kernel stand out.Solutions to some of these problems have beensketched in this paper and are currently underdevelopment.

Since the foundation of the testing project theauthor could observe that new kernel releases(including release candidates) are sometimesaccompanied by regressions. Regressions con-stitute a serious problem for both end-users anddistributors, that requires regular testing andstandardised test cases to be tackled properly.LKDTT aims at filling this gap.

Finally, several hurdles that are hampering theadoption of kdump were identified, the needfor a run-time relocatable kernel probably be-ing the most important of them.

All in all, it can be said that as far as kernelcrash dumping is concerned Linux is headingin the right direction. Kdump is already veryrobust and most of the remaining issues are al-ready being dealt with. In fact, it is just a matterof time before kdump becomes mature enoughto focus on new fields of application.

9 Future lines of work

All the different crash dumping solutions dojust that after a system crash: capture a crashdump. But there is more to a crashed systemkexec than crash dumping. For example, inhigh availability environments it may be desir-able to notify the backup system after a crash,so that the failover process can be initiated ear-lier.

In the future, kdump could also benefit from thecurrent PID virtualization efforts, which willprovide the foundation for process migration inLinux. The process migration concept could beextended to the crash case, in such a way thatafter doing some sanity-checking, tasks thathave not been damaged can be migrated and re-sume execution in a different system.

Acknowledgements

I would like to express my gratitude to ItsuroOda for his original contribution to LKDTT andvaluable comments, as well as to all those whohave laid the foundation for a reliable kernelcrash dumping mechanism in Linux.


References

[1] Diskdump patches. http://www.redhat.com/support/wpapers/redhat/netdump/.

[2] Michael K. Johnson. Red Hat, Inc.’snetwork console and crash dump facility,2002. http://www.redhat.com/support/wpapers/redhat/netdump/.

[3] Linux kernel crash dump (LKCD) homepage, 2005. http://lkcd.sourceforge.net/.

[4] Hariprasad Nellitheertha. Reboot linuxfaster using kexec, 2004.http://www-128.ibm.com/developerworks/linux/library/l-kexec.html.

[5] Kexec-tools code.http://www.xmission.com/~ebiederm/files/kexec/.

[6] Vivek Goyal, Eric W. Biederman, andHariprasad Nellitheertha. A kexec baseddumping mechanism. In Ottawa LinuxSymposium (OLS 2005), July 2005.

[7] Kdump home page. http://lse.sourceforge.net/kdump/.

[8] Itsuro Oda. Mini Kernel Dump(MKDump) home page, 2006. http://mkdump.sourceforge.net/.

[9] Linux tough dump (TD) home page(japanese site), 2006.http://www.hitachi.co.jp/Prod/comp/linux/products/solution.html.

[10] Fernando Luis Vázquez Cao. Linuxkernel dump test tool (LKDTT) homepage, 2006. http://lkdtt.sourceforge.net/.

[11] EDAC wiki.http://buttersideup.com/edacwiki/FrontPage.

[12] Prasanna Panchamukhi. Kerneldebugging with kprobes, 2004.http://www-128.ibm.com/developerworks/linux/library/l-kprobes.html.

[13] Djprobe documentation and patches.http://lkst.sourceforge.net/djprobe.html.


Exploring High Bandwidth Filesystems on LargeSystems

Dave Chinner and Jeremy HigdonSilicon Graphics, Inc.

[email protected] [email protected]

Abstract

In this paper we present the results of an investi-gation conducted by SGI into streaming filesys-tem throughput on the Altix platform with ahigh bandwidth disk subsystem.

We start by describing some of the backgroundthat led to this project and our goals for theproject. Next, we describe the benchmarkmethodology and hardware used in the project.We follow this up with a set of baseline resultsand observations using XFS on a patched 2.6.5kernel from a major distribution.

We then present the results obtained from XFS,JFS, Reiser3, Ext2, and Ext3 on a recent 2.6kernel. We discuss the common issues that wefound to adversely affect throughput and repro-ducibility and suggest methods to avoid theseproblems in the future.

Finally, we discuss improvements and optimi-sations that we have made and present the fi-nal results we achieved using XFS. From theseresults we reflect on the original goals of theproject, what we have learnt from the projectand what the future might hold.

1 Background and Goals

In the past, there have been many compar-isons of the different filesystems suppportedby Linux. Most of these comparisons focuson activities typically performed by a kerneldeveloper or use well known benchmark pro-grams. Typically these tests are run on an aver-age desktop machine with a single disk or, morerarely, a system with two or four CPUs with aRAID configuration of a few disks.

However, this really doesn’t tell us anythingabout the maximum capabilities of the filesys-tems; these machine configurations don’t pushthe boundaries of the filesystems and hencethese observations have little relevance to thosewho are trying to use Linux in large configura-tions that require substantial amounts of I/O.

Over the past two years, we have seen a dra-matic increase in the bandwidth customers re-quire new machines to support. On older, mod-ified 2.4.21 kernels, we could not achieve muchmore than 300 MiB/s on parallel buffered writeloads. Now, on patched 2.6.5 kernels, cus-tomers are seeing higher than 1 GiB/s under thesame loads. And, of course, there are customerswho simply want all the I/O bandwidth we canprovide.

The trend is unmistakable. A coarse correla-tion is that required I/O bandwidth matches the

178 • Exploring High Bandwidth Filesystems on Large Systems

amount of memory in a large machine. Mem-ory capacity is increasing faster than physi-cal disk transfer rates are increasing, and thismeans that systems are being attached to largernumbers of disks in the hope that this provideshigher throughput to populate and drain mem-ory faster. Unfortunately, what we currentlylack is any data on whether Linux can makeuse of the increased bandwidth that larger diskfarms provide.

Some of the questions we need to answer in-clude:

• How close to physical hardware limits canwe push a filesystem?

• How stable is Linux under these loads?• How does the Linux VM stand up to this

sort of load?• Do the Device Mapper (DM) and/or Mul-

tiple Device (MD) drivers limit perfor-mance or configurations?

• Are there NUMA issues we need to ad-dress?

• Do we have file fragmentation problemsunder these loads?

• How easily reproducible are the results weachieved and can we expect customers tobe able to achieve them?

• What other bottlenecks limit the perfor-mance of a system?

To answer these questions, as they are impor-tant to SGI’s customers, we put together a mod-estly sized machine to explore the limits ofhigh-bandwidth I/O on Linux.

2 Test Hardware and Methodology

2.1 Hardware

The test machine was an Altix A3700 contain-ing 24 Itanium2 CPUs running at 1.5 GHz in

12 nodes in a single cache-coherent NUMA do-main. Each node is configured with 2 GiB ofRAM for a system total of 24 GiB. Each nodehas 6.4 GB/s peak full duplex external intercon-nect bandwidth provided by SGI’s NUMALinkinterconnect. A total of 12 I/O nodes, each withthree 133 MHz PCI-X slots on two busses, wereconnected to the NUMALink fabric supplying6.4 GB/s peak full duplex bandwidth per I/Onode. The CPU and I/O nodes were connectedvia crossbar routers in a symmetric topology.

The I/O nodes were populated with a mix ofU320 SCSI and Fibre Channel HBAs (64 SCSIcontrollers in total) and distributed 256 disksamongst the controllers in JBOD configura-tion. This provided an infrastructure that al-lowed each disk run at close to its maximumread or write bandwidth independently of anyother disk in the machine.

The result is a machine with a disk subsystemtheoretically capable of just over 11.5 GiB/sof throughput evenly distributed throughout theNUMALink fabric. Hence the hardware shouldbe able to sustain maximum disk rates if thesoftware is able to drive it that fast.

2.2 Methodology

The main focus of our investigation was onXFS performance. In particular, parallel se-quential I/O patterns were of most interest asthese are the most common patterns we seeour customers using on their large machines.We also assessed how XFS compares withother mainstream filesystems on Linux on theseworkloads.

The main metrics we used to compare per-formance were aggregate disk throughput andCPU usage. We used multiple programs and in-dependent test harnesses to validate the resultsagainst each other so we had confidence in the


results of individual test runs that weren’t repli-cated.

To be able to easily compare different con-figurations and kernels, we present normalisedI/O efficiency results along with the aggregatethroughput achieved. This gives an indicationof the amount of CPU time being expended foreach unit of throughput achieved. The unit ofefficiency reported is CPU/MiB/s, or the per-centage of a CPU consumed per mebibyte persecond throughput. The lower the calculatednumber, the better the efficiency of the I/O ex-ecuted.

The tests were run with file sizes large enoughto make run times long enough to ensure thatmeasurement was accurate to at least 1%. This,combined with running the tests in a consistent(scripted) manner, enabled us to draw conclu-sions about the reproducibility of the results ob-tained.

For most of the tests run, we used SGI’s Per-formance Co-Pilot infrastructure [PCP] to cap-ture high resolution archives of the system’s be-haviour during tests. This included disk utili-sation and throughput, filesystem and volumemanager behaviour, memory usage, CPU us-age, and much more. We were able to analysethese archives after the fact which gave us greatinsight into system wide behaviour during thetesting.

To find the best throughput under different con-ditions, we varied many parameters during test-ing. These included:

• different volume configurations• the effect of I/O size on throughput and

CPU usage• buffered I/O and direct I/O• different allocation methods for writes• block device readahead• filesystem block size

• pdflush tunables• NUMA allocation methods

We tested several different kernels so we couldchart improvements or regressions over timethat our customers would see as they upgraded.Hence we tested XFS on SLES9 SP2, SLES9SP3, and 2.6.15-rc5.

We also ran a subset of the above tests on otherLinux filesystems including Ext2, Ext3, Reis-erFS v3, and JFS. We kept as many configu-ration parameters as possible constant acrossthese tests. Where supported, we used mkfsand mount parameters that were supposed tooptimise data transfer rates and large filesystemperformance.

The volume size for Ext2, Ext3, and Reis-erFS V3 was halved to approximately 4.2 TiBbecause they don’t support sizes of greaterthan 8 TiB. We took the outer portion of eachdisk for this smaller volume, hence maintain-ing the same stripe configuration. Comparedto the larger volume used by XFS and JFS,the smaller volume has lower average seektimes and higher minimum transfer rates andhence should be able to maintain higher aver-age throughputs than the larger volume as thefilesystems fill up during testing.

The comparison tests were scripted to:

1. Run mkfs with relevant large filesystemoptimisations.

2. Make a read file set with dd by writing outthe files to be read back with increasinglevels of parallelism.

3. Perform buffered read tests using one fileper thread across a range of I/O sizes andthread count measuring throughput, CPUusage, average process run time, and othermetrics required for analysis.


The filesystem was unmounted and re-mounted between each test to ensure thatall tests started without any cached filesys-tem data and memory approximately 99%empty.

4. Repeat Step 3 using buffered write tests,including truncating the file to be writtenin the overall test runtime.

Parallel writes were used to lay down the filesfor reading back to demonstrate the level offile fragmentation the filesystem suffered. Thegreater the fragmentation, the more seeking thedisks will do and the lower the subsequent readrate achieved will be. Hence the read rate di-rectly reflects on the fragmentation resistanceof the filesystem. This is also a best case re-sult because the tests are being run on an emptyfilesystem.

Finally, after we fixed several of the worst prob-lems we uncovered, we re-ran various tests todetermine the effect of the changes on the sys-tem.

2.3 Volume Layout and Constraints

Achieving maximum throughput from a singlefilesystem required a volume layout that en-abled us to keep every disk busy at the sametime. In other words, we needed to distributethe I/O as evenly as possible.

Building a wide stripe was the easiest way toachieve even distribution since we were mostlyinterested in sequential I/O performance. Thisexposed a configuration limitation of DM;dmsetup was limited to a line length of 1024characters which meant we could only build astripe approximately 90 disks wide.

Hence we ended up using a two level volumeconfiguration where we had an MD stripe of

0

500

1000

1500

2000

2500

3000

3500

4000

1 2 4 8 16 32

Thro

ughp

ut (M

iB/s

)

Thread Count

XFS-4k-ReadXFS-16k-Read

XFS-4k-WriteXFS-16k-Write

Figure 1: Baseline XFS Throughput

4 DM volumes each with 64 disks. We usedan MD stripe of DM volumes because it wasunclear whether DM and dmsetup supportedmulti-level volume configurations.

Using SGI’s XVM volume manager, we wereable to construct both a flat 256-disk stripe anda 4x64-disk multi-level stripe. Hence we wereable to confirm that there was no measurableperformance or disk utilisation difference be-tween the two configurations.

Therefore we ran all the tests on the multi-level,MD-DM stripe volume layout. The only pa-rameter that was varied in the layout was thestripe unit (and therefore stripe width) and mostof the testing was done with stripe units of512 KiB or 1 MiB.

3 Baseline XFS Results

Baseline XFS performance numbers were ob-tained from SuSE Linux Enterprise Server 9Service Pack 2 (SLES9 SP2). We ran tests onXFS filesystem with both 4 KiB and 16 KiBblock sizes. Performance varied little with I/Osize, so the results presented used 128 KiB,which is in the middle of the test range.


Looking at read throughput, we can see fromFigure 1 that there was very little differencebetween the different XFS filesystem configu-rations. In some cases the 16 KiB block sizefilesystem was faster, in other cases the 4 KiBblock size filesystem was faster. Overall, theyboth averaged out at around 3.5 GiB/s across allblock sizes.

In contrast, the 16 KiB block-size filesystem issubstantially faster than the 4 KiB filesystemwhen writing. The 4 KiB filesystem appearedto be I/O bound as it was issuing much smallerI/Os than the 16KiB filesystem and the diskswere seeking significantly more.

From the CPU efficiency graph in Figure 2,we can see that there is no difference in CPUtime expended by the filesystem for differentblock sizes on read. This was expected fromthe throughput results.

Both the read and write tests show that CPUusage is scaling linearly with throughput; in-creasing the number of threads doing I/O doesnot decrease the efficiency of the filesystem. Inother words, we are limited by either the rateat which we can issue I/Os or by somethingelse outside the filesystem. Also, the write ef-ficiency is substantially worse than for reads, itwould seem that there is room for substantialimprovement here.

4 Filesystem Comparison Results

The first thing to note about the results is thatsome of the filesystems were tested to highernumbers of threads and larger block sizes. Thereasons for this were that some configurationswere not stable enough to complete the wholetest matrix and we had to truncate some ofthe longer test runs that would have preventedus from completing a full test cycle in our

0

0.5

1

1.5

2

1 2 4 8 16 32

Effic

ienc

y (%

CPU/

MiB

/s)

Thread Count

Smaller is Better

XFS-4k-ReadXFS-16k-Read

XFS-4k-WriteXFS-16k-Write

Figure 2: Baseline XFS Efficiency

available time window. Consequently some ofthe results presented represent best-case perfor-mance rather than a mean of repeated test runs.

The kernel used for all these tests was 2.6.15-rc5.

4.1 Buffered Read Results

The maximum read rates achieved by eachfilesystem can be seen in Figure 3. The readrate changed very little with varying I/O blocksize, we saw the same maximum throughput us-ing 4 KiB I/Os as using 1 MiB I/Os. The onlyreal difference was the amount of CPU con-sumed.

It is worth noting that XFS read throughput issubstantially higher on 2.6.15-rc5 compared tothe baseline results on SLES9 SP2. A discus-sion of this improvement canbe found in Sec-tion 6.2.

The performance of Ext2 and Ext3 is also quitedifferent despite their common heritage. How-ever, the results presented for Ext2 and Ext3(as well as JFS) are the best of several test ex-ecutions due to the extreme variability of thefilesystem performance under these tests. Thereasons for this variability are discussed in Sec-tion 5.2.


0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32

Thro

ughp

ut (M

iB/s

)

Thread Count

XFS-4kXFS-16k

JFSExt2Ext3

Reiser3

Figure 3: Buffered Read Throughput Compari-son

It is clear that XFS and Ext3 give substantiallybetter throughput, and this is reflected in the ef-ficiency plots in Figure 4, where these are themost efficient filesystems. Both ReiserFS andJFS show substantial decreases in efficiency asthread count increases. This behaviour is dis-cussed in Section 5.1.

4.2 Buffered Write Results

Figure 5 shows some very clear trends inbuffered write throughput. Firstly, XFS is sub-stantially slower than the SLES9 SP2 baselineresults. Secondly, throughput is peaking at fourto eight concurrent writers for all filesystemsexcept for Ext2. XFS, using a 16 KiB filesys-tem block size, was still faster than Ext2 untilhigh thread counts were reached.

The poor write throughput of Ext3 and JFS isworth noting. JFS was unable to exceed an av-erage of 80 MiB/s write speed in all but two ofthe many test points executed, and Ext3 did notscore above 250 MiB/s and decreased to lessthan 100MiB/s at sixteen or more threads. Weused the data=writeback mode for Ext3 asit was consistently 10% faster than the data=

ordered mode.

0.1

1

10

1 2 4 8 16 32

Effic

ienc

y (%

CPU/

MiB

/s)

Thread Count

Smaller is Better

XFS-4kXFS-16k

JFSExt2Ext3

Reiser3

Figure 4: Buffered Read Efficiency Compari-son

The ReiserFS results are truncated due to prob-lems running at higher thread counts. Writeswould terminate without error unexpectedly,and sometimes the machine would hang. Dueto time constraints this was not investigated fur-ther, but it is suspected that buffer initialisationproblems which manifested on machines withboth XFS and ReiserFS filesystems were thecause. The fixes did not reach the upstream ker-nel until well after testing had been completed[Scott][Mason].

JFS demonstrated low write throughput. Wediscovered that this was partially due to truncat-ing a multi-gigabyte file taking several minutesto execute. However, the truncate time made uponly half the elapsed time of each test. Hence,even if we disregarded the truncate time, JFSwould still have had the lowest sustained writerate of all the filesystems.

Looking at the efficiency graph in Figure 6,we can see that only JFS and Ext2 had rela-tively flat profiles as the number of threads in-creased. However, the profile for JFS is rel-atively meaningless due to the low through-put. All the other filesystems show decreas-ing efficiency (increasing CPU time per MiBtransferred to disk every second) at the sameload points that they also showed decreasing


0

500

1000

1500

2000

2500

1 2 4 8 16 32

Thro

ughp

ut (M

iB/s

)

Thread Count

XFS-4kXFS-16k

JFSExt2Ext3

Reiser3

Figure 5: Buffered Write Throughput Compar-ison

throughput. This is discussed further in Sec-tion 5.1.

4.3 Direct I/O Results

Only XFS and Ext3 were compared for directI/O due to time constraints. The tests were runover different block sizes and thread counts,and involved first writing a file per thread, thenoverwriting the file, and finally reading the fileback again. A 512 KiB stripe unit was used forthese tests.

Table 1 documents the maximum sustainedthroughput we achieved with these tests. Ext3was fastest with only a single thread, but writesstill fell a long way behind XFS. As the num-ber of threads increased, Ext3 got slower andslower as it fragmented the files it was writing.At 18 threads, Ext3 direct I/O performance was

Threads FS Read Write Overwrite1 XFS 5.5 4.0 7.51 Ext3 4.2 0.6 2.5

18 XFS 10.0 7.7 7.718 Ext3 0.58 0.06 0.12

Table 1: Sequential Direct I/O Throughput(GiB/s)

1

10

1 2 4 8 16 32

Effic

ienc

y (%

CPU/

MiB

/s)

Thread Count

Smaller is Better

XFS-4kXFS-16k

JFSExt2Ext3

Reiser3

Figure 6: Buffered Write Efficiency Compari-son

between 10 and 20 times lower than for a singlethread.

In contrast, from one to 18 threads, XFS dou-bled its read and write throughput, and over-write increased marginally from its alreadyhigh single thread result. It is worth noting thatthe XFS numbers peaked substantially higherthan the sustained throughput—reads peaked atabove 10.7 GiB/s, while writes and overwritespeaked at over 8.9 GiB/s.

5 Issues Affecting Throughput

5.1 Spinlocks in Hot Paths

One thing that is clear from the buffered I/Oresults is that global spinlocks in hot paths ofa filesystem do not scale. Every journalledfilesystem except JFS was limited by spinlockcontention during parallel writes. In the caseof JFS, it appeared to be some kind of sleepingcontention that limited performance, and so theimpact of contention on CPU usage was not im-mediately measurable. Both ReiserFS and JFSdisplayed symptoms of contention in their readpaths as well.


From analysis of the contention on the XFSbuffered write path, we found that the con-tended lock was not actually being held for verylong. The fundamental problem is the numberof calls being made. For every page we writeon a 4 KiB filesystem, we are allocating fourfilesystem blocks. We do this in four separatecalls to ->prepare_write(). Hence at thepeak throughput of approximately 700 MiB/s,we are making roughly 180,000 calls per sec-ond that execute the critical section.

That gives us less than 5.6 microseconds to ob-tain the spinlock and execute our critical sec-tion to avoid contention. The code that XFSexecutes inside this critical section involvesa function call, a memory read, two likelybranches, a subtraction and a memory write.That is not a lot of code, but with enough CPUstrying to execute it in parallel it quickly be-comes a bottleneck.

Of all the journalling filesystems, XFS appearsto have the smallest global critical section in itswrite path. Filesystems that do allocation in thewrite path (instead of delaying it until later likeXFS does) can’t help but have larger criticalsections here, and this shows in the throughputbeing achieved.

Looking to the future, we need to move awayfrom allocating or mapping a block at a time inthe generic write path to reduce the load on crit-ical sections in the filesystems. While work isbeing done to reduce the number of block map-ping calls on the read path, we need to do thesame work for the write path. In the meantime,we have solved XFS’s problem in a differentway (see Section 6.1.2).

5.2 File Fragmentation and Reproducibil-ity

From observation, the main obstacle in ob-taining reproducible results across multiple test

runs on each filesystem was file fragmentation.XFS was the only filesystem that almost com-pletely avoided fragmentation of its workingfiles. ReiserFS also seemed to be somewhat re-sistant to fragmentation but the results are notconclusive due to the problems ReiserFS hadwriting files in parallel.

Ext2, Ext3 and JFS did not resist fragmenta-tion at all well. From truncated test results,we know that the variation was extreme. Acomparison of the best case results versus theworst case results for ext2 can be seen in Ta-ble 2. Both Ext3 and JFS demonstrated verysimilar performance variation due to the differ-ent amounts of fragmentation of the files beingread in each test run. While we present the bestnumbers we achieved for these filesystems, youshould keep in mind that these are not consis-tently reproducible under real world conditions.

At the other end of the scale, the XFS resultswere consistently reproducible to within ±3%.This is due to the fact that we rarely saw frag-mentation on the XFS filesystems and the diskallocation for each file was almost identical onevery test run. Even when we did see fragmen-tation, the contiguous chunks of file data werenever smaller than several gigabytes in size.

A further measure of fragmentation we usedwas the number of physical disk I/Os requiredto provide the measured throughput. In the caseof XFS, we were observing stripe unit sizedI/Os being sent to each disk (512 KiB) whilesustaining roughly 13,000 disk I/Os per secondto achieve 6.3 GiB/s.

In contrast, Ext2 and Ext3 were issuing ap-proximately 60–70,000 disk I/Os per second toachieve 1.7 GiB/s and 4.5 GiB/s respectively.That equates to average I/O sizes of approxi-mately 24 KiB and 56 KiB and each disk ex-ecuting more than 250 I/Os per second each.The disks were seek bound rather than band-width bound. Sustained read throughput of less


Threads Best Run Worst Run1 522.2 348.52 780.2 74.84 1130.3 105.08 1542.1 176.8

Table 2: Example of Ext2 Read ThroughputVariability (MiB/s)

than 300 MiB/s at 60–70,000 disk I/Os per sec-ond with an average size of 4 KiB was not un-common to see. This indicates worst case (sin-gle block) fragmentation in the filesystem. Thesame behaviour was seen with JFS as well.

The source of the fragmentation on Ext2 andExt3 would appear to be interleaved disk allo-cation when multiple files are written in parallelfrom multiple CPUs. This also occurred whenrunning parallel direct I/O writes on Ext3 (seeTable 1) so it would appear to be a general issuewith the way Ext3 handles parallel allocationstreams.

XFS solves this problem by decoupling diskblock allocation from disk space accountingand then using well known algorithmic tech-niques to avoid lock contention to achieve writescaling.

The message being conveyed here is that mostLinux filesystems do not resist fragmentationunder parallel write loads. With parallelism hit-ting the mainstream now via multicore CPUs,we need to recognise that filesystems may notbe as resistant to fragmentation under normalusage patterns as they were once recognised tobe. This used to be a problem that only super-computer vendors had to worry about. . .

5.3 kswapd and pdflush

While running single threaded tests, it was clearthat there was something running in the back-ground that was using more CPU time than the

PID State % CPU Name23589 R 97 dd

345 R 88 kswapd7344 R 83 kswapd6

23556 R 81 dd348 R 80 kswapd10346 R 79 kswapd8347 R 77 kswapd9339 R 76 kswapd1349 R 74 kswapd11343 R 72 kswapd5

23517 R 71 dd23573 R 64 dd

338 R 64 kswapd023552 R 64 dd23502 R 63 dd

340 S 63 kswapd223570 R 61 dd23592 R 60 dd

341 R 57 kswapd3

Table 3: kswapd CPU usage during bufferedwrites

writer process and pdflush combined. A sin-gle threaded read from disk consuming a singleCPU was consuming 10–15% of a CPU on eachnode running memory reclaim via kswapd. Fora single threaded write, this was closer to 30%of a CPU per node. On our twelve node ma-chine, this meant that we were using between1.5 and 3.5 CPUs to reclaim memory being al-located by a single CPU.

On buffered write tests, pdflush also appearedto be struggling to write out the dirty data.With a single write thread, pdflush would con-sume very little CPU; maybe 10% of a sin-gle CPU every five seconds. As the numberof threads increased, however, pdflush quicklybecame overwhelmed. At four threads writingat approximately 1.5 GiB/s, pdflush ran perma-nently consuming an entire CPU.

At eight or more write threads, pdflush con-sumed CPU time only sporadically; instead thekswapd CPU usage jumped from 30% of a CPU


Threads Average I/O Size1 1000 KiB2 450 KiB4 400 KiB8 250 KiB

16 200 KiB32 220 KiB

Table 4: I/O size during buffered writes

to 70–80% of a CPU per node. This can be seenin Table 3.

Monitoring of the disk level I/O patterns in-dicated that writeback was occuring from theLRU lists rather than in file offset order frompdflush. This could also be seen in the I/O sizesthat were being issued to disk as seen in Table 4as the thread count increased.

This is clearly not scalable writeback and mem-ory reclaim behaviour; we need reclaim to con-sume less CPU time and for all writeback to oc-cur in file offset order to maximise throughput.For XFS, this will also minimise fragmentationduring block allocation. See Section 6.2.2 fordetails on how we improved this behaviour.

6 Improvements and Optimisations

6.1 XFS Modifications

6.1.1 Buffered Write I/O Path

In 2.6.15, a new buffered write I/O path imple-mentation was introduced. This was written byChristoph Hellwig and Nathan Scott[Hellwig].The main change this introduced was XFS clus-tering pages directly into a bio instead of bybuffer heads and submit_bh() calls. Usingbuffer heads limited the size of an I/O to thenumber of buffer heads a bio could hold. In

other words, the larger the block size of thefilesystem, the larger the I/Os that could beformed in the write cluster path. This is the pri-mary reason for the difference in throughput wesee for the XFS filesystems with different blocksizes.

By adding complete pages to a bio rather thanbuffer heads, we were able to make XFS writeclustering independent of the filesystem blocksize. This means that any XFS filesystem canissue I/Os only limited in size by the number ofpages that can be held by the bio vector.

Unfortunately, due to the locking issue de-scribed earlier in Section 5.1, XFS with themodified write path was actually slower on ourtest machine than without it. Clearly, the spin-lock problem needed to be solved before wewould see any benefit from the new I/O path.

6.1.2 Per-CPU Superblock Counters

Kernel profiles taken during parallel bufferedwrite tests indicated contention within XFS onthe in-core superblock lock. This lock pro-tects the current in-core (in-memory) state ofthe filesystem.

In the case of delayed allocation, XFS uses thein-core superblock to track both disk space thatis actually allocated on disk as well as the spacethat has not yet been allocated but is dirty inmemory. That means during a write(2) sys-tem call we allocate the space needed for thedata being written but we don’t allocate diskblocks. Hence the “allocation” is very fastwhilst maintaining an accurate representationof how much space there is remaining in thefilesystem.

This makes contention on this structure a diffi-cult problem to solve. We need global accuracy,but we now need to avoid global contention.


The in-core superblock is a write-mostly struc-ture, so we can’t use atomic operations or RCUto scale it. The only commonly used methodremaining is to make the counters per-CPU, butwe still need to have some method of being ac-curate when necessary that performs in an ac-ceptable manner.

Hence for the free space counter we decided totrade off performance for accuracy when we areclose to ENOSPC. The algorithm that was im-plemented is essentially a distributed counterthat gets slower and more accurate as the ag-gregated total of the counter approaches zero.

When an individual per-CPU counter reacheszero, we execute a balance operation. This op-eration locks out all the per-CPU counters be-fore aggregating and redistributing the aggre-gated value evenly over all the counters beforere-enabling the counters again. This requiresa per-CPU atomic exclusion mechanism. Thebalance operation must lock every CPU fastpath out and so can be an expensive operationon a large machine.

However, on that same large machine, the fastpath cost of the per-CPU counters is orders ofmagnitude lower than a global spinlock. Hencewe are amortising the cost of an expensiverebalance very quickly compared to using aglobal spinlock on every operation. Also, whenthe filesystem has lots of free space we rarelysee a rebalance operation as the distributedcounters can sink hundreds of gigabytes of al-location on a single CPU before running dry.

If a counter rebalance results in a very smallamount being distributed to each CPU, thecounter is considered to be near zero and we fallback to a slow, global, single threaded counterfor the aggregated total. That is, we prefer ac-curacy over blazing speed. It should also benoted that using a global lock in this case tendsto be more efficient than constant rebalancingon large machines.

The results (see Figure 7 and Figure 8) speakfor themselves and the code is to be releasedwith 2.6.17[Chinner].

6.2 VM and NUMA Issues

6.2.1 SN2 Specific TLB Purging

When first running tests on 2.6.15-rc5, it wasnoticed that XFS buffered read speeds weremuch higher than we saw on SLES9 SP2,SLES9 SP3 and 2.6.14. On these kernels wewere only achieving a maximum of 4 GiB/s.Using 2.6.15-rc5 we achieved 6.4 GiB/s, andmonitoring showed all the disks at greater than90% utilisation so we were now getting near tobeing disk bound.

Further study revealed that the memory reclaimrate limited XFS buffered read throughput. Inthis particular case, the global TLB flushingspeed was found to make a large difference tothe reclaim speed.

We found this when we reverted a platform-specific optimisation that was included in2.6.15-rc1 to speed up TLB flushing[Roe]. Re-verting this optimisation reduced buffered readthroughput by approximately 30% on the samefilesystem and files. Simply put, this improve-ment was an unexpected but welcome side ef-fect of an optimisation made for different rea-sons.

6.2.2 Node Local Memory Reclaim

In a stroke of good fortune, Christoph Lametercompleted a set of modifications to the mem-ory reclaim subsystem[Lameter] while we wererunning tests. The modifications were includedin Linux 2.6.16, and they modified the reclaimbehaviour to reclaim clean pages on a given


0

1000

2000

3000

4000

5000

6000

1 2 4 8 16 32

Thro

ughp

ut (M

iB/s

)

Thread Count

Base-XFS-4kBase-XFS-16k

Opt-XFS-4kOpt-XFS-16k

Figure 7: Improved XFS Buffered WriteThroughput

node before trying to allocate from a remotenode.

The first major difference in behaviour was thatkswapd never ran during either buffered read orwrite tests. Buffered reads were now quite ob-viously I/O bound with approximately half thedisks showing 100% utilisation. Using a dif-ferent volume layout with a 1 MiB stripe unit,sustained buffered read throughput increased toover 7.6 GiB/s.

The second most obvious thing was that pdflushwas now able to flush more than 5 GiB/s of datawhilst consuming less than half a CPU. Withoutthe node local reclaim, it was only able to pushapproximately 500 MiB/s when it consumed anequivalent amount of CPU time. Writeback, es-pecially at low thread counts, became far moreefficient.

6.2.3 Memory Interleaving

While doing initial bandwidth characterisationsusing direct I/O, we found that it was nec-essary to ensure that buffer memory was al-located evenly from every node in the ma-chine. This was achieved using the numactl

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 4 8 16 32

Effic

ienc

y (%

CPU/

MiB

/s)

Thread Count

Smaller is Better



Figure 8: Improved XFS Buffered Write Effi-ciency

-i all command prefix to the test commandsbeing run.

Without memory interleaving, direct I/O (reador write) struggled to achieve much more than6 GiB/s due to the allocation patterns limit-ing the buffers to only a few nodes in the ma-chine. Hence we were limited by the per-nodeNUMALink bandwidth. Interleaving the buffermemory across all the nodes solved this prob-lem.

With buffered I/O, however, we saw very dif-ferent behaviours. In initial testing we saw lit-tle difference in throughput because the pagecache ended up spread across all nodes of themachine due to memory reclaim behaviour.

However, when testing the node local memoryreclaim patches we found that interleaving didmake a big difference to performance as the lo-cal reclaim reduced the number of nodes thatthe page cache ended up spread over. Inter-estingly, the improvement in memory reclaimspeed that the local reclaim gave us meant thatthere was no performance degradation despitenot spreading the pages all over the machine.Once we spread the pages using the numactlcommand we saw the substantial performanceincreases.


0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32

Thro

ughp

ut (M

iB/s

)

Thread Count



Figure 9: Improved XFS Buffered ReadThroughput

6.2.4 Results

We’ve compared the baseline buffered I/O re-sults from Section 3 with the best results weachieved with our optimised kernel.

From Figure 7 it is clear that we achieved asubstantial gain in write throughput. The out-standing result is the improvement of 4 KiBblock size filesystems and is a direct result ofthe I/O path rewrite. The improved write clus-tering resulted in consistently larger I/Os beingsent to disk, and this has translated into im-proved throughput. Local memory reclaim hasalso prevented I/O sizes from decreasing as thenumber of threads writing increases which hasalso contributed to higher throughputs as well.

On top of improved throughput, Figure 8 in-dicates that the buffered write efficiency hasimproved by factor of between three and four.It can been seen that the efficiency decreasessomewhat as throughput and thread count goesup, so there is still room for improvement here.

Buffered read throughput has roughly doubledas shown in Figure 9. This improvement can bealmost entirely attributed to the VM improve-ments as the XFS read path is almost identicalin the baseline and optimised kernels.

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 4 8 16 32

Effic

ienc

y (%

CPU/

MiB

/s)

Thread Count

Smaller is Better



Figure 10: Improved XFS Buffered Read Effi-ciency

Once again, the improvement in throughputcorresponds directly to an improvement in ef-ficiency. Figure 10 indicates that we sawmuch greater improvements in efficiency at lowthread counts than at high thread counts. Thesource of this decrease in efficiency is unknownand more investigation is required to under-stand it.

One potential reason for the decrease in effi-ciency of the buffered read test as throughputincreases is that the NUMALink interfaces maybe getting close to saturation. With the testsbeing run, the typical memory access patternsare a DMA write from the HBA to memory,which due to the interleaved nature of the pagecache is distributed across the NUMALink fab-ric. The data is then read by a CPU, which gath-ers the data spread across every node, and isthen written back out into a user buffer whichis spread across every node.

With both bulk data and control logic mem-ory references included, each node node in thesystem is receiving at least 2 GiB/s and trans-mitting more than 1.2 GiB/s. With per-nodereceive throughput this high, remote memoryread and write latencies can increase comparedto an idle interconnect. Hence the increase inCPU usage may simply be an artifact of sus-


tained high NUMALink utilisation.

7 Futures

The investigation that we undertook has pro-vided us with enough information about the be-haviour of these large systems for us to pre-dict issues that SGI customers will see over thenext year or two. It has also demonstrated thatthere are issues that mainstream Linux users arelikely to start to see over this same timeframe.With technologies like SAS, PCI express, mul-ticore CPUs and NUMA moving into the main-stream, issues that used to affect only high endmachines are rapidly moving down to the aver-age user. We need to make sure that our filesys-tems behave well on the average machine of theday.

At the high end, while we are on top of filesys-tem scaling issues with XFS, we are startingto see interactions between high bandwidth I/Oand independent cpuset constrained jobs onlarge machines. These interactions are com-plex and are hinting that for effective deploy-ment on large machines at high I/O bandwidthsthe filesystem needs to be NUMA and I/O pathtopology aware so that filesystem placementand I/O bandwidth locality to the running jobcan be maximised. That is, we need to be ableto control placement in the filesystem to min-imise the NUMALink bandwidth that a job’sI/O uses.

This means that filesystems are likely to needallocation hints provided to them to enable thissort of functionality. We already have policy in-formation controlling how a job uses CPU andmemory in large machines, so extending thisconcept to how the filesystem does allocationis not as far-fetched as it seems.

Improving performance in filesystems is allabout minimising disk seeking, and this comes

down to the way the filesystem allocates its diskspace. We have new issues at the high endto deal with, while the issues that have beensolved at the high end are now becoming is-sues for mainstream. As the intrinsic paral-lelism of the average computer increases, algo-rithms need to be able to resist fragmentationwhen allocations occur simultaneously so thatfilesystem performance can grow with machinecapability.

8 Conclusion

The investigation we undertook has provided uswith valuable information on the behaviour ofLinux in high bandwidth I/O loads. We iden-tified several areas which limited our perfor-mance and scalability and fixed the worst dur-ing the investigation.

We improved the efficiency of buffered I/Ounder these loads and significantly increasedthe throughput we could achieve from XFS.We discovered interesting NUMA scalabilityissues and either fixed them or developed ef-fective strategies to negate the issues.

We proved that we could achieve close to thephysical throughput limits of the disk subsys-tem with direct I/O. From analysis, we foundthat even buffered I/O was approaching physi-cal NUMALink bandwidth limits. We provedthat Linux and XFS in combination could dothis whilst maintaining reproducible and stableoperation.

We also uncovered a set of generic filesystemissues that affected every filesystem we tested.We solved these problems on XFS, and pro-vided recommendations on why we think theyalso need to be solved.

Finally, we proved that XFS is the best choicefor our customers; both on the machines theyuse and for the common workloads they run.


In conclusion, our investigation fulfilled all thegoals we set at the beginning of the task. Wegained insight into future issues we are likelyto see, and we raised a new set of questions thatneed further research. Now all we need is abigger machine and more disks.

References

[PCP] Silicon Graphics Inc., PerformanceCo-Pilot,http://oss.sgi.com/projects/pcp/

[Scott] Nathan Scott, Makealloc_page_buffers() initialisebuffer_heads using init_buffer(). Gitcommit key:01ffe339e3a0ba5ecbeb2b3b5abac7b3ef90f374

[Mason] Chris Mason, [PATCH] reiserfs:zero b_private when allocating bufferheads. Git commit key:fc5cd582e9c934ddaf6f310179488932cd154794

[Roe] Dean Roe, [IA64] - Avoid slow TLBpurges on SGI Altix systems. Git commitkey:c1902aae322952f8726469a6657df7b9d5c794fe

[Lameter] Christoph Lameter, [PATCH] Zonereclaim: Reclaim logic. Git commit key:9eeff2395e3cfd05c9b2e6074ff943a34b0c5c21

[Hellwig] Christoph Hellwig and NathanScott, [XFS] Initial pass at goingdirectly-to-bio on the buffered IO path.Git commit key:f6d6d4fcd180f8e47bf6b13fc6cce1e6c156d0ea

[Chinner] Dave Chinner, [XFS] On machineswith more than 8 cpus, when runningparallel I/O. Git commit key:8d280b98cfe3c0b69c37d355218975c1c0279bb0


The Effects of Filesystem Fragmentation

Giel de NijsPhilips Research

[email protected]

Ard BiesheuvelPhilips Research

[email protected]

Ad DenissenPhilips Research

[email protected]

Niek LambertPhilips Research

[email protected]

Abstract

To measure the actual effects of the fragmen-tation level of a filesystem, we simulate heavyusage over a longer period of time on a con-stantly nearly full filesystem. We compare var-ious Linux filesystems with respect to the levelof fragmentation, and the effects thereof on thedata throughput. Our simulated load is com-parable to prolonged use of a Personal VideoRecorder (PVR) application.

1 Introduction

For the correct reading and writing of filesstored on a block device (i.e., hard drive, opticaldisc, etc.) the actual location of the file on theplatters of the disk is of little importance; thejob of the filesystem is to transparently presentfiles to higher layers in the operating system.For efficient reading and writing, however, theactual location does matter. As blocks of a filetypically need to be presented in sequence, theyare mostly also read in sequence. If the blocksof a file are scattered all over a hard drive, thehead of the drive needs to seek to subsequentblocks very often, instead of just reading those

blocks in one go. This takes time and energy,and so the effective transfer speed of the harddrive is lower and the energy spent per bit readis higher. Obviously, one wants to avoid this.

In the early days of the PC, the filesystem ofchoice for many was Microsoft’s FAT [1], laterfollowed by FAT32. The allocation strategy, thestrategy that determines which blocks of a filego where on the disk, was very simple: writeevery block of the file to the first free blockfound. On an empty hard drive, the blocks willbe contiguous on the disk and reading will notinvolve many seeks. As the filesystem ages andfiles are created and deleted, the free blockswill be scattered over the drive, as will newlycreated files. The files on the hard drive be-come fragmented and this affects the overalldrive performance. To solve this, a processcalled defragmentation can re-order blocks onthe drive to ensure the blocks of each file andof the remaining free space will be contiguouson the disk. This is a time consuming activity,during which the system is heavily loaded.

More sophisticated filesystems like Linux’sext2 [2] incorporate smarter allocation strate-gies, which eliminate the need for defragmenta-tion for everyday use. Specific usage scenariosmight exist where these, and other, filesystems

194 • The Effects of Filesystem Fragmentation

perform significantly worse or better than aver-age. We have explored such a scenario and de-rive a theoretical background that can predict,up to a point, the stabilisation of the fragmen-tation level of a filesystem.

In this paper we study the level of fragmen-tation of various filesystems throughout theirlifetime, dealing with a specific usage scenario.This scenario is described in section 2 and al-lows us to derive formulae for the theoreticalfragmentation level in section 3. We elaborateon our simulation set-up in section 4, followedby our results in section 5. We end with somethoughts on future work in section 6 and con-clude in section 7.

2 The scenario

Off-the-shelf computers are used more andmore for storing, retrieving and processing verylarge video files as they assume the role ofa more advanced and digital version of theclassic VCR. These so-called Personal VideoRecorders (PVR) are handling all the televisionneeds of the home by recording broadcasts andplaying them back at a time convenient for theuser, either on the device itself or by streamingthe video over a home network to a separaterendering device. Add multiple tuners and anever increasing offer of broadcast programs tothe mix and you have an application that de-mands an I/O subsystem that is able to han-dle the simultaneous reading and writing of anumber of fairly high bandwidth streams. Bothhome-built systems as well as Consumer Elec-tronics (CE) grade products with this function-ality exist, running software like MythTV [3]or Microsoft Windows Media Center [4] on topof a standard operating system.

As these systems are meant to be always run-ning, power consumption of the various com-ponents becomes an issue. The costs of the

components is of course an important factor aswell, especially for CE devices. Furthermore,the performance of the system should not de-teriorate over time to such a level that it be-comes unusable, as a PVR should be low main-tenance and should just work. Clearly, over-dimensioning the system to overcome perfor-mance issues is not the preferred solution. Abetter way would be to design the subsystemsin such a way that they are able to deliver the re-quired performance efficiently and predictably.As stated above, the hard-disk drive (HDD) willbe one of the most stressed components, so it isinteresting to see if current solutions are fulfill-ing our demands.

2.1 Usage pattern

The task of a PVR is mainly to automaticallyrecord broadcast television programs, based ona personal preference, manual instruction ora recommendation system. As most popularshows are broadcast around the same time ofday and PVRs are often equipped with morethan one tuner, it is not uncommon that morethan one program is being recorded simultane-ously. As digital broadcast quality results invideo streams of about 4 to 8 megabit/s, the sizeof a typical recording is in the range of 500 MBto 5 GB.

As the hard drive of a PVR fills up, olderrecordings are deleted to make room for newerones. The decision which recording to deleteis based on age and popularity, e.g., the newsof last week can safely be deleted, but a usermight want to keep a good movie for a longerperiod of time.

The result of this is that the system might bewriting two 8 megabit/s streams to a nearly fullfilesystem, sometimes even while playing backone or more streams. For 5 to 10 recordings perday, totalling 3 to 5 hours of content, this results


in about 10 GB of video data written to the disk.Will the filesystem hold up if this is done dailyfor two years? Will the average amount of frag-mentation keep increasing or will it stabilise atsome point? Will the effective data rate whenreading the recorded files from a fresh filesys-tem differ from the data rate when reading fromone that has been used extensively? We hope toanswer these questions with our experiments.

Although the described scenario is fairly spe-cific, it is one that is expected to be increas-ingly important. The usage and size of mediafiles are both steadily increasing and generalPersonal Computer (PC) hardware is finding itsway into CE devices, for which cost and sta-bility are main issues. The characterised usagepattern is a general filesystem stress test for sit-uations involving large media files.

As an interesting side-note, our scenario de-scribes a pattern that had hardly ever been en-countered before. Normal usage of a computersystem slowly fills the hard drive while read-ing, writing and deleting. If the hard drive isfull, it is often replaced by a bigger one or usedfor read-only storage. Our PVR scenario, how-ever, describes a full filesystem that remains inuse for reading and writing large files over aprolonged period of time.

2.2 Performance vs. power

A filesystem would ideally be able to performequally well during the lifetime of the system itis part of, without significant performance lossdue to fragmentation of the files. This is notonly useful for shorter processing times of non-real-time tasks (e.g., the detection of commer-cial blocks in a recorded television broadcast),but it also influences the power consumption ofthe system [5].

If a real-time task with a predefined streamingI/O behaviour is running, such as the recording

of a television program, power can be saved ifthe average bit rate of the stream is lower thanthe maximum throughput of the hard drive. If amemory buffer is assigned for low power pur-poses, it can be filled as fast as possible by read-ing the stream from the hard drive and power-ing down the drive while serving the applica-tion from the memory buffer. This also holdsfor writing: the application can write into thememory buffer, which can be flushed to diskwhen it is full, allowing us to power off thedrive between bursts. The higher the effectiveread or write data rate is, the more effective thisapproach will be. If the fragmentation of thefilesystem is such that it influences the effec-tive data rate, it directly influences the powerconsumption. A system that provides bufferingcapabilities for streaming I/O while providinglatency guarantees is ABISS [6].

3 The theory

To derive a theory dealing with the fragmenta-tion level of a filesystem, we first need somebackground information on filesystem alloca-tion. This allocation can (and will) lead to frag-mentation, as we will describe below. We deter-mine what level of fragmentation is acceptableand as a result we can derive the fragmentationequilibrium formulae of section 3.3.

3.1 Block allocation in filesystems

Each filesystem has some sort of rationale thatgoverns which of the available blocks it will usenext when more space needs to be allocated.This rationale is what we call the allocationstrategy of a filesystem. Some allocation strate-gies are more sophisticated than others. Also,the allocation strategy of a particular filesys-tem can differ between implementations with-out sacrificing interoperability, provided that


every implementation meets the specificationsof how the data structures are represented ondisk.

3.1.1 The FAT filesystem

The File Allocation Table File System (FATFS)[1] is a filesystem developed by Microsoft inthe late seventies for its MS-DOS operatingsystem. It is still in wide use today, mainly forUSB flash drives, portable music players anddigital cameras. While recent versions of Win-dows still support FATFS, it is no longer thedefault filesystem on this platform.

A FATFS volume consists of a boot sector,two file allocation tables (FATs), a root direc-tory and a collection of files and subdirectoriesspread out across the disk. Each entry of theFAT maps to a cluster in the data space of thedisk, and contains the index number of the nextcluster of the file (or subdirectory) it belongs to.An index of zero in the FAT means the corre-sponding cluster on the disk is free, other magicnumbers exist that denote that a cluster is thelast cluster of a file or that the cluster is dam-aged or reserved. The second FAT is a backupcopy of the first one, in case the first one getscorrupted.

An instance of FATFS is characterised by threeparameters: the number of disk sectors in acluster (2i for 0 ≤ i ≤ 7), the number of clus-ters on the volume and the number of bits usedfor each FAT entry (12, 16 or 32 bits), which atleast equals the base-2 logarithm of the numberof clusters.

The allocation strategy employed by FATFS isfairly straight-forward. It scans the FAT lin-early, and uses the first free cluster found. Eachscan starts from the position in the FAT wherethe previous scan ended, which results in allof the disk being used eventually, even if the

filesystem is never full. However, this strat-egy turns out to be too naive for our purpose:if several files are allocated concurrently, thefiles end up interleaved on the disk, resulting inhigh fragmentation levels even on a near-emptyfilesystem.

3.1.2 The LIMEFS filesystem

The Large-file metadata-In-Memory Extent-based File System (LIMEFS) was developed asa research venture within Philips Research. [7]

LIMEFS is extent-based, which means it keepstrack of used and free blocks in the filesys-tem by maintaining lists of (index, count) pairs.Each pair is called an extent, and describes a setof count contiguous blocks starting at positionindex on the disk.

The allocation strategy LIMEFS uses is slightlymore sophisticated than the strategy FAT in-corporates, and turns out to be very effectivein avoiding fragmentation when dealing withlarge files. When space needs to be allocated,LIMEFS first tries to allocate blocks in the freeextent after the last written block of the currentfile. If there is no such extent, the list of free ex-tents is scanned and an extent is chosen that isnot preceded by an extent that contains the lastblock of another file that is currently open forwriting. If it finds such an extent, it will startallocating blocks from the beginning of the ex-tent. If it cannot find such an extent, it will pickan extent that is preceded by a file that is openfor writing. In this case however, it will split thefree extent in half and will only allocate fromthe second half. When the selected extent runsout of free blocks, another extent is selected us-ing the approach just described, with the addednotion that extents close to the original one arepreferred over more distant ones.


3.2 How to count fragments

A hard drive reading data sequentially is able totransfer, on average, the amount of data presenton one track in the time it takes the platter to ro-tate once. Reading more data than one track in-volves moving the head to the next track, whichis time lost for reading. The layout of the dataon the tracks is corrected for the track-to-trackseek time of the hard drive, i.e., the data is laidout in such a way that the first block on the nexttrack is just passing underneath the head themoment it has moved from the previous trackand is ready for reading. This way no additionaltime is left waiting for the right block. This isknown as track skew, shown in figure 1. Hence,the effective transfer rate for sequential readingis the amount of data present on one track, di-vided by the rotation time increased with thetrack skew.

track s

kew

rotation time

k

k−2

k−1

k+

1

k+2

Figure 1: In the time it takes the head of thehard drive to move to the next track, the drivecontinues to rotate. The lay-out of blocks com-pensates for this. After reading block k, thehead moves to the next track and arrives justin time to continue reading from block k + 1onwards. This is called track skew.

When a request is issued for a block that isnot on the current track, the head has to seekto the correct track. Subsequently, after theseek the drive has to wait until the correct blockpasses underneath the head, which takes on av-erage half the rotation time of the disk. Thetime such a seek takes is time not spent read-ing, thus seeking lowers the data throughputof the drive. Seeks occur for example whenblocks of a single non-contiguous file are readin sequence, i.e., when moving from fragmentto fragment. Some non-sequential blocks donot induce seeks however, and should not becounted as fragments.

Two consecutive blocks in a file that are notcontiguously placed on the disk only lead to aseek if the seek time is lower than the time itwould take to bridge the gap between the twoblocks by just waiting for the block to turn upbeneath the head (maybe after a track-to-trackseek).

If ts is the average time to do a full seek, and tris the rotation time of the disk, we can deriveTs, the access time resulting from a seek:

Ts = ts +12

tr (1)

The access time resulting from waiting, Tw, canbe expressed in terms of track size st , gap sizesg and track skew tts:

Tw =sg

st(tr + tts) (2)

So the maximum gap size Sg that does not in-duce a seek is the largest gap size sg for whichTw ≤ Ts still holds, and therefore

Sg = st

tstr

+ 12

1+ ttstr

(3)


The relevance of the maximum gap size Sg isthat it allows us to determine how many rota-tions and how many seeks it takes to read a par-ticular file, given the layout of its blocks on thedisk.

3.3 Fragmentation equilibrium

A small amount of fragmentation is not bad perse, if it is small enough to not significantly re-duce the transfer speed of a hard drive. Forinstance, the ext3 filesystem [8] by default in-serts one metadata block on every 1024 datablocks. Although, strictly speaking, this leadsto non-sequential data and thus fragmentation,this kind of fragmentation does not impact thedata rate significantly. If anything, it increasesperformance because the relevant metadata isalways nearby.

A more important factor is the predictabilityand the stabilisation of the fragmentation level.Dimensioning the I/O subsystem for a certainapplication is only possible if the effective datatransfer rate of the hard drive is known a pri-ori, i.e., predictable. As one should dimension asystem for the worst case, it is also helpful if thefragmentation level has an upper bound, i.e., itreaches a maximum at some point in time afterwhich it does not deteriorate any further. Withthe help of some simple mathematics, we canestimate the theoretical prerequisites for suchstabilisation.

3.3.1 Neighbouring blocks

Suppose we have a disk with a size of N allo-cation blocks and from these blocks, we makea random selection of M blocks. The first se-lected block will of course never be a neigh-bour of any previously selected blocks. Theprobability that the second block is a neighbour

of the first is 2N−1 , because 2 of the remaining

N−1 blocks are adjacent to the first block. Theprobability of the third block being a neigh-bour of one of the previous two is then 4

N−2 , or,more general, the probability of a block i beinga neighbour of one of the previously selectedblocks is:

P(i) =2i

N− i(1≤ i� N) (4)

The average number of neighbouring blockswhen randomly selecting M blocks, the ex-pected value, can be determined by the summa-tion of the probability of a selected block beinga neighbour of a previously selected block overall blocks, although with two errors: we are notcorrecting for blocks at the beginning or end ofthe disk with only one neighbour and we areconveniently forgetting that two previously se-lected blocks could already be neighbours. Aslong as N � 1 and M � N this will not influ-ence the outcome very much and simplifies theformulae. Furthermore, if M � N we can ap-proximate (4) by:

P(i)≈ 2iN

(5)

The expected value of the number of neigh-bouring blocks is then, according to (5):

E =M−1

∑i=1

2iN− i

≈M−1

∑i=1

2iN

(6)

=(M−1)M

N


3.3.2 Neighbouring fragments

The above holds for randomly allocated blocks,which is something not many filesystems do.The blocks in the above argumentation how-ever can also be seen as fragments of both filesas well as free space. This only changes themeaning of N and M and does not change theargumentation. If we now look at a moderatelyfragmented drive, the location of the fragmentswill be more or less random and our estimationof the expected value of the number of neigh-bouring fragments will be closer to the truth.

Furthermore, to have a stable amount of frag-ments, an equilibrium should exist between thenumber of new fragments created during allo-cation and the number of fragments eliminatedduring deletion (by deleting data next to a freefragment and thus ’glueing’ two fragments).

Let us interpret N as the total number of frag-ments on a drive, both occupied as well as freeand M as the number of free fragments. Also,assume that we are dealing with a drive that isalready in use for some time and is almost full,according to our PVR scenario as described insection 2.1.

The fore mentioned equilibrium can be ob-tained with the simple allocation strategy ofLIMEFS (described in section 3.1.2) used inour PVR scenario, by eliminating one fragmentwhen deleting a file, as on average one frag-ment is created during the allocation of a file(as shown in figure 2). We assume that, as thefiles are on average of the same size, for eachfile created a file is deleted; when writing a fileof L fragments (increasing the fragment countwith one), a file of L fragments is deleted. Thisshould eliminate one fragment to create a bal-ance in the amount of fragments.

By deleting a file of L fragments on a drive withN total fragments, we increase the number of

blocks already occupied

new data

new fragment

Figure 2: Empty fragments are filled (and thusnot creating more fragmentation), until the endof the file. As an empty fragment is now di-vided into a fragment of the new file and asmaller empty fragment, one new fragment iscreated.

free fragments from (M−L) to M. The num-ber of neighbouring fragments n before delet-ing and nd after deleting the file amongst thefree fragments is, according to (6):

n =(M−L−1)(M−L)

N(7)

nd =(M−1)M

N(8)

Combining (7) and (8) gets us the increase ofneighbouring free fragments and thus the de-crease of free fragments f (two neighbouringfree fragments are one fragment):

f = nd−n

=(M−1)M

N− (M−L−1)(M−L)

N

=(2M−L−1)L

N(9)

3.3.3 Balancing fragmentation

Now we define two practical, understandableparameters m for the fraction of the disk that is


free and s for the fraction of the disk occupiedby a file:

m =MN

(10)

s =LN

(11)

This is mathematically not correct, as we de-fined N, M and L as distributions of an amountof fragments, but in the experiments of section4 we will see that it works well enough for us,together with the numerous assumptions we al-ready made.

To fulfil the demand that the number of elim-inated fragments f when deleting a file of Lfragments should equal the number of createdfragments by allocating space for a file of anaverage equal size, f = 1 in (9). We combinethis statement with (10) and (11):

L =1+ s

2m− s(12)

This can be simplified even more by assumings� 1, which is true if L�N. This is a realisticassumption, as the size of a file is most often alot smaller the the size of the disk. We get:

L =1

2m− s(13)

A remarkable result of this equation is that theaverage number of fragments in each of ourfiles does not depend in the allocation unit size.Of course, the above only holds if files consistof more than one fragment, and a fragment con-sists of a fairly large number of allocation units.This fits our PVR scenario, but the deduced for-mulae do not apply to general workloads.

Another interesting result of (13) is the in-sight that a disk without free space besides the

amount needed to write the next file will resultin a situation where M = L and thus m = s. Theaverage number of fragments in a file becomesL = 1

s , meaning the number of fragments in afile equals the number of files on the disk.

4 The simulation

To test the validity of our theory described insection 3 on the one hand and to be able to as-sess if fragmentation is an issue for PVR sys-tems on the other hand, we have conducted anumber of simulations. First, we have donein-memory experiments with the simple alloca-tion strategy of LIMEFS [7]. Next, we havetried simulating a real PVR system workingon a filesystem as closely as possible with ourprogram pvrsim. This was combined withour hddfrgchk to analyse the output of thesimulation. Finally, we have done throughputand actual seek measurements by observing theblock I/O layer of the Linux kernel with JensAxboe’s blktrace [9].

All experiments were performed on a Linux2.6.15.3 kernel, only modified with theblktrace patches. The PVR simulations onactual filesystems were running on fairly recentPentium IV machines with a 250 GB WesternDigital hard drive. The actual speed of the ma-chine and hard drive should not influence theresults of the fragmentation experiments, andthe performance measurements were only com-pared with those conducted on the same ma-chine.

4.1 LIMEFS

We isolated the simple allocation strategy ofLIMEFS described in section 3.1.2 from therest of the filesystem and implemented it in a


simulation in user space. This way, the cre-ation and deletion of files was performed byonly modifying the in-memory meta data of thefilesystem, and we were quickly able to inves-tigate if our theory has practical value. The re-sults of this experiment can be found in section5.2.

4.2 pvrsim

As we clearly do not want to use a PVR fortwo years to see what the effects are of do-ing so, we have written a simulation programcalled pvrsim. This multi-threaded applica-tion writes and deletes files, similar in size torecorded broadcasts, as fast as possible. Itis able to do so with a number of concurrentthreads defined at runtime, to simulate the si-multaneous recording of multiple streams.

The size of the generated file is uniformly dis-tributed in the range from 500 MB to 5 GB.Besides that, every file is assigned a Gaussiandistributed popularity score, ranging roughlyfrom 10 to 100. This score is used to deter-mine which file should be deleted to free updisk space for a new file.

When writing new files, pvrsim always keepsa minimum amount of free disk space to pre-vent excessive fragmentation. This dependencybetween fragmentation and free disk space wasshown in (13) in section 3.3.3. If writing anew file would exceed this limit, older filesare deleted until enough space has been freed.The file with the lowest weighed popularity isdeleted first. The weighed popularity is deter-mined by dividing the popularity by the loga-rithm of the age, where the age is expressed asthe number of files created after the file at hand.

Blocks of 32 KB filled with zeroes are writtento each file until it reaches the previously deter-mined size. Next, the location of each block of

the file is looked up and the extents of the fileare determined. The extents are written to a logfile for further processing by hddfrgchk.

4.3 hddfrgchk

Our main simulation tool pvrsim outputs theextents of each created file, which need fur-ther processing to be able to analyse the re-sults properly. This processing is done byhddfrgchk, which provides two separatefunctions, described below.

4.3.1 Fragmentation measures

hddfrgchk is able to calculate a number ofvariables from the extent lists. First, it deter-mines the number of fragments of which thefile at hand consists. As explained in section3.2, this number is often higher than the actualseeks the hard drive has to do when reading thefile. We therefore calculate the theoretical num-ber of seeks, based on the characteristics of atypical hard drive, with (3) from section 3.2.The exact parameters in this calculation weredetermined from a typical hard drive by mea-surements, as described in [10].

Furthermore, we have defined a measure forthe relative effective data transfer speed. Theminimum number of rotations the drive has tomake to transfer all data of a file if all its blockswere contiguous can be calculated by dividingthe number of blocks of the file by the averagenumber of blocks on a track. The actual num-ber of rotations the drive theoretically has tomake to transfer the data can also be calculated.This is done by adding the blocks in the gapsthat were not counted as fragments in our ear-lier calculations to the total amount of blocksin the file. Furthermore, the number of rota-tions that took place in the time the hard drive


was seeking (the number of theoretical seeks ascalculated earlier is used for this) is also addedto the estimation of the number of rotations thedrive has to make to transfer the file. Dividingthe minimum number of rotations by the esti-mation of the actual number of rotations givesus the relative transfer speed.

4.3.2 Filesystem lay-out

Besides these variables, hddfrgchk also gen-erates a graphical representation of the simula-tion over time. The filesystem is depicted byan image, each pixel representing a number ofblocks. With each file written by the simula-tor, the blocks belonging to that file are given aseparate colour. When the file is deleted, this isalso updated in the image of the filesystem.

A new picture of the state of the filesystem isgenerated for every file, and the separate pic-tures are combined into an animation. The ani-mation gives a visualisation of the locations ofthe files on the drive, and gives an insight onhow the filesystem evolves.

4.4 Performance measurements

To verify the theoretical calculations ofhddfrgchk, we also have done some mea-surements. We have looked at the requests is-sued to the block I/O device (the hard drivein this case) by the block I/O layer. Theblktrace [9] patch by Jens Axboe providesuseful instrumentation for this purpose in thekernel.

4.4.1 blktrace

The kernel-side mechanism collects requestqueue operations. The user space utility

blktrace extracts those event traces via theRelay filesystem (RelayFS) [11]. The eventtraces are stored in a raw data format, to en-sure fast processing. The blkparse utilityproduces formatted output of these traces after-wards, and generates statistics.

The events that are collected originate eitherfrom the file system or are SCSI commands.The filesystem block layer requests consist ofthe read or write actions of the filesystem.These actions are queued and inserted in the in-ternal I/O scheduler queue. The requests mightbe merged with other items in the queue, at thediscretion of the I/O scheduler. Subsequently,they are issued to the block device driver, whichfinally signals when a specific request is com-pleted.

4.4.2 Deriving the number of seeks

All requests also include the Logical Block Ad-dress (LBA) of the starting sector of the re-quest, as well as the size (in sectors) of the re-quest. As we are interested in the exact actionsthe hard drive performs, we only look at the re-quests that are reported to be completed by theblock device driver, along with their locationand size. With this information, we can countthe number of seeks the hard drive has made: ifthe starting location of a request is equal to theending location of the previous request, no seekwill take place. This does not yet account forthe fact that small gaps might not induce seeks,but do lower the transfer rate.

4.4.3 Determining the data transfer rate

The effective data transfer rate can be derivedby the information provided by blktrace,but can also be calculated just by the wall clocktime needed to read a file, divided by the file


size. Comparing transfer rates of various filesshould be done with caution: the physical lo-cation on the drive significantly influences this.To have a fair comparison, the average transferrate over the whole drive should be comparedat various stages in the simulation.

5 The results

We have conducted a number of variations ofthe simulations described in section 4. Thevariables under consideration were the filesys-tem on which the simulations were takingplace, the size of the filesystem, the minimumamount of free space on the filesystem, thelength of the simulation (i.e., the number offiles created) and the number of concurrent filesbeing written.

5.1 Simulation parameters

With exploratory simulations we discoveredthat the size of the filesystem is not of signif-icant influence on the outcome of the experi-ments, as long as the filesystem is sufficientlylarge compared to both the block size and theaverage file size. As typical PVR systems typi-cally offer storage space ranging from 100 GBto 300 GB, we decided on a filesystem size of138 GB. Due to circumstances, however, someof the experiments were conducted on a filesys-tem of 100 GB.

The minimum amount of space that is alwayskept free is in the simulations with pvrsimfixed at 5% of the capacity of the drive. Ac-cording to the preliminary in-memory LIMEFSexperiments this is a reasonable value. The re-sults of these experiments are elaborated in sec-tion 5.2.

The size of the created files is chosen randomlybetween 500 MB and 5 GB, uniformly dis-tributed. As explained in section 2.1, these aretypical file sizes for a PVR recording MPEG2streams in Standard Definition (SD) resolution.

The length of the experiments, expressed innumber of files created, was initially set at10,000. The results from these runs showedthat after about 2,500 files the behaviour sta-bilised. Therefore, the length of the simulationswas set at 2,500 files.

We have done simulations with up to four si-multaneous threads, so several files were writ-ten to the disk concurrently. This was done toobserve the behaviour when recording multiplesimultaneous broadcasts.

The filesystems we have covered in the exper-iments described in this paper are FAT (bothLinux and Microsoft Windows), ext3 [2] [8],ReiserFS [12], LIMEFS and NTFS [13] (Mi-crosoft Windows). We plan to cover morefilesystems.

5.2 LIMEFS in-memory simulation

The results of the simulation of LIMEFS as de-scribed in section 4.1 are shown in figure 3. Wehave run our in-memory simulation on an imag-inary 250 GB hard drive, with the minimumamount of free space as a variable parameter.

As can be seen, in the runs with 5%, 10%and 20% free space the fragmentation stabilisesquickly. In longer runs we observed that the0%, 1% and 2% options also stabilise, but thefinal fragmentation count is much higher andthe stabilisation takes longer. The sweet spotappears to be 5% minimum free space, as thisgives a good balance between fragmentationand hard drive space usage.

A nice result from this experiment is that theobserved fragmentation counts fit our formula.


We have taken a filesystem of 250 GB and anaverage file size of 2750 MB. For the 5% freespace run, this makes the fraction of free spacem = 0.05 (see section 3.3.3). The fraction ofthe disk occupied by a file is s = L

N = 2.75250 =

0.01. So, the number of fragments in a file onaverage, according to the formula, is:

L =1

2m− s=

12 ·0.05−0.01

≈ 11

From the plot in figure 3, we can see the calcu-lation matches the outcome of the simulation.The same holds for the other values for the min-imum amount of free space.

0

10

20

30

40

50

60

0 2000 4000 6000 8000 10000

Fra

gmen

t cou

nt

Files written

Moving average of fragment count

0% free 1% free 2% free 5% free10% free20% free

Figure 3: The average fragment count during asimulation run of 10,000 files, with a variablepercentage of space that was kept free.

5.3 Fragmentation simulation

While the name pvrsim might suggest other-wise, we must stress that all results obtainedby pvrsim were obtained by writing real filesto real filesystems. The filesystems were run-ning on their native platforms (i.e., FAT andNTFS on Windows XP SP2, the others onLinux 2.6.15). For an interesting comparisonhowever, we have also tested how the Linux

version of FAT performs in a number of situ-ations.

These experiments resulted in the plots in figure4, where the number of seeks according to ourtheory (see section 3.2) and the relative speedare shown for single- and multi-threaded situa-tions.

5.3.1 Single-threaded performance

All single-threaded simulations show similarresults on all filesystems: the effective readspeed is not severely impacted by writing manylarge files in succession. Some filesystems han-dle the fragmentation more gracefully than oth-ers, but the effects on system performance arenegligible in all cases, as can be seen in thetop two plots of figure 4. Although ext3 seemsto have quite some fragmentation, the relativespeed does not suffer: 98% of the raw datathroughput is a hardly noticeable slowdown.

5.3.2 Multi-threaded performance

The multi-threaded simulations show that a fileallocation strategy that tries to cluster files thatare created concurrently performs considerablybetter compared to one that does not. The per-formance of NTFS deteriorates very quickly(after having written only a couple of files) to arelative speed of around 0.6, while the relativespeed of ReiserFS and ext3 do not drop below0.8. Linux FAT is doing slightly worse, whileLIMEFS is not impacted at all.

5.3.3 LIMEFS

According to our results, the area of PVR ap-plications is one where LIMEFS really shines.LIMEFS never produces more than around ten


fragments, even with four threads. We do ad-mit that LIMEFS is the only filesystem used inthis experiment that was designed specificallyfor the purpose of storing PVR recordings, andthat it might perform horribly or even not at allin other areas. However, the results are encour-aging, and will hopefully serve as an inspirationfor other filesystems.

5.3.4 FAT and NTFS

One interesting result of running the simulationon FAT and NTFS on Windows is that Windowsappears to allocate 1 megabyte chunks regard-less of the block size (extents are typically 16clusters of 64k, or 32 clusters of 32k). As oursimulation does not produce files smaller than 1megabyte, we have no way of determining theeffect of small files on the fragmentation levelsof the filesystems. However, the chunked allo-cation seems to alleviate the negative effects ofthe otherwise quite naive allocation strategiesof FAT and NTFS.

5.4 Seeks and throughput

We have measured the the number of the aver-age data rate of files on a newly created filesys-tem and compared that with the data rate of filesafter pvrsim simulated a PVR workload, toconfirm that our relative speed calculations arerepresentative for the actual situation. Further-more, we have analysed the activity in the blockI/O layer with blktrace to see if the num-ber of seeks derived from the placement of theblocks on the drive can be used to estimate thereal activity.

The raw data throughput of the drive on whichwe have executed our single-threaded ext3 runwas 60,590 KB/s. After writing 10,000 fileswith pvrsim, the data throughput while read-ing those files was on average 58,409 KB/s.

This results in a relative speed of 0.96, whichis close to the 0.95 we have estimated with ourcalculations. The figures of the two-threadedext3 run on a different machine (52,520 KB/sraw data throughput, 40,270 KB/s while read-ing the final files present and thus a relativespeed of 0.77, the same as calculated) confirmthis.

With the use of blktrace we counted 10,267seeks when reading the final files present on thedisk after the single-threaded ext3 run. This isan average of 366 fragments per file. If we takeinto account that small fragments do not causethe drive to seek, as explained in section 3.2,the number of seeks caused by fragments aftera gap of more than 676KB was, again accordingto the blktrace observations, 5692 or an av-erage of 203 seeks per file. This is for all prac-tical purposes close enough to the 198 seeks wederive from the location of the fragments on thedisk.

6 Future work

The experiments and results presented in thispaper are a starting point to improve the frag-mentation robustness of filesystem for PVR-like scenarios. To obtain a good overview ofthe current state-of-the-art, we are planning torun our simulations on other filesystems, e.g.,on XFS [14]. We intend to cover more com-binations of the parameters of our simulationsas well, e.g., different distributions for file sizeand popularity, different filesystem sizes, anddifferent filesystem options.

Following this route, experimental measure-ments of different workloads and scenariosmight provide interesting insights as well. Wefeel our tools could easily be modified to incor-porate other synthetic workloads, and thereforebe of great help for further experiments.


A more practical matter is improving the allo-cation strategy of the FAT filesystem. Makingthe allocation extent-based and multi-streamaware like LIMEFS will greatly improve thefragmentation behaviour, while the resultingfilesystem will remain backwards compatible.On one hand, this proves our LIMEFS strate-gies in real-life applications, while on the otherhand this will be useful for incorporation inmobile digital television devices, which mightalso act as a mass storage device and shouldtherefore use a compatible filesystem. Unfortu-nately, the usefulness of FAT in a PVR contextremains limited due to its file size limit of 4 gi-gabytes.

7 Conclusion

The formulae derived in 3 give an indication ofthe average fragmentation level for simple allo-cation strategies and large files. Although wemade quite some assumptions and took someshortcuts in the mathematical justification, theresults of the LIMEFS in-memory experimentsof section 5.2 support the theory. Another use-ful outcome of the formulae is that, at least withlarge files, the fragmentation level stabilises,which seems to be true as well for filesystemswith more sophisticated allocation strategies.

When dealing only with large files, a simple al-location strategy seems very efficient in termsof fragmentation prevention. Especially if onlyone stream is written, even FAT performs verywell. Writing multiple streams simultaneouslyrequires some precautions, but a strategy as im-plemented in LIMEFS suffices and outperformsall more complicated strategies with respect tothe fragmentation level.

The ext3 and ReiserFS filesystems have a rela-tively high fragmentation in our scenario. How-ever, the fragmentation stabilises and the im-

pact is therefore predictable, and is no real is-sue with large files. A file of 2 GB consistingof 500 fragments will result in 4.5 seconds ofseeking (with an average seek time of 9 ms).This is not significant for a movie of two hours,if the seeks are not clustered.

The relative speeds measured with ReiserFSand ext3 are not as good as the ones ofLIMEFS, but still acceptable: 80% of the per-formance after prolonged use. NTFS howeverperform horribly when using multiple simul-taneous streams. The Linux version of FATis doing surprisingly well with two concurrentstreams, much better than the Microsoft Win-dows implementation. We have still to investi-gate why this is.

An interesting observation is the fact that ext3keeps about 2% of unused free space at theend of the drive, independent of the "reservedspace" options (used to prevent the filesystemfrom being filled up by a normal user). If thisfree space is kept clustered at the end insteadof being used throughout the simulation, this isinefficient in terms of fragmentation, as our for-mulae tell us.

In general, a PVR-like device is able to pro-vide sustainable I/O performance over time if afilesystem like ext3 or ReiserFS is used. Thisdoes not assert anything about scenarios wherefile sizes are in the order of magnitude of thesize of an allocation unit. However, the ratiobetween rotation time and seek time in modernhard drives is such that seeks are not somethingto avoid at all costs anymore. For optimal usageof the hard drive under a load of a number ofconcurrent streams, an allocation strategy thatis aware of such a scenario is needed.

References

[1] Microsoft Corporation. MicrosoftExtensible Firmware Initiative FAT32


File System Specification. Whitepaper,December 2000.http://www.microsoft.com/whdc/system/platform/firmware/fatgendown.mspx?

[2] Card, Rémy; Ts’o, Theodore; Tweedie,Stephen. Design and Implementation ofthe Second Extended Filesystem.Proceedings of the First DutchInternational Symposium on Linux,1994.http://web.mit.edu/tytso/www/linux/ext2intro.html

[3] MythTV. http://www.mythtv.org

[4] Microsoft Windows XP Media Center.http://www.microsoft.com/windowsxp/mediacenter/default.mspx

[5] Mesut, Özcan; Brink, Benno van den;Blijlevens, Jennifer; Bos, Eric; Nijs, Gielde. Hard Disk Drive Power Managementfor Multi-stream Applications.Proceedings of the InternationalWorkshop on Software Support forPortable Storage, March 2005.

[6] Nijs, Giel de; Almesberger, Werner;Brink, Benno van den. Active Block I/OScheduling System (ABISS). Proceedingsof the Linux Symposium, vol. 1, pp.109–126, Ottawa, July 2005.http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf

[7] Springer, Rink. Time is of the Essence:Implementation of the LimeFS RealtimeLinux Filesystem. Graduation Report,Fontys University of Applied Sciences,Eindhoven, 2005.

[8] Johnson, Michael K. Red Hat’s NewJournaling File System: ext3.

Whitepaper, 2001. http://www.redhat.com/support/wpapers/redhat/ext3/

[9] Axboe, Jens; Brunelle, Alan D. blktraceUser Guide. http://www.kernel.org/pub/linux/kernel/people/axboe/blktrace/

[10] Mesut, Özcan; Lambert, Niek. HDDCharacterization for A/V StreamingApplications. IEEE Transactions onConsumer Electronics, Vol. 48, No. 3,802–807, August 2002.

[11] Dagenais, Michel; Moore, Richard;Wisniewski, Bob; Yaghmour, Karim;Zanussi, Tom. RelayFS - A High-SpeedData Relay Filesystem.http://relayfs.sourceforge.net/relayfs.txt

[12] Reiser, Hans. ReiserFS v.3 Whitepaper.Whitepaper, 2003.

[13] Microsoft Corporation. Local FileSystems for Windows. WinHEC, May2004. http://www.microsoft.com/whdc/device/storage/LocFileSys.mspx

[14] Hellwig, Chrisoph. XFS for Linux.UKUUG, July 2003. http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf


1

10

100

1000

10000

0 500 1000 1500 2000 2500

See

ks

Files written

Average seeks per file, 1 thread

Linux FATWindows FAT

ext3ReiserFS

NTFS

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500

Rel

ativ

e sp

eed

Files written

Average relative speed per file, 1 thread

Linux FATWindows FAT

ext3ReiserFS

NTFS

1

10

100

1000

10000

0 500 1000 1500 2000 2500

See

ks

Files written

Average seeks per file, 2 threads

Linux FATLIMEFS

ext3ReiserFS

NTFS 0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500

Rel

ativ

e sp

eed

Files written

Average relative speed per file, 2 threads

Linux FATLIMEFS

ext3ReiserFS

NTFS

1

10

100

1000

10000

0 500 1000 1500 2000 2500

See

ks

Files written

Average seeks per file, 4 threads

Windows FATLIMEFS

ext3NTFS

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500

Rel

ativ

e sp

eed

Files written

Average relative speed per file, 4 threads

Windows FATLIMEFS

ext3NTFS

Figure 4: Results of pvrsim run on various filesystems. From top to bottom the number ofconcurrent threads was respectively one, two and four. The plots on the left side are the averageamount of fragments over time, corrected to exclude small fragments as described in section 3.2.On the right side the relative speed (see section 4.3) is shown.

The LTTng tracer: A low impact performance andbehavior monitor for GNU/Linux

Mathieu DesnoyersÉcole Polytechnique de Montré[email protected]

Michel R. DagenaisÉcole Polytechnique de Montré[email protected]

Abstract

Efficient tracing of system-wide execution,allowing integrated analysis of both kernelspace and user space, is something difficult toachieve. The following article will present youa new tracer core, Linux Trace Toolkit NextGeneration (LTTng), that has taken over theprevious version known as LTT. It has the samegoals of low system disturbance and architec-ture independance while being fully reentrant,scalable, precise, extensible, modular and easyto use. For instance, LTTng allows tracepointsin NMI code, multiple simultaneous traces anda flight recorder mode. LTTng reuses and en-hances the existing LTT instrumentation andRelayFS.

This paper will focus on the approaches takenby LTTng to fulfill these goals. It will presentthe modular architecture of the project. Itwill then explain how NMI reentrancy requiresatomic operations for writing and RCU lists fortracing behavior control. It will show how thesetechniques are inherently scalable to multipro-cessor systems. Then, time precision limita-tions in the kernel will be discussed, followedby an explanation of LTTng’s own monotonictimestamps motives.

In addition, the template based code generatorfor architecture agnostic trace format will be

presented. The approach taken to allow nestedtypes, variable fields and dynamic alignment ofdata in the trace buffers will be revealed. It willshow the mechanisms deployed to facilitate useand extension of this tool by adding custom in-strumentation and analysis involving kernel, li-braries and user space programs.

It will also introduce LTTng’s trace analyzerand graphical viewer counterpart: Linux TraceToolkit Viewer (LTTV). The latter implementsextensible analysis of the trace informationthrough collaborating text and graphical plu-gins.1 It can simultaneously display multi-ple multi-GBytes traces of multi-processor sys-tems.

1 Tracing goals

With the increasing complexity of newer com-puter systems, the overall performance of appli-cations often depends on a combination of sev-eral factors including I/O subsystems, devicedrivers, interrupts, lock contention among mul-tiple CPUs, scheduling and memory manage-ment. A low impact, high performance, trac-ing system may therefore be the only tool ca-pable of collecting the information producedby instrumenting the whole system, while not

1Project website: http://ltt.polymtl.ca.

210 • The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux

changing significantly the studied system be-havior and performance.

Besides offering a flexible and easy to use in-terface to users, an efficient tracer must satisfythe requirements of the most demanding appli-cation. For instance, the widely used printkand printf statements are relatively easy touse and are correct for simple applications, butdo not offer the needed performance for instru-menting interrupts in high performance multi-processor computer systems and cannot neces-sarily be used in some code paths such as nonmaskable interrupts (NMI) handlers.

An important aspect of tracing, particularly inthe real-time and high performance comput-ing fields, is the precision of events times-tamps. Real-time is often used in embeddedsystems which are based on a number of dif-ferent architectures (e.g. ARM, MIPS, PPC)optimized for various applications. The chal-lenge is therefore to obtain a tracer with precisetimestamps, across multiple architectures, run-ning from several MHz to several GHz, somebeing multi-processors.

The number of ad hoc tracing systems devisedfor specific needs (several Linux device driverscontain a small tracer), and the experience withearlier versions of LTT, show the needs for aflexible and extensible system. This is the caseboth in terms of adding easily new instrumen-tation points and in terms of adding plugins forthe analysis and display of the resulting tracedata.

2 Existing solutions

Several different approaches have been takenby performance monitoring tools. They usuallyadhere to one of the following two paradigms.The first class of monitor, post-processing,

aims to minimize CPU usage during the exe-cution of the monitored system by collectingdata for later off-line analysis. As the goal is tohave minimum impact on performance, staticinstrumentation is habitually used in this ap-proach. Static instrumentation consists in mod-ifying the program source code to add loggingstatements that will compile with the program.Such systems include LTT [7], a Linux ker-nel Tracer, K42 [5], a research operating sys-tem from IBM, IrixView and Tornado whichare commercial proprietary products.

The second class of monitor aims at calculatingwell defined information (e.g. I/O requests perseconds, system calls per second per PID) onthe monitored CPU itself: it is what is generallycalled a pre-processing approach. It is the caseof SystemTAP [3], Kerninst [4], Sun’s dtrace[1] and IBM’s Performance and EnvironmentMonitoring (PEM) [6]. All except PEM use adynamic instrumentation approach. Dynamicinstrumentation is performed by changing as-sembly instructions for breakpoints in the pro-gram binary objects loaded in memory, like thegdb debugger does. It is suitable to their goalbecause it generally has a negligible footprintcompared to the pre-processing they do.

Since our goal is to support high performanceand real-time embedded systems, the dynamicprobe approach is too intrusive, as it impliesusing a costly breakpoint interrupt. Further-more, even if the pre-processing of informationcan sometimes be faster than logging raw data,it does not allow the same flexibility as post-processing analysis. Indeed, almost every as-pect of a system can be studied once is obtaineda trace of the complete flow of the system be-havior. However, pre-processed data can belogged into a tracer, as does PEM with K42, forlater combined analysis, and the two are there-fore not incompatible.


3 Previous Works

LTTng reuses research that has been previouslydone in the operating system tracing field in or-der to build new features and address currentlyunsolved questions more thoroughly.

The previous Linux Trace Toolkit (LTT) [7]project offers an operating system instrumen-tation that has been quite stable through the 2.6Linux kernels. It also has the advantage of be-ing cross-platform, but with types limited tofixed sizes (e.g. fixed 8, 16, 32, or 64-byte inte-gers compared to host size byte, short, integer,and long). It also suffers from the monolithicimplementation of both the LTT tracer and itsviewer which have proven to be difficult to ex-tend. Another limitation is the use of the ker-nel NTP corrected time for timestamps, whichis not monotonic. LTTng is based on LTT butis a new generation, layered, easily extensiblewith new event types and viewer plugins, witha more precise time base and that will eventu-ally support the combined analysis of severalcomputers in a cluster [2].

RelayFS [8] has been developed as a standardhigh-speed data relay between the kernel anduser space. It has been integrated in the 2.6.14Linux kernels. It offers hooks for kernel clientsto send information in large buffers and inter-acts with a user space daemon through file op-erations on a memory mapped file.

IBM, in the past years, has developed K42 [5],an open source research kernel which aims atfull scalability. It has been designed from theground up with tracing being a necessity, not anoption. It offers a very elegant lockless tracingmechanism based on the atomic compare-and-exchange operation.

The Performance and Environment Monitor-ing (PEM) [6] project shares a few similaritieswith LTTng and LTTV since some work have

been done in collaboration with members oftheir team. The XML file format for describ-ing events came from these discussions, aimingat standardizing event description and trace for-mats.

4 The LTTng approach

The following subsections describe the fivemain components of the LTTng architecture.The first one explains the control of the differ-ent entities in LTTng. It is followed by a de-scription of the data flow in the different mod-ules of the application. The automated staticinstrumentation will thereafter be introduced.Event type registration, the mecanism that linksthe extensible instrumentation to the dynamiccollection of traces, will then be presented.

4.1 Control

There are three main parts in LTTng: a userspace command-line application, lttctl; a userspace daemon, lttd, that waits for trace data andwrites it to disk; and a kernel part that controlskernel tracing. Figure 1 shows the control pathsin LTTng. lttctl is the command line applicationused to control tracing. It starts a lttd and con-trols kernel tracing behavior through a library-module bridge which uses a netlink socket.

The core module of LTTng is ltt-core. Thismodule is responsible for a number of LTTcontrol events. It controls helper modulesltt-heartbeat, ltt-facilities, and ltt-statedump.Module ltt-heartbeat generates periodic eventsin order to detect and account for cycle coun-ters overflows, thus allowing a single monoton-ically increasing time base even if shorter 32-bit (instead of 64-bit) cycle counts are storedin each event. Ltt-facilities lists the facilities


User space

lttd liblttctllttctl

Netlink socket

RelayFS

ltt-control

ltt-coreltt-statedump

ltt-heartbeat

ltt-base

Kernel

modules

Kernel

built-in

Kernel-User

Communication

ltt-facilities

Figure 1: LTTng control architecture

(collection of event types) currently loaded attrace start time. Module ltt-statedump gener-ates events to describe the kernel state at tracestart time (processes, files. . . ). A builtin ker-nel object, ltt-base, contains the symbols anddata structures required by builtin instrumenta-tion. This includes principally the tracing con-trol structures.

4.2 Data flow

Figure 2 shows the data flow in LTTng. All datais written through ltt-base into RelayFS circu-lar buffers. When subbuffers are full, they aredelivered to the lttd disk writer daemon.

Lttd is a standalone multithreaded daemonwhich waits on RelayFS channels (files) for

trace files

User space

lttd

RelayFS

libltt-usertrace-fast

ltt-baseKernel

Built-in

Kernel-User

Communication

Figure 2: LTTng data flow

data by using the poll file operation. When itis awakened, it locks the channels for readingby using a relay buffer get ioctl. At that point,it has exclusive access to the subbuffer it has re-served and can safely write it to disk. It shouldthen issue a relay buffer put ioctl to release it soit can be reused.

A side-path, libltt-usertrace-fast, running com-pletely in user space, has been developed forhigh throughput user space applications whichneed high performance tracing. It is explainedin details in Section 4.5.4.

Both lttd and the libltt-usertrace-fast compan-ion process currently support disk output, butshould eventually be extended to other medialike network communication.

4.3 Instrumentation

LTTng instrumentation, as presented in Fig-ure 3, consists in an XML event description that


XML event description

trace files

User space instrumentation

lttctl

Kernel Instrumentation

genevent

copy

Figure 3: LTTng instrumentation

is used both for automatically generating trac-ing headers and as data metainformation in thetrace files. These tracing headers implement thefunctions that must be called at instrumentationsites to log information in traces.

Most common types are supported in the XMLdescription: fixed size integers, host size inte-gers (int, long, pointer, size_t), floating pointnumbers, enumerations, and strings. All ofthese can be either host or network byte or-dered. It also supports nested arrays, se-quences, structures, and unions.

The tracing functions, generated in the tracingheaders, serialize the C types given as argu-

User spaceltt-usertrace

ltt-statedump

ltt-heartbeat

ltt-facilities

ltt-syscall


Kernel

Built-in

Kernel or

Modules

Kernel-User

Communication

User space Instrumentation

Figure 4: LTTng event type registration

ments into the LTT trace format. This formatsupports both packed or aligned data types.

A record generated by a probe hit is called anevent. Event types are grouped in facilities.A facility is a dynamically loadable object, ei-ther a kernel module for kernel instrumentationor a user space library for user space instru-mentation. An object that calls instrumentationshould be linked with its associated facility ob-ject.


4.4 Event type registration

Event type registration is centralized in the ltt-facilities kernel object, as shown in Figure 4. Itcontrols the rights to register specific type of in-formation in traces. For instance, it does not al-low a user space process using the ltt-usertraceAPI to register facilities with names conflictingwith kernel facilities.

The ltt-heartbeat built-in object and the ltt-statedump also have their own instrumentationto log events. Therefore, they also registerto ltt-facilities, just like standard kernel instru-mentation.

Registered facility names, checksums and typesizes are locally stored in ltt-facilities so theycan be dumped in a special low traffic chan-nel at trace start. Dynamic registration of newfacilities, while tracing is active, is also sup-ported.

Facilities contain information concerning thetype sizes in the compilation environment ofthe associated instrumentation. For instance, afacility for a 32-bit process would differ fromthe same facility compiled with a 64-bit pro-cess from its long and pointer sizes.

4.5 Tracing

There are many similarities between Figure 4and Figure 5. Indeed, each traced informationmust have its metainformation registered intoltt-facilities. The difference is that Figure 4shows how the metainformation is registeredwhile Figure 5 show the actual tracing. Thetracing path has the biggest impact on systembehavior because it is called for every event.

Each event recorded uses ltt-base, containerof the active traces, to get the pointers to

User space

ltt-usertrace (system call)

ltt-statedump

ltt-heartbeat

ltt-base

ltt-syscall


Kernel

Built-in

Kernel or

Modules

Kernel-User

Communication

ltt-facilities

User space instrumentation

Figure 5: LTTng tracing

RelayFS buffers. One exception is the libltt-usertrace-fast which will be explained at Sub-section 4.5.4.

The algorithms used in these tracing siteswhich make them reentrant, scalable, and pre-cise will now be explained.

4.5.1 Reentrancy

This section presents the lockless reentrancymechanism used at LTTng instrumentationsites. Its primary goal is to provide correcttracing throughout the kernel, including non-


Trace control information

from "ltt-base"

Data structure passed as parameter

from the call site

instrumentation site

RelayFS buffers

Figure 6: LTTng instrumentation site

maskable interrupts (NMI) handlers which can-not be disabled like normal interrupts. The sec-ond goal is to have the minimum impact on per-formance by both having a fast code and notdisrupting normal system behavior by takingintrusive locks or disabling interrupts.

To describe the reentrancy mechanism used bythe LTTng instrumentation site (see Figure 6),we define the call site, which is the originalcode from the instrumented program where thetracing function is called. We also define theinstrumentation site, which is the tracing func-tion itself.

The instrumentation site found in the kernel anduser space instrumentation has very well de-fined inputs and outputs. Its main input is thecall site parameters. The call site must insurethat the data given as parameter to the instru-mentation site is properly protected with its as-sociated locks. Very often, such data is alreadylocked by the call site, so there is often no needto add supplementary locking.

The other input that the instrumentation sitetakes is the global trace control information. Itis contained in a RCU list of active traces in the

ltt-base object. Note that the instrumentationsite uses the trace control information both asan input and output: this is both how tracingbehavior is controlled and where variables thatcontrol writing to RelayFS buffers are stored.

The main output of the instrumentation site isa serialized memory write of both an eventheader and the instrumentation site parametersto the per-CPU RelayFS buffers. The locationin these buffers is protected from concurrent ac-cess by using a lockless memory write schemeinspired from the one found in K42 [5]:

First, the amount of memory space necessaryfor the memory write is computed. When thedata size is known statically, this step is quitefast. If, however, variable length data (string orsequence) must be recorded, a first size calcu-lation pass is performed. Alignment of the datais taken care of in this step. To speed up dataalignment, the start address of the variable sizedata is always aligned on the architecture size:it makes it possible to do a compile time aligne-ment computation for all fixed size types.

Then, a memory region in the buffers isreserved atomically with a compare-and-exchange loop. The algorithm retries the reser-vation if a concurrent reserve occurs. Thetimestamp for the event is taken inside thecompare-and-exchange loop so that it is monot-ically incrementing with buffer offsets. Thisis done to simplify data parsing in the post-processing tool.

A reservation can fail on the following condi-tions. In normal tracing mode, a buffer fullcondition causes the reservation to fail. Onthe other hand, in flight recorder mode, weoverwrite non-read buffers, so it will neverfail. When the reservation fails, the event lostcounter is incremented and the instrumentationsite will return without doing a commit.

The next step is to copy data from the instru-


mentation site arguments to the RelayFS re-served memory region. This step must preservethe same data alignment that has been calcu-lated earlier.

Finally, a commit operation is done to releasethe reserved memory segment. No informationis kept on a per memory region basis. We onlykeep a count of the number of reserved andcommitted bytes per subbuffer. A subbufferis considered to be in a consistent state (non-corrupted and readable) when both counts areequal.

Is is possible that a process die between theslot reservation and commit because of a ker-nel OOPS. In that case, the lttd daemon willbe incapable of reading the subbuffer affectedby this condition because of unequal reserveand commit counts. This situation is resolvedwhen the reservation algorithm wraps to thefaulty subbuffer: if the reservation falls in anew buffer that has unequal reserve and commitcounts, the reader (lttd) is pushed to the nextsubbuffer, the subbuffers lost counter is incre-mented, and the subbuffer is overwritten. Toinsure that this condition will not be reached bynormal out of order commit of events (causedby nested execution contexts), the buffer mustbe big enough to contain data recorded by themaximum number of out of order events, whichis limited by the longest sequence of eventslogged from nestable contexts (softirq, inter-rupts, and NMIs).

The subbuffer delivery is triggered by a flagfrom the call site on subbuffer switch. It is peri-odically checked by a timer routine to take theappopriate actions. This ensures atomicity anda correct lockless behavior when called fromNMI handlers.

Compared to printk, which calls the sched-uler, disables interrupts, and takes spinlocks,LTTng offers a more robust reentrancy that

makes it callable from the scheduler code andfrom NMI handlers.

4.5.2 Scalability

Scalability of the tracing code for SMP ma-chines is insured by use of per-CPU data andby the lockless tracing mechanism. The in-puts of the instrumentation site are scalable: thedata given as parameter is usually either on thecaller’s stack or already properly locked. Theglobal trace information is organized in a RCUlist which does not require any lock from thereader side.

Per-CPU buffers eliminate the false sharingof cachelines between multiple CPUs on thememory write side. The fact that input-outputtrace control structures are per-CPU also elim-inates false sharing.

To identify more precisely the performancecost of this algorithm, let’s compare two ap-proaches: taking a per-CPU spinlock or us-ing an atomic compare-and-exchange opera-tion. The most frequent path implies eithertaking and releasing a spinlock along withdisabling interrupts or doing a compare-and-exchange, an atomic increment of the reservecount, and an atomic increment of the commitcount.

On a 3GHz Pentium 4, a compare-and-exchange without LOCK prefix costs 29 cycles.With a LOCK prefix, it raises to 112 cycles. Anatomic increment costs respectively 7 and 93cycles without and with a LOCK prefix. Usinga spinlock with interrupts disabled costs 214cycles.

As LTTng uses per-CPU buffers, it does notneed to take a lock on memory to protect fromother CPU concurrent access when perform-ing these operations. Only the non-locked ver-


sions of compare-and-exchange and atomic in-crement are then necessary. If we consideronly the time spent in atomic operations, usingcompare-and-exchange and atomic incrementstakes 43 cycles compared to 214 cycles for aspinlock.

Therefore, using atomic operations is five timesfaster than an equivalent spinlock on this archi-tecture while having the additionnal benefit ofbeing reentrant for NMI code and not disturb-ing the system behavior, as it does not disableinterrupts for the duration of the tracing code.

4.5.3 Time (im)precision in the Linux ker-nel

Time precision in the Linux kernel is a researchsubject on its own. However, looking at theLinux kernel x86 timekeeping code is very en-lightening on the nanosecond timestamps accu-racy provided by the kernel. Effectively, it isbased on a CPU cycle to nanosecond scalingfactor computed at boot time based on the timerinterrupt. The code that generates and uses thisscaling factor takes for granted that the valueshould only be precise enough to keep trackof scheduling periods. Therefore, the focus isto provide a fast computation of the time withshifting techniques more than providing a veryaccurate timestamp. Furthermore, doing inte-ger arithmetic necessarily implies a loss of pre-cision.

It causes problems when a tool like LTTngstrongly depends on the monotonicity and pre-cision of the time value associated with times-tamps.

To overcome the inherent kernel time preci-sion limitations, LTTng directly reads the CPUtimestamp counters. It uses the cpu_khz ker-nel variable which contains the most precisecalibrated CPU frequency available. This value

will be used by the post-processing tool, LTTV,to convert cycles to nanoseconds in a precisemanner with double precision numbers.

Due to the importance of the CPU times-tamp counters in LTTng instrumentation, aworkaround has been developed to support ar-chitectures that only have a 32-bit timestampcounter available. It uses the ltt-heartbeat mod-ule periodic timer to keep a full 64-bit times-tamp counter on architectures where it is miss-ing by detecting the 32-bit overflows in anatomic fashion; both the previous and the cur-rent TSC values are kept, swapped by a pointerchange upon overflow. The read-side must ad-ditionnaly check for overflows.

It is important to restate that the time baseused by LTTng is based neither on the ker-nel do_gettimeofday, which is NTP cor-rected and thus non monotonic nor on the ker-nel monotonic time, which suffers from integerarithmetic imprecision. LTTng uses the CPUtimestamp counter and its most accurate cali-bration.

4.5.4 User space tracing

User space tracing has been achieved in manyways in the past. The original LTT [7] did usewrite operations in a device to send events tothe kernel. It did not, however, give the sameperformances as in kernel events, as it needs around-trip to the kernel and many copies of theinformation.

K42 [5] solves this by sharing per-CPU mem-ory buffers between the kernel and user spaceprocesses. Although this is very performant, itdoes not insure secure tracing, as a given pro-cess can corrupt the traces that belong to otherprocesses or to the kernel. Moreover, sharingmemory regions between the kernel and user


space might be acceptable for a research ker-nel, but for a production kernel, it implies aweaker traceability of process-kernel commu-nications and might bring limitations on archi-tectures with mixed 32- and 64-bit processes.

LTTng provides user space tracing through twodifferent schemes to suit two distincts cate-gories of instrumentation needs.

The first category is characterized by a verylow event throughput. It can be the case of anevent than happens rarely to show a specific er-ror condition or periodically at an interval typ-ically greater or equal to the scheduler period.The “slow tracing path” is targeted at this cate-gory.

The second category, which is addressed by the“fast tracing path,” is much more demanding. Itis particularly I/O intensive and must be closeto the performance of a direct memory write.This is the case when instrumenting manuallythe critical path in a program or automaticallyevery function entry/exit by gcc.

Both mecanisms share the same facility regis-tration interface with the kernel, which passesthrough a system call, as shown in Figure 4.Validation is done by limiting these user spacefacilities to their own namespace so they cannotimitate kernel events.

The slow path uses a costly system call at eachevent call site. Its advantage is that it doesnot require linking the instrumented programagainst any library and does not have any threadstartup performance impact like the fast pathexplained below. Every event logged throughthe system call is copied in the kernel tracingbuffers. Before doing so, the system call ver-ifies that the facility ID corresponds to a validuser space facility.

The fast path, libltt-usertrace-fast (at Figure 2)library consists in a per thread companion pro-cess which writes the buffers directly to disk.

Communication between the thread and the li-brary is done through the use of circular buffersin an anonymous shared memory map. Writingthe buffers to disk is done by a separate com-panion process to insure that buffered data isnever lost when the traced program terminates.The other goal is to account the time spent writ-ing to disk to a different process than the onebeing traced. The file is written in the filesys-tem, arbitrarily in /tmp/ltt-usertrace,in files following this naming convention:process-tid-pid-timestamp, whichmakes it unique for the trace. When tracingis over, the /tmp/ltt-usertrace must bemanually moved into the kernel trace. The traceand usertrace do not have to coincide: althoughit is better to have the usertrace time span in-cluded in the kernel trace interval to benefitfrom the scheduler information for the runningprocesses, it is not mandatory and partial infor-mation will remain available.

Both the slow and the fast path reuse the lock-less tracing algorithm found in the LTTng ker-nel tracer. In the fast path, it ensures reentrancywith signal handlers without the cost of dis-abling signals at each instrumentation site.

5 Graphical viewer: LTTV

LTTng is independent of the viewer, the traceformat is well documented and a trace-readinglibrary is provided. Nonetheless, the associ-ated viewer, LTTV, will be briefly introduced.It implements optimised algorithms for randomaccess of several multi-GB traces, describingthe behavior of one or several uniprocessor ormulti-processor systems. Many plugin viewscan be loaded dynamically into LTTV for thedisplay and analysis of the data. Developerscan thus easily extend the tool by creating theirown instrumentation with the flexible XML de-scription and connect their own plugin to that


information. It is layered in a modular archi-tecture.

On top of the LGPL low-level trace files read-ing library, LTTV recreates its own representa-tion of the evolving kernel state through timeand keeps statistical information into a generichierarchical container. By combining the ker-nel state, the statistics, and the trace events, theviewers and analysis plugins can extend the in-formation shown to the user. Plugins are keptfocused (analysis, text, or graphical display,control. . . ) to increase modularity and reuse.The plugin loader supports dependency control.

LTTV also offers a rich and performant eventfilter, which allows specifying, with a logicalexpression, the events a user is interested to see.It can be reused by the plugins to limit theirscope to a subset of the information.

For performance reasons, LTTV is written in C.It uses the GTK graphical library and glib. It isdistributed under the GPLv2 license.

6 Results

This section presents the results of several mea-surements. We first present the time overheadon the system running microbenchmarks of theinstrumentation site. Then, taking these resultsas a starting point, the interrupt and schedulerimpact will be discussed. Macrobenchmarks ofthe system under different loads will then beshown, detailing the time used for tracing.

The size of the instrumentation object code willbe discussed along with possible size optimisa-tions. Finally, time precision calibration is per-formed with a NMI timer.

6.1 Test environment

The test environment consists of a 3GHz,uniprocessor Pentium 4, with hyperthreadingdisabled, running LTTng 0.5.41. The resultsare presented in cycles; the exact calibration ofthe CPU clock is 3,000.607 MHz.

6.2 Microbenchmarks

Table 1 presents probe site microbenchmarks.Kernel probe tests are done in a kernel mod-ule with interrupts disabled. User space testsare influenced by interrupts and the sched-uler. Both consist in 20,000 hits of a probethat writes 4 bytes plus the event header (20bytes). Each hit is surrounded by two times-tamp counter reads.

When compiling out the LTTng tracing, cali-bration of the tests shows that the time spent inthe two TSC reads varies between 97 and 105cycles, with an average of 100.0 cycles. Wetherefore removed this time from the raw probetime results.

As we can see, the best case for kernel tracingis a little slower than the ltt-usertrace-fast li-brary: this is due to supplementary operationsthat must be done in the kernel (preemption dis-abling for instance) that are not needed in userspace. The maximum and average values oftime spent in user space probes does not meanmuch because they are sensitive to schedulingand interrupts.

The key result in Table 1 is the average 288.5cycles (96.15ns) spent in a probe.

LTTng probe sites do not increase latency be-cause they do not disable interrupts. However,the interrupt entry/exit instrumentation itselfdoes increase interrupt response time, which


Probe site Test SeriesTime spent in probe (cycles)

min average maxKernel Tracing dynamically disabled 0 0.000 338Kernel Tracing active (1 trace) 278 288.500 6,997User space ltt-usertrace-fast library 225 297.021 88,913User space Tracing through system call 1,013 1,042.200 329,062

Table 1: LTTng microbenchmarks for a 4-byte event probe hit 20,000 times

therefore increases low priority interrupts la-tency by twice the probe time, which is 577.0cycles (192.29ns).

The scheduler response time is also affectedby LTTng instrumentation because it must dis-able preemption around the RCU list used forcontrol. Furthermore, the scheduler instrumen-tation itself adds a task switch delay equalto the probe time, for a total scheduler de-lay of twice the probe time: 577.0 cycles(192.29ns). In addition, a small implementa-tion detail (use of preempt_enable_no_resched()), to insure scheduler instrumen-tation reentrancy, has a downside: it can possi-bly make the scheduler miss a timer interrupt.This could be solved for real-time applicationsby using the no resched flavour of preemptionenabling only in the scheduler, wakeup, andNMI nested probe sites.

6.3 Macrobenchmarks

6.3.1 Kernel tracing

Table 2 details the time spent both in theinstrumentation site and in lttd for differentloads. Time spent in instrumentation is com-puted from the average probe time (288.5 cy-cles) multiplied by the number of probe hits.Time spent in lttd is the CPU time of the lttdprocess as given in the LTTV analysis. Theload is computed by subtracting the time spent

in system call mode in process 0 (idle process)from the wall time.

It is quite understandable that the probes trig-gered by the ping flood takes that much CPUtime, as it instruments a code path that is calledvery often: the system call entry. The totalcpu time used by tracing on a busy system(medium and high load scenarios) goes from1.54 to 2.28%.

6.3.2 User space tracing

Table 3 compares the ltt-usertrace-fast userspace tracer with gprof on a specific task: theinstrumentation of each function entry and exitof a gcc compilation execution. You will seethat the userspace tracing of LTTng is only aconstant factor of 2 slower than a gprof instru-mented binary, which is not bad considering theamount of additional data generated. The fac-tor of 2 is for the ideal case where the daemonwrites to a /dev/null output. In practice,the I/O device can further limit the throughput.For instance, writing the trace to a SATA disk,LTTng is 4.13 slower than gprof.

The next test consists in running an instru-mented version of gcc, itself compiled with op-tion -finstrument-functions, to com-pile a 6.9KiB C file into a 15KiB object, withlevel 2 optimisation.

As Table 3 shows, gprof instrumented gcc takes1.73 times the normal execution time. The


Load size Test SeriesCPU time (%) Data rate Events/s

load probes lttd (MiB/s)Small mozilla (browsing) 1.15 0.053 0.27 0.19 5,476Medium find 15.38 1.150 0.39 2.28 120,282High find + gcc 63.79 1.720 0.56 3.24 179,255Very high find + gcc + ping flood 98.60 8.500 0.96 16.17 884,545

Table 2: LTTng macrobenchmarks for different loads

gcc instrumentation Time (s) Data rate(MiB/s)

not instrumented 0.446gprof 0.774LTTng (null output) 1.553 153.25LTTng (disk output) 3.197 74.44

Table 3: gcc function entry/exit tracing

fast userspace instrumentation of LTTng is 3.22times slower than normal. Gprof only extractssampling of function time by using a periodicaltimer and keeps per function counters. LTTngextracts the complete function call trace of aprogram, which generates an output of 238MiBin 1.553 seconds (153.25 MiB/s). The execu-tion time is I/O-bound, it slows down to 3.197swhen writing the trace on a SATA disk throughthe operating system buffers (74.44 MiB/s).

6.4 Instrumentation objects size

Another important aspect of instrumentation isthe size of the binary instructions added to theprograms. This wastes precious L1 cache spaceand grows the overall object code size, whichis more problematic in embedded systems. Ta-ble 4 shows the size of stripped objects thatonly contain intrumentation.

Independently of the amount of data to trace,the object code size only varies in our tests

Instrumentation object codesize (bytes)

log 4-byte integer 2,288log variable length string 2,384log a structure of 2,432

int, string,sequence of

8-byte integers

Table 4: Instrumentation object size

of a maximum of 3.3% from the average size.Adding 2.37kB per event might be too much forembedded applications, but a tradeoff can bedone between inlining of tracing sites (and ref-erence locality) and doing function calls, whichwould permit instrumentation code reuse.

A complete L1 cache hit profiling should bedone to fully see the cache impact of the in-strumentation and help tweak the inlining level.Such profiling is planned.

6.5 Time precision

Time precision measurement of a timestampcounter based clock source can only be donerelatively to another clock source. The follow-ing test traces the NMI watchdog timer, us-ing it as a comparison clock source. It hasthe advantage of not being disturbed by CPUload as these interruptions cannot be deacti-vated. It is, however, limited by the precision of


3.994

3.995

3.996

3.997

3.998

3.999

4

4.001

4.002

4.003

4.004

4.005

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

inte

rval

(m

s)

event number

Traced NMI timer events interval

Figure 7: Traced NMI timer events interval

the timer crystal. The hardware used for thesetests is an Intel D915-GAG motherboard. Itstimer is driven by a TXC HC-49S crystal with a±30 PPM precision. Table 5 and Figure 7 showthe intervals of the logged NMI timer events.Their precision is discussed.

This table indicates a standard deviation of52ns and a maximum deviation of 5,075ns fromthe average. If we take the maximum deviationas a worse case, we can assume than we have a±5.075µs error between the programmable in-terrupt timer (PIT) and the trace time base (de-rived from the CPU TSC). Part of it is due to theCPU cache misses, higher priority NMIs, ker-nel minor page faults, and the PIT itself. A 52nsstandard deviation each 4ms means a 13µs er-ror each second, for a 13 PPM frequency preci-sion which is within the expected limits.

Statistic value (ns)min 3,994,844average 3,999,339max 4,004,468standard deviation 52max deviation 5,075

Table 5: Traced NMI timer events interval

7 Conclusion

As demonstrated in the previous section,LTTng is a low disturbance tracer that usesabout 2% of CPU time on a heavy workload.It is entirely based on atomic operations to in-sure reentrancy. This enables it to trace a widerange of code sites, from user space programsand libraries to kernel code, in every executioncontext, including NMI handlers.


Its time measurement precision gives a13 PPM frequency error when reading theprogrammable interrupt timer (PIT) in NMImode, which is coherent with the 30 PPMcrystal precision.

LTTng proves to be a performant and precisetracer. It offers an architecture independent in-strumentation code generator, from templates,to reduce instrumentation effort. It provides ef-ficient and convenient mechanisms for kerneland user space tracing.

A plugin based analysis tool, LTTV, helps tofurther reduce the effort for analysis and visual-isation of complex operating system behavior.Some work is actually being done in time syn-chronisation between cluster nodes, to extendLTTV to cluster wide analysis.

You are encouraged to use this tool and createnew instrumentations, either in user space or inthe kernel. LTTng and LTTV are distributedunder the GPLv2 license.2

References

[1] Bryan M. Cantrill, Michael W. Shapiro,and Adam H. Leventhal. Dynamicinstrumentation of production systems. InUSENIX ’04, 2004.

[2] Michel Dagenais, Richard Moore, RobertWisniewski, Karim Yaghmour, andThomas Zanussi. Efficient and accuratetracing of events in linux clusters. InProceedings of the Conference on HighPerformance Computing Systems (HPCS),2003.

[3] Vara Prasad, William Cohen, Frank Ch.Eigler, Martin Hunt, Jim Keniston, andBrad Chen. Locating system problems

2Project website: http://ltt.polymtl.ca

using dynamic instrumentation. In OLS(Ottawa Linux Symposium) 2005, 2005.

[4] Ariel Tamches and Barton P. Miller.Fine-grained dynamic instrumentation ofcommodity operating system kernels. In3rd Symposium on Operating SystemsDesign and Implementation, February1999.

[5] Robert W. Wisniewski and BryanRosenburg. Efficient, unified, and scalableperformance monitoring formultiprocessor operating systems. InSupercomputing, 2003 ACM/IEEEConference, 2003.

[6] Robert W. Wisniewski, Peter F. Sweeney,Kartik Sudeep, Matthias Hauswirth,Evelyn Duesterwald, Calin Cascaval, andReza Azimi. Pem performance andenvironment monitoring for whole-systemcharacterization and optimization. InPAC2 (Conference on Power/Performanceinteraction with Architecture, Circuits, andCompilers), 2004.

[7] Karim Yaghmour and Michel R. Dagenais.The linux trace toolkit. Linux Journal,May 2000.

[8] Tom Zanussi, Karim Yaghmour RobertWisniewski, Richard Moore, and MichelDagenais. relayfs: An efficient unifiedapproach for transmitting data from kernelto user space. In OLS (Ottawa LinuxSymposium) 2003, pages 519–531, 2003.


Linux as a HypervisorAn Update

Jeff DikeIntel Corp.

[email protected]

Abstract

Virtual machines are a relatively new workloadfor Linux. As with other new types of applica-tions, Linux support was somewhat lacking atfirst and improved over time.

This paper describes the evolution of hypervi-sor support within the Linux kernel, the spe-cific capabilities which make a difference tovirtual machines, and how they have improvedover time. Some of these capabilities, suchas ptrace are very specific to virtualization.Others, such as AIO and O_DIRECT supporthelp applications other than virtual machines.

We describe areas where improvements havebeen made and are mature, where work is on-going, and finally, where there are currently un-solved problems.

1 Introduction

Through its history, the Linux kernel has hadincreasing demands placed on it as it supportednew applications and new workloads. A rela-tively new demand is to act as a hypervisor, asvirtualization has become increasingly popular.In the past, there were many weaknesses in the

ability of Linux to be a hypervisor. Today, thereare noticeably fewer, but they still exist.

Not all virtualization technologies stress the ca-pabilities of the kernel in new ways. Thereare those, such as qemu, which are instruc-tion emulators. These don’t stress the ker-nel capabilities—rather they are CPU-intensiveand benefit from faster CPUs rather than morecapable kernels. Others employ a customizedhypervisor, which is often a modified Linuxkernel. This will likely be a fine hypervisor, butthat doesn’t benefit the Linux kernel becausethe modifications aren’t pushed into mainline.

User-mode Linux (UML) is the only prominentexample of a virtualization technology whichuses the capabilities of a stock Linux kernel.As such, UML has been the main impetus forimproving the ability of Linux to be a hyper-visor. A number of new capabilities have re-sulted in part from this, some of which havebeen merged and some of which haven’t. Manyof these capabilities have utility beyond virtual-ization, as they have also been pushed by peo-ple who are interested in applications that areunrelated to virtualization.

ptrace is the mechanism for virtualizing sys-tem calls, and is the core of UML’s virtualiza-tion of the kernel. As such, some changes toptrace have improved (and in one case, en-abled) the ability to virtualize Linux.

226 • Linux as a Hypervisor

Changes to the I/O system have also improvedthe ability of Linux to support guests. Thesewere driven by applications other than virtual-ization, demonstrating that what’s good for vir-tualization is often good for other workloads aswell.

From a virtualization point of view, AIO andO_DIRECT allow a guest to do I/O as thehost kernel does—straight to the disk, with nocaching between its own cache and the device.In contrast, MADV_REMOVE allows a guest todo something which is very difficult for a phys-ical machine, which is to implement hotplugmemory, by releasing pages from the middle ofa mapped file that’s backing the guest’s physi-cal memory.

FUSE (Filesystems in Userspace), another re-cent addition, is also interesting, this time froma manageability standpoint. This allows a guestto export its filesystem to the host, where a hostadministrator can perform some guest manage-ment tasks without needing to log in to theguest.

There is a new effort to add a virtualizationinfrastructure to the kernel. A number ofprojects are contributing to this effort, includ-ing OpenVZ, vserver, UML, and others who aremore interested in resource control than virtu-alization. This holds the promise of allowingguests to achieve near-native performance byallowing guest process system calls to executeon the host rather than be intercepted and virtu-alized by ptrace.

Finally, there are a few problem areas whichare important to virtualization for which thereare no immediate solutions. It would be conve-nient to be able to create and manage addressspaces separately from processes. This is partof the UML SKAS host patch, but the mecha-nism implemented there won’t be merged intomainline. The current virtualization infrastruc-ture effort notwithstanding, system call inter-

ception will be needed for some time to come.So, system call interception will still be an areaof concern. Ingo Molnar implemented a mech-anism called VCPU which effectively allows aprocess to intercept its own system calls. Thishasn’t been looked at in any detail, so it’s tooearly to see if this is a better way for virtualmachines to do system call interception.

2 The past

2.1 ptrace

When UML was first introduced, Linux was in-capable of acting as a hypervisor1. ptraceallows one process to intercept the system callsof another both at system call entry and exit.The tracing process can examine and modifythe registers of the traced child. For example,strace simply examines the process registersin order to print the system call, its arguments,and return value. Other tools, UML included,modify the registers in order to change the sys-tem call arguments or return value. Initially,on i386, it was impossible to change the actualsystem call, as the system call number had al-ready been saved before the tracing parent wasnotified of the system call. UML needed this inorder to nullify system calls so that they wouldexecute in such way as to cause no effects onthe host. This was done by changing the sys-tem call to getpid. A patch to fix this wasdeveloped soon after UML’s first release, and itwas fairly quickly accepted by Linus.

While this was a problem on i386, architecturesdiffer on their handling of attempts to changesystem call numbers. The other architecturesto which UML has been ported (x86_64, s390,and ppc) all handled this correctly, and needed

1on i386, which was the only platform UML ran onat the time


no changes to their system call interception inorder to run UML.

Once ptrace was capable of supportingUML, attention turned to its performance, asvirtualized system calls are many times slowerthan non-virtualized ones. An intercepted sys-tem call involves the system call itself, plus fourcontext switches—to the parent and back onboth system call entry and exit. UML, and anyother tool which nullifies and emulates systemcalls, has no need to intercept the system callexit. So, another ptrace patch, from Lau-rent Vivier, added PTRACE_SYSEMU, whichcauses only system call entry to notify the par-ent. There is no notification on system callexit. This reduces the context switching due tosystem call interception by 50%, with a corre-sponding performance improvement for bench-marks that execute a system call in a tightloop. There is also a noticeable performanceincrease for workloads that are not system call-intensive. For example, I have measured a ~3%improvement on a kernel build.

2.2 AIO and O_DIRECT

While these ptrace enhancements weredriven solely by the needs of UML, most of theother enhancements to the kernel which makeit more capable as a hypervisor were driven byother applications. This is the case of the I/Oenhancements, AIO and O_DIRECT, whichhad been desired by database vendors for quitea while.

AIO (Asynchronous IO) is the ability to issuean I/O request without having to wait for it tofinish. The familiar read and write inter-faces are synchronous—the caller can use themto make one I/O request and has to wait untilit finishes before it can make another request.The wait can be long if the I/O requires disk ac-cess, which hurts the performance of processes

which could have issued more requests or doneother work in the meantime

A virtual OS is one such process. The ker-nel typically issues many disk I/O requests ata time, for example, in order to perform reada-head or to swap out unused memory. Whenthese requests are performed sequentially, aswith read and write, there is a large perfor-mance loss compared to issuing them simulta-neously. For a long time, UML handled thisproblem by using a separate dedicated threadfor I/O. This allowed UML to do other workwhile an I/O request was pending, but it didn’tallow multiple outstanding I/O requests.

The AIO capabilities which were introduced inthe 2.6 kernel series do allow this. On a 2.6host, UML will issue many requests at once,making it act more like a native kernel.

A related capability is O_DIRECT I/O. This al-lows uncached I/O—the data isn’t cached in thekernel’s page cache. Unlike a cached write,which is considered finished when the data isstored in the page cache, an O_DIRECT writeisn’t completed until the data is on disk. Sim-ilarly, an O_DIRECT read brings the data infrom disk, even if it is available in the pagecache. The value of this is that it allows pro-cesses to control their own caching without thekernel performing duplicate caching on its own.For a virtual machine, which comes with itsown caching system, this allows it to behavelike a native kernel and avoid the memory con-sumption caused by buffered I/O.

2.3 MADV_REMOVE

Unlike AIO and O_DIRECT, which allow avirtual kernel to act like a native kernel, MADV_REMOVE allows it implement hotplug memory,which is very much more difficult for a physical


machine. UML implements its physical mem-ory by creating a file on the host of the appro-priate size and mapping pages from it into itsown address space and those of its processes. Ihave long wanted a way to be able to free dirtypages from this file to the host as though theywere clean. This would allow a simple wayto manage the host’s memory by moving it be-tween virtual machines.

Removing memory from a virtual machine isdone by allocating pages within it and freeingthose pages to the host. Conversely, addingmemory is done by freeing previously allocatedpages back to the virtual machine’s VM sys-tem. However, if dirty pages can’t be freed onthe host, there is no benefit.

I implemented one mechanism for doing thissome time ago. It was a new driver, /dev/anon, which was based on tmpfs. UML phys-ical memory is formed by mapping this device,which has the semantics that when a page is nolonger mapped, it is freed. With /dev/anon,in order to pull memory from a UML instance,it is allocated from the guest VM system andthe corresponding /dev/anon pages are un-mapped. Those pages are freed on the host, andanother instance can have a similar amount ofmemory plugged in.

This driver was never seriously considered forsubmission to mainline because it was a fairlydirty kludge to the tmpfs driver and because itwas never fully debugged. However, the needfor something equivalent remained.

Late in 2005, Badari Pulavarty from IBM pro-posed an madvise extension to do somethingequivalent. His motivation was that some IBMdatabase wanted better control over its mem-ory consumption and needed to be able to pokeholes in a tmpfs file that it mapped. This is ex-actly what UML needed, and Hugh Dickens,who was aware of my desire for this, pointed

Badari in my direction. I implemented a mem-ory hotplug driver for UML, and he used it inorder to test and debug his implementation.

MADV_REMOVE is now in mainline, and at thiswriting, the UML memory hotplug driver is in-mm and will be included in 2.6.17.

3 Present

3.1 FUSE

FUSE2 is an interesting new addition to thekernel. It allows a filesystem to be imple-mented by a userspace driver and mounted likeany in-kernel filesystem. It implements a de-vice, /dev/fuse, which the userspace driveropens and uses to communicate with the kernelside of FUSE. It also implements a filesystem,with methods that communicate with the driver.FUSE has been used to implement things likesshfs, which allows filesystem access to a re-mote system over ssh, and ftpfs, which allowsan ftp server to be mounted and accessed as afilesystem.

UML uses FUSE to export its filesystem to thehost. It does so by translating FUSE requestsfrom the host into calls into its own VFS. Therewere some mismatches between the interfaceprovided by FUSE and the interface expectedby the UML kernel. The most serious wasthe inability of the /dev/fuse device to sup-port asynchronous operation—it didn’t supportO_ASYNC or O_NONBLOCK. The UML ker-nel, like any OS kernel, is event-driven, andworks most naturally when requests and otherthings that require attention generate interrupts.It must also be possible to tell when a particu-lar interrupt source is empty. For a file, thismeans that when it is read, it returns -EAGAIN

2http://fuse.sourceforge.net/


instead of blocking when there is no input avail-able. /dev/fuse didn’t do either, so I im-plemented both O_ASYNC and O_NONBLOCKsupport and sent the patches to Miklos Szeredi,the FUSE maintainer.

The benefit of exporting a UML filesystem tothe host using FUSE is that it allows a num-ber of UML management tasks to be performedon the host without needing to log in to theUML instance. For example, it would allowthe host administrator to reset a forgotten rootpassword. In this case, root access to the UMLinstance would be difficult, and would likely re-quire shutting the instance down to single-usermode.

By chrooting to the UML filesystem mount onthe host, the host admin can also examine thestate of the instance. Because of the chroot,system tools such as ps and top will see theUML /proc and /sys, and will display thestate of the UML instance. Obviously, this onlyprovides read access to this state. Attempting tokill a runaway UML process from within thischroot will only affect whatever host processhas that process ID.

3.2 Kernel virtualization infrastructure

There has been a recent movement to introducea fairly generic virtualization infrastructure intothe kernel. Several things seemed to have hap-pened at about the same time in order to makethis happen. Two virtualization projects, Vir-tuozzo and vserver, which had long maintainedtheir kernel changes outside the mainline ker-nel tree, expressed an interest in getting theirwork merged into mainline. There was also in-terest in related areas, such as workload migra-tion and resource management.

This effort is headed in the direction of intro-ducing namespaces for all global kernel data.

The concept is the same as the current filesys-tem namespaces—processes are in the globalnamespace by default, but they can place them-selves in a new namespace, at which pointchanges that they make to the filesystem aren’tvisible to processes outside the new namespace.The changes in question are changed mounts,not changed files—when a process in a newnamespace changes a file, that’s visible outsidethe namespace, but when it make a mount inits namespace, that’s not visible outside. Forfilesystems, the situation is more complicatedthan that because there are rules for propagatingnew mounts between namespaces. However,for virtualization purposes, the simplest viewof namespaces works—that changes within thenamespace aren’t visible outside it.

When3 finished, it will be possible to createnew instantiations of all of the kernel subsys-tems. At this point, virtualization approacheslike OpenVZ and vserver will map pretty di-rectly onto this infrastructure.

UML will be able to put this to good use, butin a different way. It will allow UML to haveits process system calls run directly on the host,without needing to intercept and emulate themitself. UML will create an virtualized instanceof a subsystem, and configure it as appropriate.At that point, UML process system calls whichuse that subsystem can run directly on the hostand will behave the same as if it had been exe-cuted within UML.

For example, virtualizing time will be a mat-ter of introducing a time namespace whichcontains an offset from the host time. Anyprocess within this namespace will see a sys-tem time that’s different from the host timeby the amount of this offset. The offsetis changed by settimeofday, which cannow be an unprivileged operation since its ef-fects are invisible outside the time namespace.

3or if—some subsystems will be difficult to virtualize


gettimeofday will take the host time andadd the namespace offset, if any.

With the time namespace working, UMLcan take advantage of it by allowinggettimeofday to run directly on the hostwithout being intercepted. settimeofdaywill still need to be intercepted because itwill be a privileged operation within the UMLinstance. In order to allow it to run on the host,user and groups IDs will need to be virtualizedas well.

UML will be able to use the virtualized sub-systems as they become available, and not haveto wait until the infrastructure is finished. Todo this, another ptrace extension will beneeded. It will be necessary to selectively in-tercept system calls, so a system call mask willbe added. This mask will specify which sys-tem calls should continue to be intercepted andwhich should be allowed to execute on the host.

Since some system calls will sleep when theyare executed on the host, the UML kernel willneed to be notified. When a process sleeps in asystem call, UML will need to schedule anotherprocess to run, just as it does when a system callsleeps inside UML. Conversely, when the hostsystem call continues running, the UML willneed to be notified so that it can mark the pro-cess as runnable within its own scheduler. So,another ptrace extension, asking for notifica-tion when a child voluntarily sleeps and whenit wakes up again, will be needed. As a side-benefit, this will also provide notification to theUML kernel when a process sleeps because itneeds a page of memory to be read in, eitherbecause that page hadn’t been loaded yet or be-cause it had been swapped out. This will allowUML to schedule another process, letting it dosome work while the first process has its pagefault handled.

3.3 remap_file_pages

When page faults are virtualized, they are fixedby calling either mmap or mprotect4 on thehost. In the case of mapping a new page, a newvm_area_struct (VMA) will be created onthe host. Normally, a VMA describes a largenumber of contiguous pages, such as the pro-cess text or data regions, being mapped from afile into a region of a process virtual memory.

However, when page faults are virtualized, aswith UML, each host VMA covers a singlepage, and a large UML process can have thou-sands of VMAs. This is a performance prob-lem, which Ingo Molnar solved by allowingpages to be rearranged within a VMA. Thisis done by introducing a new system call,remap_file_pages, which enables pagesto be mapped without creating a new VMA foreach one. Instead, a single large mapping ofthe file is created, resulting in a single VMA onthe host, and remap_file_pages is used toupdate the process page tables to change pagemappings underneath the VMA.

Paolo Giarrusso has taken this patch and ismaking it more acceptable for merging intomainline. This is a challenging process, as thepatch is intrusive into some sensitive areas ofthe VM system. However, the results shouldbe worthwhile, as remap_file_pages pro-duces noticeable performance improvementsfor UML, and other mmap-intensive applica-tions, such as some databases.

4 Future

So far, I’ve talked about virtualization enhance-ments which either already exist or which show

4depending on whether the fault was caused by nopage being present or the page being mapped with insuf-ficient access for the faulting operation


some promise of existing in the near future.There are a couple of areas where there areproblems with no attractive solutions or a so-lution that needs a good deal of work in orderto be possibly mergeable.

4.1 AIO enhancements

4.1.1 Buffered AIO

Currently AIO is only possible in conjunctionwith O_DIRECT. This is where the greatestbenefit from AIO is seen. However, there is de-mand for AIO on buffered data, which is storedin the kernel buffer cache. UML has severalfilesystems which store data in the host filesys-tem, and the ability for these filesystems to per-form AIO would be welcome. There is a patchto implement this, but it hasn’t been merged.

4.1.2 AIO on metadata

Virtual machines would prefer to sleep in thehost kernel only when they choose to, and foroperations which may sleep to be performedasynchronously and deliver an event of somesort when they complete. AIO accomplishesthis nicely for file data. However, operationson file metadata, such as stat, can still sleepwhile the metadata is read from disk. So, theability to perform stat asynchronously wouldbe a nice small addition to the AIO subsystem.

4.1.3 AIO mmap

When reading and writing buffered data, it ispossible to save memory by mapping the dataand modifying the data in memory rather thanusing read and write. When mapping a file,there is no copying of the data into the process

address space. Rather, the page of data in thekernel’s page cache is mapped into the addressspace.

Against the memory savings, there is the cost ofchanging the process memory mappings, whichcan be considerable—comparable to copyinga page of data. However, on systems wherememory is tight, the option of using mmap forguest file I/O rather than read and writewould be welcome.

Currently, there is no support for doing mmapasynchronously. It can be simulated (whichUML does) by calling mmap (which returnsafter performing the map, but without readingany data into the new page), and then doing anAIO read into the page. When the read finishes,the data is known to be in memory and the pagecan be accessed with high confidence5 that theaccess will not cause a page fault and sleep.

This works well, but real AIO mmap supportwould have the advantage that the cost of themmap and TLB flush could be hidden. If theAIO completes while another process is in con-text, then the address space of the process re-questing the I/O can be updated for free, as aTLB flush would not be necessary.

4.2 Address spaces

UML has a real need for the ability of one pro-cess to be able to change mappings within theaddress space of another. In SKAS (SeparateKernel Address Space) mode, where the UMLkernel is in a separate address space from itsprocesses, this is critical, as the UML kernelneeds to be able to fix page faults, COW pro-cesses address spaces during fork, and emptyprocess address spaces during execve. In

5there is a small chance that the page could beswapped out between the completion of the read and thesubsequent access to the data


SKAS3 mode, with the host SKAS patch ap-plied, this is done using a special device whichcreates address spaces and returns file descrip-tors that can be used to manipulate them. InSKAS0 mode, which requires no host patches,address space changes are performed by a bitof kernel code which is mapped into the pro-cess address space.

Neither of these solutions is satisfactory, norare any of the alternatives that I know about.

4.2.1 /proc/mm

/proc/mm is the special device used inSKAS3 mode. When it is opened, it createsa new empty address space and returns a filedescriptor referring to it. This address spaceremains in existence for as long as the file de-scriptor is open. On the last close, if it is not inuse by a process, the address space is freed.

Mappings within a /proc/mm address spaceare changed by writing structures to the cor-responding file descriptor. This structure isa tagged union with an arm each for mmap,munmap, and mprotect. In addition, there isa ptrace extension, PTRACE_SWITCH_MM,which causes the traced child to switch fromone address space to another.

From a practical point of view, this has beena great success. It greatly improves UML per-formance, is widely used, and has been stableon i386 for a long time. However, from a con-ceptual point of view, it is fatally flawed. Thepractice of writing a structure to a file descrip-tor in order to accomplish something is merelyan ioctl in disguise. If I had realized this atthe time, I would have made it an ioctl. How-ever, the requirement for a new ioctl is usuallysymptomatic of a design mistake. The use ofwrite (or ioctl) is an abuse of the inter-face. It would have been better to implementthree new system calls.

4.2.2 New system calls

My proposal, and that of Eric Biederman, whowas also thinking about this problem, was toadd three new system calls that would be thesame as mmap, munmap, and mprotect, ex-cept that they would take an extra argument, afile descriptor, which would describe the ad-dress space to be operated upon, as shown inFigure 1

This new address space would be returned bya fourth new system call which takes no argu-ments and returns a file descriptor referring tothe address space:

int new_mm(void);

Linus didn’t like this idea, because he didn’twant to introduce a bunch of new system callswhich are identical to existing ones, except fora new argument. Instead he proposed a newsystem call which would run any other systemcall in the context of a different address space.

4.2.3 mm_indirect

This new system call is shown in Figure 2.

This would switch to the address space spec-ified by the file descriptor and run the systemcall described by the second and third argu-ments.

Initially, I thought this was a fine idea, and Iimplemented it, but now I have a number of ob-jections to it.

• It is unstructured—there is no type-checking on the system call arguments.This is generally considered undesirable inthe system call interface as it makes it im-possible for the compiler to detect manyerrors.


int fmmap(int address_space, void ∗start, size_t length,int prot, int flags, int fd, off_t offset);

int fmunmap(int addresss_space, void ∗start, size_t length);int fmprotect(int address_space, const void ∗addr, size_t len,

int prot);

Figure 1: Extended mmap, munmap, and mprotect

int mm_indirect(int fd, unsigned long syscall,unsigned long ∗args);

Figure 2: mm_indirect

• It is too general—it makes sense to invokerelatively few system calls under mm_indirect. For UML, I care only aboutmmap, mprotect, and munmap6. Theother system calls for which this mightmake sense are those which take pointersinto the process address space as either ar-guments or output values, but there is cur-rently no demand for executing those in adifferent address space.

• It has strange corner cases—the im-plementation of mm_indirect has tobe careful with address space referencecounts. Several system calls changethis reference count and mm_indirectwould need to be aware of these. Forexample, both exit and execve deref-erence the current address space. mm_indirect has to take a reference on thenew address space for the duration of thesystem call in order to prevent it disap-pearing. However, if the indirected sys-tem call is exit, it will never return, andthat reference will never be dropped. Thiscan be fixed, but the presence of behav-ior like this suggests that it is a bad idea.Also, the kernel stack could be attacked by

6and modify_ldt on i386 and x86_64

nesting mm_indirect. The best way todeal with these problems is probably justto disallow running the problematic sys-tem calls under mm_indirect.

• There are odd implementation problems—for performance reasons, it is desirablenot to do an address space switch to thenew address space when it’s not neces-sary, which it shouldn’t be when chang-ing mappings. However, mmap can sleep,and some systems (like SMP x86_64)get very upset when a process sleepswith current->mm != current->active_mm.

For these reasons, I now think that mm_indirect is really a bad idea.

These are all of the reasonable alternatives thatI am aware of, and there are objections to allof them. So, with the exception of having theseideas aired, we have really made no progress onthis front in the last few years.

4.3 VCPU

An idea which has independently come up sev-eral times is to do virtualization by introduc-


ing the idea of another context within a pro-cess. Currently, processes run in what wouldbe called the privileged context. The idea isto add an unprivileged context which is enteredusing a new system call. The unprivileged con-text can’t run system calls or receive signals. Ifit tries to execute a system call or a signal is de-livered to it, then the original privileged contextis resumed by the “enter unprivileged context”system call returning. The privileged contextthen decides how to handle the event before re-suming the unprivileged context again.

In this scheme, the privileged context would bethe UML kernel, and the unprivileged contextwould be a UML process. This idea has thepromise of greatly reducing the overhead of ad-dress space switching and system call intercep-tion.

In 2004, Ingo Molnar implemented this, butdidn’t tell anyone until KS 2005. I haven’t yettaken a good look at the patch, and it may turnout that it is unneeded given the virtualizationinfrastructure that is in progress.

5 Conclusion

In the past few years, Linux has greatly im-proved its ability to host virtual machines. Theptrace enhancements have been specificallyaimed at virtualization. Other enhancements,such as the I/O changes, have broader appli-cation, and were pushed for reasons other thanvirtualization.

This progress notwithstanding, there are ar-eas where virtualization support could improve.The kernel virtualization infrastructure projectholds the promise of greatly reducing the over-head imposed on guests, but these are earlydays and it remains to be seen how this will playout.

If, for some reason, it goes nowhere, Ingo Mol-nar’s VCPU patch is still a possibility.

There are still some unresolved problems, no-tably manipulating remote address spaces. Thisaside, all major problems with Linux hostingvirtual machines have at least proposed solu-tions, if they haven’t yet been actually solved.

System Firmware Updates Utilizing SofwareRepositories

OR: Two proprietary vendor firmware update packages walk into a dark alley,six RPMS in a yum repository walk out. . .

Matt DomschDell

[email protected]

Michael BrownDell

[email protected]

Abstract

Traditionally, hardware vendors don’t makeit easy to update the firmware (motherboardBIOSes, RAID controller firmware, systemsmanagement firmware, etc.) that’s flashed intotheir systems. Most provide DOS-based toolsto accomplish this, requiring a reboot into aDOS environment. In addition, some vendorsrelease OS-specific, proprietary tools, in pro-prietary formats, to accomplish this. Examplesinclude Dell Update Packages for BIOS andfirmware, HP Online System Firmware UpdateComponent for Linux, and IBM ServRAIDBIOS and Firmware updates for Linux. Thesetools only work on select operating systems, arelarge because they carry all necessary prerequi-site components in each package, and cannoteasily be integrated into existing Linux changemanagement frameworks such as YUM repos-itories, Debian repositories, Red Hat Networkservice, or Novell/SuSE YaST Online Updaterepositories.

We propose a new architecture that utilizes na-tive Linux packaging formats (.rpm, .deb) andnative Linux change management frameworks(yum, apt, etc.) for delivering and installingsystem firmware. This architecture is OS dis-

tribution, hardware vendor, device, and changemanagement system agnostic.

The architecture is easy as PIE: splitting Pay-load, Inventory, and Executable componentsinto separate packages, using package formatRequires/Provides language to handle depen-dencies at a package installation level, and us-ing matching Requires/Provides language tohandle runtime dependency resolution and in-stallation ordering.

The framework then provides unifying appli-cations such as inventory_firmware andapply_updates that handle runtime order-ing of inventory, execution, and conflict reso-lution/notification for all of the plug-ins. Theseare the commands a system administrator runs.Once all of the separate payload, inventory,and execution packages are in package man-ager format, and are put into package managerrepositories, then standard tools can retreive,install, and execute them:

# yum install $(inventory_firmware -b)# apply_updates

We present a proof-of-concept source code im-plementing the base of this system; web site

236 • System Firmware Updates Utilizing Sofware Repositories

and repository containing Dell desktop, note-book, workstation, and server BIOS images;open source tools for flashing Dell BIOSes; andopen source tools to build such a repositoryyourself.

1 Overview

The purpose of this paper is to describe aproposal and sample implementation to per-form generic, vendor-neutral, firmware updatesusing a system that integrates cleanly intoa normal Linux environment. Firmware in-cludes things such as system BIOS; addon-cardfirmware, e.g. RAID cards; system BaseboardManagement (BMC), hard drives, etc. The firstconcept of the proposal is the definition of a ba-sic update framework and a plugin API to in-ventory and update the system. For this, wedefine some basic utilities upon which to basethe update system. We also define a plug-in architecture and API so that different ven-dor tools can cleanly integrate into the system.The second critical piece of the update sys-tem is cleanly separating system inventory, ex-ecution of updates, and payload, i.e. individ-ual firmware images. After defining the basicutilities to glue these functions together, we de-fine how a package management system shouldpackage each function. Last, we define the in-teraction between the package manager and therepository manager to create a defined inter-face for searching a repository for applicableupdates. This paper will cover each of thesepoints. The proposal describes an implementa-tion, called firmware-tools [1].

2 Infrastructure

This section will detail the basic compo-nents used by the firmware update system,

and firmware-tools. The base infrastruc-ture for this system consists of two com-ponents: inventory and execution. Theseare named inventory_firmware andapply_updates, respectively. These arecurrently command-line utilities, but it is an-ticipated that, after the major architectural is-sues are worked out and this has been morewidely peer-reviewed, there will be GUI wrap-pers written.

The basic assumption is that, before you canupdate firmware on a device, you need severalpieces of information.

• What is the existing firmware version?

• What are the available versions offirmware that are on-disk?

• How do you do a version comparison?

• How do I get the correct packages installedfor the hardware I have? In other words,solve the bootstrap issue.

It is important to note that all of these ques-tions are independent of exactly how the filesand utilities get installed on the system. Wehave deliberately split out the behavior of theinstalled utilities from the specification of howthese utilities are installed. This allows us flex-ibility in packaging the tools using the “best”method for the system. Packaging will be dis-cussed in a section below. The specification ofpackaging and how it interacts with the repos-itory layer is an important aspect of how theinitial set of utilities get bootstrapped onto thesystem, as well as how payload upgrades arehandled over time.

2.1 Existing Firmware Version

The answer to the question, “What is the ex-isting firmware version?” is provided by the


inventory_firmware tool. The basicinventory_firmware tool has no capabil-ity to inventory anything; all inventory capabil-ity is provided by plugins. Plugins consist of apython module with a specific entry point, plusa configuration fragment to tell inventory_firmware about the plugin. Each plugin pro-vides inventory capability for one device type.The plugin API is covered in the API sectionof this paper, below. It should be noted that, atthis point, the plugin API is still open for sug-gestions and updates.

As an example, there is a dell-lsiflashpackage that provides a plugin to inven-tory firmware on LSI RAID adapters. Thedell-lsiflash plugin package drops aconfiguration file fragment into the plugindirectory /etc/firmware/firmware.d/ inorder to activate the plugin. This configurationfile fragment looks like this:

[delllsi]# plugin that provides# inventory for LSI RAID cards.inventory_plugin=delllsi

This causes the inventory_plugin to loada python module named delllsi.py and usethe entry points defined there to perform inven-tory on LSI RAID cards. The delllsi.pymodule is free to do the inventory any way itchooses. For example, there are vendor utili-ties that can sometimes be re-purposed to pro-vide quick and easy inventory. In this specificcase, we have written a small python extensionmodule in C which calls a specific ioctl()in the LSI megaraid drivers to perform the in-ventory and works across all LSI hardware sup-ported by the megaraid driver family. Note thatwhile the framework is open source, the per-device inventory applications may choose theirown licenses (of course, open source apps arestrongly preferred).

2.2 Available Firmware Images

The next critical part of infrastructure lies inenumerating the payload files that are availableon-disk. The main firmware-tools configura-tion file defines the top-level directory wherefirmware payloads are stored. The default lo-cation for firmware images is /usr/share/

firmware/. This can be changed such that,for example, multiple systems network mount acentral repository of firmware images. In gen-eral each type or class of firmware update willcreate a subdirectory under the main top-leveldirectory, and each individual firmware pay-load will have another subdirectory under that.

Each individual firmware payload consists oftwo files: a binary data file of the firmware anda package.ini metatdata file used by thefirmware-tools utilities. It specifies the mod-ules to be used to apply the update and the ver-sion of the update, among other things.

2.3 Version Comparison

Another interesting problem lies in doing ver-sion comparison between different versionstrings to try to figure out which is newer,due to the multitude of version string formatsused by different firmware types. For exam-ple, some firmware might have version stringssuch as A01, A02, etc., while other firmwarehas version strings such as 2.7.0-1234, 2.8.1-1532, etc. Each different system mayhave different precedence rules. For exam-ple, current Dell BIOS releases have versionstrings in sequence like A01, A02, etc. Butnon-release, beta BIOS have version stringslike X01, X02, etc., and developer test BIOShave version strings like P01, P02, etc. Thisposes a problem because a naive string com-parison would always rank beta “X-rev” BIOSas higher version than production BIOS, whichis undesirable.


The solution to this problem is to allow pluginsto define version comparison functions. Thesefunctions take two strings as input and out-put which one is newer. Each package.iniconfiguration file contains the payload version,plus the name of the plugin to use for versioncomparison.

2.4 Initial Package Installation—Bootstrap

The last interesting problem arises when youconsider how to decide which packages todownload from the package repository and in-stall on the local machine. This is a criticalproblem to solve in order to drive usability ofthis solution. If the user has to know detailsof the machine to manually decide which pack-ages to download, then the system will not besucessful. Next to consider is that a centralizedsolution does not fit in well with the distributednature of Linux, Linux development, and themany vendors we hope to support with this so-lution. We aim to provide a distributed solutionwhere the packages themselves carry the nec-essary metadata such that a repository managermetadata query can provide an accurate list ofwhich package is needed.

Normal package metadata relates to the soft-ware in the package, including files, libraries,virtual package names, etc. The firmware-toolsconcept extends this by defining “applicabilitymetadata” and adding it to the payload pack-ages. For example, we addProvides: pci_firmware(...)RPM tags to tell that the given RPM file is ap-plicable to certain PCI cards. Details on pack-aging are in the next section, including speci-fications on package Provides that must be ineach package.

We then provide a special “bootstrap inven-tory” mode for the inventory tool. In this mode,

inventory_firmware outputs a standard-ized set of package Provides names, basedupon the current system hardware configura-tion. By default, this list only includes pci_

firmware(...). Additional vendor-specificaddon packs can add other, vendor-specificpackage names. For example, the Dell ad-don pack, firmware-addon-dell, adds system_bios(...) and bmc_firmware(...) stan-dard packages to the list. We hope for widevendor adoption in this area, where differentvendors can provide addon packs for their stan-dard systems. In this manner, the user need notknow anything about their hardware, other thanthe manufacturer. They simply ask their reposi-tory manager to install the addon pack for theirsystem. They then run bootstrap inventory toget a list of all other required packages. Thislist is fed to the OS repository manager, forexample, yum, up2date, apt, etc. The reposi-tory manager will then search the repository forpackages with matching Provides names. Thispackage will normally be the firmware payloadpackage. Through the use of Requires, the pay-load packages will then pull the execution andinventory packages into the transaction.

3 plugin-api

The current firmware-tools provides onlyinfrastructure. All actual work is done bywriting plugins to do either inventory, boot-strap, or execution tasks. We expect that asnew members join the firmware-tools projectthis API will evolve. The current API is verystraightforward, consisting of a configurationfile, two mandatory function calls, and oneoptional function call. It is implemented inpython, but we anticipate that in the future wemay add a C API, or something like a WBEMAPI. The strength of the current implementa-tion is its simplicity.


3.1 Configuration

Plugins are expected to write a configu-ration file fragment into /etc/firmware/

firmware.d/. This fragment should benamed modulename.conf. It is an INI-format configuration file that is read withthe python ConfigParser module. Each con-figuration fragment should have one sectionnamed the same as the plugin, for exam-ple, [delllsi]. At the moment, thereare only two configuration directives that canbe placed in this section. The first is,bootstrap_inventory_plugin= and theother is inventory_plugin=.

3.2 Bootstrap Inventory

When in bootstrap mode, inventory_

firmware searches the configuration forbootstrap_inventory_plugin= di-rectives. It then dynamically loads thespecified python module. It then calls theBootstrapGenerator() function in thatmodule. This function takes no arguments andis expected to be a python “generator” function[2]. This function yields, one-by-one, instancesof the package.InstalledPackage class.

Figure 1 illustrates the Dell bootstrap generatorfor the firmware-addon-dell package.

This module is responsible for generating a listof all possible packages that could be applica-ble to Dell systems. As you can see, it outputstwo standard packages, system_bios(...)and bmc_firmware(...). It is also responsi-ble for outputting a list of pci_firmware(...) packages with the system name appended.In the future, as more packages are added tothe system, we anticipate that the bootstrap willalso output package names for things such asexternal SCSI/SAS enclosures, system back-planes, etc.

3.3 System Inventory

When in system inventory mode, inventory_firmware searches the configuration forinventory_plugin= directives. It then dy-namically loads the specified python module.It then calls the InventoryGenerator()

function in that module. This functiontakes no arguments and is expected to be apython “generator” function. This functionyields, one-by-one, instances of the package.InstalledPackage class. The differencehere between this and bootstrap mode isthat, in system inventory mode, the inven-tory function will populate version andcompareStrategy fields of the package.

InstalledPackage class.

Figure 2 illustrates the Dell inventory generatorfor the firmware-addon-dell package.

The inventory generator in this instance outputsonly the BIOS inventory, with more detailedversion information. It is also responsible forsetting up the correct comparison function touse for version comparison purposes.

3.4 On-Disk Payload Repository

The on-disk payload repository is the topleveldirectory where firmware payloads are stored.There is currently not a separate tools togenerate an inventory of the repository, but,there is python module code in repository.

py which will provide a list of availablepackages in the on-disk repository. Therepository.Repository class handles theon-disk repository. The constructor shouldbe given the top-level directory. Af-ter construction, the iterPackages() oriterLatestPackages() generator functionmethods can be called to get a list of pack-ages in the repository. These generator func-tions output either all repository packages,


# standard entry point -- Bootstrapdef BootstrapGenerator():

# standard function call to get Dell System IDsysId = biosHdr.getSystemId()

# output packages for Dell BIOS and BMCfor i in [ "system_bios(ven_0x1028_dev_0x%04x)", "bmc_firmware(ven_0x1028_dev_0x%04x)" ]:

p = package.InstalledPackage(name = (i % sysId).lower())

yield p

# output all normal PCI bootstrap packages with system-specific name appended.module = __import__("bootstrap_pci", globals(), locals(), [])for pkg in module.BootstrapGenerator():

pkg.name = "%s/%s" % (pkg.name, "system(ven_0x1028_dev_0x%04x)" % sysId)yield pkg

Figure 1: Dell bootstrap generator code

# standard entry point -- Inventorydef InventoryGenerator():

sysId = biosHdr.getSystemId()biosVer = biosHdr.getSystemBiosVer()p = package.InstalledPackage(

name = ("system_bios(ven_0x1028_dev_0x%04x)" % sysId).lower(),version = biosVer,compareStrategy = biosHdr.compareVersions,)

yield p

Figure 2: Dell inventory generator code


or only latest packages, respectively. Theyread the package.ini file for each pack-age and output an instance of package.

RepostitoryPackage. The package.ini

specifies the wrapper to use for each reposi-tory package object. The wrapper will over-ride the compareVersion() and install()methods as appropriate.

3.5 Execution

Execution is handled by calling theinstall() on a package object returned fromthe repository inventory. The install()

method is set up by a type-specific wrapper, asspecified in the package.ini file. Figure 3shows a typical wrapper class.

The wrapper constructor is passed a packageobject. The wrapper will then set up meth-ods in the package object for install and ver-sion compare. Typical installation functionis a simple call to a vendor command linetool. In this example, it uses the open-sourcedell_rbu kernel driver and the open-sourcelibsmbios [3] dellBiosUpdate applicationto perform the update.

4 Packaging

The goal of packaging is to make it as easyas possible to integrate firmware update ap-plications and payloads into existing OS de-ployments. This means following a standards-based packaging format. For Linux, this is theLinux Standard Base-specified Red Hat Pack-age Manager (RPM) format, though we don’tpreclude native Debian or Gentoo package for-mats. The concepts are equally applicable;implementation is left as an exercise for thereader.

Base infrastructure components are in thefirmware-tools package, detailed previ-ously. Individual updates for specific deviceclasses are split into two (or more) packages: anInventory and Execution package, and a Pay-load package. The goal is to be able to providenewer payloads (the data being written into theflash memory parts) separate from providingnewer inventory and execution components. Inan ideal world, once you get the relatively sim-ple inventory and execution components right,they would rarely have to change. However,one would expect the payloads to change regu-larly to add features and fix bugs in the productitself.

4.1 RPM Dependencies

Payload packages have a one-way (optionallyversioned) RPM dependency on the related In-ventory and Execution package. This allowstools to request the payload package, and therelated Inventory and Execution package isdownloaded as well. Should there be a com-pelling reason to do so, the Inventory and Ex-ecution components may be packaged sepa-rately, though most often they’re done by thesame tool.

Payload packages further Provide various tags,again to simplify automated download tools.

Lets look at the details, using BIOS packagefor Dell PowerEdge 6850 as an example.The actual BIOS firmware image is pack-aged in an RPM called system_bios_

PE6850-a02-12.3.noarch.rpm. Thispackage has RPM version-release a02-12.3,and is a noarch rpm because it does not con-tain any CPU architecture-specific executablecontent.

This package Provides:


class BiosPackageWrapper(object):def __init__(self, package):

package.installFunction = self.installFunctionpackage.compareStrategy = biosHdr.compareVersionspackage.type = self

def installFunction(self, package):ret = os.system("/sbin/modprobe dell_rbu")if ret:

out = ("Could not load Dell RBU kernel driver (dell_rbu).\n"" This kernel driver is included in Linux kernel 2.6.14 and later.\n"" For earlier releases, you can download the dell_rbu dkms module.\n\n"" Cannot continue, exiting...\n")

return (0, out)status, output = commands.getstatusoutput("""dellBiosUpdate -u -f %s""" %

os.path.join(package.path, "bios.hdr"))if status:

raise package.InstallError(output)return 1

Figure 3: Example wrapper class

system_bios(ven_0x1028_dev_0x0170) = a02-12.3system_bios_PE6850 = a02-12.3

Let’s look at these one at a time.system_bios(ven_0x1028_dev_0x0170) = a02-12.3

This can be parsed as denoting a system BIOS,from a vendor with PCI SIG Vendor ID num-ber of 0x1028 (Dell). For each vendor, therewill be a vendor-specific system type number-ing scheme which we care nothing about ex-cept to consume. In this example, 0x0170 isthe software ID number of the PowerEdge 6850server type. The BIOS version, again using avendor-specific versioning scheme, is A02. Allof the data in these fields can be determinedprogramatically, so is suitable for automatedtools.

Most systems and devices will have prettier,marketing names. Whenever possible, we wantto use those, rather than the ID numbers, wheninteracting with the sysadmin. So this pack-age also provides the same version informa-tion, only now using the marketing short namePE6850.system_bios_PE6850 = a02-12.3

Presumably the marketing short names, thoughper-vendor, will not conflict in this flat names-pace. The BIOS version, A02, is seen hereagain, as well as a release field (12.3) whichcan be used to indicate the version of the vari-ous tools used to produce this payload package.This version-release value matches that of theRPM package.

The firmware-addon-dell package provides anID-to-shortname mapping config appropriatefor Dell-branded systems. It is anticipated thatother vendors will provide equivalent function-ality for their packages. Users generating theirown content for systems not in the list can ac-cept the auto-generated name or add their sys-tem ID to the mapping config.

Epochs are used to account for version schemechanges, such as Dell’s conversion from theAxx format to the x.y.z format.

To account for various types of firmware thatmay be present on the system, we have comeup with a list for RPM Provides tags seen inFigure 4. We anticipate adding new entries tothis list as firmware updates for new types ofdevices are added to the system.

The combination pci_firmware/system entries


system_bios(ven_VEN_dev_ID)pci_firmware(ven_VEN_dev_DEV)pci_firmware(ven_VEN_dev_DEV_subven_SUBVEN_subdev_SUBDEV)pci_firmware(ven_VEN_dev_DEV_subven_SUBVEN_subdev_SUBDEV)/system(ven_VEN_dev_ID)bmc_firmware(ven_VEN_dev_ID)

system_bios_SHORTNAMEpci_firmware_SHORTNAMEpci_firmware_SHORTNAME/system_SHORTNAMEbmc_firmware_SHORTNAME

Figure 4: Package Manager Provides lines in payload packages

are to address strange cases where a givenpayload is applicable to a given device in agiven system only, where the PCI ven/dev/subven/subdev values aren’t enough to dis-ambiguate this. It’s very rare, and should beused with extreme caution, if at all.

These can be expanded to add additionalfirmware types, such as SCSI backplanes, hotplug power supply backplanes, disks, etc. asthe need arises. These names were chosento avoid conflicts with existing RPM packagesProvides.

4.2 Payload Package Contents

Continuing our BIOS example, the toplevelfirmware storage directory is /usr/share/

firmware. BIOS has its own subdirectory un-der the toplevel, at /usr/share/firmware/bios/, representing the top-level BIOS direc-tory. The BIOS RPM payload packages in-stall their files into subdirectories of the BIOStoplevel directory. Figure 5 shows this layout.

This allows multiple versions of each payloadto be present on the file system, which may behandy for downrev’ing. It also allows an entireset of packages to be installed once on a fileserver and shared out to client servers.

In this example, the actual data being writ-ten to the flash is in the file bios.hdr. Thepackage.ini file contains metadata about

the payload described above and consumed bythe framework apps. The package.xml filelisted here was copied from the original ven-dor package. It contains additional metadata,and may be used by vendor-specific tools. Thefirmware-addon-dell package uses the in-formation in this file to only attempt installingthe payload onto the system type for which itwas made (e.g. to avoid trying to flash a desk-top system with a server BIOS image).

4.3 Obtaining Payload Content

We’ve described the format of the packages, butwhat if the existing update tools aren’t alreadyin the proper format? For example, as detailedat the beginning of this paper, most vendors re-lease their content in proprietary formats. Thesolution is to write a tool that will take the ex-isting proprietary formats and repackage theminto the firmware-tools format.

The fwupdate-tools package provides ascript, mkbiosrepo.sh, which can down-load files from support.dell.com, extractand unpack the relevant payloads from them,and re-package them into packages as we’vedescribed here. This allows a graceful transi-tion from an existing packaging format to thisnew format with little impact to existing busi-ness processes. The script can be extended todo likewise for other proprietary vendor pack-age formats.


# rpm -qpl system_bios_PE6850-a02-12.3.noarch.rpm/usr/share/firmware/bios/usr/share/firmware/bios/system_bios_ven_0x1028_dev_0x0170_version_a02/usr/share/firmware/bios/system_bios_ven_0x1028_dev_0x0170_version_a02/bios.hdr/usr/share/firmware/bios/system_bios_ven_0x1028_dev_0x0170_version_a02/package.ini/usr/share/firmware/bios/system_bios_ven_0x1028_dev_0x0170_version_a02/package.xml

Figure 5: Example Package Manager file layout

If this format proves to be popular, it is hopedthat vendors will start to release packages in na-tive firmware-tools format. The authors of thispaper are already working internally to Dell topush for this change, although there is currentlyno ETA nor guarantee of official Dell support.We are working on the open-source firmware-tools project to prototype the solution and to getpeer review on this concept from other industryexperts in this area.

5 Repositories

We recognize that each OS distribution has itsown model for making packages avaialble inan online repository. Red Hat Enterprise Linuxcustomers use Red Hat Network, or RHN Satel-lite Server, to host packages. Fedora and Cen-tOS use Yellow dog Updater, Modified (YUM)repositories. SuSE uses Novell ZenWorks,YaST Online Update (YOU) repositories, andnewer SuSE releases can use YUM reposito-ries too. Debian uses FTP archives. Other thirdparty package managers have their own sys-tems and tools. The list goes on and on. Ingeneral, you can put RPMs or debs into any ofthese, and they “just work.”

As an optimization, you can package RPMs in asingle directory, and provide the multiple formsof metadata that each require in that same loca-tion, letting one set of packages, and one repos-itory, be easily used by all of the system types.The mkbiosrepo.sh script manages meta-data for both YUM and YOU tools. Creation of

channels in Red Hat Network Satellite Serveris, unfortunately, a manual process at present;uploading content into channels is easily doneusing RHN tools. Providing packages in otherrepository formats is another exercise left to thereader.

6 System Administrator Use

Up to this point, everything has focused on cre-ating and publishing packages in a format forsystem administration tools to consume. Sohow does this all look from the sysadmin per-spective?

6.1 Pulling from a Repository

First, you must configure your target systemsto be able to pull files from the online reposito-ries. How you do that is update system spe-cific, but it probably involves editing a con-figuration file (/etc/yum.repos.d/, /usr/sysconfig/rhn/sources, . . . ) to point atthe repository, configure GPG keys, and thelike. Nothing here is specific to updatingfirmware.

The first tool you need is one that willmatch your system vendor, which pulls inthe framework packages, which provides theinventory_firmware tool.

# yum install firmware-addon-dell


6.2 Bootstrapping from a Repository

Now it’s time to request from the repositoryall the packages that might match your targetsystem. inventory_firmware, in bootstrapmode, provides the list of packages that couldexist. Figure 6 shows an example.

We pass this value to yum or up2date, as such:

# yum install $(inventory_firmware

-b)

or

# up2date -i $(inventory_firmware

-b -u)

This causes each of the possible firmware Pay-load packages, if they exist in any of the reposi-tories we have configured to use, to be retreivedand installed into the local file system. Be-cause the Payload packages have RPM depen-dencies on their Inventory and Execution pack-ages, those are downloaded and installed also.

Subsequent update runs, such as the nightlyyum or up2date run will then pick up any newerpackages, using the list of packages actuallyon our target system. If packages for new de-vice types are released into the repository (e.g.someone adds disk firmware update capability),then the sysadmin will have to run the abovecommands again to download those new pack-ages.

6.3 Applying Firmware Updates

apply_updates will perform the actualflash part update using the inventory and exe-cution tools and payloads for each respectivedevice type.

# apply_updates

apply_updates can be configured to runautomatically at RPM package installationtime, though its more likely to be run as ascheduled downtime activity.

7 Proof of Concept Payload Repos-itory

Using the above tool set, we’ve created a proof-of-concept payload repository [4], containingthe latest Dell system BIOS for over 200 sys-tem types, and containing Dell PERC RAIDcontroller firmware for current generation con-trollers. It provides YUM and YOU meta-data in support of target systems running Fe-dora Core 3, 4, and 5, Red Hat EnterpriseLinux 3 and 4 (and its clones like CentOS),and Novell/SuSE Linux Enterprise Server 9 and10. New device types and distributions will beadded in the future.

8 Future Directions

We believe that this model for automaticallydownloading firmware can also be used forother purposes. For example, we could tagDKMS [5] driver RPMS with tags and have theinventory system output pci_driver(...)

lines to be fed into yum or up2date. A proposalhas been sent to the dkms mailing list withsubsequent commentary and discussion. Thismodel could also be used for things like In-tel ipw2x00 firmware, which typically is down-loaded separately from the kernel and mustmatch the kernel driver version.

9 Conclusion

While most sysadmins only update their BIOSand firmware when they have to, the process


# inventory\_firmware -bsystem_bios(ven_0x1028_dev_0x0170)bmc_firmware(ven_0x1028_dev_0x0170)pci_firmware(ven_0x8086_dev_0x3595)/system(ven_0x1028_dev_0x0170)pci_firmware(ven_0x8086_dev_0x3596)/system(ven_0x1028_dev_0x0170)pci_firmware(ven_0x8086_dev_0x3597)/system(ven_0x1028_dev_0x0170)...

Figure 6: Running inventory_firmware -b

should be as easy as possible. By utilizingOS tools already present, BIOS and firmwarechange management becomes just as easy asother software change management. We’ve de-veloped this to be Linux distribution, hardwaremanufacturer, system manufacturer, and updatemechanism agnostic, and have demonstrated itscapability with Dell BIOS and PERC Firmwareon a number of Linux distributions and ver-sions. We encourage additional expansion ofthe types of devices handled, types of OSs, andtypes of update systems, and would welcomepatches that provide this functionality.

10 Glossary

Package: OS standard package (.rpm/.deb)

Package Manager: OS standard package man-ager (rpm/dpkg)

Repository Manager: OS standard repositorysolution (yum/apt)

References

[1] Firmware-tools ProjectHome page: http://linux.dell.com/firmware-tools/Mailing list: http://lists.us.dell.com/mailman/listinfo/firmware-tools-devel

[2] Python Generator documentationhttp://www.python.org/dev/peps/pep-0255/

[3] Libsmbios ProjectHome Page: http://linux.dell.com/libsmbiosMailing list: http://lists.us.dell.com/mailman/listinfo/libsmbios-devel

[4] Proof of Concept Payload RepositoryHome Page: http://fwupdate.com

[5] DKMS ProjectHome Page:http://linux.dell.com/dkmsMailing list:http://lists.us.dell.com/mailman/listinfo/dkms-devel

The Need for Asynchronous, Zero-Copy Network I/OProblems and Possible Solutions

Ulrich DrepperRed Hat, Inc.

[email protected]

Abstract

The network interfaces provided by today’sOSes severely limit the efficiency of networkprograms. The kernel copies the data coming infrom network interface at least once internallybefore making the data available in the user-level buffer. This article explains the problemsand introduces some possible solutions. Thesenecessarily cover more than just the network in-terfaces themselves, there is a bit more supportneeded.

1 Introduction

Writing scalable network applications is todaymore challenging than ever. The problem isthe antiquated (standardized) network API. TheUnix socket API is flexible and remains usablebut with ever higher network speeds, new tech-nologies like interconnects, and the resultingexpected scalability we reach the limits. CPUsand especially their interface to the memorysubsystem are not capable of dealing with thehigh volume of data in the available short time-frames.

What is needed is an asynchronous interface fornetworking. The asynchronicity would primar-ily be a means to avoid unnecessary copying

of data. It also would help to avoid conges-tion since network buffers can earlier be freedwhich in turn ensures that retransmits due tofull network buffers are minimized.

Existing interfaces like the POSIX AIO func-tions fall short of providing the necessary func-tionality. This is not only due to the fact thatpre-posting of buffers is only possible at a lim-ited scale. A perhaps bigger problem is theexpensive event handling. The event handlingitself has requirements and challenges whichcurrently cannot be worked around (like wak-ing up too many waiters).

In the remainder of the paper we will see thedifferent set of interfaces which are needed:

• event handling

• physical memory handling

• asynchronous network interfaces

The event handling must work with the existingselect()/poll() interfaces. It also shouldbe generic enough to be usable for other eventswhich currently do not map to file descriptors inthe moment (like message queues, futexes, etc).This way we might finally have one unified in-ner loop in the event handling of a program.

248 • The Need for Asynchronous, Zero-Copy Network I/O

Physical memory suddenly becomes importantbecause network devices address only physicalmemory. But physical memory is invisible inthe Unix ABI and this is very much welcome.If this is not the case the Copy-On-Write con-cept used on CPUs with MMU-support wouldnot work. The implementation of functions likefork() becomes harder. The proposal willcenter around making physical memory regionsobjects the kernel knows to handle.

Finally, using event and physical memory han-dling, it is possible to define sane interfaces forasynchronous network handling. In the follow-ing sections we will see some ideas on how thiscan happen.

It is not at all guaranteed that these interfaceswill stand the test of time or will even be imple-mented. The intention of this paper is to get theball rolling because the problems are pressingand we definitely need something along theselines. Starting only from the bottom (i.e., fromthe kernel implementation) has the danger ofignoring the needs of programmers and mightmiss the bigger picture (e.g., integration into abigger event handling scheme, for instance).

2 The Existing Implementation

Network stacks in Unix-like OSes have moreor less the same architecture today as they had10–20 years ago. The interface the OS providesfor reading and writing from and to network in-terfaces consists of the interfaces in Table 1.

These interfaces all work synchronously. Theinterfaces for reading return only when datais available or in error conditions. In non-blocking mode they can return immediately ifno data is available, but this is no real asyn-chronous handling. The data is not transferredto userlevel in an asynchronous fashion.

receiving sendingread() write()

recv() send()

recvfrom() sendto()

recvmsg() sendmsg()

Table 1: Network APIs

receiving sending

read()1 write()1

aio_read() aio_write()

lio_listio()

Table 2: AIO APIs

Linux provides an asynchronous mode for ter-minals, sockets, pipes, and FIFOs. If a filedescriptor has the O_ASYNC flag set calls toread() and write() immediately return andthe kernel notifies the program about comple-tion by sending a signal. This is a slightlymore complicated and more restrictive versionof the AIO interfaces than when using SIGEV_

SIGNAL (see below) and therefore suffers in ad-dition to its own limitation of those of SIGEV_SIGNAL.

For truly asynchronous operations on files thePOSIX AIO functions from Table 2 are avail-able. With these interfaces it is possible to sub-mit a number of input and output requests onone or more file descriptors. Requests are filledas the data becomes available. No particularorder is guaranteed but requests can have pri-orities associated with them and the implemen-tation is supposed to order the requests by pri-ority. The interfaces also have a synchronousmode which comes in handy from time to time.Interesting here is the asynchronous mode. Thebig problem to solve is that somehow the pro-gram has to be able to find out when the submit-ted requests are handled. There are three modes

1With the O_ASYNC flag set for the descriptor.


defined by POSIX:

SIGEV_SIGNAL The completion is signaledby sending a specified signal to the pro-cess. Which thread receives the signal isdetermined by the kernel by looking at thesignal masks. This makes it next to im-possible to use this mechanism (and O_

ASYNC) in a library which might be linkedinto arbitrary code.

SIGEV_THREAD The completion is signaledby creating a thread which executes aspecified function. This is quite expensivein spite of NPTL.

SIGEV_NONE No notification is sent. The pro-gram can query the state of the requestusing the aio_error() interface whichreturns EINPROGRESS in case the requesthas not yet been finished. This is also pos-sible for the other two modes but it is cru-cial for SIGEV_NONE.

The POSIX AIO interfaces are designed for fileoperations. Descriptors for sockets might beused with the Linux implementation but thisis not what the functions are designed for andthere might be problems.

All I/O interfaces have some problems in com-mon: the caller provides the buffer into whichthe received data is stored. This is a problemin most situation for even the best theoreticalimplementation. Network traffic arrives asyn-chronously, mostly beyond the control of theprogram. The incoming data has to be storedsomewhere or it gets lost.

To avoid copying the data more than once itwould therefore be necessary to have buffersusable by the user available right the momentwhen the data arrives. This means:

• for the read() and recv() interfaces itwould be necessary that the program ismaking such a call just before when thedata arrives. If there is no such call out-standing the kernel has to use its ownbuffers or (for unreliable protocols) it candiscard the data.

• with aio_read() and the equivalentlio_listio() operation it is possible topre-post a number of buffers. When thenumber goes down more buffers can bepre-posted. The main problem with theseinterfaces is what happens next. Some-how the program needs to be notifiedabout arrival of data. The three mech-anisms described above are either basedon polling (SIGEV_NONE) or are far tooheavy-weight. Imagine sending 1000sof signals a second corresponding to thenumber of incoming packages. Creatingthreads is even more expensive.

Another problem is that for unreliable protocolsit might be more important to always receivethe last arriving data. It might contain morerelevant information. In this case data whicharrived before should be sacrificed.

A second problem all implementations have incommon is that the caller can provide arbitrarymemory regions for input and output buffer tothe kernel. This is in general wanted. But ifthe network hardware is supposed to transferdirectly into the memory regions specified it isnecessary for the program to use memory thatis special. The network hardware uses DirectMemory Access (DMA) to write into RAM in-stead of passing data through the CPU. Thishappens at a level below the virtual addressspace management, DMA only uses physicaladdresses.

Besides possible limitations on where the RAMfor the buffers is located in the physical address


space, the biggest problem is that the buffersmust remain in RAM until used. Ordinarilyuserlevel programs do not see physical RAM;the virtual address is an abstraction and the OSmight decide to remove memory pages fromRAM to make room for other processes. If thiswould appear while an network I/O request ispending the DMA access of the network hard-ware would touch RAM which is now used forsomething else.

This means while buffers are used for DMAthey must not be evicted from RAM. They mustbe locked. This is possible with the mlock()

interface but this is a privileged operation. Ifa process would be able to lock down arbi-trary amounts of memory it would impact allthe other processes on the system which wouldbe starved of resources. Recent Linux kernelsallow unprivileged processes to lock down amodest amount of memory (by default eightpages or so) but this would not be enough forheavily network oriented applications.

The POSIX AIO interfaces certainly show theway for the interfaces which can solve the net-working problems. But we have to solve sev-eral problems:

• make DMA-ready memory available tounprivileged applications;

• create an efficient event handling mecha-nism which can handle high volumes ofevents;

• create I/O interfaces which can use thenew memory and event handling. As abonus they should be usable for disk I/Oas well.

At this point it should be mentioned that aworking group of the OpenGroup, the Intercon-nect Software Consortium, tried to tackle thisproblem. The specification is available from

their website at http://www.opengroup.org/icsc/. They arrived at the same set ofthree problems and proposed solutions. Theirsolutions are not implemented, though, andthey have some problems. Most importantly,the event handling does not integrate with thefile-descriptor-based event handling.

3 Memory Handling

The main requirement on the memory handlingis to provide memory regions which are avail-able at userlevel and which can be directly ac-cessed by hardware other than the processor.Network cards and disk controllers can trans-fer data without the help of the CPU throughDMA. DMA addresses memory based on thephysical addresses. It does not matter how thephysical memory is currently used. If the vir-tual memory system of the OS decides that apage of RAM should be used for some otherpurpose the devices would overwrite the newuser’s memory unless this is actively prevented.There is no demand-paging as for the userlevelcode.

To be sure the DMA access will use the cor-rect buffer, it is necessary to prevent swappingthe destination pages out. This is achieved byusing mlock(). Memory locking depletes theamount of RAM the system can use to keep asmuch of the combined virtual memory of allprocesses in RAM. This can severely limit theperformance of the system or eventually pre-vent it from making any progress. Memorylocking is therefore a privileged operation. Thisis the first problem to be solved.

The situation is made worse by the fact thatlocking can only be implemented on a per-page-basis. Locking one small object on a pageties down the entire page.


One possibility would be avoid locking pagesin the program and have the kernel instead dothe work all by itself and on-demand. Thatmeans if a network I/O request specifies abuffer the kernel could automatically make surethat the memory page(s) containing the bufferis locked. This would be the most elegant so-lution from the userlevel point-of-view. Butit would mean significant overhead: for everyoperation the memory page status would haveto be checked and if necessary modified. Net-work operations can be frequent and multiplebuffers can be located on the same page. Ifthis is known the checks performed by the ker-nel would be unnecessary and if they are per-formed the kernel must keep track how manyDMA buffers are located on the page. This so-lution is likely to be unattractive.

It is possible to defer solving this problem,fully or in part, to the user. In the least ac-commodating solution, the kernel could simplyrequire the userlevel code to use mmap() andmprotect() with a new flag to create DMA-able memory regions. Inside these memoryregions the program can carve out individualbuffers, thereby mitigating the problem of lock-ing down many pages which are only partiallyused as buffers. This solution puts all the bur-den on the userlevel runtime.

It also has a major disadvantage. Pages lockedusing mlock() are locked until they are un-locked or unmapped. But for the purposeof DMA the pages need not be permanentlylocked. The locking is really only needed whileI/O requests using DMA are being executed.For the network I/O interfaces we are talkingabout here the kernel always knows when sucha request is pending. Therefore it is theoreti-cally possible for the kernel to lock the pageson request. For this the pages would have to bespecially marked. While no request is pendingor if a network interface is used which does notprovide DMA access the virtual memory sub-

system of the OS can move the page around inphysical memory or even swap it out.

One relatively minor change to the kernel couldallow for such optimizations. If the mmap()

call could be passed a new flag MAP_DMA thekernel would know what the buffer is used for.It could keep track of the users of the page andavoid locking it unless it is necessary. In aninitial implementation the flag could be treatedas an implicit mlock() call. If the flag is cor-rectly implemented it would also be possible tospecify different limits on the amount of mem-ory which can be locked and for DMA respec-tively. This is no full solution to the problem ofrequiring privileges to lock memory, though (anapplication could simply have a read() callpending all the time).

The MAP_DMA flag could also help dealing withthe effects of fork(). The POSIX specifica-tion requires that no memory locking is inher-ited by the child. File descriptors are inheritedon the other hand. If parts of the solution forthe new network interfaces uses file descriptors(as it is proposed later) we would run into aproblem: the interface is usable but before thefirst use it would be necessary to re-lock thememory. With the MAP_DMA flag this could beavoided. The memory would simply automati-cally re-locked when it is used in the child forthe first time. To help in situations where thememory is not used at all after fork(), for ex-ample, if an exec call immediately follows, allMAP_DMA memory is unlocked in the child.

Using this one flag alone could limit the perfor-mance of the system, though. The kernel willalways have to make sure that the memory islocked when an I/O request is pending. Thisis overhead which could potentially be a limit-ing factor. The programmer oftentimes has bet-ter knowledge of the program semantics. Shewould know which memory regions are usedfor longer periods of time so that one explicit


lock might be more appropriate than implicitlocking performed by the kernel.

A second problem is fragmentation. A pro-gram is usually not one homogeneous body ofcode. Many separate libraries are used whichall could perform network I/O. With the MAP_

DMA method proposed so far each of the li-braries would have to allocate its own memoryregion. This use of memory might be inefficientbecause of the granularity of memory lockingand because not all parts of the program mightneed the memory concurrently.

To solve the issue, the problem has to be tackledat a higher level. We need to abstract the mem-ory handling. Providing interfaces to allocateand deallocate memory would give the imple-mentation sufficient flexibility to solve these is-sues and more. The allocation interfaces couldstill be implemented using the MAP_DMA flagand the allocation functions could “simply” beuserlevel interfaces and no system calls. Onepossible set of interfaces could look like this:

int dma_alloc(dma_mem_t *handlep,size_t size, unsigned int flags);

int dma_free(dma_mem_t handle,size_t size);

The interfaces which require DMA-able mem-ory would be passed a value of type dma_mem_t. How this handle is implemented wouldbe implementation defined and could in factchange over time. An initial, trivial implemen-tation could even do without support for some-thing like MAP_DMA and use explicit mlock()calls.

epoll_wait() poll() select()

epoll_pwait() ppoll() pselect()

Table 3: Notification APIs

4 Event Handling

The existing event handling mechanisms ofPOSIX AIO uses polling, signals, or the cre-ation of threads. Polling is not a general solu-tion. Signals are not only costly, they are alsounreliable. Only a limited, small number ofsignals can be outstanding at any time. Oncethe limit is reached a program has to fall backon alternative mechanisms (like polling) untilthe situation is rectified. Also, due to the limi-tations imposed on code usable in signal han-dlers, writing programs using signal notifica-tion is awkward and error-prone. The creationof threads is even more expensive and despitethe speed of NPTL has absolute no chance toscale with high numbers of events.

What is needed is a completely new mecha-nism for event notification. We cannot use thesame mechanisms as used for synchronous op-erations on a descriptor for a socket or a file.If data is available and can be sent, this doesnot mean that an asynchronously posted requesthas been fulfilled.

The structure of a program designed to run ona Unix-y system requires that the event mech-anism can be used with the same interfacesused today for synchronous notification (seeTable 3). It would be possible to invent a com-pletely new notification handling mechanismand map the synchronous file descriptor oper-ation to it. But why? The existing mechanismwork nicely, they scale well, and programmersare familiar with them. It also means existingcode does not have to be completely rewritten.

Creating a separate channel (e.g., file descrip-tor) for each asynchronous I/O request is not


scalable. The number of I/O requests can behigh enough to forbid the use of the poll

and select interfaces. The epoll interfaceswould also be problematic because for each re-quest the file descriptor would have to be reg-istered and later unregistered. This overheadis too big. Furthermore, is file descriptor has acertain cost in the kernel and therefore the num-ber is limited.

What is therefore needed is a kind of bus usedto carry the notifications for many requests.A mechanism like netlink would be usable.The netlink sockets receive broadcast trafficfor all the listeners and each process has to filterout the data which it is interested in. Broadcast-ing makes netlink sockets unattractive (atbest) for event handling. The possible volumeof notifications might be overwhelming. Theoverhead for the unnecessary wake-ups couldbe tremendous.

If filtering is accepted as not being a viableimplementation requirement we have as a re-quirement for the solution that each processcan create multiple, independent event channel,each capable of carrying arbitrarily many no-tification events from multiple sources. If wewould not be able to create multiple indepen-dent channels a program could not concurrentlyand uncoordinatedly create such channels.

Each channel could be identified by a descrip-tor. This would then allow the use of the no-tification APIs in as many places as necessaryindependently. At each site only the relevantevents are reported which allows the event han-dling to be as efficient as possible.

An event is not just an impulse, it has totransmit some information. The request whichcaused the event has to be identified. It is usu-ally2 regarded best to allow the programmeradd additional information. A single pointer

2See the sigevent structure.

is sufficient, it allows the programmer to re-fer to additional data allocated somewhere else.There is no need to allow adding an arbitraryamount of data. The event data structure cantherefore be of fixed length. This simplifies theevent implementation and possibly allows it toperform better. If the transmission of the eventstructure would be implemented using socketthe SOCK_SEQPACKET type can be used. Thestructure could look like this:

typedef struct event_data {enum { event_type_aio,

event_type_msq,event_type_sig } ev_type;

union {aio_ctx_t *ev_aio;mqd_t *ev_msq;sigevent_t ev_sig;

} ev_un;ssize_t ev_result;int ev_errno;void *ev_data;

} event_data_t;

This structure can be used to signal eventsother than AIO completion. It could be a gen-eral mechanism. For instance, there currentlyis no mechanism to integrate POSIX messagequeues into poll() loops. With an exten-sion to the sigevent structure it could be pos-sible to register the event channel using themq_notify() interface. The kernel can be ex-tended to send events in all kinds of situations.

One possible implementation consists of intro-ducing a new protocol family PF_EVENT. Anevent channel could then be created with:

int efd = socket(PF_EVENT,SOCK_SEQPACKET, 0);


int ev_send(int s, const void *buf, size_t len, int flags, ev_t ec,void *data);

int ev_sendto(int s, const void *buf, size_t len, int flags,const struct sockaddr *to, socklen_t tolen, ev_t ec, void *data);

int ev_sendmsg(int s, const struct msghdr *msg, int flags, ev_t ec,void *data);

int ev_recv(int s, void *buf, size_t len, int flags, ev_t ec,void *data);

int ev_recvfrom(int s, void *buf, size_t len, int flags,struct sockaddr *to, socklen_t tolen, ev_t ec, void *data);

int ev_recvmsg(int s, struct msghdr *msg, int flags, ev_t ec,void *data);

Figure 1: Network Interfaces with Event Channel Parameters

The returned handle could be used in poll()

calls and be used as the handle for the eventchannel. There are two potential problemswhich need some thought:

• The kernel cannot allow the event queueto take up arbitrary amounts of memory.There has to be an upper limit on the num-ber of events which can be queued at thesame time. When this happens a specialevent should be generated. It might bepossible to use out-of-band notification forthis so that the error is recognized rightaway.

• The number of events on a channel can po-tentially be high. In this case the overheadof all the read()/recv() calls could bea limiting factor. It might be beneficial toapply some of the techniques for the net-work I/O discussed in the next section tothis problem as well. Then it might bepossible to poll for new events without thesystem call overhead.

To enable optimizations like possible userlevel-visible event buffers the actual interface for theevent handling should be something like this:

ec_t ec_create(unsigned flags);int ec_destroy(ec_t ec);int ec_to_fd(ec_t ec);int ec_next_event(ec_t ec,event_data_t *d);

The ec_to_fd() function returns a file de-scriptor which can be used in poll() orselect() calls. An implementation mightchoose to make this interface basically a no-opby implementing the event channel descriptoras a file descriptor. The ec_next_event()

function returns the next event. A call mightresult in a normal read() or recv call butit might also use a user-level-visible buffer toavoid the system call overhead. The events sig-naled by poll() etc can be limited to the ar-rival of new data. I.e., the userlevel code is re-sponsible for clearing the buffers before wait-ing for the next event using poll(). The ker-nel is involved in the delivery of new data andtherefore this type of event can be quite easilybe generated.

Handles of type ec_t can be passed to theasynchronous interfaces. The kernel can then


int aio_send(struct aiocb *aiocbp, int flags);int aio_sendto(struct aiocb *aiocbp, int flags,const struct sockaddr *to, socklen_t tolen);

int aio_sendmsg(struct aiocb *aiocbp, int flags);int aio_recv(struct aiocb *aiocbp, int flags);int aio_recvfrom(struct aiocb *aiocbp, int flags, struct sockaddr *to,socklen_t tolen);

int aio_recvmsg(struct aiocb *aiocbp, int flags);

Figure 2: Network Interfaces matching POSIX AIO

create appropriate events on the channel. Therewill be no fixed relationship between the filedescriptor or socket used in the asynchronousoperation and the event channel. This gives themost flexibility to the programmer.

5 I/O Interfaces

There are several possible levels of innovationand complexity which can go into the design ofthe asynchronous I/O interfaces. It makes senseto go through them in sequence of increasingcomplexity. The more complicated interfaceswill likely take advantage of the same function-ality the less complicated need, too. Mention-ing the new interfaces here is not meant to im-ply that all interfaces should be provided by theimplementation.

The simplest of the interfaces can extend thenetwork interfaces with asynchronous variantswhich use the event handling introduced in theprevious section. One possibility is to extendinterfaces like recv() and send() to take ad-ditional parameters to use event channels. Theresult is seen in Figure 1.

Calls to these functions immediately return.Valid requests are simply queued and the no-tifications about the completion are sent viathe event channel ec. The data parameter

is the additional value passed back as part ofthe event_data_t object read from the eventchannel. The event notification would signalthe type of operation by setting ev_type ap-propriately. Success and the amount of datareceived or transmitted are stored in the ev_

errno and ev_result elements.

There are two objections to this approach. First,the other frequently used interfaces for sockets(read() and write()) are not handled. Al-though their functionality is a strict subset ofrecv() and send() respectively it might be adeterrent. The second argument is more severe:there is no justification to limit the event han-dling to network transfer. The same functional-ity would be “nice to have”TM for file, pipe, andFIFO I/O. Extending the read() and write()interfaces in the same way as the network I/Ointerfaces makes no sense, though. We alreadyhave interfaces which could be extended.

With a simple extension of the sigevent

structure we can reuse the POSIX AIO inter-faces. All that would be left to do is to defineappropriate versions of the network I/O inter-faces to match the existing POSIX AIO inter-faces and change the aiocb structure slightly.The new interfaces can be seen in Figure 2. Theaiocb structure needs to have one additionalelement:


struct aiocb {...struct msghdr *aio_msg;...

};

It is used in the aio_sendmsg() and aio_

recvmsg() calls. The implementation canchose to reuse the memory used for the aio_

buf element because it never gets used at thesame time as aio_msg. The other four inter-faces use aio_buf and aio_nbytes to spec-ify the source and destination buffer respec-tively.

The <signal.h> header has to be extended todefine SIGEV_EC. If the sigev_notify el-ement of the sigevent structure is set to thisvalue the completion is signal by an appropriateevent available on an event channel. The chan-nel is identified by a new element which mustbe added to the sigevent structure:

struct sigevent {...ec_t sigev_ec;...

};

The additional pointer value which is passedback to the application is also stored in thesigevent structure. The application has tostore it in sigev_value.sival_ptr whichis in line with all the other uses of this part ofthe sigevent structure.

Introducing these additional AIO interfaces andthe SIGEV_EC notification mechanism wouldhelp to solve some problems.

• programs could get more efficient notifica-tion of events (at least more efficient thansignals and thread creation), even for fileI/O;

• network operations which require the ex-tended functionality of the recv andsend interfaces can be performed asyn-chronously;

• by pre-posting buffers with aio_read()

or the aio_recv() and now the aio_

recv and aio_send interfaces networkI/O might be able to avoid intermediatebuffers.

Especially the first two points are good argu-ments to implement these interfaces or at thevery least allow the existing POSIX AIO in in-terfaces use the event channel notification. Asexplained in section section 2 the memory han-dling of the POSIX AIO functions makes di-rect use by the network hardware cumbersomeand slower than necessary. Additionally thesystem call overhead is high when many inter-faces use the event channel notification. As ex-plained network requests have to be submitted.This can potentially be solved by extending thelio_listio() interface to allow submit mul-tiple requests at once. But this will not solvethe problem of the resulting event notificationstorm. For this we need more radical changes.

6 Advanced I/O Interfaces

For the more advanced interfaces we need to in-tegrate the DMA memory handling into the I/Ointerfaces. We need to consider synchronousand asynchronous interfaces. We could ignorethe synchronous interfaces and require the useof lio_listio or an equivalent interface butthis is a bit cumbersome to use.


int dma_assoc(int sock, dma_mem_t mem, size_t size, unsigned flags);int dma_disassoc(int sock, dma_mem_t, size_t size);

Figure 3: Association of DMA-able memory to Sockets

int sio_reserve(dma_mem_t dma, void **memp off, size_t size);int sio_release(dma_mem_t dma, void *mem, size_t size);

Figure 4: Network Buffer Memory Management

For network interfaces it is ideally the interfacewhich controls the memory into which incom-ing data is written. Today this happens withbuffers allocated by and under full control ofthe kernel. It is conceivable to allow applica-tions to allocate buffers and assign them to agiven interface. This is where dma_alloc()

comes in. The latter possibility has some dis-tinct advantages; mainly, it gives the programthe opportunity to influence the address spacelayout. This can be necessary for some pro-grams.3

It is usually not possible to associate each net-work interface with a userlevel process. Thenetwork interface is in most cases a shared re-source. The usual Unix network interface rulestherefore need to be followed. A userlevel pro-cess opens a socket, binds the socket to a port,and it can send and receive data. For the incom-ing data the header decides which port the re-mote party wants to target. Based on the num-ber, the socket is selected. Therefore the associ-ation of the DMA-able buffer should be with asocket. What is needed are interfaces as can beseen in Figure 3. It probably should be possibleto associate more than one DMA-able memoryregion with a socket. This way it is possible todynamically react to unexpected network traffic

3For instance, when address space is scarce or whenfixed addresses are needed.

volume by adding additional buffers.

Once the memory is associated with the socketthe application cannot use it anymore as itpleases until dma_disassoc() is called. Thekernel has to be notified if the memory is writ-ten to and the kernel needs to tell the applica-tion when data is available to be read. Oth-erwise the kernel might start using a DMAmemory region which the program is also us-ing, thus overwriting the data. We thereforeneed at least interfaces as shown in Figure 4.The sio_reserve() interface allows to re-serve (parts of) the DMA-able buffer for writ-ing by the application. This will usually bedone in preparation of a subsequent send op-eration. The dma parameter is the value re-turned by a previous call to dma_alloc(). Weuse a size parameter because this allows theDMA-able buffer to be split into several smallerpieces. As explained in section 3 it is more ef-ficient to allocate larger blocks of DMA-ablememory instead of many smaller ones becausememory locking only works with page granu-larity. The implementation is responsible fornot using the same part of the buffer more thanonce at the same time. A pointer to the avail-able memory is returned in the variable pointedto by memp.

When reading from the network the situationis reversed: the kernel will allocate the mem-


int sio_send(int sock, const void *buf, size_t size, int flags);int sio_sendto(int sock, const void *buf, size_t size, int flags,const struct sockaddr *to, socklen_t tolen);

int sio_sendmsg(int sock, const void *buf, size_t size, int flags);int sio_recv(int sock, void **buf, size_t size, int flags);int sio_recvfrom(int sock, const void **buf, size_t size, int flags,struct sockaddr *to, socklen_t tolen);

int sio_recvmsg(int sock, const void **buf, size_t size, int flags);

Figure 5: Advanced Synchronous Network Interfaces

ory region into which it stores the incomingdata. This happens using the kernel-equivalentof the sio_reserve() interface. Then theprogram is notified about the location and sizeof the incoming data. Until the program is donehandling the data the buffer cannot be reused.To signal that the data has been handled, thesio_release() interface is used. It is alsopossible to use the interface to abort the prepa-ration of a write operation by undoing the ef-fects of a previous sio_reserve() call.

The sio_reserve() and sio_release()

interfaces basically implement dynamic mem-ory allocation and deallocation. It adds an un-due burden on the implementation to requirea full-fledged malloc-like implementation. Itis therefore suggested to require a significantminimum allocation size. If reservations arealso rounded according to the minimum sizethis will in turn limit the number of reservationswhich can be given out at any given time. It ispossible to use a simple bitmap allocator.

What remains to be designed are the actual net-work interfaces. For the synchronous inter-faces we need the equivalent of the send andrecv interfaces. The send interfaces can ba-sically work like the existing Unix interfaceswith the one exception that the memory blockcontaining the data must be part of a DMA-ablememory region. The recv interfaces need tohave one crucial difference: the implementa-

tion must be able to decide the location of thebuffer containing the returned data. The result-ing interfaces can be seen in Figure 5.

The programmer has to make sure the bufferpointers passed to the sio_send functionshave been returned by a sio_reserve() callor as part of the notification of a previous sio_recv call. The implementation can potentiallydetect invalid pointers.

When the sio_recv functions return, thepointer pointed to by the second parameter con-tains the address of the returned data. This ad-dress is in the DMA-able memory area associ-ated with the socket. After the data is handledand the buffer is not used anymore the applica-tion has to mark the region as unused by callingsio_release(). Otherwise the kernel wouldrun out of memory to store the incoming datain.

For the asynchronous interfaces one couldimagine simply adding a sigevent structureparameter to the sio_recv and sio_send in-terfaces. This is unfortunately not sufficient.The program must be able to retrieve the er-ror status and the actual number of bytes whichhave been received or sent. There is no wayto transmit this information in the sigevent

structure. We could extend it but would du-plicate functionality which is already available.The asynchronous file I/O interfaces have the


same problem and the solution is the AIO con-trol block structure aiocb. It only makes senseto extend the POSIX AIO interfaces. We al-ready defined the additional interfaces neededin Figure 2. What is missing is the tie-in withthe DMA handling.

For this the most simplistic approach is to ex-tend aiocb structure by adding an elementaio_dma_buf of type dma_mem_t replacingthe aio_buf pointer for DMA-ready opera-tions. To use aio_dma_buf instead of aio_buf the caller passes the new AIO_DMA_BUF

flag to the aio_recv and aio_send inter-faces. For the lio_listio() interface it ispossible to define new operations LIO_DMA_

READ and LIO_DMA_WRITE. This leaves theexisting aio_read() and aio_write() in-terfaces. It would be possible to define alter-native interfaces which take a flag parameter orone could simply ignore the problem and tellpeople to use lio_listio() instead.

The implementation of the AIO functions toreceive data when operating on DMA-ablebuffers could do more than just pass the re-quest to the kernel. The implementation cankeep track of the buffers involved and check foravailable data in them before calling the ker-nel. If data is available the call can be avoidedand the appropriate buffer can be made knownthrough an appropriate event. When writing thedata could be written into the DMA-able buffer(if necessary). Depending on the implementa-tion of the user-level/kernel interaction of theDMA-able buffers it might or might not be nec-essary to make a system call to notify the kernelabout the new pending data.

7 Related Interfaces

The event channel mechanism is generalenough to be used in other situations than just

I/O. They can help solving a long-standingproblem of the interfaces Unix systems pro-vide. Programs, be it server or interactive pro-grams, are often designed with a central loopfrom which the various activities requested areinitiated. There can be one thread working theinner loop or many. The requested actions canbe performed by the thread which received therequest or a new thread can be created whichperforms the action. The threads in the programare then either waiting in the main loop or busyworking on an action. If the action could poten-tially be delayed significantly the thread wouldadd the wait event to the list the main loop han-dles and then enters the main loop again. Thisachieves maximum resource usage.

In reality this is not so easy. Not all events canbe waited on with the same mechanism. POSIXdoes not provide mechanisms to use poll() towait for messages to arrive in message queues,for mutexes to be unlocked, etc. This is wherethe event channels can help. If we can asso-ciate an event channel with these objects thekernel could generate events whenever the statechanges.

For POSIX message queues there is fortunatelynot much which needs to be done. The mq_

notify() interface takes a sigevent struc-ture parameter. Once the implementation is ex-tended to handle SIGEV_EC for I/O it shouldwork here, too. One question to be answeredis what to pass as the data parameter which canbe used to identify the request.

For POSIX semaphore we need a new inter-face to initiate asynchronous waiting. Fig-ure 6 shows the prototype for sem_await().The first two parameters are the same as forsem_wait(). The latter two parameters spec-ify the event channel and the parameter to passback. When the event reports a successful op-eration the semaphore has been posted. It is notnecessary to call sem_wait() again.


int sem_await(sem_t semdes, const struct timespec *abstime,ec_t ec, void *data);

int pthread_mutex_alock(pthread_mutex_t *mutex, ec_t ec, void *data);

Figure 6: Additional Event Channel Users

The actual implementation of this interface willbe more interesting. Semaphores and also mu-texes are implemented using futexes. Only partof the actual implementation is in the kernel.The kernel does not know the actual protocolused for the synchronization primitive, this isleft to the implementation. In case the eventchannel notification is requested the kernel willhave to learn about the protocol.

Once the POSIX semaphore problem is solvedit is easy enough to add support for the POSIXmutexes, read-write mutexes, barriers, etc. Thepthread_mutex_alock() interface is Fig-ure 6 is a possible solution. The other syn-chronization primitives can be similarly han-dled. This extends also to the System V mes-sage queues and semaphores. The differencefor the latter two is that the implementation isalready completely in the kernel and thereforethe implementation should be significantly sim-pler.

8 Summary

The proposed interfaces for network I/O havethe potential of great performance improve-ments. They avoid using the most limitingresources in a modern computer: memory-to-CPU cache and CPU cache-to-memory band-width. By minimizing the number of copieswhich have to be performed the CPUs have thechance of keeping up with the faster increasingnetwork speeds.

Along the way we sketched out a event han-dling implementation which is not only effi-cient enough to keep up with the demandsof the network interfaces. It is also versatileenough to finally allow implementing a uni-fied inner loop of all event driven programs.With poll or select interfaces being ableto receive event notifications for currently un-observable objects like POSIX/SysV messagequeues and futexes many programs have the op-portunity to become much easier because spe-cial handling for these cases can be removed.

Problem Solving With Systemtap

Frank Ch. EiglerRed Hat

[email protected]

Abstract

Systemtap is becoming a useful tool to helpsolve low-level OS problems. Most features de-scribed in the future tense at last year’s OLSare now complete. We review the status andrecent developments of the system. In pass-ing, we present solutions to some complex low-level problems that bedevil kernel and applica-tion developers.

Systemtap recently gained support for staticprobing markers that are compiled into the ker-nel, to complement the dynamic kprobessystem. It is a simple and fast mechanism, andwe invite kernel developers and other trace-liketools to adopt it.

1 Project status

At OLS 2005, we presented[4] systemtap, theopen source tool being developed for trac-ing/probing of a live unmodified linux sys-tem. It accepts commands in a simple script-ing language, and hooks them up to probes in-serted at requested code locations within thekernel. When the kernel trips across the probes,routines in a compiled form of the script arequickly run, then the kernel resumes. Over thelast year, with the combined efforts of a dozendevelopers supported by four companies, muchof this theory has turned into practice.

1.1 Scripting language

The systemtap script is a small domain-specificlanguage resembling awk and C. It has only afew data types (integers and strings, plus asso-ciative arrays of these), full control structures(blocks, conditionals, loops, functions). It islight on punctuation (semicolons optional) andon declarations (types are inferred and checkedautomatically). Its core concept, the “probe,”consists of a probe point (its trigger event) andits handler (the associated statements).

Probe points name the kernel events at whichthe statements should be executed. One mayname nearly any function, or a source file andline number where the breakpoint is to be set(just like in a symbolic debugger), or requestan asynchronous event like a periodic timer.Systemtap defines a hierarchical probe pointnamespace, a little like DNS.

Probe handlers have few constraints. They canprint data right away, to provide a sort of on-the-fly printk. Or, they can save a times-tamp in a variable and compare it with a laterprobe hit, to derive timing profiles. Or, theycan follow kernel data structures, and speak upif something is amiss.

The scripting language is implemented by atranslator that creates C code, which is in turncompiled into a binary kernel module. Probe

262 • Problem Solving With Systemtap

points are mapped to virtual addresses by refer-ence to the kernel’s DWARF debugging infor-mation left over from its build. The same datais used to resolve references to kernel “target-side” variables. Their compiled nature allowseven elaborate probe scripts to run fast.

Safety is an essential element of the design.All the language constructs are subjected totranslation- and run-time checks, which aim toprevent accidental damage to the system. Thisincludes prevention of infinite loops, memoryuse, and recursion, pointer faults, and severalothers. Many checks may be inspected withinthe translator-generated C code.

Some safety mechanisms are incomplete atpresent. Systemtap contains a blacklist of ker-nel areas that are deemed unsafe to probe, sincethey might trigger infinite probing recursion,locking reentrancy, or other nasty phenomena.This blacklist is just getting started, so prob-ing using broad wildcards is a recipe for pan-ics. Similarly, we haven’t sufficiently ana-lyzed the script-callable utility functions likeour gettimeofday wrapper to ensure that itis safe to call from any probe handler. Work inthese directions is ongoing.

1.2 Recent developments

The most basic development since last summeris that the system works, whereas last year werelied on several mock-ups. You can downloadit1, build it, and use it on your already installedkernels today. It is not perfect nor complete,but nor is it vapourware.

kprobes has received a heart transplant. Themost significant of these was truly concur-rent probing on multiprocessor machines, madepossible by a switch to RCU data structures.

1http://sourceware.org/systemtap/

Implementation details are discussed in a sep-arate paper [3] during this conference. In or-der to exploit the parallelism enabled by thisimprovement, systemtap supports variables totrack global statistics aggregates like averagesor counts using contention-free data structures.

For folks who like the exhilaration of full con-trol, or have a distaste for the scripting lan-guage, Systemtap supports bypassing the cush-ion. In “guru mode,” systemtap allows inter-mingling of literal C code with script, to go be-yond the limitations of pure script code. Onecan query or manipulate otherwise inaccessiblekernel state directly, but bears responsibility fordoing so safely.

Systemtap documentation is slowly growing, asis our collection of sample scripts. There is afifteen-page language tutorial, and a few dozenworked out examples on our web site. Moreand more first-time users are popping up on themailing list, so we are adapting to supportingnew users, not just fellow project developers.

1.3 Usage scenarios

While it’s still early, systemtap has suggestedseveral uses. First is simple exploration andprofiling. A probe on “timer.profile” and col-lecting stack backtrace samples gets one acoarse profile. A probe at a troubled function,with a similar a stack backtrace, tells one whois the troublemaker. Probes on system call han-dling functions (or more conveniently namedaliases defined in a library) give one an instantsystem-wide strace, with as much filteringand summarizing as one may wish. As a taste,figure 1 demonstrates probing function nestingwithin a compilation unit.

Daniel Berranger [1] arranged to run sys-temtap throughout a Linux boot sequence(/etc/init.d scripts) to profile the I/O


and forking characteristics of the many startupscripts and daemons. Some wasteful behaviorshowed up right away in the reports. On a sim-ilar topic, Dave Jones [2] is presenting a paperat this conference.

Another problem may be familiar: an overac-tive kswapd. In an old Red Hat EnterpriseLinux kernel, it was found that some innerpage-scanning loop ran several orders of mag-nitude more iterations than anticipated, due tosome error in queue management code. Doesthis kind of thing not happen regularly? Sys-temtap was not available for diagnosing thisbug, but it would have been easy to probe loopsin the suspect functions, say by source file andline number, to count and graph relative execu-tion counts.

2 Static probing markers

Systemtap recently added support for staticprobing markers or “markers” for short. Thisis a way of letting developers designate pointsin their functions as being candidates forsystemtap-style probing. The developer insertsa macro call at the points of interest, giving themarker a name and some optional parameters,and grudgingly recompiles the kernel. (Thename can be any alphanumeric symbol, andshould be reasonably unique across the kernelor module. Parameters may be string or nu-meric expressions.)

In exchange for this effort, systemtap marker-based probes are faster and more precise thankprobes. The better precision comes from nothaving to covet the compiler’s favours. Suchfickle favours include retaining clean bound-aries in the instruction stream between inter-esting statements, and precisely describing po-sitions of variables in the stack frame. Sincemarkers don’t rely on debugging information,

neither favour is required, and the compiler canchannel its charms into unabated optimization.The speed advantage comes from using directcall instructions rather than int 3 breakpointsto dispatch to the systemtap handlers. We willsee below just how big a difference this makes.

STAP_MARK (name);STAP_MARK_NS (name,num,string);

Just putting a marker into the code does noth-ing except waste a few cycles. A marker can be“activated” by writing a systemtap probe asso-ciated with the marker name. All markers withthe same name are identified, and are made tocall the probe handler routine. Like any othersystemtap probe, the handler can trace, collect,filter, and aggregate data before returning.

probe kernel.mark("name") { }probe module("drv").mark("name") { }

2.1 Implementation

As hinted above, the probe marker is a macro2

that consists of a conditional indirect functioncall. Argument expressions are evaluated in theconditional function call. Similarly to C++, anexplicit argument-type signature is appended tothe macro and the static variable name.

#define STAP_MARK(n) do { \static void (*__mark_##n##_)(); \if (unlikely (__mark_##n##_)) \

(void) (__mark_##n##_()); \} while (0)

In x86 assembly language, this translates to aload from a direct address, test, and a con-ditional branch over a call sequence. The

2Systemtap includes a header file that defines a scoresof type/arity permutations.


load/zero-test is easily optimized by “hoisting”it up (earlier), since it is operating on privatedata. With GCC’s -freorder-blocks op-timization flag, the instructions for the functioncall sequence tend to be pushed well away from(beyond) the hot path, and get jumped to us-ing a conditional forward branch. That is idealfrom the perspective of hardware static branchprediction.

A new static variable is created for each macro.If the macro is instantiated within an inlinefunction, all inlined instances within a programwill share that same variable. Systemtap cansearch for the variables in the symbol table bymatching names against the stylized namingscheme. Further, systemtap deduces argumenttypes from the signature suffix, so it can write atype-safe function to accept the parameters anddispatch to a compiled probe handler.

During probe initialization, the static variablecontaining the marker’s function pointer is sim-ply overwritten to point at the handler, and it iscleared again at shutdown.3

This design implies that only a single handlercan be associated with any single marker: othersystemtap sessions are locked out temporar-ily. Should this become a problem for par-ticularly popular markers, we can add supportfor “multi-marker” macros that use some smallnumber of synonymous static variables insteadof one. This would trade utility for speed.

2.2 Performance

Several performance metrics are interesting:code bloat, slowdown due to a dormant marker,dispatch cost of an active marker. These quan-tities may be compared to the classic kprobes

3These operations atomically synchronize usingcmpxchg.

alternative. On all these metrics, markers seemto perform well.

For demonstration purposes, we insertedmarker macros in just two spots in a 2.6.16-based kernel: the scheduler context-switch rou-tine, just before switch_to (passing the“from” and “to” task->pid numbers), andthe system call handler sys_getuid (pass-ing current->uid). All tests were run ona Fedora Core 5 machine with a 3 GHz Pen-tium 4 HT.

Code bloat is the number of bytes of instruc-tion code needed to support the marker, whichimpacts the instruction cache. With kprobes,there is no code inserted, so those numbers arezero. We measured it for static markers by dis-assembling otherwise identical kernel binaries,compiled with and without markers.

function test callgetuid 10 19

context_switch 19 34

Slowdown due to a dormant marker is the timepenalty for having a potential but unused probepoint. This quantity is also zero for kprobes.For our static markers, it is the time taken totest whether the static variable is set, and it be-ing clear, to bypass the probe function call. Itmay incur a data cache miss (for loading thestatic variable), but the actual test and properlypredicted branch can be nearly “free.”

Indeed, a microbenchmark that calls an marker-instrumented getuid system call in a tightloop a million times has minimum and aver-age times that match one that calls an uninstru-mented system call (getgid). A different mi-crobenchmark that runs the same marker macrobut in user space, surrounded by rdtscllcalls, indicates a cost of a handful of cycleseach: 4–20.


Since the slowdown due to a dormant markeris so small, we plan to measure a heavily in-strumented kernel macroscopically. However,adding markers strategically into the kernel ischallenging, if they are to represent plausibleextra load.

Finally, let’s discuss the dispatch speed of anactive marker. This is important because itrelates inversely to the maximum number ofprobes that can trigger per unit time. The over-head for a reasonable frequency of probe hitsshould not overwhelm the system. For ourstatic markers, the dispatch overhead consistsof the indirect function call. On the test plat-form, this additional cost is just 50–60 cycles.

For kprobes, an active probe includes an elab-orate process involving triggering a breakpointfault (int 3 on x86), entering the fault han-dler, identifying which handler belongs to thatparticular breakpoint address, calling the han-dler, single-stepping the original instruction un-der the breakpoint, and probably some othersteps we left out.

A realistic systemtap-based microbenchmarkmeasured the time required for one roundtrip of the same functions used above: themarker-instrumented sys_getuid and unin-strumented sys_getgid. Each probe han-dler is identical, and increments a script-levelglobal counter variable for each visit. The fol-lowing matrix summarizes the typical numberof nanoseconds per system call (lower is better)with the listed instrumentation active.

function marker kprobe both neithergetuid 820 2100 2250 620getgid 2100 620

Note that the complete marker-based probesrun in 200 ns, and kprobes-based probes run in1480 ns. Some arithmetic lets us work back-ward, to estimate just the dispatching times and

exclude the systemtap probes. The cost of the50–60 cycles of function call dispatch for themarkers (measured earlier) takes about 20 ns onthe test host. That implies that the systemtapprobe handler took about 180 ns. Since iden-tical probe handlers were run for both kprobesand markers, we can subtract that, leaving 1300ns as the kprobes dispatch overhead.

While the above analysis only pretends to bequantitative, it gives some evidence that mark-ers have attractive performance: cheap to sitaround dormant, and fast when activated.

3 Next steps

3.1 User-space probes

Systemtap still lacks support for probing user-space programs: we can go no higher than thesystem call interface. A kprobes extension isunder development to allow the same sorts ofbreakpoints to be inserted into shared librariesand executables at runtime that it now managesin the kernel. When this part is finished and ac-cepted, systemtap will exploit it shortly. Probesin user space would use a similar syntax to re-fer to sources or symbols as already availablefor kernel probe points.

Probing in user space may seem like a task for adifferent sort of tool, perhaps a plain debuggerlike gdb, or a fancier one like frysk4, or anothersupervisor process based on ptrace. How-ever, we believe that the handler routine of evena user-space probe should run in kernel space,because:

1. The microsecond level speed of a kprobes“round trip” is still an order of magnitudefaster than the equivalent process statequery / manipulation using the ptrace API.

4http://sources.redhat.com/frysk


2. Some problems require correlation of ac-tivities in the kernel with those in user-space. Such correlations are naturallyexpressed by a single script that sharesvariables amongst kernel- and user-spaceprobes.

Once we pass that hurdle, joint application ofuser-space kprobes and static probing markerswill make it possible for user-space programsand libraries to contain probing markers too.This would let libraries or programs designatetheir own salient probe points, while enjoying alow dormant probe cost. Language interpreterslike Perl and PHP can insert markers into theirevaluation loops to mark events like script func-tion entries/exits and garbage collection. Com-plex applications can instrument multithread-ing events like synchronization and lock con-tention.

3.2 Debugging aid

Systemtap is becoming stable enough that ker-nel developers should feel comfortable with us-ing it as a first-ditch debugging aid. When yourun into a problem where a little bit of tracing,profiling, event counting might help, we are ea-ger to help you write the necessary scripts.

3.3 Sysadmin aid

We would like to develop a suite of system-tap scripts that supplant tools like netstat,vmstat, strace. For inspiration, it maybe desirable to port the OpenSolaris DTrace-Toolkit5, which is a suite of dtrace scripts toprovide an overview of the entire system’s ac-tivity. Systemtap will make it possible to save

5http://www.opensolaris.org/os/community/dtrace/dtracetoolkit/

and reuse compiled scripts, so that deploymentand execution of such a suite could be easierand faster.

3.4 Grand unified tracing

There are many linux kernel tracing projectsaround. Every few months, someone reinventsLTT and auditing. While the author does notunderstand all the reasons for which these toolstend not to be integrated into the mainstreamkernel, perhaps one of them is performance.

To the extent that is true, we propose thatthese groups consider using a shared pool ofstatic markers as the basic kernel-side instru-mentation mechanism. If they prove to haveas low dormant cost and as high active perfor-mance as initial experience suggests, perhapsthis could motivate the various tracing effortsand kernel subsystem developers to finally joinforces. Let’s designate standard trace/probepoints once and for all. Tracing backends canattach to these markers the same way system-tap would. There would be no need for them tomaintain kernel patches any more. Let’s thinkabout it.

References

[1] Daniel Berranger.http://people.redhat.com/

berrange/systemtap/bootprobe/,January 2006.

[2] Dave Jones. Why Userspace Sucks. InProceedings of the 2006 Ottawa LinuxSymposium, July 2006.

[3] Ananth N. Mavinakayanahalli et al.Probing the Guts of Kprobes. InProceedings of the 2006 Ottawa LinuxSymposium, July 2006.


[4] Vara Prasad et al. DynamicInstrumentation of Production Systems. InProceedings of the 2005 Ottawa LinuxSymposium, volume 2, pages 49–64, July2005.


# cat socket-trace.stpprobe kernel.function("*@net/socket.c") {printf ("%s -> %s\n", thread_indent(1), probefunc())

}probe kernel.function("*@net/socket.c").return {printf ("%s <- %s\n", thread_indent(-1), probefunc())

}

# stap socket-trace.stp0 hald(2632): -> sock_poll28 hald(2632): <- sock_poll

[...]0 ftp(7223): -> sys_socketcall

1159 ftp(7223): -> sys_socket2173 ftp(7223): -> __sock_create2286 ftp(7223): -> sock_alloc_inode2737 ftp(7223): <- sock_alloc_inode3349 ftp(7223): -> sock_alloc3389 ftp(7223): <- sock_alloc3417 ftp(7223): <- __sock_create4117 ftp(7223): -> sock_create4160 ftp(7223): <- sock_create4301 ftp(7223): -> sock_map_fd4644 ftp(7223): -> sock_map_file4699 ftp(7223): <- sock_map_file4715 ftp(7223): <- sock_map_fd4732 ftp(7223): <- sys_socket4775 ftp(7223): <- sys_socketcall

[...]

Figure 1: Tracing and timing functions in net/sockets.c.

Perfmon2: a flexible performance monitoring interfacefor Linux

Stéphane EranianHP Labs

[email protected]

Abstract

Monitoring program execution is becomingmore than ever key to achieving world-classperformance. A generic, flexible, and yet pow-erful monitoring interface to access the perfor-mance counters of modern processors has beendesigned. This interface allows performancetools to collect simple counts or profiles on aper kernel thread or system-wide basis. It in-troduces several innovations such as customiz-able sampling buffer formats, time or overflow-based multiplexing of event sets. The cur-rent implementation for the 2.6 kernel supportsall the major processor architectures. Severalopen-source and commercial tools based on in-terface are available. We are currently workingon getting the interface accepted into the main-line kernel. This paper presents an overview ofthe interface.

1 Introduction

Performance monitoring is the action of col-lecting information about the execution of aprogram. The type of information collected de-pends on the level at which it is collected. Wedistinguish two levels:

• the program level: the program is instru-mented by adding explicit calls to routinesthat collect certain metrics. Instrumenta-tion can be inserted by the programmer orthe compiler, e.g., the -pg option of GNUcc. Tools such as HP Caliper [5] or IntelPIN [17] can also instrument at runtime.With those tools, it is possible to collect,for instance, the number of times a func-tion is called, the number of time a basicblock is entered, a call graph, or a memoryaccess trace.

• the hardware level: the program is notmodified. The information is collected bythe CPU hardware and stored in perfor-mance counters. They can be exploitedby tools such as OProfile and VTUNE onLinux. The counters measure the micro-architectural behavior of the program, i.e.,the number of elapsed cycles, how manydata cache stalls, how many TLB misses.

When analyzing the performance of a pro-gram, a user must answer two simple ques-tions: where is time spent and why is spent timethere? Program-level monitoring can, in manysituations and with some high overhead, answerthe first, but the second question is best solvedwith hardware-level monitoring. For instance,gprof can tell you that a program spends 20%of its time in one function. The difficulty is

270 • Perfmon2: a flexible performance monitoring interface for Linux

to know why. Is this because the function iscalled a lot? Is this due to algorithmic prob-lems? Is it because the processor stalls? If so,what is causing the stalls? As this simple ex-ample shows, the two levels of monitoring canbe complementary.

The current CPU hardware trends are increas-ing the need for powerful hardware monitor-ing. New hardware features present the op-portunity to gain considerable performance im-provements through software changes. To ben-efit from a multi-threaded CPU, for instance,a program must become multi-threaded itself.To run well on a NUMA machine, a programmust be aware of the topology of the machineto adjust memory allocations and thread affinityto minimize the number of remote memory ac-cesses. On the Itanium [3] processor architec-ture, the quality of the code produced by com-pilers is a big factor in the overall performanceof a program, i.e, the compiler must extract theparallelism of the program to take advantage ofthe hardware.

Hardware-based performance monitoring canhelp pinpoint problems in how software usesthose new hardware features. An operating sys-tem scheduler can benefit from cache profilesto optimize placement of threads to avoidingcache thrashing in multi-threaded CPUs. Staticcompilers can use performance profiles to im-prove code quality, a technique called Profile-Guided Optimization (PGO). Dynamic compil-ers, in Managed Runtime Environments (MRE)can also apply the same technique. Profile-Guided Optimizations can also be applied di-rectly to a binary by tools such as iSpike [11].In virtualized environments, such as Xen [14],system managers can also use monitoring infor-mation to guide load balancing. Developers canalso use this information to optimize the lay-out of data structures, improve data prefetch-ing, analyze code paths [13]. Performance pro-files can also be used to drive future hardware

requirements such as cache sizes, cache laten-cies, or bus bandwidth.

Hardware performance counters are logicallyimplemented by the Performance MonitoringUnit (PMU) of the CPU. By nature, this is afairly complex piece of hardware distributed allacross the chip to collect information about keycomponents such as the pipeline, the caches,the CPU buses. The PMU is, by nature, veryspecific to each processor implementation, e.g.,the Pentium M and Pentium 4 PMUs [9] havenot much in common. The Itanium proces-sor architecture specifies the framework withinwhich the PMU must be implemented whichhelps develop portable software.

One of the difficulties to standardize on a per-formance monitoring interface is to ensure thatit supports all existing and future PMU mod-els without preventing access to some of theirmodel specific features. Indeed, some models,such as the Itanium 2 PMU [8], go beyond justcounting events, they can also capture branchtraces, where cache misses occur, or filter onopcodes.

In Linux and across all architectures, the wealthof information provided by the PMU is often-times under-exploited because a lack of a flex-ible and standardized interface on which toolscan be developed.

In this paper, we give an overview of perfmon2,an interface designed to solve this problem forall major architectures. We begin by reviewingwhat Linux offers today. Then, we describe thevarious key features of this new interface. Weconclude with the current status and a short de-scription of the existing tools.

2 Existing interfaces

The problem with performance monitoring inLinux is not the lack of interface, but rather the


multitude of interfaces. There are at least threeinterfaces:

• OProfile [16]: it is designed for DCPI-style [15] system-wide profiling. It is sup-ported on all major architectures and is en-abled by major Linux distributions. It cangenerate a flat profile and a call graph perprogram. It comes with its own tool set,such as opcontrol. Prospect [18] is an-other tool using this interface.

• perfctr [12]: it supports per-kernel-threadand system-wide monitoring for most ma-jor processor architectures, except for Ita-nium. It is distributed as a stand-alone ker-nel patch. The interface is mostly used bytools built on top of the PAPI [19] perfor-mance toolkit.

• VTUNE [10]: the Intel VTUNE perfor-mance analyzer comes with its own kernelinterface, implemented by an open-sourcedriver. The interface supports system-wide monitoring only and is very specificto the needs of the tool.

All these interfaces have been designed with aspecific measurement or tool in mind. As such,their design is somewhat limited in scope, i.e.,they typically do one thing very well. For in-stance, it is not possible to use OProfile to countthe number of retired instructions in a thread.The perfctr interface is the closest match towhat we would like to build, yet it has someshortcomings. It is very well designed andtuned for self-monitoring programs but sam-pling support is limited, especially for non self-monitoring configurations.

With the current situation, it is not necessarilyeasy for developers to figure out how to writeor port their tools. There is a question of func-tionalities of each interfaces and then, a ques-tion of distributions, i.e., which interface ships

with which distribution. We believe this situa-tion does not make it attractive for developersto build modern tools on Linux. In fact, Linuxis lagging in this area compared to commercialoperating systems.

3 Design choices

First of all, it is important to understand why akernel interface is needed. A PMU is accessi-ble through a set of registers. Typically thoseregisters are only accessible, at least for writ-ing, at the highest privilege level of execution(pl0 or ring0) which is where only the kernelexecutes. Furthermore, a PMU can trigger in-terrupts which need kernel support before theycan be converted into a notification to a user-level application such as a signal, for instance.For those reasons, the kernel needs to providean interface to access the PMU.

The goal of our work is to solve the hardware-based monitoring interface problem by design-ing a single, generic, and flexible interface thatsupports all major processor architectures. Thenew interface is built from scratch and intro-duces several innovations. At the same time,we recognize the value of certain features ofthe other interfaces and we try to integrate themwherever possible.

The interface is designed to be built into thekernel. This is the key for developers, as itensures that the interface will be available andsupported in all distributions.

To the extent possible, the interface must al-low existing monitoring tools to be ported with-out many difficulties. This is useful to ensureundisrupted availability of popular tools suchas VTUNE or OProfile, for instance.

The interface is designed from the bottom up,first looking at what the various processors pro-


vide and building up an operating system in-terface to access the performance counters in auniform fashion. Thus, the interface is not de-signed for a specific measurement or tool.

There is efficient support for per-thread mon-itoring where performance information is col-lected on a kernel thread basis, the PMU state issaved and restored on context switch. There isalso support for system-wide monitoring whereall threads running on a CPU are monitored andthe PMU state persists across context switches.

In either mode, it is possible to collect simplecounts or profiles. Neither applications nor theLinux kernel need special compilation to en-able monitoring. In per-thread mode, it is pos-sible to monitor unmodified programs or multi-threaded programs. A monitoring session canbe dynamically attached and detached from arunning thread. Self-monitoring is supportedfor both counting and profiling.

The interface is available to regular users andnot just system administrators. This is espe-cially important for per-thread measurements.As a consequence, it is not possible to assumethat tools are necessarily well-behaved and theinterface must prevent malicious usage.

The interface provides a uniform set of fea-tures across platforms to maximize code re-usein performance tools. Measurement limitationsare mandated by the PMU hardware not thesoftware interface. For instance, if a PMU doesnot capture where cache misses occur, thereis nothing the interface nor its implementationcan do about it.

The interface must be extensible because wewant to support a variety of tools on very dif-ferent hardware platforms.

4 Core Interface

The interface leverages a common property ofall PMU models which is that the hardware in-terface always consists of a set of configura-tion registers, that we call PMC (PerformanceMonitor Configuration), and a set of data reg-isters, that we call PMD (Performance Moni-tor Data). Thus, the interface provides basicread/write access to the PMC/PMD registers.

Across all architectures, the interface exposesa uniform register-naming scheme using thePMC and PMD terminology inherited from theItanium processor architecture. As such, appli-cations actually operate on a logical PMU. Themapping from the logical to the actual PMU isdescribed in Section 4.3.

The whole PMU machine state is representedby a software abstraction called a perfmon con-text. Each context is identified and manipulatedusing a file descriptor.

4.1 System calls

The interface is implemented with multiple sys-tem calls rather than a device driver. Per-thread monitoring requires that the PMU ma-chine state be saved and restored on contextswitch. Access to such routine is usually pro-hibited for drivers. A system call provides moreflexibility than ioctl for the number, type,and type checking of arguments. Furthermore,system calls reinforce our goal of having the in-terface be an integral part of the kernel, and notjust an optional device driver.

The list of system calls is shown in Table 1.A context is created by the pfm_create_

context call. There are two types of contexts:per-thread or system-wide. The type is deter-mined when the context is created. The sameset of functionalities is available to both types


int pfm_create_context(pfarg_ctx_t *c, void *s, size_t s)int pfm_write_pmcs(int f, pfarg_pmc_t *p, int c)int pfm_write_pmds(int f, pfarg_pmd_t *p, int c)int pfm_read_pmds(int f, pfarg_pmd_t *p, int c)int pfm_load_context(int f, pfarg_load_t *l)int pfm_start(int fd, pfarg_start_t *s)int pfm_stop(int f)int pfm_restart(int f)int pfm_create_evtsets(int f, pfarg_setdesc_t *s, int c)int pfm_getinfo_evtsets(int f, pfarg_setinfo_t *i, int c)int pfm_delete_evtsets(int f, pfarg_setdesc_t *s, int c)int pfm_unload_context(int f)

Table 1: perfmon2 system calls

of context. Upon return from the call, the con-text is identified by a file descriptor which canthen be used with the other system calls.

The write operations on the PMU registers areprovided by the pfm_write_pmcs and pfm_

write_pmds calls. It is possible to accessmore than one register per call by passing avariable-size array of structures. Each structureconsists, at a minimum, of a register index andvalue plus some additional flags and bitmasks.

An array of structures is a good compromisebetween having a call per register, i.e., one reg-ister per structure per call, and passing the en-tire PMU state each time, i.e., one large struc-ture per call for all registers. The cost of a sys-tem call is amortized, if necessary, by the factthat multiple registers are accessed, yet flexibil-ity is not affected because the size of the arrayis variable. Furthermore, the register structuredefinition is generic and is used across all ar-chitectures.

The PMU can be entirely programmed beforethe context is attached to a thread or CPU. Toolscan prepare a pool of contexts and later attachthem on-the-fly to threads or CPUs.

To actually load the PMU state onto the ac-tual hardware, the context must be bound to ei-ther a kernel thread or a CPU with the pfm_

load_context call. Figure 1 shows the ef-fect of the call when attaching to a thread of

file table pfm_context pfm_contextfile table

kerneluser

fd

monitoring tool monitored process

fd

controlling process monitored process

before pfm_load_context()

m

after pfm_load_context(12)

12 14 12 14

Figure 1: attaching to a thread

a dual-threaded process. A context can onlybe bound to one thread or CPU at a time. Itis not possible to bind more than one contextto a thread or CPU. Per-thread monitoring andsystem-wide monitoring are currently mutuallyexclusive. By construction, multiple concurrentper-thread contexts can co-exist. Potential con-flicts are detected when the context is attachedand not when it is created.

An attached context persists across a call toexec. On fork or pthread_create, thecontext is not automatically cloned in the newthread because it does not always make senseto aggregate results or profiles from child pro-cesses or threads. Monitoring tools can lever-age the 2.6 kernel ptrace interface to receivenotifications on the clone system call to de-cide whether or not to monitor a new thread orprocess. Because the context creation and at-tachment are two separate operations, it is pos-sible to batch creations and simply attach andstart on notification.

Once the context is attached, monitoring can bestarted and stopped using the pfm_start andpfm_stop calls. The values of the PMD regis-ters can be extracted with the pfm_read_pmdscall. A context can be detached with pfm_

unload_context. Once detached the contextcan later be re-attached to any thread or CPU ifnecessary.

A context is destroyed using a simple close


call. The other system calls listed in Table 1 re-late to sampling or event sets and are discussedin later sections.

Many 64-bit processor architectures providethe ability to run with a narrow 32-bit instruc-tion set. For instance, on Linux for x86_64,it is possible to run unmodified 32-bit i386 bi-naries. Even though, the PMU is very imple-mentation specific, it may be interesting to de-velop/port tools in 32-bit mode. To avoid dataconversions in the kernel, the perfmon2 ABI isdesigned to be portable between 32-bit (ILP32)and 64-bit (LP64) modes. In other words, allthe data structures shared with the kernel usefixed-size data types.

4.2 System-wide monitoring

fd1 fd2

CPU1CPU0

monitoring tool

userkernel

worker processes

Figure 2: monitoring two CPUs

A perfmon context can be bound to only oneCPU at a time. The CPU on which the call topfm_load_context is executed determinesthe monitored CPU. It is necessary to set theaffinity of the calling thread to ensure that itruns on the CPU to monitor. The affinity can

later be modified, but all operations requiringaccess to the actual PMU must be executed onthe monitored CPU, otherwise they will fail. Inthis setup, coverage of a multi-processor sys-tem (SMP), requires that multiple contexts becreated and bound to each CPU to monitor.Figure 2 shows a possible setup for a moni-toring tool on a 2-way system. Multiple non-overlapping system-wide attached context canco-exist.

The alternative design is to have the kernelpropagate the PMU access to all CPUs of inter-est using Inter-Processor-Interrupt (IPI). Suchapproach does make sense if all CPUs are al-ways monitored. This is the approach chosenby OProfile, for instance.

With the perfmon2 approach, it is possible tomeasure subsets of CPUs. This is very inter-esting for large NUMA-style or multi-core ma-chines where all CPUs do not necessarily runthe same workload. And even then, with a uni-form workload, it possible to divide the CPUsinto groups and capture different events in eachgroup, thereby overlapping distinct measure-ments in one run. Aggregation of results canbe done by monitoring tools, if necessary.

It is relatively straightforward to construct auser-level helper library that can simplify mon-itoring multiple CPUs from a single thread ofcontrol. Internally, the library can pin threadson the CPUs of interest. Synchronization be-tween threads can easily be achieved using abarrier built with the POSIX-threads primitives.We have developed and released such a libraryas part of the libpfm [6] package.

Because PMU access requires the controllingthread to run on the monitored CPU, proces-sor and memory affinity are inherently enforcedthereby minimizing overhead which is impor-tant when sampling in NUMA machines. Fur-thermore, this design meshes well with certainPMU features such as the Precise-Event-Based


Sampling (PEBS) support of the Pentium 4 pro-cessor (see Section 5.4 for details).

4.3 Logical PMU

PMU register names and implementations arevery diverse. On the Itanium processor archi-tecture, they are implemented by actual PMCand PMD indirect registers. On the AMDOpteron [1] processors, they are called PERF-SEL and PERFCTR indirect registers but areactually implemented by MSR registers. Aportable tool would have to know about thosenames and the interface would have to changefrom one architecture to another to accommo-date the names and types of the registers for theread and write operations. This would defeatour goal of having a uniform interface on allplatforms.

To mask the diversity without compromisingaccess to all PMU features, the interface ex-poses a logical PMU. This PMU is tailored tothe underlying hardware PMU for propertiessuch as the number of registers it implements.But it also guarantees the following propertiesacross all architectures:

• the configuration registers are called PMCregisters and are managed as 64-bit wideindirect registers

• the data registers are called PMD registersand are managed as 64-bit wide indirectregisters

• counters are 64-bit wide unsigned integers

The mapping of PMC/PMD to actual PMU reg-isters is defined by a PMU description tablewhere each entry provides the default value, abitmask of reserved fields, and the actual nameof the register. The mapping is defined by the

implementation and is accessible via a sysfsinterface.

The routine to access the actual register is partof the architecture specific part of a perfmon2implementation. For instance, on Itanium 2processor, the mapping is defined such that theindex in the table corresponds to the index ofthe actual PMU register, e.g., logical PMD0corresponds to actual PMD0. The read functionconsists of a single mov rXX=pmd[0] instruc-tion. On the Pentium M processor however, themapping is defined as follows:

% cat /sys/kernel/perfmon/pmu_desc/mappings

PMC0:0x100000:0xffcfffff:PERFEVTSEL0

PMC1:0x100000:0xffcfffff:PERFEVTSEL1

PMD0:0x0:0xffffffffffffffff:PERFCTR0

PMD1:0x0:0xffffffffffffffff:PERFCTR1

When a tool writes to register PMD0, it writesto register PERFEVTSEL0. The actual regis-ter is implemented by MSR 0x186. Thereis an architecture specific section of the PMUdescription table that provides the mapping tothe MSR. The read function consist of a singlerdmsr instruction.

On the Itanium 2 processors, we use this map-ping mechanism to export the code (IBR) anddata (DBR) debug registers as PMC registersbecause they can be used to restrict monitoringto a specific range of code or data respectively.There was no need to create an Itanium 2 pro-cessor specific system call in the interface tosupport this useful feature.

To make applications more portable, countersare always exposed as 64-bit wide unsigned in-tegers. This is particularly interesting whensampling, see Section 5 for more details. Usu-ally, PMUs implement narrower counters, e.g.,47 bits on Itanium 2 PMU, 40 bits on AMDOpteron PMU. If necessary, each implemen-tation must emulate 64-bit counters. This canbe accomplished fairly easily by leveraging the


counter overflow interrupt capability present onall modern PMUs. Emulation can be turned offby applications on a per-counter basis, if neces-sary.

Oftentimes, it is interesting to associate PMU-based information with non-PMU based infor-mation such as an operating system resourceor other hardware resource. For instance, onemay want to include the time since monitoringhas been started, the number of active networksconnections, or the identification of the currentprocess in a sample. The perfctr interface pro-vides this kind of information, e.g., the virtualcycle counter, through a kernel data structurethat is re-mapped to user level.

With perfmon2, it is possible to leverage themapping table to define Virtual PMD regis-ters, i.e., registers that do not map to actualPMU or PMU-related registers. This mecha-nism provides a uniform and extensible nam-ing and access interface for those resources.Access to new resources can be added with-out breaking the ABI. When a tool invokespfm_read_pmds on a virtual PMD register, aread call-back function, provided by the PMUdescription table, is invoked and returns a 64-bit value for the resource.

4.4 PMU description module

Hardware and software release cycles do not al-ways align correctly. Although Linux kernelpatches are produced daily on the kernel.org web site, most end-users really run pack-aged distributions which have a very differ-ent development cycle. Thus, new hardwaremay become available before there is an ac-tual Linux distribution ready. Similarly, pro-cessors may be revised and new steppings mayfix bugs in the PMU. Although providing up-dates is fairly easy nowadays, end-users tend tobe reluctant to patch and recompile their ownkernels.

It is important to understand that monitoringtool developers are not necessarily kernel de-velopers. As such, it is important to providesimple mechanisms whereby they can enableearly access to new hardware, add virtual PMDregisters and run experimentations without fullkernel patching and recompiling.

There are no technical reasons for having thePMU description tables built into the kernel.With a minimal framework, they can as wellbe implemented by kernel modules where theybecome easier to maintain. The perfmon2 in-terface provides a framework where a PMU de-scription module can be dynamically insertedinto the kernel at runtime. Only one module canbe inserted at a time. When new hardware be-comes available, assuming there is no changesneeded in the architecture specific implementa-tion, a new description module can be providedquickly. Similarly, it becomes easy to experi-ment with virtual PMD registers by modifyingthe description table and not the interface northe core implementation.

5 Sampling Support

Statistical Sampling or profiling is the act ofrecording information about the execution ofa program at some interval. The interval iscommonly expressed in units of time, e.g., ev-ery 20ms. This is called Time-Based sampling(TBS). But the interval can also be expressedin terms of a number of occurrences of a PMUevent, e.g., every 2000 L2 cache misses. This iscalled Event-Based sampling (EBS). TBS caneasily be emulated with EBS by using an eventwith a fixed correlation to time, e.g., the num-ber of elapsed cycles. Such emulation typicallyprovides a much finer granularity than the op-erating system timer which is usually limited tomillisecond at best. The interval, regardless ofits unit, does not have to be constant.


At the end of an interval, the information isstored into a sample which may contain infor-mation as simple as where the thread was, i.e.,the instruction pointer. It may also include val-ues of some PMU registers or other hardwareor software resources.

The quality of a profile depends mostly on theduration of the run and the number of samplescollected. A good profile can provide a lot ofuseful information about the behavior of a pro-gram, in particular it can help identify bottle-necks. The difficulty is to manage to overheadinvolved with sampling. It is important to makesure that sampling does not perturb the execu-tion of the monitored program such that it doesnot exhibit its normal behavior. As the sam-pling interval decreases, overhead increases.

The perfmon2 interface has an extensive set offeatures to support sampling. It is possible tomanage sampling completely at the user level.But there is also kernel-level support to mini-mize the overhead. The interface provides sup-port for EBS.

5.1 Sampling periods

All modern PMUs implement a counter over-flow interrupt mechanism where the proces-sor generates an interrupt whenever a counterwraps around to zero. Using this mechanismand supposing a 64-bit wide counter, it is pos-sible to implement EBS by expressing a sam-pling period p as 264 − p or in two’s comple-ment arithmetics as −p. After p occurrences,the counter overflows, an interrupt is generatedindicating that a sample must be recorded.

Because all counters are 64-bit unsigned in-tegers, tools do not have to worry about theactual width of counters when setting the pe-riod. When 64-bit emulation is needed, theimplementation maintains a 64-bit software

hwPMD fffe7960

ffffffff fffe7960

32 bits 32 bits

ffffffff

32 bits

swPMD 0

value = −100000 =

Figure 3: 64-bit counter emulation

value and loads only the low-order bits ontothe actual register as shown in Figure 3. AnEBS overflow is declared only when the 64-bitsoftware-maintained value overflows.

The interface does not have the notion of a sam-pling period, all it knows about is PMD values.Thus a sampling period p, is programmed intoa PMD by setting its value to −p. The num-ber of sampling periods is only limited by thenumber of counters. Thus, it is possible to over-lap sampling measurements to collect multipleprofiles in one run.

For each counter, the interface provides threevalues which are used as follows:

• value: the value loaded into the PMD reg-ister when the context is attached. This isthe initial value.

• long_reset: the value to reload into thePMD register after an overflow with user-level notification.

• short_reset: the value to reload into thePMD register after an overflow with nouser-level notification.

The three values can be used to try and masksome of the overhead involved with sampling.The initial period would typically be large be-cause it is not always interesting to capturesamples in initialization code. The long andshort reset values can be used to mask the noisegenerated by the PMU interrupt handler. Weexplain how they are used in Section 5.3.


5.2 Overflow notifications

To support sampling at the user level, it is nec-essary to inform the tool when a 64-bit over-flow occurs. The notification can be requestedper counter and is sent as a message. There isonly one notification per interrupt even whenmultiple counters overflow at the same time.

Each perfmon context has a fixed-depth mes-sage queue. The fixed-size message containsinformation about the overflow such as whichcounter(s) overflowed, the instruction pointer,the current CPU at the time of the overflow.Each new message is appended to the queuewhich is managed as a FIFO.

Instead of re-inventing yet another notificationmechanism, existing kernel interfaces are lever-aged and messages are extracted using a simpleread call on the file descriptor of the context.The benefit is that common interfaces such asselect or poll can be used to wait on mul-tiple contexts at the same time. Similarly, asyn-chronous notifications via SIGIO are also sup-ported.

Regular file descriptor sharing semantic ap-plies, thus it is possible to delegate notificationprocessing to a specific thread or child process.

During a notification, monitoring is stopped.When monitoring another thread, it is possi-ble to request that this thread be blocked whilethe notification is being processed. A tool maychoose the block on notification option whenthe context is created. Depending on the typeof sampling, it may be interesting to have thethread run just to keep the caches and TLBwarm, for instance.

Once a notification is processed, the pfm_

restart function is invoked. It is used to resetthe overflowed counters using their long resetvalue, to resume monitoring, and potentially tounblock the monitored thread.

5.3 Kernel sampling buffer

It is quite expensive to send a notification touser level for each sample. This is particularlybad when monitoring another thread becausethere could be, at least, two context switchesper overflow and a couple of system calls.

One way to minimize this cost, it is to amor-tize it over a large set of samples. The idea isto have the kernel directly record samples intoa buffer. It is not possible to take page faultsfrom the PMU interrupt handler, usually a highpriority handler. As such the memory wouldhave to be locked, an operation that is typi-cally restricted to privileged users. As indicatedearlier, sampling must be available to regularusers, thus, the buffer is allocated by the kerneland marked as reserved to avoid being pagedout.

When the buffer becomes full, the monitoringtool is notified. A similar approach is used bythe OProfile and VTUNE interfaces. Severalissues must be solved for the buffer to becomeuseable:

• how to make the kernel buffer accessibleto the user?

• how to reset the PMD values after an over-flow when the monitoring tool is not in-volved?

• what format for the buffer?

The buffer can be made available via a readcall. This is how OProfile and VTUNEwork.Perfmon2 uses a different approach to try andminimize overhead. The buffer is re-mappedread-only into the user address space of themonitoring tool with a call to mmap, as shownin Figure 4. The content of the buffer is guaran-teed consistent when a notification is received.


file table file tablepfm_context

monitoring tool

fd

user

kernel

sampling buffer

pfm_context

monitoring tool

fd

userkernel

sampling buffer

after pfm_create_context() after mmap()

Figure 4: re-mapping the sampling buffer

On counter overflow, the kernel needs to knowwhat value to reload into an overflowed PMDregister. This information is passed, per reg-ister, during the pfm_write_pmd call. If thebuffer does not become full, the kernel uses theshort reset value to reload the counter.

When the buffer becomes full, monitoring isstopped and a notification is sent. Reset is de-ferred until the monitoring tool invokes pfm_restart, at which point, the buffer is markedas empty, the overflowed counter is reset withthe long reset value and monitoring resumes.

long period short period

monitoring activeprogram executes

recovery period

short period short period

program executesmonitoring active

stoppedmonitoring

processingoverflow

Figure 5: short vs. long reset values.

The distinction between long and short resetvalues allows tools to specify a different, po-tentially larger value, for the first period after anoverflow notification. It is very likely that theuser-level notification and subsequent process-ing will modify the CPU state, e.g., caches andTLB, such that when monitoring resumes, theexecution will enter a recovery phase where itsbehavior may be different from what it wouldhave been without monitoring. Depending on

the type of sampling, the long vs. short resetvalues can be leveraged to hide that recoveryperiod. This is demonstrated in Figure 5 whichshows where the long reset value is used afteroverflow processing is completed. Of course,the impact and duration of the recovery periodis very specific to each workload and CPU.

It is possible to request, per counter, that bothreset values be randomized. This is very use-ful to avoid biased samples for certain mea-surements. The pseudo-random number gen-erator does not need to be very fancy, simplevariation are good enough. The randomizationis specified by a seed value and a bitmask tolimit the range of variation. For instance, amask of 0xff allows a variation in the inter-val [0-255] from the base value. The exist-ing implementation uses the Carta [2] pseudo-random number generator because it is simpleand very efficient.

A monitoring tool may want to record the val-ues of certain PMD registers in each sample.Similarly, after each sample, a tool may want toreset certain PMD registers. This could be usedto compute event deltas, for instance. EachPMD register has two bitmasks to convey thisinformation to the kernel. Each bit in the bit-mask represents a PMD register, e.g., bit 1 rep-resents PMD1. Let us suppose that on overflowof PMD4, a tool needs to record PMD6 andPMD7 and then reset PMD7. In that case, thetool would initialize the sampling bitmask ofPMD4 to 0xc0 and the reset bitmask to 0x80.

With a kernel-level sampling buffer, the for-mat in which samples are stored and whatgets recorded becomes somehow fixed and it ismore difficult to evolve. Monitoring tools canhave very diverse needs. Some tools may wantto store samples sequentially into the buffer,some may want to aggregate them immediately,others may want to record non PMU-based in-formation, e.g., the kernel call stack.


As indicated earlier, it is important to ensurethat existing interfaces such as OProfile orVTUNE, both using their own buffer formats,can be ported without having to modify a lot oftheir code. Similarly, It is important to ensurethe interface can take advantage of advancedPMU support sampling such as the PEBS fea-ture of the Pentium 4 processor.

Preserving a high level of flexibility for thebuffer, while having it fully specified into theinterface did not look very realistic. We re-alized that it would be very difficult to comeup with a universal format that would satisfyall needs. Instead, the interface uses a radi-cally different approach which is described inthe next section.

5.4 Custom Sampling Buffer Formats

kernel

useruser interface

coreperfmon custom sampling

format

user interface

cust

om fo

rmat

inte

rface

sam

plin

g fo

rmat

inte

rface

Figure 6: Custom sampling format architecture

The interface introduces a new flexible mecha-nism called Custom Sampling Buffer Formats,or formats for short. The idea is to removethe buffer format from the interface and insteadprovide a framework for extending the interfacevia specific sampling formats implemented bykernel modules. The architecture is shown inFigure 6.

Each format is uniquely identified by a 128-bitUniversal Unique IDentifier (UUID) which canbe generated by commands such as uuidgen.In order to use a format, a tool must pass thisUUID when the context is created. It is possible

to pass arguments, such as the buffer size, to aformat when a context is created.

When the format module is inserted into thekernel, it registers with the perfmon core viaa dedicated interface. Multiple formats can beregistered. The list of available formats is ac-cessible via a sysfs interface. Formats canalso be dynamically removed like any otherkernel module.

Each format provides a set of call-backs func-tions invoked by the perfmon core during cer-tain operations. To make developing a formatfairly easy, the perfmon core provides certainbasic services such as memory allocation andthe ability to re-map the buffer, if needed. For-mats are not required to use those services.They may, instead, allocate their own bufferand expose it using a different interface, suchas a driver interface.

At a minimum, a format must provide a call-back function invoked on 64-bit counter over-flow, i.e., an interrupt handler. That handlerdoes not bypass the core PMU interrupt han-dler which controls 64-bit counter emulation,overflow detection, notification, and monitor-ing masking. This layering make it very simpleto write a handler. Each format controls:

• how samples are stored

• what gets recorded on overflow

• how the samples are exported to user-level

• when an overflow notification must be sent

• whether or not to reset counters after anoverflow

• whether or not to mask monitoring after anoverflow


The interface specifies a simple and relativelygeneric default sampling format that is built-in on all architectures. It stores samples se-quentially in the buffer. Each sample has afixed-size header containing information suchas the instruction pointer at the time of theoverflow, the process identification. It is fol-lowed by a variable-size body containing 64-bit PMD values stored in increasing index or-der. Those PMD values correspond to the infor-mation provided in the sampling bitmask of theoverflowed PMD register. Buffer space is man-aged such that there can never be a partial sam-ple. If multiple counters overflow at the sametime, multiple contiguous samples are written.

Using the flexibility of formats, it was fairlyeasy to port the OProfile kernel code overto perfmon2. An new format was created toconnect the perfmon2 PMU and OProfileinterrupt handlers. The user-level OProfile,opcontrol tool was migrated over to use theperfmon2 interface to program the PMU. Theresulting format is about 30 lines of C code.The OProfile buffer format and managementkernel code was totally preserved.

Other formats have been developed since then.In particular we have released a format thatimplements n-way buffering. In this format,the buffer space is split into equal-size regions.Samples are stored in one region, when it fillsup, the tool is notified but monitoring remainsactive and samples are stored in the next region.This idea is to limit the number of blind spotsby never stopping monitoring on counter over-flow.

The format mechanism proved particularly use-ful to implement support for the Pentium 4 pro-cessor Precise Event-Based Sampling (PEBS)feature where the CPU is directly writing sam-ples to a designated region of memory. Byhaving the CPU write the samples, the skewobserved on the instruction pointer with typ-ical interrupt-based sampling can be avoided,

thus a much improved precision of the sam-ples. That skew comes from the fact that thePMU interrupt is not generated exactly on theinstruction where the counter overflowed. Thephenomenon is especially important on deeply-pipelined processor implementations, such asPentium 4 processor. With PEBS, there is aPMU interrupt when the memory region givento the CPU fills up.

The problem with PEBS is that the sample for-mat is now fixed by the CPU and it cannot bechanged. Furthermore, the format is differentbetween the 32-bit and 64-bit implementationsof the CPU. By leveraging the format infras-tructure, we created two new formats, one for32-bit and one for 64-bit PEBS with less thanone hundred lines of C code each. Perfmon2is the first to provide support for PEBS and itrequired no changes to the interface.

6 Event sets and multiplexing

On many PMU models, the number of coun-ters is fairly limited yet certain measurementsrequire lots of events. For instance, on the Ita-nium 2 processor, it takes about a dozen eventsto gather a cycle breakdown, showing how eachCPU cycle is spent, yet there are only 4 coun-ters. Thus, it is necessary to run the workloadunder test multiple times. This is not alwaysvery convenient as workloads sometimes can-not be stopped or are long to restart. Further-more, this inevitably introduces fluctuations inthe collected counts which may affect the accu-racy of the results.

Even with a large number of counters, e.g., 18for the Pentium 4 processor, there are still hard-ware constraints which make it difficult to col-lect some measurements in one run. For in-stance, it is fairly common to have constraintssuch as:


• event A and B cannot be measured to-gether

• event A can only be measured on counterC.

Those constraints are unlikely to go away inthe future because that could impact the perfor-mance of CPUs. An elegant solution to theseproblems is to introduce the notion of eventssets where each set encapsulates the full PMUmachine state. Multiple sets can be defined andthey are multiplexed on the actual PMU hard-ware such that only one set if active at a time.At the end of the multiplexed run, the countsare scaled to compute an estimate of what theywould have been, had they been collected forthe entire duration of the measurement.

The accuracy of the scaled counts depends alot of the switch frequency and the workload,the goal being to avoid blind spots where cer-tain events are not visible because the set thatmeasures them did not activate at the right time.The key point is to balance to the need for highswitch frequency with higher overhead.

Sets and multiplexing can be implemented to-tally at the user level and this is done by thePAPI toolkit, for instance. However, it is criti-cal to minimize the overhead especially for nonself-monitoring measurements where it is ex-tremely expensive to switch because it could in-cur, at least, two context switches and a bunchof system calls to save the current PMD val-ues, reprogram the new PMC and PMD regis-ters. During that window of time the monitoredthread usually keeps on running opening up alarge blind spot.

The perfmon2 interface supports events setsand multiplexing at the kernel level. Switch-ing overhead is significantly minimized, blindspots are eliminated by the fact that switchingsystematically occurs in the context of the mon-itored thread.

Sets and multiplexing is supported for per-thread and system-wide monitoring and forboth counting and sampling measurements.

6.1 Defining sets

0 5

3 50

0

pfm_context

sets

after pfm_create_evtsets(5)

pfm_context

sets

after pfm_create_evtsets(3)

after pfm_create_context()

pfm_context

sets

Figure 7: creating sets.

Each context is created with a default eventset, called set0. Sets can be dynamically cre-ated, modified, or deleted when the contextis detached using the pfm_create_evtsets

and pfm_delete_evtsets calls. Informa-tion, such as the number of activations of aset, can be retrieved with the pfm_getinfo_

evtsets call. All these functions take arrayarguments and can, therefore, manipulate mul-tiple sets per call.

A set is identified with a 16-bit number. Assuch, there is a theoretical limit of 65k sets.Sets are managed through an ordered list basedon their identification numbers. Figure 7 showsthe effect of adding set5 and set3 on the list.

Tools can program registers in each set by pass-ing the set identification for each element of thearray passed to the read or write calls. In onepfm_write_pmcs call it is possible to pro-gram registers for multiple sets.


6.2 Set switching

Set switching can be triggered by two differentevents: a timeout or a counter overflow. This isanother innovation of the perfmon2 interface,again giving tools maximum flexibility. Thetype of trigger is determined, per set, when itis created.

The timeout is specified in micro-seconds whenthe set is created. The granularity of the time-out depends on the granularity of kernel inter-nal timer tick, usually 1ms or 10ms. If the gran-ularity is 10ms, then it is not possible to switchmore than 100 times per second, i.e., the time-out cannot be smaller than 100µs. Because thegranularity can greatly affect the accuracy of ameasurement, the actual timeout, rounded upthe the closest multiple of the timer tick, is re-turned by the pfm_create_evtsets call.

It is also possible to trigger a switch on counteroverflow. To avoid dedicating a counter as atrigger, there is a trigger threshold value asso-ciated with each counter. At each overflow, thethreshold value is decremented, when it reacheszero, switching occurs. It is possible to havemultiple trigger counters per set, i.e., switch onmultiple conditions.

The next set is determined by the position inthe ordered list of sets. Switching is managedin a round-robin fashion. In the example fromFigure 7, this means that the set following set5is set0.

Using overflow switching, it is possible to im-plement counter cascading where a counterstarts counting only when a certain number ofoccurrences, n, of an event E is reached. In afirst set, a PMC register is programmed to mea-sure event E, the corresponding PMD registeris initialized to −n, and its switch trigger is setto 1. The next set is setup to count the event ofinterest and it will activated only when there isan overflow in the first set.

6.3 Sampling

Sets are fully integrated with sampling. Set in-formation is propagated wherever is necessary.The counter overflow notification carries theidentification of the active set. The default sam-pling format fully supports sets. Samples fromall sets are stored them into the same buffer.The active set at the time of the overflow isidentified in the header of each sample.

7 Security

The interface is designed to be built into thebase kernel, as such, it must follow the samesecurity guidelines.

It is not possible to assume that tools will al-ways be well-behaved. Each implementationmust check arguments to calls. It must not bepossible to use the interface for malicious at-tacks. A user cannot run a monitoring tool toextract information about a process or the sys-tem without proper permission.

All vector arguments have a maximum size tolimit the amount of kernel memory necessaryto perform the copy into kernel space. Bynature, those calls are non-blocking and non-preemptible, ensuring that memory is eventu-ally freed. The default limit is set to a page.

The sampling buffer size is also limited be-cause it consumes kernel memory that cannotbe paged out. There is a system-wide limitand a per-process limit. The latter is usingthe resource limit on locked memory (RLIMIT_MEMLOCK). The two-level protection is requiredto prevent users from launching lots of pro-cesses each allocating a small buffer.

In per-thread mode, the user credentials arechecked against the permission of the thread


to monitor when the context is attached. Typi-cally, if a user cannot send a signal to the pro-cess, it is not possible to attach. By default,per-thread monitoring is available to all users,but a system administrator can limit to a usergroup. An identical, but separate, restriction isavailable for system-wide contexts.

On several architectures, such as Itanium, itis possible to read the PMD registers directlyfrom user-level, i.e., with a simple instruction.There is always a provision to turn this fea-ture off. The interface supports this mode ofaccess by default for all self-monitoring per-thread context. It is turned off by default forall other configurations, thereby preventing spyapplications from peeking at values left in PMDregisters by others.

All size and user group limitations can be con-figured by a system administrator via a simplesysfs interface.

As for sampling, we are planning on adding aPMU interrupt throttling mechanism to preventDenial-of-Service (DoS) attacks when applica-tions set very high sampling rates.

8 Fast user-level PMD read

Invoking a system call to read a PMD registercan be quite expensive compared to the cost ofthe actual instruction. On an Itanium 2 1.5GHzprocessor, for instance, it costs about 36 cyclesto read a PMD with a single instruction andabout 750 cycles via pfm_read_pmds whichis not really optimized at this point. As a ref-erence, the simplest system call, i.e., getpid,costs about 210 cycles.

On many PMU models, it is possible to directlyread a PMD register from user level with a sin-gle instruction. This very lightweight mode

of access is allowed by the interface for allself-monitoring threads. Yet, if actual counterswidth is less than 64-bit, only the partial valueis returned. The software-maintained value re-quires a kernel call.

To enable fast 64-bit PMD read accesses fromuser level, the interface supports re-mapping ofthe software-maintained PMD values to userlevel for self-monitoring threads. This mech-anism was introduced by the perfctr interface.This enables fast access on architectures with-out hardware support for direct access. For theothers, this enables a full 64-bit value to bereconstructed by merging the high-order bitsfrom the re-mapped PMD with the low-orderbit obtained from hardware.

Re-mapping has to be requested when the con-text is created. For each event set, the PMD reg-ister values have to be explicitly re-mapped viaa call to mmap on the file descriptor identify-ing the context. When a set is created a specialcookie value is passed back by pfm_create_

evtset. It is used as an offset for mmap andis required to identify the set to map. The map-ping is limited to one page per set. For each set,the re-mapped region contains the 64-bit soft-ware value of each PMD register along with astatus bit indicating whether the set is the activeset or not. For non-active sets, the re-mappedvalue is the up-to-date full 64-bit value.

Given that the merge of the software and hard-ware values is not atomic, there can be a racecondition if, for instance, the thread is pre-empted in the middle of building the 64-bitvalue. There is no way to avoid the race, insteadthe interface provides an atomic sequence num-ber for each set. The number is updated eachtime the state of the set is modified. The num-ber must be read by user-level code before andafter reading the re-mapped PMD value. If thenumber is the same before and after, it meansthat the PMD value is current, otherwise the


operation must be restarted. On the same Ita-nium 2 processor and without conflict, the costis about 55 cycles to read the 64-bit value of aPMD register.

9 Status

A first generation of this interface has been im-plemented for the 2.6 kernel series for the Ita-nium Processor Family (IPF). It uses a singlemultiplexing system call, perfmonctl, and ismissing events sets, PMU description tables,and fast user-level PMD reads. It is currentlyshipping with all major Linux distributions forthis architecture.

The second generation interface, which we de-scribe in this paper, currently exists as a kernelpatch against the latest official 2.6 kernel fromkernel.org. It supports the following pro-cessor architectures and/or models:

• all the Itanium processors

• the AMD Opteron processors in 64-bitmode

• the Intel Pentium M and P6 processors

• the Intel Pentium 4 and Xeon processors.That includes 32-bit and 64-bit (EM64T)processors. Hyper-Threading and PEBSare supported.

• the MIPS 5k and MIPS 20k processors

• preliminary support for IBM Power5 pro-cessor

Certain ports were contributed by other com-panies or developers. As our user communitygrows, we expect other contributions to bothkernel and user-level code. The kernel patch

has been widely distributed and has generateda lot of discussions on various Linux mailinglists.

Our goal is to establish perfmon2 as the stan-dard Linux interface for hardware-based per-formance monitoring. We are in the process ofgetting it reviewed by the Community in prepa-ration for a merge with the mainline kernel.

10 Existing tools

Several tools already exists for the interface.Most of them are only available for Itaniumprocessors at this point, because an implemen-tation exists since several years.

The first open-source tool to use the interface ispfmon [6] from HP Labs. This is a command-line oriented tool initially built to test the in-terface. It can collect counts and profiles on aper-thread or system-wide basis. It supports theItanium, AMD Opteron, and Intel Pentium Mprocessors. It is built on top of a helper library,called libpfm, which handles all the event en-codings and assignment logic.

HP Labs also developed q-tools [7], a re-placement program for gprof. Q-tools usesthe interface to collect a flat profile and a sta-tistical call graph of all processes running in asystem. Unlike gprof, there is no need to re-compile applications or the kernel. The profileand call graph include both user- and kernel-level execution. The tool only works on Ita-nium 2 processors because it leverages certainPMU features, in particular the Branch TraceBuffer. This tool takes advantage of the in-terface by overlapping two sampling measure-ments to collect the flat profile and call graph inone run.

The HP Caliper [5] is an official HP productwhich is free for non-commercial use. This is


a professional tool which works with all ma-jor Linux distributions for Itanium processors.It collects counts or profiles on a per-thread orper-CPU basis. It is very simple to use andcomes with a large choice of preset metricssuch as flat profile (fprof), data cache misses(dcache_miss). It exploits all the advancedPMU features, such as the Branch Trace Buffer(BTB) and the Data Event Address Registers(D-EAR). The profiles are correlated to sourceand assembly code.

The PAPI toolkit has long been available on topof the perfmon2 interface for Itanium proces-sors. We expect that PAPI will migrate over toperfmon2 on other architectures as well. Thismigration will likely simplify the code and al-low better support for sampling and set multi-plexing.

The BEA JRockit JVM on Linux/ia64, start-ing with version 1.4.2 is also exploiting the in-terface. The JIT compiler is using a, dynam-ically collected, per-thread profile to improvecode generation. This technique [4], called Dy-namic Profile Guided Optimization (DPGO),takes advantage of the efficient per-thread sam-pling support of the interface and of the abil-ity of the Itanium 2 PMU to sample branchesand locations of cache misses (Data Event Ad-dress Registers). What is particularly interest-ing about this example is that it introduces anew usage model. Monitoring is used each timea program runs and not just during the develop-ment phase. Optimizations are applied in theend-user environment and for the real work-load.

11 Conclusion

We have designed the most advanced perfor-mance monitoring interface for Linux. It pro-vides a uniform set of functionalities across all

architectures making it easier to write portableperformance tools. The feature set was care-fully designed to allow efficient monitoring anda very high degree of flexibility to support adiversity of usage models and hardware archi-tectures. The interface provides several keyfeatures such as custom sampling buffer for-mats, kernel support event sets multiplexing,and PMU description modules.

We have developed a multi-architecture imple-mentation of this interface that support all ma-jor processors. On the Intel Pentium 4 proces-sor, this implementation is the first to offer sup-port for PEBS.

We are in the process of getting it merged intothe mainline kernel. Several open-source andcommercial tools are available on Itanium 2processors at this point and we expect that oth-ers will be released for the other architecturesas well.

Hardware-based performance monitoring is thekey tool to understand how applications and op-erating systems behave. The monitoring infor-mation is used to drive performance improve-ments in applications, operating system ker-nels, compilers, and hardware. As proces-sor implementation enhancements shift frompure clock speed to multi-core, multi-thread,the need for powerful monitoring will increasesignificantly. The perfmon2 interface is wellsuited to address those needs.

References

[1] AMD. AMD64 ArchitectureProgrammer’s Manual: SystemProgramming, 2005.http://www.amd.com/us-en/Processors/DevelopWithAMD.

[2] David F. Carta. Two fast implementationsof the minimal standard random number


generator. Com. of the ACM,33(1):87–88, 1990. http://doi.acm.org/10.1145/76372.76379.

[3] Intel Coporation. The Itanium processorfamily architecture. http://developer.intel.com/design/itanium2/documentation.htm.

[4] Greg Eastman, Shirish Aundhe, RobertKnight, and Robert Kasten. Inteldynamic profile-guided optimization inthe BEA JRockitTM JVM. In 3rdWorkshop on Managed RuntimeEnvironments, MRE’05, 2005.http://www.research.ibm.com/mre05/program.html.

[5] Hewlett-Packard Company. The Caliperperformance analyzer. http://www.hp.com/go/caliper.

[6] Hewlett-Packard Laboratories. Thepfmon tool and the libpfm library.http://perfmon2.sf.net/.

[7] Hewlett-Packard Laboratories. q-tools,and q-prof tools. http://www.hpl.hp.com/research/linux.

[8] Intel. Intel Itanium 2 ProcessorReference Manual for SoftwareDevelopment and Optimization, April2003. http://www.intel.com/design/itanium/documentation.htm.

[9] Intel. IA-32 Intel Architecture SoftwareDevelopers’ Manual: SystemProgramming Guide, 2004.http://developer.intel.com/design/pentium4/manuals/index_new.htm.

[10] Intel Corp. The VTuneTM performanceanalyzer. http://www.intel.com/software/products/vtune/.

[11] Chi-Keung Luk and Robert Muth et al.Ispike: A post-link optimizer for the IntelItanium architecture. In Code Generationand Optimization Conference 2004(CGO 2004), March 2004.http://www.cgo.org/cgo2004/papers/01_82_luk_ck.pdf.

[12] Mikael Pettersson. the Perfctr interface.http://user.it.uu.se/~mikpe/linux/perfctr/.

[13] Alex Shye et al. Analysis of pathprofiling information generated withperformance monitoring hardware. InINTERACT HPCA’04 workshop, 2004.http://rogue.colorado.edu/draco/papers/interact05-pmu_pathprof.pdf.

[14] B.D̃ragovic et al. Xen and the art ofvirtualization. In Proceedings of theACM Symposium on Operating SystemsPrinciples, October 2003.http://www.cl.cam.ac.uk/Research/SRG/netos/xen/architecture.html.

[15] J.Ãnderson et al. Continuous profiling:Where have all the cycles gone?, 1997.http://citeseer.ist.psu.edu/article/anderson97continuous.html.

[16] John Levon et al. Oprofile.http://oprofile.sf.net/.

[17] Robert Cohn et al. The PIN tool. http://rogue.colorado.edu/Pin/.

[18] Alex Tsariounov. The Prospectmonitoring tool.http://prospect.sf.net/.

[19] University of Tenessee, Knoxville.Performance Application ProgrammingInterface (PAPI) project.http://icl.cs.utk.edu/papi.


OCFS2: The Oracle Clustered File System, Version 2

Mark FashehOracle

[email protected]

Abstract

This talk will review the various componentsof the OCFS2 stack, with a focus on the filesystem and its clustering aspects. OCFS2 ex-tends many local file system features to thecluster, some of the more interesting of whichare posix unlink semantics, data consistency,shared readable mmap, etc.

In order to support these features, OCFS2 logi-cally separates cluster access into multiple lay-ers. An overview of the low level DLM layerwill be given. The higher level file systemlocking will be described in detail, includinga walkthrough of inode locking and messagingfor various operations.

Caching and consistency strategies will be dis-cussed. Metadata journaling is done on a pernode basis with JBD. Our reasoning behind thatchoice will be described.

OCFS2 provides robust and performant recov-ery on node death. We will walk through thetypical recovery process including journal re-play, recovery of orphaned inodes, and recov-ery of cached metadata allocations.

Allocation areas in OCFS2 are broken up intogroups which are arranged in self-optimizing“chains.” The chain allocators allow OCFS2 todo fast searches for free space, and dealloca-tion in a constant time algorithm. Detail on thelayout and use of chain allocators will be given.

Disk space is broken up into clusters which canrange in size from 4 kilobytes to 1 megabyte.File data is allocated in extents of clusters. Thisallows OCFS2 a large amount of flexibility infile allocation.

File metadata is allocated in blocks via a suballocation mechanism. All block allocators inOCFS2 grow dynamically. Most notably, thisallows OCFS2 to grow inode allocation on de-mand.

1 Design Principles

A small set of design principles has guidedmost of OCFS2 development. None of themare unique to OCFS2 development, and in fact,almost all are principles we learned from theLinux kernel community. They will, however,come up often in discussion of OCFS2 file sys-tem design, so it is worth covering them now.

1.1 Avoid Useless Abstraction Layers

Some file systems have implemented large ab-straction layers, mostly to make themselvesportable across kernels. The OCFS2 develop-ers have held from the beginning that OCFS2code would be Linux only. This has helped usin several ways. An obvious one is that it made

290 • OCFS2: The Oracle Clustered File System, Version 2

the code much easier to read and navigate. De-velopment has been faster because we can di-rectly use the kernel features without worryingif another OS implements the same features, orworse, writing a generic version of them.

Unfortunately, this is all easier said than done.Clustering presents a problem set which mostLinux file systems don’t have to deal with.When an abstraction layer is required, threeprinciples are adhered to:

• Mimic the kernel API.

• Keep the abstraction layer as thin as pos-sible.

• If object life timing is required, try to usethe VFS object life times.

1.2 Keep Operations Local

Bouncing file system data around a cluster canbe very expensive. Changed metadata blocks,for example, must be synced out to disk beforeanother node can read them. OCFS2 design at-tempts to break file system updates into nodelocal operations as much as possible.

1.3 Copy Good Ideas

There is a wealth of open source file system im-plementations available today. Very often dur-ing OCFS2 development, the question “How doother file systems handle it?” comes up with re-spect to design problems. There is no reason toreinvent a feature if another piece of softwarealready does it well. The OCFS2 developersthus far have had no problem getting inspira-tion from other Linux file systems.1 In somecases, whole sections of code have been lifted,with proper citation, from other open sourceprojects!

1Most notably Ext3.

2 Disk Layout

Near the top of the ocfs2_fs.h header, onewill find this comment:

/** An OCFS2 volume starts this way:

* Sector 0: Valid ocfs1_vol_disk_hdr that cleanly

* fails to mount OCFS.

* Sector 1: Valid ocfs1_vol_label that cleanly

* fails to mount OCFS.

* Block 2: OCFS2 superblock.

** All other structures are found

* from the superblock information.

*/

The OCFS disk headers are the only amount ofbackwards compatibility one will find within anOCFS2 volume. It is an otherwise brand newcluster file system. While the file system basicsare complete, there are many features yet to beimplemented. The goal of this paper then, is toprovide a good explanation of where things arein OCFS2 today.

2.1 Inode Allocation Structure

The OCFS2 file system has two main alloca-tion units, blocks and clusters. Blocks canbe anywhere from 512 bytes to 4 kilobytes,whereas clusters range from 4 kilobytes up toone megabyte. To make the file system math-ematics work properly, cluster size is alwaysgreater than or equal to block size. At formattime, the disk is divided into as many cluster-sized units as will fit. Data is always allocatedin clusters, whereas metadata is allocated inblocks

Inode data is represented in extents which areorganized into a b-tree. In OCFS2, extents arerepresented by a triple called an extent record.

Extent records are stored in a large in-inode ar-ray which extends to the end of the inode block.When the extent array is full, the file systemwill allocate an extent block to hold the current


EXTENT BLOCK

DISK INODE

Figure 1: An Inode B-tree

Record Field Field Size Descriptione_cpos 32 bits Offset into the

file, in clusterse_clusters 32 bits Clusters in this

extente_blkno 64 bits Physical disk

offset

Table 1: OCFS2 extent record

array. The first extent record in the inode willbe re-written to point to the newly allocated ex-tent block. The e_clusters and e_cposvalues will refer to the part of the tree under-neath that extent. Bottom level extent blocksform a linked list so that queries accross a rangecan be done efficiently.

2.2 Directories

Directory layout in OCFS2 is very similar toExt3, though unfortunately, htree has yet to beported. The only difference in directory en-try structure is that OCFS2 inode numbers are64 bits wide. The rest of this section can beskipped by those already familiar with the di-rent structure.

Directory inodes hold their data in the samemanner which file inodes do. Directory datais arranged into an array of directory en-tries. Each directory entry holds a 64-bit inodepointer, a 16-bit record length, an 8-bit namelength, an 8-bit file type enum (this allows usto avoid reading the inode block for type), and

of course the set of characters which make upthe file name.

2.3 The Super Block

The OCFS2 super block information is con-tained within an inode block. It contains astandard set of super block information—blocksize, compat/incompat/ro features, root inodepointer, etc. There are four values which aresomewhat unique to OCFS2.

• s_clustersize_bits – Cluster sizefor the file system.

• s_system_dir_blkno – Pointer tothe system directory.

• s_max_slots – Maximum number ofsimultaneous mounts.

• s_first_cluster_group – Blockoffset of first cluster group descriptor.

s_clustersize_bits is self-explanatory.The reason for the other three fields will be ex-plained in the next few sections.

2.4 The System Directory

In OCFS2 file system metadata is containedwithin a set of system files. There are two typesof system files, global and node local. All sys-tem files are linked into the file system via thehidden system directory2 whose inode numberis pointed to by the superblock. To find a sys-tem file, a node need only search the systemdirectory for the name in question. The mostcommon ones are read at mount time as a per-formance optimization. Linking to system files

2debugfs.ocfs2 can list the system dir with thels // command.


from the system directory allows system file lo-catations to be completely dynamic. Addingnew system files is as simple as linking theminto the directory.

Global system files are generally accessibleby any cluster node at any time, given that ithas taken the proper cluster-wide locks. Theglobal_bitmap is one such system file.There are many others.

Node local system files are said to be owned bya mounted node which occupies a unique slot.The maximum number of slots in a file sys-tem is determined by the s_max_slots su-perblock field. The slot_map global systemfile contains a flat array of node numbers whichdetails which mounted node occupies which setof node local system files.

Ownership of a slot may mean a differentthing to each node local system file. Forsome, it means that access to the system fileis exclusive—no other node can ever access it.For others it simply means that the owning nodegets preferential access—for an allocator file,this might mean the owning node is the onlyone allowed to allocate, while every node maydelete.

A node local system file has its slot number en-coded in the file name. For example, the jour-nal used by the node occupying the third filesystem slot (slot numbers start at zero) has thename journal:0002.

2.5 Chain Allocators

OCFS2 allocates free disk space via a specialset of files called chain allocators. Rememberthat OCFS2 allocates in clusters and blocks, sothe generic term allocation units will be usedhere to signify either. The space itself is brokenup into allocation groups, each of which con-tains a fixed number of allocation units. These

GROUP

DESCRIPTOR

ALLOCATION UNITS

Figure 2: Allocation Group

groups are then chained together into a set ofsingly linked lists, which start at the allocatorinode.

The first block of the first allocation unitwithin a group contains an ocfs2_group_descriptor. The descriptor contains a smallset of fields followed by a bitmap which ex-tends to the end of the block. Each bit in thebitmap corresponds to an allocation unit withinthe group. The most important descriptor fieldsfollow.

• bg_free_bits_count – number ofunallocated units in this group.

• bg_chain – describes which groupchain this descriptor is a part of.

• bg_next_group – points to the nextgroup descriptor in the chain.

• bg_parent_dinode – pointer to diskinode of the allocator which owns thisgroup.

Embedded in the allocator inode is an ocfs2_chain_list structure. The chain list con-tains some fields followed by an array ofocfs2_chain_rec records. An ocfs2_chain_rec is a triple which describes achain.

• c_blkno – First allocation group.


ALLOCATOR

INODE

GROUP

DESCRIPTOR

GROUP

DESCRIPTOR

GROUP

DESCRIPTOR

Figure 3: Chain Allocator

• c_total – Total allocation units.

• c_free – Free allocation units.

The two most interesting fields at the top ofan ocfs2_chain_list are: cl_cpg, clus-ters per group; and cl_bpc, bits per cluster.The product of those two fields describes thetotal number of blocks occupied by each allo-cation group. As an example, the cluster allo-cator whose allocation units are clusters has acl_bpc of 1 and cl_cpg is determined bymkfs.ocfs2 (usually it just picks the largestvalue which will fit within a descriptor bitmap).

Chain searches are said to be self-optimizing.That is, while traversing a chain, the file systemwill re-link the group with the most number offree bits to the top of the list. This way, fullgroups can be pushed toward the end of the listand subsequent searches will require fewer diskreads.

2.6 Sub Allocators

The total number of file system clusters islargely static as determined by mkfs.ocfs2or optionally grown via tunefs.ocfs2. Filesystem blocks however are dynamic. For exam-ple, an inode block allocator file can be grownas more files are created.

To grow a block allocator, cl_bpc clustersare allocated from the cluster allocator. Thenew ocfs2_group_descriptor record ispopulated and that block group is linked to thetop of the smallest chain (wrapping back tothe first chain if all are equally full). Otherthan the descriptor block, zeroing of the re-maining blocks is skipped—when allocated, allfile system blocks will be zeroed and writtenwith a file system generation value. This allowsfsck.ocfs2 to determine which blocks in agroup are valid metadata.

2.7 Local Alloc

Very early in the design of OCFS2 it was de-termined that a large amount of performancewould be gained by reducing contention on thecluster allocator. Essentially the local alloc is anode local system file with an in-inode bitmapwhich caches clusters from the global clusterallocator. The local alloc file is never lockedwithin the cluster—access to it is exclusive toa mounted node. This allows the block to re-main valid in memory for the entire lifetime ofa mount.

As the local alloc bitmap is exhausted of freespace, an operation called a window slide isdone. First, any unallocated bits left in the lo-cal alloc are freed back to the cluster alloca-tor. Next, a large enough area of contiguousspace is found with which to re-fill the local al-loc. The cluster allocator bits are set, the local


alloc bitmap is cleared, and the size and offsetof the new window are recorded. If no suitablefree space is found during the second step of awindow slide, the local alloc is disabled for theremainder of that mount.

The size of the local alloc bitmap is tuned atmkfs time to be large enough so that mostblock group allocations will fit, but the totalsize would not be so large as to keep an in-ordinate amount of data unallocatable by othernodes.

2.8 Truncate Log

The truncate log is a node local system file withnearly the same properties of the local alloc file.The major difference is that the truncate log isinvolved in de-allocation of clusters. This inturn dictates a difference in disk structure.

Instead of a small bitmap covering a section ofthe cluster allocator, the truncate log containsan in-inode array of ocfs2_truncate_recstructures. Each ocfs2_truncate_rec isan extent, with a start (t_start) cluster and alength (t_clusters). This structure allowsthe truncate log to cover large parts of the clus-ter allocator.

All cluster de-allocation goes through the trun-cate log. It is flushed when full, two seconds af-ter the most recent de-allocation, or on demandby a sync(2) call.

3 Metadata Consistency

A large amount of time is spent inside a clus-ter file system keeping metadata blocks consis-tent. A cluster file system not only has to trackand journal dirty blocks, but it must understandwhich clean blocks in memory are still validwith respect to any disk changes which othernodes might initiate.

3.1 Journaling

Journal files in OCFS2 are stored as node localsystem files. Each node has exclusive access toits journal, and retains a cluster lock on it forthe duration of its mount.

OCFS2 does block based journaling via theJBD subsystem which has been present in theLinux kernel for several years now. This is thesame journaling system in use by the Ext3 filesystem. Documentation on the JBD disk formatcan be found online, and is beyond the scope ofthis document.

Though the OCFS2 team could have inventedtheir own journaling subsystem (which couldhave included some extra cluster optimiza-tions), JBD was chosen for one main reason—stability. JBD has been very well tested as a re-sult of being in use in Ext3. For any journaledfile system, stability in its journaling layer iscritical. To have done our own journaling layerat the time, no matter how good, would haveinevitably introduced a much larger time pe-riod of unforeseen stability and corruption is-sues which the OCFS2 team wished to avoid.

3.2 Clustered Uptodate

The small amount of code (less than 550 lines,including a large amount of comments) in fs/ocfs2/updtodate.c attempts to mimicthe buffer_head caching API while main-taining those properties across the cluster.

The Clustered Uptodate code maintains a smallset of metadata caching information on ev-ery OCFS2 memory inode structure (structocfs2_inode_info). The caching informa-tion consists of a single sector_t per block.These are stored in a 2 item array unioned witha red-black tree root item struct rb_root.If the number of buffers that require tracking


grows larger than the array, then the red-blacktree is used.

A few rules were taken into account before de-signing the Clustered Uptodate code:

1. All metadata changes are done under clus-ter lock.

2. All metadata changes are journaled.

3. All metadata reads are done under a read-only cluster lock.

4. Pinning buffer_head structures is notnecessary to track their validity.

5. The act of acquiring a new cluster lock canflush metadata on other nodes and invali-date the inode caching items.

There are actually a very small number of ex-ceptions to rule 2, but none of them require theClustered Uptodate code and can be ignored forthe sake of this discussion.

Rules 1 and 2 have the effect that the returncode of buffer_jbd() can be relied upon totell us that a buffer_head can be trusted. Ifit is in the journal, then we must have a clus-ter lock on it, and therefore, its contents aretrustable.

Rule 4 follows from the logic that a newlyallocated buffer head will not have itsBH_Uptodate flag set. Thus one does notneed to pin them for tracking purposes—ablock number is sufficient.

Rule 5 instructs the Clustered Uptodate code toignore BH_Uptodate buffers for which wedo not have a tracking item—the kernel maythink they’re up to date with respect to disk, butthe file system knows better.

From these rules, a very simple algorithmis implemented within ocfs2_buffer_uptodate().

1. If buffer_uptodate() returns false,return false.

2. If buffer_jbd() returns true, returntrue.

3. If there is a tracking item for this block,return true.

4. Return false.

For existing blocks, tracking items are in-serted after they are succesfully read from disk.Newly allocated blocks have an item insertedafter they have been populated.

4 Cluster Locking

4.1 A Short DLM Tutorial

OCFS2 includes a DLM which exports a pared-down VMS style API. A full description of theDLM internals would require another paper thesize of this one. This subsection will concen-trate on a description of the important parts ofthe API.

A lockable object in the OCFS2 DLM is re-ferred to as a lock resource. The DLM has noidea what is represented by that resource, nordoes it care. It only requires a unique name bywhich to reference a given resource. In order togain access to a resource, a process 3 acquireslocks on it. There can be several locks on a re-source at any given time. Each lock has a locklevel which must be compatible with the levelsof all other locks on the resource. All lock re-sources and locks are contained within a DLMdomain.

3When we say process here, we mean a processwhich could reside on any node in the cluster.


Name Access Type CompatibleModes

EXMODE Exclusive NLMODEPRMODE Read Only PRMODE,

NLMODENLMODE No Lock EXMODE,

PRMODE,NLMODE

Table 2: OCFS2 DLM lock Modes

In OCFS2, locks can have one of three levels,also known as lock modes. Table 2 describeseach mode and its compatibility.

Most of the time, OCFS2 calls a single DLMfunction, dlmlock(). Via dlmlock() onecan acquire a new lock, or upconvert, anddownconvert existing locks.

typedef void (dlm_astlockfunc_t)(void ∗);

typedef void (dlm_bastlockfunc_t)(void ∗, int);

enum dlm_status dlmlock(struct dlm_ctxt ∗dlm,int mode,struct dlm_lockstatus ∗lksb,int flags,const char ∗name,dlm_astlockfunc_t ∗ast,void ∗data,dlm_bastlockfunc_t ∗bast);

Upconverting a lock asks the DLM to changeits mode to a level greater than the currentlygranted one. For example, to make changes toan inode it was previously reading, the file sys-tem would want to upconvert its PRMODE lockto EXMODE. The currently granted level staysvalid during an upconvert.

Downconverting a lock is the opposite ofan upconvert—the caller wishes to switch toa mode that is more compatible with othermodes. Often, this is done when the currently

granted mode on a lock is incompatible with themode another process wishes to acquire on itslock.

All locking operations in the OCFS2 DLM areasynchronous. Status notification is done viaa set of callback functions provided in the ar-guments of a dlmlock() call. The two mostimportant are the AST and BAST calls.

The DLM will call an AST function after adlmlock() request has completed. If the sta-tus value on the dlm_lockstatus structureis DLM_NORMAL then the call has suceeded.Otherwise there was an error and it is up to thecaller to decide what to do next.

The term BAST stands for Blocking AST. TheBAST is the DLMs method of notifying thecaller that a lock it is currently holding is block-ing the request of another process.

As an example, if process A currently holds anEXMODE lock on resource foo and process Brequests an PRMODE lock, process A will besent a BAST call. Typically this will promptprocess A to downconvert its lock held on footo a compatible level (in this case, PRMODE orNLMODE), upon which an AST callback is trig-gered for both process A (to signify completionof the downconvert) and process B (to signifythat its lock has been acquired).

The OCFS2 DLM supports a feature calledLock Value Blocks, or LVBs for short. An LVBis a fixed length byte array associated with alock resource. The contents of the LVB are en-tirely up to the caller. There are strict rules toLVB access. Processes holding PRMODE andEXMODE locks are allowed to read the LVBvalue. Only processes holding EXMODE locksare allowed to write a new value to the LVB.Typically a read is done when acquiring or up-converting to a new PRMODE or EXMODE lock,while writes to the LVB are usually done whendownconverting from an EXMODE lock.


4.2 DLM Glue

DLM glue (for lack of a better name) is aperformance-critical section of code whose jobit is to manage the relationship between the filesystem and the OCFS2 DLM. As such, DLMglue is the only part of the stack which knowsabout the internals of the DLM—regular filesystem code never calls the DLM API directly.

DLM glue defines several cluster lock typeswith different behaviors via a set of functionpointers, much like the various VFS ops struc-tures. Most lock types use the generic func-tions. The OCFS2 metadata lock defines mostof its own operations for complexity reasons.

The most interesting callback that DLM gluerequires is the unblock operation, which has thefollowing definition:

int (∗unblock)(struct ocfs2_lock_res ∗, int ∗);

When a blocking AST is recieved for anOCFS2 cluster lock, it is queued for process-ing on a per-mount worker thread called thevote thread. For each queued OCFS2 lock,the vote thread will call its unblock() func-tion. If possible the unblock() function isto downconvert the lock to a compatible level.If a downconvert is impossible (for instance thelock may be in use), the function will return anon-zero value indicating the operation shouldbe retried.

By design, the DLM glue layer never deter-mines lifetiming of locks. That is dictatedby the container object—in OCFS2, this ispredominantly the struct inode which al-ready has a set of lifetime rules to be obeyed.

Similarly, DLM glue is only concerned withmulti-node locking. It is up to the callers toserialize themselves locally. Typically this isdone via well-defined methods such as holdinginode->i_mutex.

The most important feature of DLM glue isthat it implements a technique known as lockcaching. Lock caching allows the file system toskip costly DLM communication for very largenumbers of operations. When a DLM lock iscreated in OCFS2 it is never destroyed until thecontainer object’s lifetime makes it useless tokeep around. Instead, DLM glue maintains itscurrent mode and instead of creating new locks,calling processes only take references on a sin-gle cached lock. This means that, aside fromthe initial acquisition of a lock and barring anyBAST calls from another node, DLM glue cankeep most lock / unlock operations down to asingle integer increment.

DLM glue will not block locking processes inthe case of an upconvert—say a PRMODE lockis already held, but a process wants exclusiveaccess in the cluster. DLM glue will continueto allow processes to acquire PRMODE level ref-erences while upconverting to EXMODE. Sim-ilarly, in the case of a downconvert, processesrequesting access at the target mode will not beblocked.

4.3 Inode Locks

A very good example of cluster locking inOCFS2 is the inode cluster locks. Each OCFS2inode has three locks. They are described inlocking order, outermost first.

1. ip_rw_lockres which serializes fileread and write operations.

2. ip_meta_lockres which protects in-ode metadata.

3. ip_data_lockres which protects in-ode data.

The inode metadata locking code is responsiblefor keeping inode metadata consistent across


the cluster. When a new lock is acquired atPRMODE or EXMODE, it is responsible for re-freshing the struct inode contents. To dothis, it stuffs the most common inode fields in-side the lock LVB. This allows us to avoid aread from disk in some cases. The metadataunblock() method is responsible for wakingup a checkpointing thread which forces jour-naled data to disk. OCFS2 keeps transactionsequence numbers on the inode to avoid check-pointing when unecessary. Once the check-point is complete, the lock can be downcon-verted.

The inode data lock has a similar responsibil-ity for data pages. Complexity is much lowerhowever. No extra work is done on acquiry ofa new lock. It is only at downconvert that workis done. For a downconvert from EXMODE toPRMODE, the data pages are flushed to disk.Any downconvert to NLMODE truncates thepages and destroys their mapping.

OCFS2 has a cluster wide rename lock, forthe same reason that the VFS has s_vfs_rename_mutex—certain combinations ofrename(2) can cause deadlocks, even be-tween multiple nodes. A comment in ocfs2_rename() is instructive:/* Assume a directory hierarchy thusly:

* a/b/c

* a/d

* a,b,c, and d are all directories.

** from cwd of ’a’ on both nodes:

* node1: mv b/c d

* node2: mv d b/c

** And that’s why, just like the VFS, we need a

* file system rename lock. */

Serializing operations such as mount(2) andumount(2) is the super block lock. File sys-tem membership changes occur only under anEXMODE lock on the super block. This is usedto allow the mounting node to choose an ap-propriate slot in a race-free manner. The superblock lock is also used during node messaging,as described in the next subsection.

4.4 Messaging

OCFS2 has a network vote mechanism whichcovers a small number of operations. The votesystem stems from an older DLM design andis scheduled for final removal in the next majorversion of OCFS2. In the meantime it is worthreviewing.

Vote Type OperationOCFS2_VOTE_REQ_MOUNT Mount notificationOCFS2_VOTE_REQ_UMOUNT Unmount notificationOCFS2_VOTE_REQ_UNLINK Remove a nameOCFS2_VOTE_REQ_RENAME Remove a nameOCFS2_VOTE_REQ_DELETE Query an inode wipe

Table 3: OCFS2 vote types

Each vote is broadcast to all mounted nodes(except the sending node) where they are pro-cessed. Typically vote messages about a givenobject are serialized by holding an EXMODEcluster lock on that object. That way the send-ing node knows it is the only one sending thatexact vote. Other than errors, all votes ex-cept one return true. Membership is keptstatic during a vote by holding the super blocklock. For mount/unmount that lock is held atEXMODE. All other votes keep a PRMODE lock.This way most votes can happen in parallelwith respect to each other.

The mount/unmount votes instruct the othermounted OCFS2 nodes to the mount status ofthe sending node. This allows them in turn totrack whom to send their own votes to.

The rename and unlink votes instruct receivingnodes to look up the dentry for the name beingremoved, and call the d_delete() functionagainst it. This has the effect of removing thename from the system. If the vote is an unlinkvote, the additional step of marking the inode aspossibly orphaned is taken. The flag OCFS2_


INODE_MAYBE_ORPHANED will trigger addi-tional processing in ocfs2_drop_inode().This vote type is sent after all directory and in-ode locks for the operation have been acquired.

The delete vote is crucial to OCFS2 beingable to support POSIX style unlink-while-openacross the cluster. Delete votes are sent fromocfs2_delete_inode(), which is calledon the last iput() of an orphaned inode. Re-ceiving nodes simply check an open count ontheir inode. If the count is anything other thanzero, they return a busy status. This way thesending node can determine whether an inodeis ready to be truncated and deleted from disk.

5 Recovery

5.1 Heartbeat

The OCFS2 cluster stack heartbeats on disk andvia its network connection to other nodes. Thisallows the cluster to maintain an idea of whichnodes are alive at any given point in time. It isimportant to note that though they work closelytogether, the cluster stack is a separate entityfrom the OCFS2 file system.

Typically, OCFS2 disk heartbeat is done on ev-ery mounted volume in a contiguous set of sec-tors allocated to the heartbeat system fileat file system create time. OCFS2 heartbeatactually knows nothing about the file system,and is only given a range of disk blocks toread and write. The system file is only usedas a convenient method of reserving the spaceon a volume. Disk heartbeat is also never ini-tiated by the file system, and always startedby the mount.ocfs2 program. Manual con-trol of OCFS2 heartbeat is available via theocfs2_hb_ctl program.

Each node in OCFS2 has a unique node num-ber, which dictates which heartbeat sector itwill periodically write a timestamp to. Opti-mizations are done so that the heartbeat threadonly reads those sectors which belong to nodeswhich are defined in the cluster configuration.Heartbeat information from all disks is accu-mulated together to determine node liveness. Anode need only write to one disk to be consid-ered alive in the cluster.

Network heartbeat is done via a set of keep-alive messages that are sent to each node. In theevent of a split brain scenario, where the net-work connection to a set of nodes is unexpect-edly lost, a majority-based quorum algorithmis used. In the event of a 50/50 split, the groupwith the lowest node number is allowed to pro-ceed.

In the OCFS2 cluster stack, disk heartbeat isconsidered the final arbiter of node liveness.Network connections are built up when a nodebegins writing to its heartbeat sector. Likewisenetwork connections will be torn down when anode stops heartbeating to all disks.

At startup time, interested subsystems regis-ter with the heartbeat layer for node up andnode down events. Priority can be assigned tocallbacks and the file system always gets nodedeath notification before the DLM. This is toensure that the file system has the ability tomark itself needing recovery before DLM re-covery can proceed. Otherwise, a race existswhere DLM recovery might complete beforethe file system notification takes place. Thiscould lead to the file system gaining locks onresources which are in need of recovery—forinstance, metadata whose changes are still inthe dead node’s journal.


5.2 File System Recovery

Upon notification of an unexpected node death,OCFS2 will mark a recovery bitmap. Anyfile system locks which cover recoverable re-sources have a check in their locking path forany set bits in the recovery bitmap. Those pathswill then block until the bitmap is clear again.Right now the only path requiring this checkis the metadata locking code—it must wait onjournal replay to continue.

A recovery thread is then launched which takesa EXMODE lock on the super block. This en-sures that only one node will attempt to recoverthe dead node. Additionally, no other nodeswill be allowed to mount while the lock is held.Once the lock is obtained, each node will checkthe slot_map system file to determine whichjournal the dead node was using. If the nodenumber is not found in the slot map, then thatmeans recovery of the node was completed byanother cluster node.

If the node is still in the slot map then journalreplay is done via the proper JBD calls. Oncethe journal is replayed, it is marked clean andthe node is taken out of the slot map.

At this point, the most critical parts of OCFS2recovery are complete. Copies are made of thedead node’s truncate log and local alloc files,and clean ones are stamped in their place. Aworker thread is queued to reclaim the diskspace represented in those files, the node is re-moved from the recovery bitmap and the superblock lock is dropped.

The last part of recovery—replay of the copiedtruncate log and local alloc files—is (appropri-ately) called recovery completion. It is allowedto take as long as necessary because locking op-erations are not blocked while it runs. Recoverycompletion is even allowed to block on recov-ery of other nodes which may die after its work

is queued. These rules greatly simplify the codein that section.

One aspect of recovery completion which hasnot been covered yet is orphan recovery. Theorphan recovery process must be run againstthe dead node’s orphan directory, as well as thelocal orphan directory. The local orphan direc-tory is recovered because the now dead nodemight have had open file descriptors againstan inode which was locally orphaned—thus thedelete_inode() code must be run again.

Orphan recovery is a fairly straightforward pro-cess which takes advantage of the existing in-ode life-timing code. The orphan directoryin question is locked, and the recovery com-pletion process calls iget() to obtain an in-ode reference on each orphan. As referencesare obtained, the orphans are arranged in asingly linked list. The orphan directory lock isdropped, and iput() is run against each or-phan.

6 What’s Been Missed!

Lots, unfortunately. The DLM has mostly beenglossed over. The rest of the OCFS2 clusterstack has hardly been mentioned. The OCFS2tool chain has some unique properties whichwould make an interesting paper. Readers in-terested in more information on OCFS2 areurged to explore the web page and mailing listsfound in the references section. OCFS2 devel-opment is done in the open and when not busy,the OCFS2 developers love to answer questionsabout their project.

7 Acknowledgments

A huge thanks must go to all the authors of Ext3from which we took much of our inspiration.


Also, without JBD OCFS2 would not be whatit is today, so our thanks go to those involved inits development.

Of course, we must thank the Linux kernelcommunity for being so kind as to accept ourhumble file system into their kernel. In partic-ular, our thanks go to Christoph Hellwig andAndrew Morton whose guidance was critical ingetting our file system code up to kernel stan-dards.

8 References

The OCFS2 home page can be found athttp://oss.oracle.com/projects/

ocfs2/.

From there one can find mailing lists, documen-tation, and the source code repository.


tgt: Framework for Storage Target Drivers

Tomonori FUJITANTT Cyber Solutions Laboratories

[email protected]

Mike ChristieRed Hat, Inc.

[email protected]

Abstract

In order to provide block I/O services, Linuxusers have had to modify kernel code by hand,use binary kernel modules, or purchase spe-cialized hardware. With the mainline kernelnow having SCSI Parallel Interface (SPI), FibreChannel (FC), iSCSI, and SCSI RDMA (SRP)initiator support, Linux target framework (tgt)aims to fill the gap in storage functionality byconsolidating several target driver implementa-tions and providing a SCSI protocol indepen-dent API that will simplify target driver cre-ation and maintenance.

Tgt’s key goal and its primary hurdle has beenimplementing a great portion of tgt in userspace, while continuing to provide performancecomparable to a target driver implemented en-tirely in the kernel. By pushing the SCSI statemachine, I/O execution, and the managementcomponents of the framework outside of thekernel, it enjoys debugging, maintenance andmainline inclusion benefits. However, it hascreated new challenges. Both traditional ker-nel target implementations and tgt have hadto transform Block Layer and SCSI Layer de-signs, which assume requests will be initiatedfrom the top of the storage stack (the requestqueue’s make_request_fn()) to an archi-tecture that can efficiently handle asynchronousrequests initiated by the the end of the stack(the low level drivers interrupt handler), buttgt also must efficiently communicate and syn-

chronize with the user-space daemon that im-plements the SCSI target state machine and per-forms I/O.

1 Introduction

The SCSI protocol was originally designed touse a parallel bus interface and used to be tiedclosely to it. With the increasing demands ofstorage capacity and accessibility, it becameobvious that Direct Attached Storage (DAS),the classic storage architecture, in which a hostand storage devices are directly connected bysystem buses and parallel cable, cannot meettoday’s industry scalability and manageabil-ity requirements. This lead to the inventionof Storage Area Network (SAN) technology,which enables hosts and storage devices to beconnected via high-speed interconnection tech-nologies such as Fibre Channel, Gigabit Ether-net, Infiniband, etc.

To enable SAN technology, the SCSI-3 archi-tecture, as can be seen in Figure 1, brought animportant change to the division of the stan-dard into interface, protocol, device model, andcommand set. This allows device models andcommand sets with various transports (physi-cal interfaces), such as Fibre Channel, Ether-net, and Infiniband. The device type specificcommand set, the primary command set, andtransport are independent of each other.

304 • tgt: Framework for Storage Target Drivers

SCSI Primary command set (for all device types)

Interlock

Protocol

Parallel

Interface

Fibre

Channel

Protocol

Fibre

Channel

RDMA

Protocol

InfiniBand

iSCSI

Internet

Block

commands

(disk)

Stream

Commands

(tape)

Multi-Media

Commands

(CD, DVD)

Controller

Commands

(RAID)

Interconnects

Transport

Protocols

Device Type

Specific

Command sets

Figure 1: SCSI-3 architecture

1.1 What is a Target

SCSI uses a client-server model (Figure 2). Re-quests are initiated by a client, which in SCSIterminology is called an Initiator Device, andare processed by a server, which in SCSI termi-nology is known as a Target Device. Each tar-get contains one or more logical units and pro-vides services performed by device servers andtask management functions performed by taskmanagers. A logical unit is an object that im-plements one or more device functional mod-els described in the SCSI command standardsand processes commands (eq., reading from orwriting to the media) [5].

Currently, the Linux kernel has support for sev-eral types of initiators including ones that useFC, TCP/IP, RDMA, or SPI for their transportprotocol. There is however, no mainline targetsupport.

2 Overview of Target Drivers

2.1 Target Driver

Generally in the past, a target driver is respon-sible for the following tasks:

Initiator

Logical Units

Target

0 1 2

Transport

Figure 2: SCSI target and initiator

1. Handling its interconnect hardware inter-face and transport protocol.

2. Processing the primary and device specificcommand sets.

3. Accessing local devices (attached to theserver directly) when necessary.

Since hardware interfaces are unique, the ker-nel needs a specific target driver for everyhardware interface. However, the rest of thetasks are independent of hardware interfacesand transport protocols.

The duplication of code between tasks two andthree lead to the necessity for a target frame-work that provides a API set useful for everytarget driver. In tgt, target drivers simply takeSCSI commands from transport protocol pack-ets, hand them over to the framework, and sendback the responses to the clients via transportprotocol packets. Figure 3 shows a simplifiedview of how hardware interfaces and transportprotocols interact in tgt. It is more complicatedthan the above explanation of the ideal modeldue to some exceptions described below.

Tgt is integrated with Linux’s SCSI Mid Layer(SCSI-ML), so it supports two hardware inter-face models:

Hardware A Host Bus Adapter (HBA) han-dles the major part of transport protocol


iSCSI

FCP

SRP

Hardware

Interface

Transport

Protocols

Target drivers

& kernel subsystems

TCP stackNIC

FC HBA

iSCSI HBA

RNIC

IB TCA

RNIC

IB TCA

RNIC stack

IB stack

RNIC stack

IB stack

Target driver

Target driver

Target driver

Target driver

Target driver

Target driver

Target driver

Target

Framework

Figure 3: transport protocols and hardware in-terfaces

processing and the target driver imple-ments the functionality to communicatebetween the HBA and tgt. Tgt needs a spe-cific target driver for each type of HBA.FCP and SPI drivers follow this model.Drivers for other transports like iSCSI orSRP or for interconnects like iSER fol-low this model when there is specializedhardware to offload protocol or intercon-nect processing.

Software For transports like iSCSI and SRPor interconnects like iSER, a target drivercan implement the transport protocol pro-cessing in a kernel module and access lowlevel hardware through another subsystemsuch as the networking or infiniband stack.This allows a single target driver to workwith various hardware interfaces.

3 Target Framework (tgt)

Our key design philosophy is implementing asignificant portion of tgt in user space whilemaintaining performance comparable to a tar-get driver implemented in kernel space. Thisconforms to the current trend of pushing codethat can be implemented in user space out of

tgt daemontgtadm

Kernel Space

tgt core

User Space

Unix socket

Netlink socket

SCSI Mid Layer

Block LayerTarget Drivers

Target driver

libraries

Transport

libraries

Dynamic libraries

Figure 4: tgt components

the kernel [6] and enables developers to use richuser space libraries and development tools suchas gdb.

As can be seen in Figure 4, the tgt architecturehas two kernel components: the target driverand tgt core. The target driver’s primary re-sponsibilities are to manage the transport con-nections with initiator devices and pass com-mands and task management function requestsbetween its hardware or interconnect subsys-tem and tgt core. tgt core is a simple connectorbetween target drivers and the user space dae-mon (tgtd) that enables the driver to send tgtd avector of commands or task management func-tion requests through a netlink interface.

Tgt core was integrated into scsi-ml with minormodifications to the scsi_host_templateand various scsi helper functions for allocatingscsi commands. This allows tgt to rely on scsi-ml and the Block Layer for tricky issues such ashot-plugging, command buffer mapping, scat-ter gather list creation, and transport class inte-gration. Note that tgt does not change the cur-rent scsi-ml API set, so normally the only mod-ifications are required to the initiator low leveldriver’s (LLD) interrupt handler to process tar-get specific requests and to the transport classesso that they are able to present target specific at-tributes.

All SCSI protocol processing is performed in


user space, so as can be seen by Figure 4 thebulk of the tgt is implemented in: tgtadm, tgtd,transport libraries and driver libraries. tgtadmis a simple management tool. A transport li-brary is equivalent to a kernel transport classwhere functionality common to a set of driversusing the same transport can be placed. Driverlibraries, are dynamically linked target driverspecific libraries that can be used to imple-ment functionality such as special setup andtear down operations. And, tgtd is the SCSIstate machine that executes commands and taskmanagement requests.

The clear concern over the user space SCSI pro-tocol processing is degraded performance1. Weexplain some techniques to overcome this prob-lem as we discuss in more detail the tgt compo-nents.

3.1 API for Target Drivers

The target drivers interact with tgt corethrough a new tgt API set, and the exist-ing mid-layer API set and data structures.For the most part, target drivers work in avery similar manner as the existing initia-tor drivers. In many cases the initiator onlyneeds to implement the new target callbacks onthe scsi_host_template: transfer_response(), transfer_data(), andtsk_mgmt_response(), to enable a targetmode in its hardware. We examine the detailsof the new callbacks later in this section.

3.1.1 Kernel Setup

The first step in registering a target driver withscsi-ml and tgt core is to create a scsi host

1In the early days, tgt performed performance sen-sitive SCSI commands in kernel space (eq. read/writefrom/to storage devices). However, it turned out that thecurrent design was able to achieve comparable perfor-mance.

adapter instance. This is accomplished by call-ing the same functions that are used for theinitiator: scsi_host_alloc() and scsi_add_host(). If an HBA will be running inboth target and initiator mode then only a sin-gle call to each of those functions is necessaryfor each HBA. The final step in setting up atarget driver is to allocate a uspace_req_qfor each scsi host that will be running in tar-get mode. A uspace_req_q is used by tgtcore to send requests to user-space. It can be al-located and initialized by calling scsi_tgt_alloc_queue().

3.1.2 Processing SCSI Commands in theTarget Driver

The target driver needs to allocate the scsi_cmnd data structure for a SCSI command re-ceived from a client via scsi_host_get_command(). This corresponds to scsi-ml’sscsi_get_command() usage for allocat-ing a scsi_cmnd for each request comingfrom the Block Layer or scsi-ml Upper LayerDriver (ULD). While the former allocates thescsi_cmnd and the request data struc-tures, the latter allocates only the scsi_cmnddata structure.

The target driver sets up and passes the scsi_cmnd data structure to tgt core via scsi_tgt_queue_command(). The followinginformation is passed to tgt core from the tar-get driver:

SCSI command buffer to contain SCSI com-mand.

lun buffer buffer to represent logical unitnumber.

tag unique value to identify this SCSI com-mand.


task attribute task attribute for ordering.

buffer length number of data bytes to transfer.

On completion of executing a SCSI command,tgt core invokes transfer_response(),which is specified in the scsi_host_template data structure.

transfer_data() is invoked prior totransfer_response() if a SCSI com-mand involves data transfer. Like scsi-ml, ascatter gather list of pages at the request_buffer member in the scsi_cmnd datastructure is used to specify data to transfer.Also like scsi-ml, tgt core utilizes Block Layerand scsi-ml helpers to create scatter gatherlists within the scsi_host_template lim-its such as max_sectors, dma_boundary,sg_tablesize, and use_clustering.

If the SCSI command involves a target-to-initiator data transfer, a target driver transfersdata pointed out by the scatter gather list tothe client, and then invokes the function pointerpassed as a argument of transfer_data()to notify tgt core of the completion of the oper-ation.

If the SCSI command involves a initiator-to-target data transfer, the target driver copies(through a DMA operation or memcpy) datato the scatter gather list (the LLD or transportclass requests the client to send data to write be-fore the actual transfer if necessary), and theninvokes the function pointer passed as a ar-gument of transfer_data() to notify tgtcore of the completion of the transfer.

Depending on the transfer size and hardware ortransport limitations, tgt core may have to calltransfer_data() multiple times to trans-mit the entire payload. To accomplish this, tgtis not able to easily reuse the existing BlockLayer and SCSI API. This is due to tgt core

executing from interrupt context, and becausethe scatter list APIs tgt utlizes were not in-tended for requests starting at end of the storagestack. To work around the Block Layer scattergather list allocation function assumption thata request will normally be completed in onescatter list, tgt required two modifications orworkarounds. The first and easiest, was the ad-dition of an offset field to the scsi_cmndto track where the LLD is currently at in thetransfer. The more difficult change, and prob-ably more of a hack, was for tgt core to main-tain two lists of BIOs for each request. Onelist contains BIOs that have not been mappedto scatter lists and the second list contains BIOsthat have been mapped into scatter gather lists,completed, and need to be unmapped from pro-cess context when the command is completed.

3.1.3 Task Management Function

A target driver can send task management func-tion (TMF) requests to tgt core via scsi_tgt_tsk_mgmt_request().

The first argument is the TMF type. Currently,the supported TMF types are ABORT_TASK,ABORT_TASK_SET, and LOGICAL_UNIT_RESET.

The second argument is the tag value to iden-tify a command to abort. This correspondsto the tag argument of scsi_tgt_queue_command() and used only with ABORT_TASK.

The third argument is the lun buffer to identifya logical unit against which the TMF requestis performed. This is used with TMF requestsexcept for ABORT_TASK.

The last argument is a pointer to enable targetdrivers to identify this TMF request on comple-tion of it.


tgt core invokes eh_abort_handler() peraborted command to allow the target driver toclean up any resources that it may have inter-nally allocated for the command. Unlike whenit is called by scsi-ml’s error handler, the hostis not guaranteed to be quiesced and may haveinitiator and target commands running.

Subsequently to eh_abort_handler(),tsk_mgmt_response() is invoked. Thepointer to identify the completed TMF requestis passed as the argument.

3.2 tgt core

tgt core conveys SCSI commands, TMF re-quests, and these results between target driversand the user space daemon, tgtd, through anetlink interface, which enables a user spaceprocess to read and write a stream of data viathe socket API. tgt core encapsulates the re-quests into netlink packets and sends them touser space to be executed. Then it receivesnetlink packets from user space, extracts the re-sults of the operation, and performs auxiliarytasks in compliance with the results. Figure 5shows the packet format for SCSI commands.

struct {int host_no;

uint32_t cid;

uint32_t data_len;

uint8_t scb[16];

uint8_t lun[8];

int attribute;

uint64_t tag;

} cmd_req;

Figure 5: Netlink packet for SCSI commands

Since moving large amounts of data via netlinkleads to a performance drop because of thememory copies, for the command’s data buffer

tgt uses the memory mapped I/O technique uti-lized by the Linux SCSI generic (SG) devicedriver [4], which moves an address that themmap() system call returns instead of lots ofdata.

When tgt core receives the address from userspace, it increments the reference count onthe pages of the mapped region and setsup the the scatter gather list in scsi_cmnd data structure. tgt core relies onthe standard kernel API, bio_map_user(),scsi_alloc_sgtable(), and blk_rq_map_sg() for these chores. Similarly,bio_unmap_user() and scsi_free_sgtable() decrements the reference andcleans up the the scatter gather list. The formeralso marks the pages as dirty in case of initiator-to-target data transfer (WRITE_* command).

3.3 User Space Daemon (tgtd)

The user space daemon, tgtd, is the heart of tgt.It contains the SCSI state machine, executes re-quests and provides a consistent API for man-agement via Unix domain sockets. It commu-nicates with the target drivers through tgt core’snetlink interface.

tgtd currently uses a single process model.This enables us to avoid tricky race conditions.Imagine that a SCSI command is sent to a par-ticular device and a management request to re-move the device comes at the same time. How-ever, this means that tgtd always needs to workin an asynchronous manner.

The tgtd code is independent of transport proto-cols and target drivers. The transport-protocoldependent and target-driver dependent features,such as showing parameters, are implementedin dynamic libraries: transport-protocol li-braries and target-driver libraries.


There are two instances that the administratorsmust understand: target and device. A targetinstance works as a SCSI target device server.Every working scsi host adapter that imple-ments a target driver is bound to a particulartarget instance. Multiple scsi host adapter in-stances can be bound to a single target instance.A device instance corresponds to a SCSI logi-cal unit. A target instance can have multipledevice instances.

3.3.1 SCSI Command Processing in tgtd

In previous sections, the process by which acommand is moved between the kernel and userspace and how it is transferred between the tar-get and initiator ports has been detailed. Now,the final piece of the process, where tgtd per-forms command execution, is described.

1. tgtd receives a netlink packet containinga SCSI command and finds the the tar-get instance (the device instance is lookedup if necessary) to which the commandshould be routed. As shown in Figure 5,the packet contains the host bus adapter IDand the logical unit buffer.

2. tgtd processes the task attribute to knowwhen to execute the command (immedi-ately or delay).

3. When the command is scheduled, tgtd ex-ecutes it and sends the result to tgt core.

4. tgtd is notified via tgt core’s netlink inter-face that the target driver has completedany needed data transfer and has success-fully sent the response. tgtd is then able tofree resources that it had allocated for thecommand.

In case of non-I/O commands, involving target-to-initiator data transfer, tgtd allocates buffer

via valloc(), builds the response in it, andsends the address of the buffer to tgt core. Thebuffer is freed on completion of the command.

In case of I/O commands, tgtd maps the re-quested length starting at the offset from thedevice’s file, and sends the address to the tgtcore. On completion of the command, tgt callsthe munmap system call.

To improve performance, if tgtd can map thewhole device file (typically, it is possible with64-bit architectures), tgtd does not call themmap or munmap system calls per command.Instead, it maps the whole device file when thedevice instance is added to a target instance.

3.3.2 Task Management Function

When tgtd receives task management functionrequests to abort SCSI commands, it searchesthe commands, sends an abort request per thefound commands, and then sends the TMFcompletion notification to tgt core.

Once tgt core marks pages as a dirty, it is im-possible to stop them being committed to disk.Thus, tgtd does not try to abort a command if itis waiting for the completion. If tgtd receives arequest to abort such command, it waits for thecompletion of the command and then sends theTMF completion notification indicating that thecommand is not found.

3.4 Configuration

The currently supported management opera-tions are: creation and deletion of target anddevice instances and binding a host adapter in-stance to a target instance. All objects are in-dependent of transport protocols. Transport-protocol dependent management requests (such


as showing parameters) are performed by usingthe corresponding transport-protocol library.

The command-line management tool, tgtadm,is distributed together with tgt for ease of use,though tgtd provides a management API viaUnix domain sockets so that administrators orvendors can implement their own managementtools.

4 Status and the Future

Today, tgt implements only what is necessaryto be able to benchmark it against kernel driverimplementations and provide basic functional-ity. This has been due to the code being desta-bilized several times as a result of code reviewcomments that have forced most of the code tobe pushed to user space. This means that thereis a long list of features to be implemented andported from previous versions of tgt.

4.1 Target Driver Support

At the time of the writing of this paper, thereis only one working target driver, ibmvstgt,which is a SRP target driver, though it worksonly for virtualization environments on IBMpSeries [2]. The virtual machines communi-cate via RDMA. One virtual machine (calledthe Virtual I/O server) works as a SRP serverthat provides I/O services for the rest of virtualmachines. ibmvstgt is based on the IBM stan-dalone (not a framework) target driver [3], ib-mvscsis. By converting ibmvscsis to tgt morethan 2,000 lines were removed from the origi-nal driver.

Currently, there is a Qlogic FC, qla2xxx-based,target driver being converted from kernel basedtarget framework, SCST [9], to tgt. And, an

Emulex FC, lpfc-based, target driver that uti-lized a GPL FC target framework is beingworked on. Both of these require FC transportclass Remote Port (rport) changes that will al-low the FC class and tgt core to perform trans-port level recovery for the target LLDs.

The iSCSI code in mainline has also begun tobe modified to support target mode. With theintroduction of libiscsi a target could be imple-mented by creating a new iscsi transport mod-ule or by modifying the iscsi tcp module.

4.2 Kernel and User Space Communica-tion

Netlink usage leads to memory allocations andmemory copies for every netlink packet. Italso suffers from frequent system calls. tgtavoids a significant performance drop by us-ing the memory mapped I/O technique for thecommand data buffer, but there is still roomfor improvement. By removing the copy of thecommand and its status and sending a vector ofcommands or status we can reduce the memorycopies and kernel user space trips.

Previous versions of tgt had used a mmappedpacket socket (AF_PACKET), to send messagesto user space. This removes netlink from thecommand execution initiation and proved tobe a performance gain, but difficulties in themmapped packet socket interface usage haveprevented tgt from currently using it. Anotheroption that provides high speed data transfersis the relayfs file system [10]. Unfortunately,both are unidirectional, and only communica-tion from the kernel to user space is improved.

Another common technique for reducing sys-tem call usage is to send multiple requests inone invocation. During testing of tgt, this wasattempted and showed promise, but was dif-ficult to tune. Investigation will be resumed


when more drivers are stabilized and can bebenchmarked.

4.3 Virtualization

The logical unit that executes requests from aremote initiator is not limited to being backedby a local device that provides a SCSI interface.The object being used by tgt for the logicalunit’s device instance could be a IDE, SATA,SCSI, Device Mapper (DM), or Multiple De-vice (MD) device or a file. To provide this flex-ibility, previous versions of tgt provided virtu-alized devices for the clients regardless of theattached local object. tgtd had two types of vir-tualization:

device virtualization With device virtual-ization, a device_type_handler(DTH) emulates commands that cannotbe executed directly by its device type.For example, a MD or DM device has nonotion of a SCSI INQUIRY. In this casethe DTH has the generic SCSI protocolemulation library execute the INQUIRY.

I/O virtualization With I/O virtualization aio_type_handler (IOTH) enables tgtto access regular and block device filesusing mmap or use specialized interfacessuch as SG IO.

Many types and combinations of device and I/Ovirtualization are possible. A Virtual Tape Li-brary (VTL) could provide a tape library usingdisk drives, or by using the virtual CD deviceand file I/O virtualization a tgt machine couldprovide cdrom devices for the client with ISOimage files.

Another interesting virtualization mechanism ispassthrough, directly passing SCSI commandsto SCSI devices. This provides a feature, called

storage bridge, to bind different SAN protocols.For example, suppose that you are already in aworking FC environment, where there are is FCstorage server and clients with a FC HBA. Ifyou need to add new clients that need to accessthe FC storage server, however you cannot af-ford to buy new FC HBAs, an iSCSI-FC bridgecan connect the existing FC network with a newiSCSI network.

Currently, tgt supports disk device virtualiza-tion and file I/O virtualization. The file I/O vir-tualization simply opens a file and accesses itsdata via the mmap system call.

5 Related Work

SCST is the first GPLed attempt to implementa complete framework for target drivers. Thereare several major differences between tgt andSCST: SCST is mature and contains many fea-tures, all the SCST components reside in kernelspace, and SCST duplicates functionality foundin scsi-ml and the Block Layer instead of ex-ploiting and modifying those subsystems.

Besides ibmvscsis, there have been severalstandalone target drivers. iSCSI Enterprise Tar-get software (IET) [1] is a popular iSCSI tar-get driver for Ethernet adapters, which tgt hasused as a base for istgt2. Other software iSCSItargets include the UNH and Intel implementa-tions [8, 7] and there are several iSCSI and FCdrivers for specific hardware like Qlogic andChelsio.

2The first author has maintained IET. The failure topush it into the mainline kernel is one of the reasons whytgt was born.


6 Conclusion

This paper describes tgt, a new framework thatadds storage target driver support to the SCSIsubsystem. What differentiates tgt from othertarget frameworks and standalone drivers is itsattempt to push the SCSI state model and I/Oexecution to user space.

By using the Block and SCSI layer, tgt has beenable to quickly implement a solution that by-passes performance problems that result fromexecuting memory copies to and from the ker-nel. However, the Block and SCSI Layers, werenot designed to handle large asynchronous re-quests originating from the LLDs interrupt han-dlers. Since the Block Layer SG IO and SCSIUpper Layer Drivers like SG, share a commontechnique, code, and problems, we hope wewill be able to find a final solution that will ben-efit tgt core and the rest of the kernel.

Tgt has undergone several rewrites as a result ofcode reviews, but is now reaching a point wherehardware interface vendors and part time devel-opers are collaborating to solidify tgt core andtgtd, implement new target drivers, make mod-ifications to other kernel subsystems to supporttgt, and implement new features.

The source code is available from http://stgt.berlios.de/

References

[1] iSCSI Enterprise Target software, 2004.http://iscsitarget.sourceforge.net/.

[2] Dave Boutcher and Dave Engebretsen.Linux Virtualization on IBM POWER5Systems. In Ottawa Linux Symposium,pages 113–120, July 2004.

[3] Dave Boutcher. SCSI target for IBMPower5 LPAR, 2005. http://patchwork.ozlabs.org/linuxppc64/patch?id=2285.

[4] Douglas Gilbert. The Linux SCSIGeneric (sg) Driver, 1999.http://sg.torque.net/sg/.

[5] T10 Technical Editor. SCSI architecturemodel-3, 2004.http://www.t10.org/ftp/t10/drafts/sam3/sam3r14.pdf.

[6] Edward Goggin, Alasdair Kergon,Christophe Varoqui, and David Olien.Linux Multipathing. In Ottawa LinuxSymposium, pages 147–167, July 2005.

[7] Intel iSCSI Reference Implementation,2001. http://sourceforge.net/projects/intel-iscsi/.

[8] UNH-iSCSI initiator and target, 2003.http://unh-iscsi.sourceforge.net/.

[9] Vladislav Bolkhovitin. Generic SCSITarget Middle Level for Linux, 2003.http://scst.sourceforge.net/.

[10] Karim Yaghmour, Robert Wisniewski,Richard Moore, and Michel Dagenais.relayfs: An efficient unified approach fortrasmitting data from kernel to userspace. In Ottawa Linux Symposium,pages 519–531, July 2003.

More Linux for LessuClinuxTM on a $5.00 (US) Processor

Michael HennerichAnalog Devices

[email protected]

Robin GetzAnalog Devices

[email protected]

Abstract

While many in the Linux community focus onenterprise and multi-processor servers, thereare also many who are working and deploy-ing Linux on the network edge. Due to itsopen nature, and the ability to swiftly developcomplex applications, Linux is rapidly becom-ing the number one embedded operating sys-tem. However, there are many differences be-tween running Linux on a Quad processor sys-tem with 16Gig of Memory and 250Gig ofRAID storage than a on a system where the to-tal cost of hardware is less than the price of atypical meal.

1 Introduction

In the past few years, LinuxTM

has become anincreasingly popular operating system choicenot only in the PC and Server market, alsoin the development of embedded devices—particularly consumer products, telecommuni-cations routers and switches, Internet appli-ances, and industrial and automotive applica-tions.

The advantage of Embedded Linux is that it isa royalty-free, open source, compact solution

that provides a strong foundation for an ever-growing base of applications to run on. Linuxis a fully functional operating system (OS) withsupport for a variety of network and file han-dling protocols, a very important requirementin embedded systems because of the need to“connect and compute anywhere at anytime.”Modular in nature, Linux is easy to slim downby removing utility programs, tools, and othersystem services that are not needed in the tar-geted embedded environment. The advantagesfor companies using Linux in embedded mar-kets are faster time to market, flexibility, andreliability.

This paper attempts to answer several questionsthat all embedded developers ask:

• Why use a kernel at all?

• What advantages does Linux provide overother operating systems?

• What is the difference between Linux onx86 and low cost processors?

• Where can I get a kernel and how do I getstarted?

• Is Linux capable of providing real-timefunctionality?

• What are the possibilities to port a existingreal-time application to a system runningalso Linux?

314 • More Linux for Less

2 Why use a kernel at all

All applications require control code as supportfor the algorithms that are often thought of asthe “real” program. The algorithms require datato be moved to and/or from peripherals, andmany algorithms consist of more than one func-tional block. For some systems, this controlcode may be as simple as a “super loop” blindlyprocessing data that arrives at a constant rate.However, as processors become more power-ful, considerably more sophisticated control orsignal processing may be needed to realize theprocessor’s potential, to allow the processor toabsorb the required functionality of previouslysupported chips, and to allow a single processorto do the work of many. The following sectionsprovide an overview of some of the benefits ofusing a kernel on a processor.

2.1 Rapid Application Development

The use of the Linux kernel allows rapid de-velopment of applications compared to creat-ing all of the control code required by hand.An application or algorithm can be created anddebugged on an x86 PC using powerful desk-top debugging tools, and using standard pro-gramming interfaces to device drivers. Movingthis code base to an embedded linux kernel run-ning on a low-cost embedded processor is triv-ial because the device driver model is exactlythe same. Opening an audio device on the x86Desktop is done in exactly the same was as onan embedded Linux system. This allows you toconcentrate on the algorithms and the desiredcontrol flow rather than on the implementationdetails. Embedded Linux kernels and applica-tions supports the use of C, C++, and assem-bly language, encouraging the development ofcode that is highly readable and maintainable,yet retaining the option of hand-optimizing ifnecessary.

2.2 Debugged Control Structures

Debugging a traditional hand-coded applica-tion can be laborious because developmenttools (compiler, assembler, and linker amongothers) are not aware of the architecture of thetarget application and the flow of control thatresults. Debugging complex applications ismuch easier when instantaneous snapshots ofthe system state and statistical runtime data areclearly presented by the tools. To help offsetthe difficulties in debugging software, embed-ded Linux kernels are tested with the same teststhat many desktop distributions use before re-leasing a Linux kernel. This ensure that the em-bedded kernel is as bug-free as possible.

2.3 Code Reuse

Many programmers begin a new project bywriting the infrastructure portions that transfersdata to, from, and between algorithms. Thisnecessary control logic usually is created fromscratch by each design team and infrequentlyreused on subsequent projects. The Linux ker-nel provides much of this functionality in astandard, portable, and reusable manner. Fur-thermore, the kernel and its tight integrationwith the GNU development and debug toolsare designed to promote good coding practiceand organization by partitioning large appli-cations into maintainable and comprehensibleblocks. By isolating the functionality of sub-systems, the kernel helps to prevent the morassall too commonly found in systems program-ming. The kernel is designed specifically totake advantage of commonality in user applica-tions and to encourage code reuse. Each threadof execution is created from a user-defined tem-plate, either at boot time or dynamically by an-other thread. Multiple threads can be createdfrom the same template, but the state associ-ated with each created instance of the thread


remains unique. Each thread template repre-sents a complete encapsulation of an algorithmthat is unaware of other threads in the systemunless it has a direct dependency.

2.4 Hardware Abstraction

In addition to a structured model for algo-rithms, the Linux kernel provides a hardwareabstraction layer. Presented programming in-terfaces allow you to write most of the applica-tion in a platform-independent, high-level lan-guage (C or C++). The Linux Application Pro-gramming Interface (API) is identical for allprocessors which support Linux, allowing codeto be easily ported to a different processor core.When porting an application to a new plat-form, programmers must only address the areasnecessarily specific to a particular processor—normally device drivers. The Linux architec-ture identifies a crisp boundary around thesesubsystems and supports the traditionally dif-ficult development with a clear programmingframework and code generation. Common de-vices can use the same driver interface (for ex-ample a serial port driver may be specific for acertain hardware, but the application←→ serialport driver interface should be exactly the same,providing a well-defined hardware abstraction,and making application development faster).

2.5 Partitioning an Application

A Linux application or thread is an encapsu-lation of an algorithm and its associated data.When beginning a new project, use this notionof an application or thread to leverage the ker-nel architecture and to reduce the complexityof your system. Since many algorithms may bethought of as being composed of subalgorithmbuilding blocks, an application can be parti-tioned into smaller functional units that can be

individually coded and tested. These build-ing blocks then become reusable componentsin more robust and scalable systems.

You define the behavior of Linux applicationsby creating the application. Many applicationor threads of the same type can be created, butfor each thread type, only one copy of the codeis linked into the executable code. Each appli-cation or thread has its own private set of vari-ables defined for the thread type, its own stack,and its own C run-time context.

When partitioning an application into threads,identify portions of your design in which asimilar algorithm is applied to multiple sets ofdata. These are, in general, good candidates forthread types. When data is present in the sys-tem in sequential blocks, only one instance ofthe thread type is required. If the same opera-tion is performed on separate sets of data simul-taneously, multiple threads of the same type cancoexist and be scheduled for prioritized execu-tion (based on when the results are needed).

2.6 Scheduling

The Linux kernel can be a preemptive multi-tasking kernel. Each application or thread be-gins execution at its entry point. Then, it ei-ther runs to completion or performs its primaryfunction repeatedly in an infinite loop. It is therole of the scheduler to preempt execution of aan application or thread and to resume its ex-ecution when appropriate. Each application orthread is given a priority to assist the schedulerin determining precedence.

The scheduler gives processor time to thethread with the highest priority that is in theready state. A thread is in the ready state whenit is not waiting for any system resources it hasrequested.


2.7 Priorities

Each application or thread is assigned a dy-namically modifiable priority. An applicationis limited to forty (40) priority levels. How-ever, the number of threads at each priority islimited, in practice, only by system memory.Priority level one is the highest priority, and pri-ority thirty is the lowest. The system maintainsan idle thread that is set to a priority lower thanthat of the lowest user thread.

Assigning priorities is one of the most difficulttasks of designing a real-time preemptive sys-tem. Although there has been research in thearea of rigorous algorithms for assigning pri-orities based on deadlines (for example, rate-monotonic scheduling), most systems are de-signed by considering the interrupts and signalstriggering the execution, while balancing thedeadlines imposed by the system’s input andoutput streams.

2.8 Preemption

A running thread continues execution unlessit requests a system resource using a kernelsystem call. When a thread requests a signal(semaphore, event, device flag, or message) andthe signal is available, the thread resumes exe-cution. If the signal is not available, the threadis removed from the ready queue; the thread isblocked. The kernel does not perform a con-text switch as long as the running thread main-tains the highest priority in the ready queue,even if the thread frees a resource and enablesother threads to move to the ready queue at thesame or lower priority. A thread can also beinterrupted. When an interrupt occurs, the ker-nel yields to the hardware interrupt controller.When the ISR completes, the highest prioritythread resumes execution.

2.9 Application and Hardware Interaction

Applications should have minimal knowledgeof hardware; rather, they should use devicedrivers for hardware control. A application cancontrol and interact with a device in a portableand hardware abstracted manner through astandard set of APIs.

The Linux Interrupt Service Routine frame-work encourages you to remove specific knowl-edge of hardware from the algorithms encap-sulated in threads. Interrupts relay informa-tion to threads through signals to device driversor directly to threads. Using signals to con-nect hardware to the algorithms allows the ker-nel to schedule threads based on asynchronousevents. The Linux run-time environment can bethought of as a bridge between two domains,the thread domain and the interrupt domain.The interrupt domain services the hardwarewith minimal knowledge of the algorithms, andthe thread domain is abstracted from the de-tails of the hardware. Device drivers and sig-nals bridge the two domains.

2.10 Downside of using a kernel

• Memory consumption: to have a usableLinux system, you should consider havingat least 4–8 MB of SDRAM, and at least2MB of Flash.

• Boot Time: the kernel is fast, but some-times not fast enough, expect to have a 2–5second boot time.

• Interrupt Latency: On occasions, a Linuxdevice driver, or even the kernel, will dis-able interrupts. Some critical kernel op-erations can not be interrupted, and it isunfortunate, but interrupts must be turnedoff for a bit. Care has been taken to keepcritical regions as short as possible as they


cause increased and variable interrupt la-tency.

• Robustness: although a kernel has gonethrough lots of testing, and many peopleare using it, it is always possible that thereare some undiscovered issues. Only youcan test it in the configuration that you willship it.

3 Advantages of Linux

Despite the fact that Linux was not originallydesigned for use in embedded systems, it hasfound its way into many embedded devices.Since the release of kernel version 2.0.x and theappearance of commercial support for Linuxon embedded processors, there has been an ex-plosion of embedded devices that use Linux astheir OS. Almost every day there seems to be anew device or gadget that uses Linux as its op-erating system, in most cases going completelyunnoticed by the end users. Today a largenumber of the available broadband routers, fire-walls, access points, and even some DVD play-ers utilize Linux, for more examples see Lin-uxdevices.1

Linux offers a huge amount of drivers for allsorts of hardware and protocols. Combine thatwith the fact that Linux does not have run-timeroyalties, and it quickly becomes clear whythere are so many developers using Linux fortheir devices. In fact, in a recent embedded sur-vey, 75% of developers indicated they are us-ing, or are planning on using an open sourceoperating system.2

1http://www.linuxdevices.org2Embedded systems survey http://www.

embedded.com/showArticle.jhtml?articleID=163700590

Many commercial and non-commercial Linuxkernel trees and distributions enable a wide var-ity of choices for the embedded developer.

One of the special trees is the uClinux(Pronounced you-see-linux, the name uClinuxcomes from combining the greek letter mu (µ)and the English capital C. Mu stands for mi-cro, and the C is for controller) kernel tree, athttp://www.uclinux.org. This is a dis-tribution which includes a Linux kernel opti-mized for low-cost processors, including pro-cessors without a Memory Management Unit(MMU). While the nommu kernel patch hasbeen included in the official Linux 2.6.x kernel,the most up-to-date development activity andprojects can be found at uClinux Project Pageand Blackfin/uClinux Project Page3. Patchessuch as these are used by commercial Linuxvendors in conjunction with their additional en-hancements, development tools, and documen-tation to provide their customers an easy-to-usedevelopment environment for rapidly creatingpowerful applications on uClinux.

Contrary to most people’s understanding,uClinux is not a “special” Linux kernel tree, butthe name of a distribution, which goes throughtesting on low-cost embedded platforms.

www.uclinux.org provides developerswith a Linux distribution that includes differentkernels (2.0.x, 2.4.x, 2.6.x) along with requiredlibraries; basic Linux shells and tools; and awide range of additional programs such as webserver, audio player, programming languages,and a graphical configuration tool. There arealso programs specially designed with size andefficiency as their primary considerations. Oneexample is busybox, a multicall binary, whichis a program that includes the functionality ofa lot of smaller programs and acts like any oneof them if it is called by the appropriate name.If busybox is linked to ls and contains the ls

3http://www.blackfin.uclinux.org


code, it acts like the ls command. The benefitof this is that busybox saves some overhead forunique binaries, and those small modules canshare common code.

In general, the uClinux distribution is morethan adequate enough to compile a full Linuximage for a communication device, like arouter, without writing a single line of code.

4 Differences between MMU Linuxand noMMU Linux

Since Linux on processors with MMU andwithout MMU are similar to UNIX256 in that itis a multiuser, multitasking OS, the kernel hasto take special precautions to assure the properand safe operation of up to thousands of pro-cesses from different users on the same systemat once. The UNIX security model, after whichLinux is designed, protects every process in itsown environment with its own private addressspace. Every process is also protected fromprocesses being invoked by different users. Ad-ditionally, a Virtual Memory (VM) system hasadditional requirements that the Memory Man-agement Unit (MMU) must handle, like dy-namic allocation of memory and mapping of ar-bitrary memory regions into the private processmemory.

Some processors, like Blackfin, do not pro-vide a full-fledged MMU. These processors aremore power efficient and significantly cheaperthan the alternatives, while sometimes havinghigher performance.

Even on processors featuring Virtual Memory,some system developers target their applica-tion to run without the MMU turned on, be-cause noMMU Linux can be significantly fasterthan Linux on the same processor. Overheadof MMU operations can be significant. Even

when a MMU is available, it is sometimes notused in systems with high real-time constraints.Context switching and Inter Process Communi-cation (IPC) can also be several times faster onuClinux. A benchmark on an ARM9 processor,done by H.S. Choi and H.C. Yun, has proventhis.4

To support Linux on processors without anMMU, a few trade-offs have to be made:

1. No real memory protection (a faulty pro-cess can bring the complete system down)

2. No fork system call

3. Only simple memory allocation

4. Some other minor differences

4.1 Memory Protection

Memory protection is not a real problem formost embedded devices. Linux is a very sta-ble platform, particularly in embedded devices,where software crashes are rarely observed.Even on a MMU-based system running Linux,software bugs in the kernel space can crashthe whole system. Since Blackfin has mem-ory protection, but not Virtual Memory, Black-fin/uClinux has better protection than other no-MMU systems, and does provide some protec-tion from applications writing into peripherals,and therefore will be more robust than uClinuxrunning on different processors.

There are two most common principal reasonscausing uClinux to crash:

• Stack overflow: When Linux is runningon an architecture where a full MMU ex-ists, the MMU provides Linux programs

4http://opensrc.sec.samsung.com/document/uc-linux-04_sait.pdf


basically unlimited stack and heap space.This is done by the virtualization of phys-ical memory. However most embeddedLinux systems will have a fixed amountof SDRAM, and no swap, so it is not re-ally “unlimited.” A program with a mem-ory leak can still crash the entire system onembedded Linux with a MMU and virtualmemory.

Because noMMU Linux can not supportVM, it allocates stack space during com-pile time at the end of the data for the ex-ecutable. If the stack grows too large onnoMMU Linux, it will overwrite the staticdata and code areas. This means that thedeveloper, who previously was obliviousto stack usage within the application, mustnow be aware of the stack requirements.

On gcc for Blackfin, there is a com-piler option to enable stack checking.If the option -fstack-limit-symbol=

_stack_start is set, the compiler willadd in extra code which checks to en-sure that the stack is not exceeded. Thiswill ensure that random crashes due tostack corruption/overflow will not happenon Blackfin/uClinux. Once an applicationcompiled with this option and exceeds itsstack limit, it gracefully dies. The devel-oper then can increase the stack size atcompile time or with the flthdr utilityprogram during runtime. On productionsystems, stack checking can either be re-moved (increase performance/reduce codesize), or left in for the increase in robust-ness.

• Null pointer reference: The BlackfinMMU does provide partial memory pro-tection, and can segment user space fromkernel (supervisor) space. On Black-fin/uClinux, the first 4K of memory start-ing at NULL is reserved as a buffer forbad pointer dereferences. If an applica-

tion uses a uninitialized pointer that readsor writes into the first 4K of memory, theapplication will halt. This will ensure thatrandom crashes due to uninitialized point-ers are less likely to happen. Other im-plementations of noMMU Linux will startwriting over the kernel.

4.2 No Fork

The second point can be little more problem-atic. In software written for UNIX or Linux,developers sometimes use the fork system callwhen they want to do things in parallel. Thefork() call makes an exact copy of the orig-inal process and executes it simultaneously. Todo that efficiently, it uses the MMU to map thememory from the parent process to the childand copies only those memory parts to thatchild it writes. Therefore, uClinux cannot pro-vide the fork() system call. It does, how-ever, provide vfork(), a special version offork(), in which the parent is halted whilethe child executes. Therefore, software thatuses the fork() system call has to be modi-fied to use either vfork() or POSIX threadsthat uClinux supports, because they share thesame memory space, including the stack.

4.3 Memory Allocation

As for point number three, there usually isno problem with the malloc support noMMULinux provides, but sometimes minor modifi-cations may have to be made. Memory allo-cation on uClinux can be very fast, but on theother hand a process can allocate all availablememory. Since memory can be only allocatedin contiguous chunks, memory fragmentationcan be sometimes an issue.


4.4 Minor Differences

Most of the software available for Linux orUNIX (a collection of software can be found onhttp://freshmeat.net) can be directlycompiled on uClinux. For the rest, there is usu-ally only some minor porting or tweaking to do.There are only very few applications that do notwork on uClinux, with most of those being ir-relevant for embedded applications.

5 Developing with uClinux

When selecting development hardware, devel-opers should not only carefully make their se-lection with price and availability considera-tions in mind, but also look for readily availableopen source drivers and documentation, as wellas development tools that makes life easier—e.g., kernel, driver and application debugger,profiler, strace.

/subsectionTesting uClinux Especially whendeveloping with open source—where softwareis given as-is—developers making a platformdecision should also carefully have a eye on thetest methodology for kernel, drivers, libraries,and toolchain. After all, how can a developer,in a short time, determine if the Linux kernelrunning on processor A is better or worse thanrunning on processor B?

The simplest way to test a new kernel on a newprocessor is to just boot the platform, and tryout the software you normally run. This is animportant test, because it tests most quickly thethings that matter most, and you are most likelyto notice things that are out of the ordinary fromthe normal way of working. However, this ap-proach does not give widespread test coverage;each user tends to use the GNU/Linux systemonly for a very limited range of the available

functions it offers and it can take significanttime to build the processor tool chain, build thekernel, and download it to the target for the test-ing.

Another alternative is to run test suites. Theseare software packages written for the expresspurpose of testing, and they are written to covera wide range of functions and often to exposethings that are likely to go wrong.

The Linux Test Project (LTP), as an example,is a joint project started by SGI and maintainedby IBM, that has a goal to deliver test suites tothe open source community that validate the re-liability, robustness, and stability of Linux. TheLTP test suite contains a collection of tools fortesting the Linux kernel and related features.Analog Devices, Inc., sponsored the portingof LTP to architectures supported by noMMULinux.

Testing with test suites applies not only the ker-nel, also all other tools involved during the de-velopment process. If you can not trust yourcompiler or debugger, then you are lost. Black-fin/uClinux uses DejaGnu to ease and automatethe over 44,000 toolchain tests, and checkingof their expected results while running on tar-get hardware. In addition there are test suitesincluded in Blackfin/uClinux to do automatedstress tests on kernel and device drivers usingexpect scripts. All these tests can be easily re-produced because they are well documented.

Here are the test results for the Blackfin gcc-4.xcompiler.


=== gas Summary ===# of expected passes 79

== binutils Summary ===# of expected passes 26# of untested testcases 7

=== gdb Summary ===# of expected passes 9018# of unexpected failures 62# of expected failures 41# of known failures 27# of unresolved testcases 9# of untested testcases 5# of unsupported tests 32

=== gcc Summary ===# of expected passes 36735# of unexpected failures 33# of unexpected successes 1# of expected failures 75# of unresolved testcases 28# of untested testcases 28# of unsupported tests 393

=== g++ Summary ===# of expected passes 11792# of unexpected failures 10# of unexpected successes 1# of expected failures 67# of unresolved testcases 14# of unsupported tests 165

All of the unexpected failures have been anal-ysed to ensure that the toolchain is as stable aspossible with all types of software that some-one could use it with.

6 Where can I get uClinux and howdo I get started?

Normally, the first selection that is made onceLinux is chosen as the embedded operating sys-

tem, is to identify the lowest cost processorthat will meet the performance targets. Luck-ily, many silicon manufacturers are fighting forthis position.

During this phase of development it is about the5 processor Ps.

• Penguins

• Price

• Power

• Performance

• Peripherals

6.1 Low-cost Processors

Although the Linux kernel supports many ar-chitectures, including alpha, arm, frv, h8300,i386, ia64, m32r, m68k, mips, parisc, powerpc,s390, sh, sparc, um, v850, x86_64, and xtensa,many Linux developers are surprised to hear ofa recent new Linux port to the Blackfin Proces-sor.

Blackin Processors combine the ability for real-time signal processing and the functionality ofmicroprocessors, fulfilling the requirements ofdigital audio and communication applications.The combination of a signal processing corewith traditional processor architecture on a sin-gle chip avoids the restrictions, complexity, andhigher costs of traditional heterogeneous multi-processor systems.

All Blackfin Processors combine a state-of-the-art signal processing engine with the advan-tages of a clean, orthogonal RISC-like micro-processor instruction set and Single Instruc-tion Multiple Data (SIMD) multimedia capa-bilities into a single instruction set architecture.The Micro Signal Architecture (MSA) core is a


dual-MAC (Multiply Accumulator Unit) mod-ified Harvard Architecture that has been de-signed to have unparalleled performance ontypical signal processing5 algorithms, as wellas standard program flow and arbitrary bit ma-nipulation operations mainly used by an OS.

The single-core Blackfin Processors have twolarge blocks of on-chip memory providing highbandwidth access to the core. These memoryblocks are accessed at full processor core speed(up to 756MHz). The two memory blocks sit-ting next to the core, referred to as L1 mem-ory, can be configured either as data or in-struction SRAM or cache. When configuredas cache, the speed of executing external codefrom SDRAM is nearly on par with runningthe code from internal memory. This feature isespecially well suited for running the uClinuxkernel, which doesn’t fit into internal memory.Also, when programming in C, the memory ac-cess optimization can be left up to the core byusing cache.

6.2 Development Environment

A typical uClinux development environmentconsists of a low-cost Blackfin STAMP board,and the GNU Compiler Collection (gcc crosscompiler) and the binutils (linker, assembler,etc.) for the Blackfin Processor. Additionally,some GNU tools like awk, sed, make, bash,etc., plus tcl/tk are needed, although theyusually come by default with the desktop Linuxdistribution.

An overview of some of the STAMP board fea-tures are given below:

• ADSP-BF537 Blackfin device with JTAGinterface

5Analog Devices, Inc. Blackfin Processors http://www.analog.com/blackfin

• 500MHz core clock

• Up to 133MHz system clock

• 32M x 16bit external SDRAM (64MB)

• 2M x 16bit external flash (4MB)

• 10/100 Mbps Ethernet Interface (via on-chip MAC, connected via DMA)

• CAN Interface

• RS-232 UART interface with DB9 serialconnector

• JTAG ICE 14 pin header

• Six general-purpose LEDs, four general-purpose push buttons

• Discrete IDC Expansion ports for all pro-cessor peripherals

All sources and tools (compiler, binutils, gnudebugger) needed to create a working uClinuxkernel on the Blackfin Processors can be freelyobtained from http://www.blackfin.uclinux.org. To use the binary rpms, aPC with a Linux distribution like Red Hat orSuSE is needed. Developers who can notinstall Linux on their PC have a alternative.Cooperative Linux (coLinux) is a relativelynew means to provide Linux services on aWindows host. There already exists an out-of-the-box solution that can be downloadedfor free from http://blackfin.uclinux.

org/projects/bfin-colinux. This pack-age comes with a complete Blackfin uClinuxdistribution, including all user-space applica-tions and a graphical Windows-like installer.

After the installation of the development envi-ronment and the decompression of the uClinuxdistribution, development may start.


Figure 1: BF537-STAMP Board from Analog Devices

6.3 Compiling a kernel & Root Filesystem

First the developer uses the graphical config-uration utility to select an appropriate BoardSupport Package (BSP) for his target hard-ware. Supported target platforms are STAMPfor BF533, BF537, or the EZKIT for the DualCore Blackfin BF561. Other Blackfin deriva-tives not listed like BF531, BF532, BF536, orBF534 are also supported but there isn’t a de-fault configuration file included.

After the default kernel is configured and suc-cessfully compiled, there is a full-featuredLinux kernel and a filesystem image that can bedownloaded and executed or flashed via NFS,tftp, or Kermit protocol onto the target hard-ware with the help of preinstalled u-boot bootloader. Once successful, further developmentcan proceed.

6.4 Hello World

A further step could be the creation of a simpleHello World program.

Here is the program hello.c as simple as itcan be:

#include <stdio.h>

int main () {printf("Hello World\n");return 0;

}

The first step is to cross compile hello.c onthe development host PC:

host> bfin-uclinux-gcc -Wl,-elf2flt \hello.c -o hello

The output executable is hello.

When compiling programs that runon the target under the Linux kernel,


bfin-uclinux-gcc is the compiler used.Executables are linked against the uClibcruntime library. uClibc is a C library fordeveloping embedded Linux systems. It ismuch smaller than the GNU C Library, butnearly all applications supported by glibc alsowork perfectly with uClibc. Library functioncalls like printf() invoke a system call,telling the operating system to print a string tostdout, the console. The elf2flt commandline option tells the linker to generate a flatbinary—elf2flt converts a fully linkedELF object file created by the toolchain into abinary flat (BFLT) file for use with uClinux.

The next step is to download hello to the tar-get hardware. The are many ways to accom-plish that. One convenient way is be to placehello into a NFS or SAMBA exported fileshare on the development host, while mount-ing the share form the target uClinux system.Other alternatives are placing hello in a webserver’s root directory and use the wget com-mand on the target board. Or simply use ftp,tftp, or rcp to transfer the executable.

6.5 Debugging in uClinux

Debugging tools in the hello case are not anecessity, but as programs become more so-phisticated, the availablilty of good debuggingtools become a requirement.

Sometimes an application just terminates afterbeing executed, without printing an appropriateerror message. Reasons for this are almost infi-nite, but most of the time it can be traced backto something really simple, e.g. it can not opena file, device driver, etc.

strace is a debugging tool which prints out atrace of all the system calls made by a anotherprogram. System calls and signals are eventsthat happen at the user/kernel interface. A close

examination of this boundary is very useful forbug isolation, sanity checking, and attemptingto capture race conditions.

If strace does not lead to a quick result,developers can follow the unspectacular waymost Linux developers go using printf orprintk to add debug statements in the codeand recompile/rerun.

This method can be exhausting. The standardLinux GNU Debugger (GDB) with its graphi-cal front-ends can be used instead to debug userapplications. GDB supports single stepping,backtrace, breakpoints, watchpoints, etc. Thereare several options to have gdb connected to thegdbserver on the target board. Gdb can connectover Ethernet, Serial, or JTAG (rproxy). For de-bugging in the kernel space, for instance devicedrivers, developers can use the kgdb Blackfinpatch for the gdb debugger.

If a target application does not work becauseof hidden inefficiencies, profiling is the key tosuccess. OProfile is a system-wide profiler forLinux-based systems, capable of profiling allrunning code at low overhead. OProfile usesthe hardware performance counters of the CPUto enable profiling of a variety of interestingstatistics, also including basic time spent pro-filing. All code is profiled: hardware and soft-ware interrupt handlers, kernel modules, thekernel, shared libraries, and applications.

The Blackfin gcc compiler has very favor-able performance, a comparison with other gcccompilers can be found here: GCC Code-SizeBenchmark Environment (CSiBE). But some-times it might be necessary to do some handoptimization, to utilize all enhanced instructioncapabilities a processor architecture provides.There are a few alternatives: Use Inline assem-bly, assembly macros, or C-callable assembly.


GCC Code Size Benchmark

3,634,174

3,410,494

3,057,192

2,884,078

2,875,194

2,991,774

3,861,877

3,500,098

3,193,538

2,747,131

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

arm-elf

arm-linux

bfin-elf

i386-elf

i686-linux

m68k-elf

mips-elf

ppc-elf

sh-elf

thumb-elf

Architectures

Co

de

Siz

e (i

n B

ytes

)

Figure 2: Results from GCC Code-Size Benchmark Environment (CSiBE) Department of SoftwareEngineering, University of Szeged

6.6 C callable assembly

For a C program to be able to call an assem-bly function, the names of the function must beknown to the C program. The function proto-type is therefore declared as an external func-tion.

extern int minimum(int,int);

In the assembly file, the same function name isused as the label at the jump address to whichthe function call branches. Names defined inC are used with a leading underscore. So thefunction is defined as follows.

.global _minimum;_minimum:

R0 = MIN(R0,R1);RTS; /*Return*/

The function name must be declared using the.global directive in the assembly file to letthe assembler and compiler know that its usedby a another file. In this case registers R0 andR1 correspond to the first and second functionparameter. The function return value is passedin R0. Developers should make themselvescomfortable with the C runtime parameter pass-ing model of the used architecture.

6.7 Floating Point & Fractional FloatingPoint

Since many low cost architectures (like Black-fin) do not include hardware floating point unit(FPU), floating point operations are emulatedin software. The gcc compiler supports twovariants of soft floating-point support. Thesevariants are implemented in terms of two alter-native emulation libraries, selected at compiletime.


The two alternative emulation libraries are:

• The default IEEE-754 floating-point li-brary: It is a strictly-conforming vari-ant, which offers less performance, but in-cludes all the input-checking that has beenrelaxed in the alternative library.

• The alternative fast floating-point library:It is a high-performance variant, which re-laxes some of the IEEE rules in the in-terests of performance. This library as-sumes that its inputs will be value num-bers, rather than Not-a-number values.

The selection of these libraries is controlledwith the -ffast-math compiler option.

Luckily, most embedded applications do notuse floating point.

However, many signal processing algorithmsare performed using fractional arithmetic. Un-fortunately, C does not have a fixed point frac-tional data type. However, fractional operationscan be implemented in C using integer opera-tions. Most fractional operations must be im-plemented in multiple steps, and therefore con-sume many C statements for a single opera-tion, which makes them hard to implement ona general purpose processor. Signal processorsdirectly support single cycle fractional and in-teger arithmetic, while fractional arithmetic isused for the actual signal processing operationsand integer arithmetic is used for control op-erations such as memory address calculations,loop counters and control variables.

The numeric format in signed fractional no-tation makes sense to use in all kind of sig-nal processing computations, because it is hardto overflow a fractional result, because mul-tiplying a fraction by a fraction results in asmaller number, which is then either truncated

or rounded. The highest full-scale positive frac-tional number is 0.99999, while the highestfull-scale negative number is –1.0. To converta fractional back to an integer number, the frac-tional must be multiplied by a scaling factor sothe result will be always between ±2N−1 forsigned and 2N for unsigned integers.

6.8 libraries

The standard uClinux distribution con-tains a rich set of available C libraries forcompression, cryptography, and other pur-poses. (openssl, libpcap, libldap,libm, libdes, libaes, zlib, libpng,libjpeg, ncurses, etc.) The Black-fin/uClinux distribution additionally includes:libaudio, libao, libSTL, flac,tremor, libid3tag, mpfr, etc. Further-more Blackfin/uClinux developers currentlyincorporate signal processing libraries intouClinux with highly optimized assembly func-tions to perform all kinds of common signalprocessing algorithms such as Convolution,FFT, DCT, and IIR/FIR Filters, with low MIPSoverhead.

6.9 Application Development

The next step would be the development of thespecial applications for the target device or theporting of additional software. A lot of de-velopment can be done in shell scripts or lan-guages like Perl or Python. Where C program-ming is mandatory, Linux, with its extraordi-nary support for protocols and device drivers,provides a powerful environment for the devel-opment of new applications.

Example: Interfacing a CMOS Camera Sensor

The Blackfin processor is a very I/O-balancedprocessor. This means it offers a variety of


high-speed serial and parallel peripheral inter-faces. These interfaces are ideally designed ina way that they can be operated with very lowor no-overhead impact to the processor core,leaving enough time for running the OS andprocessing the incoming or outgoing data. ABlackfin Processor as an example has multiple,flexible, and independent Direct Memory Ac-cess (DMA) controllers. DMA transfers canoccur between the processor’s internal mem-ory and any of its DMA-capable peripherals.Additionally, DMA transfers can be performedbetween any of the DMA-capable peripheralsand external devices connected to the exter-nal memory interfaces, including the SDRAMcontroller and the asynchronous memory con-troller.

The Blackfin processor provides, besides otherinterfaces, a Parallel Peripheral Interface (PPI)that can connect directly to parallel D/Aand A/D converters, ITU-R-601/656 video en-coders and decoders, and other general-purposeperipherals, such as CMOS camera sensors.The PPI consists of a dedicated input clock pin,up to three frame synchronization pins, and upto 16 data pins.

Figure 3 is an example of how easily a CMOSimaging sensor can be wired to a Blackfin Pro-cessor, without the need of additional activehardware components.

Below is example code for a simple programthat reads from a CMOS Camera Sensor, as-suming a PPI driver is compiled into the kernelor loaded as a kernel module. There are twodifferent PPI drivers available, a generic full-featured driver, supporting various PPI opera-tion modes (ppi.c), and a simple PPI FrameCapture Driver (adsp-ppifcd.c). The lat-ter is used here. The application opens thePPI device driver, performs some I/O controls(ioctls), setting the number of pixels per lineand the number of lines to be captured. Afterthe application invokes the read system call,

the driver arms the DMA transfer. The start of anew frame is detected by the PPI peripheral, bymonitoring the Line- and Frame-Valid strobes.A special correlation between the two signalsindicates the start of frame, and kicks off theDMA transfer, capturing pixels-per-line timeslines samples. The DMA engine stores the in-coming samples at the address allocated by theapplication. After the transfer is finished, exe-cution returns to the application. The image isthen converted into the PNG (Portable NetworkGraphic) format, utilizing libpng included inthe uClinux distribution. The converted imageis then written to stdout. Assuming the com-piled program executable is called readimg, acommand line to execute the program, writingthe converted output image to a file, can looklike following:

root:~> readimg > /var/image.png

Example: Reading from a CMOS Camera Sen-sor

Audio, Video, and Still-image silicon productswidely use an I2C-compatible Two Wire In-terface (TWI) as a system configuration bus.The configuration bus allows a system masterto gain access over device internal configura-tion registers such as brightness. Usually, I2Cdevices are controlled by a kernel driver. Butit is also possible to access all devices on anadapter from user space, through the /dev in-terface. The following example shows how towrite a value of 0x248 into register 9 of a I2Cslave device identified by I2C_DEVID:

#define I2C_DEVID (0xB8>>1)#define I2C_DEVICE "/dev/i2c-0"

i2c_write_register(I2C_DEVICE,

I2C_DEVID,9,0x0248);


MICRON MT9T001 ADSP−BF537

PPI_FS1

PPI_FS3

TMRnSDA

PPI_FS2

PPIn_DATAPPI_CLK

SCL

FRAME VALID

DATAPIXEL CLOCK

SCL

LINE_VALID

SDA

8/16

Figure 3: Micron CMOS Imager gluelessly connected to Blackfin

Example: Writing configuration data to e.g. aCMOS Camera Sensor

The power of Linux is the inexhaustible num-ber of applications released under various opensource licenses that can be cross compiled torun on the embedded uClinux system. Crosscompiling can be sometimes a little bit tricky,that’s why it is discussed here.

6.10 Cross compiling

Linux or UNIX is not a single platform, thereis a wide range of choices. Most programs dis-tributed as source code come with a so-calledconfigure script. This is a shell script that mustbe run to recognize the current system config-uration, so that the correct compiler switches,library paths, and tools will be used. Whenthere isn’t a configure script, the developer canmanually modify the Makefile to add target-processor-specific changes, or can integrate it

into the uClinux distribution. Detailed instruc-tions can be found here. The configure script isusually a big script, and it takes quite a whileto execute. When this script is created from re-cent autoconf releases, it will work for Black-fin/uClinux with minor or no modifications.

The configure shell script inside a source pack-age can be executed for cross compilation usingfollowing command line:

host> CC=’bfin-uclinux-gcc -O2 \-Wl,-elf2flt’ ./configure \--host=bfin-uclinux \--build=i686-linux

Alternatively:

host> ./configure \--host=bfin-uclinux \--build=i686-linux \LDFLAGS=’-Wl,-elf2flt’ \CFLAGS=-O2


#define WIDTH 1280#define HEIGHT 1024

int main( int argc, char *argv[] ) {int fd;char * buffer;

/* Allocate memory for the raw image */buffer = (char*) malloc (WIDTH * HEIGHT);

/* Open /dev/ppi */fd = open("/dev/ppi0", O_RDONLY,0);if (fd == -1) {

printf("Could not open dev\/ppi\n");free(buffer);exit(1);

}

ioctl(fd, CMD_PPI_SET_PIXELS_PER_LINE, WIDTH);ioctl(fd, CMD_PPI_SET_LINES_PER_FRAME, HEIGHT);

/* Read the raw image data from the PPI */read(fd, buffer, WIDTH * HEIGHT);

put_image_png (buffer, WIDTH, HEIGHT)

close(fd); /* Close PPI */}

/** convert image to png and write to stdout

*/void put_image_png (char *image, int width, int height) {

int y;char *p;png_infop info_ptr;

png_structp png_ptr = png_create_write_struct (PNG_LIBPNG_VER_STRING,NULL, NULL, NULL);

info_ptr = png_create_info_struct (png_ptr);

png_init_io (png_ptr, stdout);

png_set_IHDR (png_ptr, info_ptr, width, height,8, PNG_COLOR_TYPE_GRAY, PNG_INTERLACE_NONE,PNG_COMPRESSION_TYPE_DEFAULT, PNG_FILTER_TYPE_DEFAULT);

png_write_info (png_ptr, info_ptr);p = image;

for (y = 0; y < height; y++) {png_write_row (png_ptr, p);p += width;

}png_write_end (png_ptr, info_ptr);png_destroy_write_struct (&png_ptr, &info_ptr);

}

Figure 4: read.c file listing


#define I2C_SLAVE_ADDR 0x38 /* Randomly picked */

int i2c_write_register(char * device, unsigned char client, unsigned char reg,unsigned short value) {

int addr = I2C_SLAVE_ADDR;char msg_data[32];struct i2c_msg msg = { addr, 0, 0, msg_data };struct i2c_rdwr_ioctl_data rdwr = { &msg, 1 };

int fd,i;

if ((fd = open(device, O_RDWR)) < 0) {fprintf(stderr, "Error: could not open %s\n", device);exit(1);

}

if (ioctl(fd, I2C_SLAVE, addr) < 0) {fprintf(stderr, "Error: could not bind address %x \n", addr);

}

msg.len = 3;msg.flags = 0;msg_data[0] = reg;msg_data[2] = (0xFF & value);msg_data[1] = (value >> 8);msg.addr = client;

if (ioctl(fd, I2C_RDWR, &rdwr) < 0) {fprintf(stderr, "Error: could not write \n");

}

close(fd);return 0;

}

Figure 5: Application to write configuration data to a CMOS Sensor

There are at least two causes able to stop therunning script: some of the files used by thescript are too old, or there are missing tools orlibraries. If the supplied scripts are too old toexecute properly for bfin-uclinux, or theydon’t recognize bfin-uclinux as a possi-ble target, the developer will need to replaceconfig.subwith a more recent version frome.g. an up-to-date gcc source directory. Only invery few cases cross compiling is not supportedby the configure.in script manually writ-ten by the author and used by autoconf. In thiscase latter file can be modified to remove orchange the failing test case.

7 Network Oscilloscope

The Network Oscilloscope Demo is one ofthe sample applications, besides the VoIP Lin-phone Application or the Networked AudioPlayer, included in the Blackfin/uClinux dis-tribution. The purpose of the Network Oscil-loscope Project is to demonstrate a simple re-mote GUI (Graphical User Interface) mecha-nism to share access and data distributed overa TCP/IP network. Furthermore, it demon-strates the integration of several open sourceprojects and libraries as building blocks intosingle application. For instance gnuplot, aportable command-line driven interactive datafile and function plotting utility, is used togenerate graphical data plots, while thttpd,a CGI (Common Gateway Interface) capable


web server, is servicing incoming HTTP re-quests. CGI is typically used to generate dy-namic webpages. It’s a simple protocol tocommunicate between web forms and a spec-ified program. A CGI script can be written inany language, including C/C++, that can readstdin, write to stdout, and read environ-ment variables.

The Network Oscilloscope works as follow-ing. A remote web browser contacts the HTTPserver running on uClinux where the CGI scriptresides, and asks it to run the program. Pa-rameters from the HTML form such as sam-ple frequency, trigger settings, and display op-tions are passed to the program through the en-vironment. The called program samples datafrom a externally connected Analog-to-DigitalConverter (ADC) using a Linux device driver(adsp-spiadc.c). Incoming samples arepreprocessed and stored in a file. The CGIprogram then starts gnuplot as a process andrequests generation of a PNG or JPEG im-age based on the sampled data and form set-tings. The webserver takes the output of theCGI program and tunnels it through to the webbrowser. The web browser displays the outputas an HTML page, including the generated im-age plot.

A simple C code routine can be used to supplydata in response to a CGI request.

Example: Simple CGI Hello World application

8 Real-time capabilities of uClinux

Since Linux was originally developed forserver and desktop usage, it has no hard real-time capabilities like most other operating sys-tems of comparable complexity and size. Nev-ertheless, Linux and in particular, uClinux hasexcellent so-called soft real-time capabilities.

This means that while Linux or uClinux can-not guarantee certain interrupt or scheduler la-tency compared with other operating systemsof similar complexity, they show very favorableperformance characteristics. If one needs a so-called hard real-time system that can guaranteescheduler or interrupt latency time, there are afew ways to achieve such a goal:

Provide the real-time capabilities in the formof an underlying minimal real-time kernelsuch as RT-Linux (http://www.rtlinux.org) or RTAI (http://www.rtai.org).Both solutions use a small real-time kernel thatruns Linux as a real-time task with lower pri-ority. Programs that need predictable real timeare designed to run on the real-time kernel andare specially coded to do so. All other tasksand services run on top of the Linux kernel andcan utilize everything that Linux can provide.This approach can guarantee deterministic in-terrupt latency while preserving the flexibilitythat Linux provides.

For the initial Blackfin port, included in Xeno-mai v2.1, the worst-case scheduling latency ob-served so far with userspace Xenomai threadson a Blackfin BF533 is slightly lower than 50us under load, with an expected margin of im-provement of 10–20 us, in the future.

Xenomai and RTAI use Adeos as a underlyingHardware Abstraction Layer (HAL). Adeos isa real time enabler for the Linux kernel. Tothis end, it enables multiple prioritized O/S do-mains to exist simultaneously on the same hard-ware, connected through an interrupt pipeline.

Xenomai as well as Adeos has been ported tothe Blackfin architecture by Philippe Gerumwho leads both projects. This development hasbeen significantly sponsored by Openwide, aspecialist in embedded and real time solutionsfor Linux.

Nevertheless in most cases, hard real-time is


not needed, particularly for consumer multime-dia applications, in which the time constraintsare dictated by the abilities of the user to recog-nize glitches in audio and video. Those physi-cally detectable constraints that have to be metnormally lie in the area of milliseconds, whichis no big problem on fast chips like the Black-fin Processor. In Linux kernel 2.6.x, the newstable kernel release, those qualities have evenbeen improved with the introduction of the newO(1) scheduler.

Figures below show the context switch time fora default Linux 2.6.x kernel running on Black-fin/uClinux:

Context Switch time was measured with lat_ctx from lmbench. The processes are con-nected in a ring of Unix pipes. Each processreads a token from its pipe, possibly does somework, and then writes the token to the next pro-cess. As number of processes increases, effectof cache is less. For 10 processes the averagecontext switch time is 16.2us with a standarddeviation of .58, 95% of time is under 17us.

9 Conclusion

Blackfin Processors offer a good price/performance ratio (800 MMAC @ 400 MHzfor less than (US)$5/unit in quantities), ad-vanced power management functions, andsmall mini-BGA packages. This representsa very low-power, cost- and space-efficientsolution. The Blackfin’s advanced DSP andmultimedia capabilities qualify it not onlyfor audio and video appliances, but alsofor all kinds of industrial, automotive, andcommunication devices. Development toolsare well tested and documented, and includeeverything necessary to get started and suc-cessfully finished in time. Another advantageof the Blackfin Processor in combination with

uClinux is the availability of a wide range ofapplications, drivers, libraries and protocols,often as open source or free software. In mostcases, there is only basic cross compilationnecessary to get that software up and running.Combine this with such invaluable tools asPerl, Python, MySQL, and PHP, and develop-ers have the opportunity to develop even themost demanding feature-rich applications ina very short time frame, often with enoughprocessing power left for future improvementsand new features.

10 Legal

This work represents the view of the authors anddoes not necessarily represent the view of AnalogDevices, Inc.

Linux is registered trademark of Linus Torvalds.uClinux is trademark of Arcturus Networks Inc.SGI is trademark of Silicon Graphics, Inc. ARMis a registered trademark of ARM Limited. Black-fin is a registered trademark of Analog Devices Inc.IBM is a registered trademark of International Busi-ness Machines Corporation. UNIX is a registeredtrademark of The Open Group. Red Hat is regis-tered trademark of Red Hat, Inc. SuSE is registeredtrademark of Novell Inc.

All other trademarks belong to their respective own-ers.

Hrtimers and Beyond: Transforming the Linux TimeSubsystems

Thomas Gleixnerlinutronix

[email protected]

Douglas NiehausUniversity of Kansas

[email protected]

Abstract

Several projects have tried to rework Linuxtime and timers code to add functions such ashigh-precision timers and dynamic ticks. Pre-vious efforts have not been generally accepted,in part, because they considered only a sub-set of the related problems requiring an inte-grated solution. These efforts also suffered sig-nificant architecture dependence creating com-plexity and maintenance costs. This paperpresents a design which we believe provides agenerally acceptable solution to the completeset of problems with minimum architecture de-pendence.

The three major components of the design arehrtimers, John Stulz’s new timeofday approach,and a new component called clock events.Clock events manages and distributes clockevents and coordinates the use of clock eventhandling functions. The hrtimers subsystemhas been merged into Linux 2.6.16. Althoughthe name implies “high resolution” there is nochange to the tick based timer resolution atthis stage. John Stultz’s timeofday rework ad-dresses jiffy and architecture independent timekeeping and has been identified as a fundamen-tal preliminary for high resolution timers andtickless/dynamic tick solutions. This paper pro-vides details on the hrtimers implementationand describes how the clock events component

will complement and complete the hrtimers andtimeofday components to create a solid foun-dation for architecture independent support ofhigh-resolution timers and dynamic ticks.

1 Introduction

Time keeping and use of clocks is a fundamen-tal aspect of operating system implementation,and thus of Linux. Clock related services in op-erating systems fall into a number of differentcategories:

• time keeping

• clock synchronization

• time-of-day representation

• next event interrupt scheduling

• process and in-kernel timers

• process accounting

• process profiling

These service categories exhibit strong interac-tions among their semantics at the design leveland tight coupling among their components atthe implementation level.

334 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

Hardware devices capable of providing clocksources vary widely in their capabilities, accu-racy, and suitability for use in providing the de-sired clock services. The ability to use a givenhardware device to provide a particular clockservice also varies with its context in a unipro-cessor or multi-processor system.

TOD Clock source HW

ISR Clock event source HW

TOD Clock source HW


TOD Clock source HW


Arch 1

Arch 2

Arch 3

Timekeeping

Tick

Process acc.

Profiling

Jiffies

Timer wheel

Figure 1: Linux Time System

Figure 1 shows the current software architec-ture of the clock related services in a vanilla 2.6Linux system. The current implementation ofclock related services in Linux is strongly asso-ciated with individual hardware devices, whichresults in parallel implementations for each ar-chitecture containing considerable amounts ofessentially similar code. This code duplicationacross a large number of architectures makes itdifficult to change the semantics of the clockrelated services or to add new features suchas high resolution timers or dynamic ticks be-cause even a simple change must be made inso many places and adjusted for so many im-plementations. Two major factors make im-plementing changes to Linux clock related ser-vices difficult: (1) the lack of a generic ab-straction layer for clock services and (2) theassumption that time is tracked using periodictimer ticks (jiffies) that is strongly integratedinto much of the clock and timer related code.

2 Previous Efforts

A number of efforts over many years have ad-dressed various clock related services and func-tions in Linux including various approaches tohigh resolution time keeping and event schedul-ing. However, all of these efforts have encoun-tered significant difficulty in gaining broad ac-ceptance because of the breadth of their impacton the rest of the kernel, and because they gen-erally addressed only a subset of the clock re-lated services in Linux.

Interestingly, all those efforts have a commondesign pattern, namely the attempt to inte-grate new features and services into the existingclock and timer infrastructure without changingthe overall design.

There are no projects to our knowledge whichattempt to solve the complete problem as weunderstand and have described it. All existingefforts, in our view, address only a part of thewhole problem as we see it, which is why, inour opinion, the solutions to their target prob-lems are more complex than under our pro-posed architecure, and are thus less likely to beaccepted into the main line kernel.

3 Required Abstractions

The attempt to integrate high resolution timersinto Ingo Molnar’s real-time preemption patchled to a thorough analysis of the Linux timerand clock services infrastructure. Whilethe comprehensive solution for addressing theoverall problem is a large-scale task it can beseparated into different problem areas.

• clock sources management for time keep-ing


• clock synchronization

• time-of-day representation

• clock event management for schedulingnext event interrupts

• eliminating the assumption that timers aresupported by periodic interrupts and ex-pressed in units of jiffies

These areas of concern are largely independentand can thus be addressed more or less inde-pendently during implementation. However,the important points of interaction among themmust be considered and supported in the overalldesign.

3.1 Clock Source Management

An abstraction layer and associated API are re-quired to establish a common code frameworkfor managing various clock sources. Withinthis framework, each clock source is required tomaintain a representation of time as a monoton-ically increasing value. At this time, nanosec-onds are the favorite choice for the time valueunits of a clock source. This abstraction layerallows the user to select among a range of avail-able hardware devices supporting clock func-tions when configuring the system and pro-vides necessary infrastructure. This infras-tructure includes, for example, mathematicalhelper functions to convert time values specificto each clock source, which depend on prop-erties of each hardware device, into the com-mon human-oriented time units used by theframework, i.e. nanoseconds. The centraliza-tion of this functionality allows the system toshare significantly more code across architec-tures. This abstraction is already addressed byJohn Stultz’s work on a Generic Time-of-daysubsystem [5].

3.2 Clock Synchronization

Crystal driven clock sources tend to be impre-cise due to variation in component tolerancesand environmental factors, such as tempera-ture, resulting in slightly different clock tickrates and thus, over time, different clock val-ues in different computers. The Network TimeProtocol (NTP) and more recently GPS/GSMbased synchronization mechanisms allow thecorrection of system time values and of clocksource drift with respect to a selected standard.Value correction is applied to the monotoni-cally increasing value of the hardware clocksource. This is an optional functionality asit can only be applied when a suitable refer-ence time source is available. Support for clocksynchronization is a separate component fromthose discussed here. There is work in progressto rework the current mechanism by John Stultzand Roman Zippel.

3.3 Time-of-day Representation

The monotonically increasing time value pro-vided by many hardware clock sources cannotbe set. The generic interface for time-of-dayrepresentation must thus compensate for driftas an offset to the clock source value, and rep-resent the time-of-day (calendar or wall clocktime) as a function of the clock source value.The drift offset and parameters to the func-tion converting the clock source value to a wallclock value can set by manual interaction orunder control of software for synchronizationwith external time sources (e.g. NTP).

It is important to note that the current Linux im-plementation of the time keeping component isthe reverse of the proposed solution. The inter-nal time representation tracks the time-of-daytime fairly directly and derives the monotoni-cally increasing nanosecond time value from it.


This is a relic of software development historyand the GTOD/NTP work is already addressingthis issue.

3.4 Clock Event Management

While clock sources provide read access tothe monotonically increasing time value, clockevent sources are used to schedule the nextevent interrupt(s). The next event is currentlydefined to be periodic, with its period definedat compile time. The setup and selection ofthe event sources for various event driven func-tionalities is hardwired into the architecture de-pendent code. This results in duplicated codeacross all architectures and makes it extremelydifficult to change the configuration of the sys-tem to use event interrupt sources other thanthose already built into the architecture. An-other implication of the current design is thatit is necessary to touch all the architecture-specific implementations in order to providenew functionality like high resolution timers ordynamic ticks.

The clock events subsystem tries to addressthis problem by providing a generic solution tomanage clock event sources and their usage forthe various clock event driven kernel function-alities. The goal of the clock event subsystemis to minimize the clock event related architec-ture dependent code to the pure hardware re-lated handling and to allow easy addition andutilization of new clock event sources. It alsominimizes the duplicated code across the ar-chitectures as it provides generic functionalitydown to the interrupt service handler, which isalmost inherently hardware dependent.

3.5 Removing Tick Dependencies

The strong dependency of Linux timers on us-ing the the periodic tick as the time source

and representation was one of the main prob-lems faced when implementing high resolutiontimers and variable interval event scheduling.All attempts to reuse the cascading timer wheelturned out to be incomplete and inefficient forvarious reasons. This led to the implementationof the hrtimers (former ktimers) subsystem. Itprovides the base for precise timer schedulingand is designed to be easily extensible for highresolution timers.

4 hrtimers

The current approach to timer management inLinux does a good job of satisfying an ex-tremely wide range of requirements, but it can-not provide the quality of service required insome cases precisely because it must satisfysuch a wide range of requirements. This is whythe essential first step in the approach describedhere is to implement a new timer subsystemto complement the existing one by assuming asubset of its existing responsibilities.

4.1 Why a New Timer Subsystem?

The Cascading Timer Wheel (CTW), whichwas implemented in 1997, replaced the originaltime ordered double linked list to resolve thescalability problem of the linked list’s O(N) in-sertion time. It is based on the assumption thatthe timers are supported by a periodic interrupt(jiffies) and that the expiration time is also rep-resented in jiffies. The difference in time value(delta) between now (the current system time)and the timer’s expiration value is used as an in-dex into the CTW’s logarithmic array of arrays.Each array within the CTW represents the setof timers placed within a region of the systemtime line, where the size of the array’s regionsgrow exponentially. Thus, the further into the


array start end granularity1 1 256 12 257 16384 2563 16385 1048576 163844 1048577 67108864 10485765 67108865 4294967295 67108864

Table 1: Cascading Timer Wheel Array Ranges

future a timer’s expiration value lies, the largerthe region of the time line represented by thearray in which it is stored. The CTW groupstimers into 5 categories. Note that each CTWarray represents a range of jiffy values and thatmore than one timer can be associated with agiven jiffy value.

Table 1 shows the properties of the differenttimer categories. The first CTW category con-sists of n1 entries, where each entry representsa single jiffy. The second category consists ofn2 entries, where each entry represents n1*n2jiffies. The third category consists of n3 en-tries, where each entry represents n1*n2*n3jiffies. And so forth. The current kernel usesn1=256 and n2..n5 = 64. This keeps the num-ber of hash table entries in a reasonable rangeand covers the future time line range from 1 to4294967295 jiffies.

The capacity of each category depends on thesize of a jiffy, and thus on the periodic in-terrupt interval. While the 10 ms tick periodin 2.4 kernels implied 2560ms for the CTWfirst category, this was reduced to 256ms in theearly 2.6 kernels (1 ms tick) and readjusted to1024ms when the HZ value was set to 250.Each CTW category maintains an time indexcounter which is incremented by the “wrap-ping” of the lower category index which occurswhen its counter increases to the point whereits range overlaps that of the higher category.This triggers a “cascade” where timers from thematching entry in the higher category have to

be removed and reinserted into the lower cat-egory’s finer-grained entries. Note that in thefirst CTW category the timers are time-sortedwith jiffy resolution.

While the CTW’s O(1) insertion and removalis very efficient, timers with an expiration timelarger than the capacity of the first categoryhave to be cascaded into a lower category atleast once. A single step of cascading movesmany timers and it has to be done with inter-rupts disabled. The cascading operation canthus significantly increase maximum latenciessince it occasionally moves very large sets oftimers. The CTW thus has excellent averageperformance but unacceptable worst case per-formance. Unfortunately the worst case perfor-mance determines its suitability for supportinghigh resolution timers.

However, it is important to note that the CTWis an excellent solution (1) for timers hav-ing an expiration time lower than the capac-ity of the primary category and (2) for timerswhich are removed before they expire or haveto be cascaded. This is a common scenariofor many long-term protocol-timeout relatedtimers which are created by the networking andI/O subsystems.

The KURT-Linux project at the University ofKansas was the first to address implementinghigh resolution timers in Linux [4]. Its con-centration was on investigating various issuesrelated to using Linux for real-time computing.The UTIME component of KURT-Linux exper-imented with a number of data structures tosupport high resolution timers, including bothseparate implementations and those integratedwith general purpose timers. The HRT projectbegan as a fork of UTIME code [1]. bothprojects added a sub-jiffy time tracking compo-nent to increase resolution, and when integrat-ing support with the CTW, sorted timers withina given jiffy on the basis of the subjiffy value.


This increased overhead involved with cascad-ing due to the O(N) sorting time. The experi-ence of both projects demonstrated that timermanagement overhead was a significant factor,and that the necessary changes in the timer codewere quite scattered and intrusive. In sum-mary, the experience of both projects demon-strated that separating support for high resolu-tion and longer-term generic (CTW) timers wasnecessary and that a comprehensive restructur-ing of the timer-related code would be requiredto make future improvements and additions totimer-related functions possible. The hrtimersdesign and other aspects of the architecture de-scribed in this paper was strongly motivated bythe lessons derived from both previous projects.

4.2 Solution

As a first step we categorized the timers intotwo categories:

Timeouts: Timeouts are used primarily bynetworking and device drivers to detect whenan event (I/O completion, for example) doesnot occur as expected. They have low resolu-tion requirements, and they are almost alwaysremoved before they actually expire.

Timers: Timers are used to schedule ongoingevents. They can have high resolution require-ments, and usually expire. Timers are mostlyrelated to applications (user space interfaces)

The timeout related timers are kept in the ex-isting timer wheel and a new subsystem opti-mized for (high resolution) timer requirementshrtimers was implemented.

hrtimers are entirely based on human timeunits: nanoseconds. They are kept in a time

sorted, per-CPU list, implemented as a red-black tree. Red-black trees provide O(log(N))insertion and removal and are considered tobe efficient enough as they are already usedin other performance critical parts of the ker-nel e.g. memory management. The timers arekept in relation to time bases, currently CLOCK_MONOTONIC and CLOCK_REALTIME, orderedby the absolute expiration time. This separa-tion allowed to remove large chunks of codefrom the POSIX timer implementation, whichwas necessary to recalculate the expiration timewhen the clock was set either by settimeofdayor NTP adjustments.

hrtimers went through a couple of revision cy-cles and were finally merged into Linux 2.6.16.The timer queues run from the normal timersoftirq so the resulting resolution is not betterthan the previous timer API. All of the struc-ture is there to do better once the other parts ofthe overall timer code rework are in place.

After adding hrtimers the Linux time(r) systemlooks like this:

TOD Clock source HW


TOD Clock source HW


TOD Clock source HW


Arch 1

Arch 2

Arch 3

Timekeeping

Tick

Process acc.

Profiling

Jiffies

Timer wheel

hrtimers

Figure 2: Linux time system + htimers

4.3 Further Work

The primary purpose of the separate imple-mentation for the high resolution timers, dis-


cussed in Section 7, is to improve their sup-port by eliminating the overhead and variablelatency associated with the CTW. However, it isalso important to note that this separation alsocreates an opportunity to improve the CTWbehavior in supporting the remaining timers.For example, using a coarser CTW granular-ity may lower overhead by reducing the num-ber of timers which are cascaded, given that aneven larger percentage of CTW timers would becanceled under an architecture supporting highresolution timers separately. However, whilethis is an interesting possibility, it is currentlya speculation that must be tested.

5 Generic Time-of-day

The Generic Time-of-day subsystem (GTOD)is a project led by John Stultz and was pre-sented at OLS 2005. Detailed information isavailable from the OLS 2005 proceedings [5].It contains the following components:

• Clock source management

• Clock synchronization

• Time-of-day representation

GTOD moves a large portion of code out ofthe architecture-specific areas into a genericmanagement framework, as illustrated in Fig-ure 3. The remaining architecture-dependentcode is mostly limited to the direct hardware in-terface and setup procedures. It allows simplesharing of clock sources across architecturesand allows the utilization of non-standard clocksource hardware. GTOD is work in progressand intends to produce set of changes whichcan be adopted step by step into the main-linekernel.

HW


HW


HW


Arch 1

Arch 2

Arch 3

Timekeeping

Tick

Process acc.

Profiling

Jiffies

Timer wheel

hrtimers

Clock source

TODClock synchr.

Shared HW

Figure 3: Linux time system + htimers +GTOD

6 Clock Event Source Abstraction

Just as it was necessary to provide a generalabstraction for clock sources in order to movea significant amount of code into the architec-ture independent area, a general framework formanaging clock event sources is also requiredin the architecture independent section of thesource under the architecture described here.Clock event sources provide either periodic orindividual programmable events. The manage-ment layer provides the infrastructure for regis-tering event sources and manages the distribu-tion of the events for the various clock relatedservices. Again, this reduces the amount of es-sentially duplicate code across the architecturesand allows cross-architecture sharing of hard-ware support code and the easy addition of non-standard clock sources.

The management layer provides interfaces forhrtimers to implement high resolution timersand also builds the base for a generic dy-namic tick implementation. The managementlayer supports these more advanced functionsonly when appropriate clock event sources havebeen registered, otherwise the traditional peri-odic tick based behavior is retained.


6.1 Clock Event Source Registration

Clock event sources are registered either by thearchitecture dependent boot code or at mod-ule insertion time. Each clock event sourcefills a data structure with clock-specific prop-erty parameters and callback functions. Theclock event management decides, by using thespecified property parameters, the set of systemfunctions a clock event source will be used tosupport. This includes the distinction of per-CPU and per-system global event sources.

System-level global event sources are used forthe Linux periodic tick. Per-CPU event sourceare used to provide local CPU functionalitysuch as process accounting, profiling, and highresolution timers. The clock_event datastructure contains the following elements:

• name: clear text identifier

• capabilities: a bit-field which describesthe capabilities of the clock event source andhints about the preferred usage

• max_delta_ns: the maximum event delta(offset into future) which can be scheduled

• min_delta_ns: the minimum event deltawhich can be scheduled

• mult: multiplier for scaled math conversionfrom nanoseconds to clock event source units

• shift: shift factor for scaled math conver-sion from nanoseconds to clock event sourceunits

• set_next_event: function to schedule thenext event

• set_mode: function to toggle the clock eventsource operating mode (periodic / one shot)

• suspend: function which has to be called be-fore suspend

• resume: function which has to be called be-fore resume

• event_handler: function pointer which isfilled in by the clock event management code.This function is called from the event sourceinterrupt

• start_event: function called before theevent_handler function in case that theclock event layer provides the interrupt han-dler

• end_event: function called after theevent_handler function in case that theclock event layer provides the interrupt han-dler

• irq: interrupt number in case the clock eventlayer requests the interrupt and provides the in-terrupt handler

• priv: pointer to clock source private datastructures

The clock event source can delegate the inter-rupt setup completely to the management layer.It depends on the type of interrupt which is as-sociated with the event source. This is possiblefor the PIT on the i386 architecture, for exam-ple, because the interrupt in question is handledby the generic interrupt code and can be ini-tialized via setup_irq. This allows us to com-pletely remove the timer interrupt handler fromthe i386 architecture-specific area and move themodest amount of hardware-specific code intoappropriate source files. The hardware-specificroutines are called before and after the eventhandling code has been executed.

In case of the Local APIC on i386 and theDecrementer on PPC architectures, the inter-rupt handler must remain in the architecture-specific code as it can not be setup throughthe standard interrupt handling functions. Theclock management layer provides the functionwhich has to be called from the hardware level


handler in a function pointer in the clock sourcedescription structure. Even in this case theshared code of the timer interrupt is removedfrom the architecture-specific implementationand the event distribution is managed by thegeneric clock event code. The Clock Eventssubsystem also has support code for clock eventsources which do not provide a periodic mode;e.g. the Decrementer on PPC or match regis-ter based event sources found in various ARMSoCs.

6.2 Clock Event Distribution

The clock event layer provides a set of prede-fined functions, which allow the association ofvarious clock event related services to a clockevent source.

The current implementation distributes eventsfor the following services:

• periodic tick

• process accounting

• profiling

• next event interrupt (e.g. high resolutiontimers, dynamic tick)

6.3 Interfaces

The clock event layer API is rather small.Aside from the clock event source registrationinterface it provides functions to schedule thenext event interrupt, clock event source notifi-cation service, and support for suspend and re-sume.

6.4 Existing Implementations

At the time of this writing the base frameworkcode and the conversion of i386 to the clockevent layer is available and functional.

The clock event layer has been successfullyported to ARM and PPC, but support has notbeen continued due to lack of human resources.

6.5 Code Size Impact

The framework adds about 700 lines of codewhich results in a 2KB increase of the kernelbinary size.

The conversion of i386 removes about 100 linesof code. The binary size decrease is in the rangeof 400 bytes.

We believe that the increase of flexibility andthe avoidance of duplicated code across archi-tectures justifies the slight increase of the bi-nary size.

The first goal of the clock event implementationwas to prove the feasibility of the approach.There is certainly room for optimizing the sizeimpact of the framework code, but this is an is-sue for further development.

6.6 Further Development

The following work items are planned:

• Streamlining of the code

• Revalidation of the clock distribution de-cisions

• Support for more architectures

• Dynamic tick support


6.7 State of Transformation

The clock event layer adds another level ofabstraction to the Linux subsystem related totime keeping and time-related activities, as il-lustrated in Figure 4. The benefit of adding theabstraction layer is the substantial reduction inarchitecture-specific code, which can be seenmost clearly by comparing Figures 3 and 4.

HW

HW

HW

HW

HW

ISR HW

Arch 1

Arch 2

Arch 3

Timekeeping

Tick

Process acc.

Profiling

Jiffies

Timer wheel

hrtimers

Clock source

TODClock synchr.

Shared HW

Clock events

ISR

Event distribution

Shared HW

Figure 4: Linux time system + htimers +GTOD + clock events

7 High Resolution Timers

The inclusion of the clock source and clockevent source management and abstraction lay-ers provides now the base for high resolutionsupport for hrtimers.

While previous attempts of high resolutiontimer implementations needed modification allover the kernel source tree, the hrtimers basedimplementation only changes the hrtimers codeitself. The required change to enable highresolution timers for an architecture which issupported by the Generic Time-of-day and the

clock event framework is the inclusion of a sin-gle line in the architecture specific Kconfig file.

The next event modifications remove the im-plicit but strong binding of hrtimers to jiffytick boundaries. When the high resolution ex-tension is disabled the clock event distribu-tion code works in the original periodic modeand hrtimers are bound to jiffy tick boundariesagain.

8 Implementation

While the base functionality of hrtimers re-mains unchanged, additional functionality hadto be added.

• Management function to switch to highresolution mode late in the boot process.

• Next event scheduling

• Next event interrupt handler

• Separation of the hrtimers queue from thetimer wheel softirq

During system boot it is not possible to use thehigh resolution timer functionality, while mak-ing it possible would be difficult and wouldserve no useful function. The initialization ofthe clock event framework, the clock sourceframework and hrtimers itself has to be doneand appropriate clock sources and clock eventsources have to be registered before the highresolution functionality can work. Up to thepoint where hrtimers are initialized, the sys-tem works in the usual low resolution peri-odic mode. The clock source and the clockevent source layers provide notification func-tions which inform hrtimers about availabilityof new hardware. hrtimers validates the usabil-ity of the registered clock sources and clock


event sources before switching to high reso-lution mode. This ensures also that a kernelwhich is configured for high resolution timerscan run on a system which lacks the necessaryhardware support.

The time ordered insertion of hrtimers pro-vides all the infrastructure to decide whetherthe event source has to be reprogrammed whena timer is added. The decision is made per timerbase and synchronized across timer bases in asupport function. The design allows the systemto utilize separate per-CPU clock event sourcesfor the per-CPU timer bases, but mostly onlyone reprogrammable clock event source per-CPU is available. The high resolution timerdoes not support SMP machines which haveonly global clock event sources.

The next event interrupt handler is called fromthe clock event distribution code and movesexpired timers from the red-black tree to aseparate double linked list and invokes thesoftirq handler. An additional mode field in thehrtimer structure allows the system to executecallback functions directly from the next eventinterrupt handler. This is restricted to codewhich can safely be executed in the hard inter-rupt context and does not add the timer backto the red-black tree. This applies, for exam-ple, to the common case of a wakeup functionas used by nanosleep. The advantage of exe-cuting the handler in the interrupt context is theavoidance of up to two context switches—fromthe interrupted context to the softirq and to thetask which is woken up by the expired timer.The next event interrupt handler also providesfunctionality which notifies the clock event dis-tribution code that a requested periodic intervalhas elapsed. This allows to use a single clockevent source to schedule high resolution timerand periodic events e.g. jiffies tick, profiling,process accounting. This has been proved towork with the PIT on i386 and the Incrementeron PPC.

The softirq for running the hrtimer queuesand executing the callbacks has been separatedfrom the tick bound timer softirq to allow ac-curate delivery of high resolution timer signalswhich are used by itimer and POSIX intervaltimers. The execution of this softirq can still bedelayed by other softirqs, but the overall laten-cies have been significantly improved by thisseparation.

HW

HW

HW

HW

HW

ISR HW

Arch 1

Arch 2

Arch 3

Timekeeping

Process acc.

Profiling

Jiffies

Timer wheel

hrtimers

Clock source

TODClock synchr.

Shared HW

Clock events

ISR

Event distribution

Shared HW

Next event

Figure 5: Linux time system + htimers +GTOD + clock events + high resolution timers

8.1 Accuracy

All tests have been run on a Pentium III400MHz based PC. The tables show compar-isons of vanilla Linux 2.6.16, Linux-2.6.16-hrt5 and Linux-2.6.16-rt12. The tests for inter-vals less than the jiffy resolution have not beenrun on vanilla Linux 2.6.16. The test threadruns in all cases with SCHED_FIFO and pri-ority 80.

Test case: clock_nanosleep(TIME_ABSTIME), Interval 10000 microseconds,10000 loops, no load.


Kernel min max avg2.6.16 24 4043 19892.6.16-hrt5 12 94 202.6.16-rt12 6 40 10

Test case: clock_nanosleep(TIME_ABSTIME), Interval 10000 micro seconds,10000 loops, 100% load.

Kernel min max avg2.6.16 55 4280 21982.6.16-hrt5 11 458 552.6.16-rt12 16

Test case: POSIX interval timer, Interval 10000micro seconds, 10000 loops, no load.


Test case: POSIX interval timer, Interval 10000micro seconds, 10000 loops, 100% load.


Test case: clock_nanosleep(TIME_ABSTIME), Interval 500 micro seconds,100000 loops, no load.

Kernel min max avg2.6.16-hrt5 5 108 242.6.16-rt12 5 48 7

Test case: clock_nanosleep(TIME_ABSTIME), Interval 500 micro seconds,100000 loops, 100% load.


Test case: POSIX interval timer, Interval 500micro seconds, 100000 loops, no load.


Test case: POSIX interval timer, Interval 500micro seconds, 100000 loops, 100% load.


The real-time preemption kernel results are sig-nificantly better under high load due to the gen-eral low latencies for high priority real-timetasks. Aside from the general latency opti-mizations, further improvements were imple-mented specifically to optimize the high reso-lution timer behavior.

Separate threads for each softirq. Longlasting softirq callback functions e.g. in thenetworking code do not delay the delivery ofhrtimer softirqs.

Dynamic priority adjustment for high reso-lution timer softirqs. Timers store the prior-ity of the task which inserts the timer and thenext event interrupt code raises the priority ofthe hrtimer softirq when a callback functionfor a high priority thread has to be executed.The softirq lowers its priority automatically af-ter the execution of the callback function.

9 Dynamic Ticks

We have not yet done a dynamic tick imple-mentation on top of the existing framework, butwe considered the requirements for such an im-plementation in every design step.

The framework does not solve the general prob-lem of dynamic ticks: how to find the next ex-piring timer in the timer wheel. In the worst


case the code has to walk through a large num-ber of hash buckets. This can not be changedwithout changing the basic semantics and im-plementation details of the timer wheel code.

The next expiring hrtimer is simply retrievedby checking the first timer in the time orderedred-black tree.

On the other hand, the framework will deliverall the necessary clock event source mecha-nisms to reprogram the next event interrupt andenable a clean, non-intrusive, out of the box, so-lution once an architecture has been convertedto use the framework components.

The clock event functionalities necessary fordynamic tick implementations are availablewhether the high resolution timer functionalityis enabled or not. The framework code takescare of those use cases already.

With the integration of dynamic ticks the trans-formation of the Linux time related subsystemswill become complete, as illustrated in Fig-ure 6.

HW

HW

HW

HW

HW

ISR HW

Arch 1

Arch 2

Arch 3

Timekeeping

Process acc.

Profiling

Jiffies

Timer wheel

hrtimers

Clock source

TODClock synchr.

Shared HW

Clock events

ISR

Event distribution

Shared HW

Next event

Dynamic tick

Figure 6: Transformed Linux Time Subsystem

10 Conclusion

The existing parts and pieces of the overall so-lution have proved that a generic solution forhigh resolution timers and dynamic tick is fea-sible and provides a valuable benefit for theLinux kernel.

Although most of the components have beentested extensively in the high resolution timerpatch and the real-time preemption patch thereis still a way to go until a final inclusion intothe mainline kernel can be considered.

In general this can only be achieved by a step bystep conversion of functional units and archi-tectures. The framework code itself is almostself contained so a not converted architectureshould not have any impacts.

We believe that we provided a clear vision ofthe overall solution and we hope that more de-velopers get interested and help to bring thisfurther in the near future.

10.1 Acknowledgments

We sincerely thank all those people who helpedus to develop this solution. Help has been pro-vided in form of code contributions, code re-views, testing, discussion, and suggestions. Es-pecially we want to thank Ingo Molnar, JohnStultz, George Anzinger, Roman Zippel, An-drew Morton, Steven Rostedt and BenediktSpranger. A special thank you goes to JonathanCorbet who wrote some excellent articles abouthrtimers (and the previous ktimers) implemen-tation [2, 3].

References

[1] George Anzinger and Monta Vista. Highresolution timers home page.


http://high-res-timers.sourceforge.net.

[2] J. Corbet. Lwn article: A new approach tokernel timers. http://lwn.net/Articles/152436.

[3] J. Corbet. Lwn article: The high resolutiontimer api. http://lwn.net/Articles/167897.

[4] B. Srinivasan, S. Pather, R. Hill, F. Ansari,and D. Niehaus. A firm real-time systemimplementation using commercial off-theshelf hardware and free software. In 4th

Real-Time Technology and ApplicationsSymposium, Denver, June 1998.

[5] J. Stulz. We are not getting any younger:A new approach to timekeeping andtimers. In Ottawa Lnux Symposium,Ottawa, Ontario, Canada, July 2005.

Making Applications Mobile Under Linux

Cédric Le Goater, Daniel Lezcano, Clément CalmelsIBM France

{clg, dlezcano, clement.calmels}@fr.ibm.com

Dave Hansen, Serge E. HallynIBM Linux Technology Center

{haveblue, serue}@us.ibm.com

Hubertus FrankeIBM T.J. Watson Research Center

[email protected]

Abstract

Application mobility has been an operating sys-tem research topic for many years. Manyapproaches have been tried and solutions arefound across the industry. However, perfor-mance remains the main issue and all the ef-forts are now focused on performant solutions.In this paper, we will discuss a prototype whichminimizes the overhead at runtime and theamount of application state. We will examineconstraints and requirements to enhance perfor-mance. Finally, we will discuss features and en-hancements in the Linux kernel needed to im-plement migration of applications.

1 Introduction and Motivation

Applications increasingly run for longer peri-ods of time and build more context over time aswell. Recovering that context can be time con-suming, depending on the application, and usu-ally requires that the application be re-run fromthe beginning to reconstruct its context. A fewapplications now provide the ability to check-point their data or context to a file, enabling thatapplication to be restarted later in the case of

a failure, a system upgrade, or a need to rede-ploy hardware resources. This ability to check-point context is most common in what is re-ferred to as the High Performance Computing(HPC) environment, which is often composedof large numbers of computers working on adistributed, long running computation. The ap-plications often run for days or weeks at a time,some even as long as a year.

Even outside the HPC arena, there are manyapplications which have long start up times,long periods of processing configuration files,pre-computing information and so on. Histor-ically, emacs was built with a script which in-cluded undump—the ability to checkpoint thefull state of emacs into a binary which couldthen be started much more quickly. Some enter-prise class applications have thirty minute startup times, and those applications continue tobuild complex context as they continue to run.

Increasingly we as users tend to expect that ourapplications will perform quickly, start quickly,re-start quickly on a failure, and be alwaysavailable. However, we also expect to be ableto upgrade our operating system, apply secu-rity fixes, add components, memory, some-times even processing power without losing allof the context that our applications have ac-

348 • Making Applications Mobile Under Linux

quired.

This paper discusses a generic mechanism forsaving the state of an application at any point,with the ability to later restart that applica-tion exactly where it left off. This ability tosave status and restart an application is typi-cally referred to as checkpoint/restart, abbre-viated throughout as CPR. This paper focuseson the key areas for allowing applications tobe virtualized, simplifying the ability to check-point and later restart an application. Further,the technologies covered here would allow ap-plications to potentially be restarted on a differ-ent operating system image than the one fromwhich it was checkpointed. This provides theability to move an application (or even a set ofapplications) dynamically from one machine orvirtual operating system image to another.

Once the fundamental mechanisms are inplace, this technology can be used for suchmore advanced capabilities such as check-pointing a cluster wide application—in otherwords synchronizing and stopping a coordi-nated, distributed applications, and restartingthem. Cluster wide CPR would allow a siteadministrator to install a security update orperform scheduled maintainance on the entirecluster without impacting the application run-ning.

Also, CPR would enable applications to bemoved from host to host depending on sys-tem load. For instance, an overloaded machinecould have its workload rebalanced by movingan application set from one machine to anotherthat is otherwise underutilized. Or, several sys-tems which are underloaded could have theirapplications consolidated to a single machine.CPR plus migration will henceforth be referredto as CPRM.

Most of the capabilities we’ve highlighted hereare best enabled via application virtualization.

Application virtualization is a means of ab-stracting, or virtualizing, the software resourcesof the system. These include such things as pro-cess id’s, IPC ids, network connections, mem-ory mappings, etc. It is also a means to containand isolate resources required by the applica-tion to enable its mobility. Compared to thevirtual machine approach, application virtual-ization approach minimizes the state of the ap-plication to be transferred and also allows for ahigher degree of resource sharing between ap-plications. On the other hand, it has limitedfault containment, when compared to the vir-tual machine approach.

We built a prototype, called MCR, by modify-ing the Linux kernel and creating such a layerof containment. They are various other projectswith similar goals, for instance VServer [8]and OpenVZ [7] and dated Linux implementa-tion of BSD Jails [4]. In this paper, we willdescribe our experiences from implementingMCR and examine the many communalities ofthese projects.

2 Related Work in CPR

CPR is theoretically simple. Stop executionof the task and store the state of all mem-ory, registers, and other resources. To restart,reload the executable image, load the statesaved during the checkpoint, and restart exe-cution at the location indicated by the instruc-tion pointer register. In practice, complicationsarise due to issues like inter-processes sharing,security implications, and the ways that the ker-nel transparently manages resources. This sec-tion groups some of the existing solutions andreviews their shortcomings.

Virtual machines control the entire systemstate, making CPR easy to implement. Thestate of all memory and resources can simply be


stored into a file, and recreated by the machineemulator or the operating system itself. In-deed, the two most commonly mentioned VMs,VMware [9] and Xen [10], both enable live mi-gration of their guest operating systems. Thedrawbacks of CPRM of an entire virtual ma-chine is the increased overhead of dealing withall resources defining the VM. This can makethe approach unsuitable for load balancing ap-plications, since a requirement to add the over-head of a full VM and associated daemons toeach migrateable application can have tremen-dous performance implications. This issue isfurther explored in Section 6.

A more lighter weight CPRM approach canbe achieved by isolating applications, whichis predicated on the safe and proper isolationand migration of its underlying resources. Ingeneral, we look at these isolated and migrate-able units as containers around the relevant pro-cesses and resources. We distinguish conceptu-ally system containers, such as VServer [8] orOpenVZ [7], and application containers, suchas Zap [12] and our own prototype MCR. Sincecontainers share a single OS instance, many re-sources provided by the OS must be speciallyisolated. These issues are discussed in detailin Section 5. Common to both container ap-proaches is their requirement to be able to CPRan isolated set of individual resources.

For many applications CPR can be completelyachieved from user space. An example imple-mentation is ckpt [5]. Ckpt teaches applica-tions to checkpoint themselves in response toa signal by either preloading a library, or in-jecting code after application startup. The newcode, when triggered, writes out a new exe-cutable file. This executable reloads the appli-cation and resets its state before continuing exe-cution where it left off. Since this method is im-plemented with no help from the kernel, thereis state which cannot easily be stored, such aspending signals, or recreated, such as a pro-

cess’ original process id. This method is alsopotentially very inefficient as described in Sec-tion 5.2. The user space approaches also fallshort by requiring applications to be rewrittenand by exhibiting poor resource sharing.

CPR becomes much easier given some helpfrom the kernel. Kernel-based CPR solutionsinclude zap [12], crak [2], and our MCR pro-totype. We will be analyzing MCR in detail inSection 4, followed by the requirements for aconsolidated application virtualization and mi-gration kernel approach.

3 Concepts and principles

A user application is a set of resources—tasks,files, memory, IPC objects, etc.—that are ag-gregated to provide some features. The generalconcept behind CPR is to freeze the applicationand save all its state (in both kernel and userspaces) so that it can be resumed later, possiblyon another host. Doing this transparently with-out any modification to the application code isquite an easy task, as long as you maintain asingle strong requirement: consistency.

3.1 Consistency

The kernel ensures consistency for each re-source’s internal state and also provides systemidentifiers to user space to manipulate these re-sources. These identifiers are part of the ap-plication state, and as such are critical to theapplication’s correct behavior. For example, aprocess waiting for a child will use the knownprocess id of that child. If you were to re-sume a checkpointed application, you wouldrecreate the child’s pid to ensure that the par-ent would wait on the correct process. Unfortu-nately, Linux does not provide such control on


how pids, or many other system identifiers, areassociated with resources.

An interesting approach to this problem is vir-tualization: a resource can be associated with asupplementary virtual system identifier for userspace. The kernel can maintain associationsbetween virtual and system identifiers and of-fer interfaces to control the way virtual iden-tifiers are assigned. This makes it possible tochange the underlying resource and its systemidentifier without changing the virtual identi-fier known by the application. A direct side ef-fect is that such virtualized applications can beconfused by virtual identifier collisions if theyare not separated from one another. These con-flicts can be avoided if the virtualization is im-plemented with resource containment features.For example, /proc should only export virtualpid entries for processes in the same virtualizedcontainer as the reading process.

3.2 Process subsystem

Processes are the essential resources. They of-fer many features and are involved in many re-lationships, such as parent, thread group, andprocess group. The pid is involved in manysystem calls and regular UNIX features suchas session management and shell job control.Correct virtualization must address the wholepicture to preserve existing semantics. For ex-ample, if we want to run multiple applicationsin different containers from the same login ses-sion, we will also want to keep the same systemsession identifier for the ancestor of the con-tainer so as to still benefit from the regular ses-sion cleanup mechanism. The consequence isthat we need a system pid to be virtualized mul-tiple times in different containers. This meansthat any kernel code dealing with pids that arecopied to/from user space must be patched toprovide containment and choose the correct vir-

tualization space according to the implied con-text.

3.3 Filesystem and Devices

A filesystem may be divided into two parts. Onone hand, global entries that are visible from allcontainers like /usr. On the other hand, lo-cal entries that may be specific to the contain-ers like /tmp and of course /proc. /procvirtualization for numerical process entries isquite straightforward: simply generate a file-name out of the virtual pid. Tagging the re-sultant dentries with a container id makesit possible to support conflicting names in thedirectory name cache and to filter out unrelatedentries during readdir().

The same tagging mechanism can be appliedusing files attributes for instance. Some userlevel administration commands can be used bythe system administrator to keep track of filescreated by the containers. Devices files can betagged to be visible and usable by dedicatedcontainers. And of course, the mknod systemcall should fail on containers or be restricted toa minimal usage.

3.4 Network

If the application is network oriented, the mi-gration is more complex because resources areseen from outside the container, such as IP ad-dresses, communication ports, and all the un-derlying data related to the TCP protocol. Mi-gration needs to take all these resources into ac-count, including in-flight messages.

The IP address assigned to a source host shouldbe recreated on the target host during the mi-gration. This mobile IP is the foundation of themigration, but adds a constraint on the network.We will need to stay on the same network.


The migration of an application will be possibleonly if the communication channels are clearlyisolated. The connections and the data associ-ated with each application should be identified.To ensure such a containment, we need to iso-late the network interfaces.

We will need to freeze the TCP layer beforethe checkpoint, to make sure the state of thepeers is consistent with the snapshot. To dothis, we will block network traffic for both in-coming and outgoing packets. The applicationwill be migrated immediately after the check-point and all the network resources related tothe container will be cleaned up.

4 Design Overview

We have designed and implemented MCR,a lightweight application oriented containerwhich supports mobility. It is discussed herebecause it is one of a few implementationswhich are relatively complete. It provides anexcellent view on what issues and complexitiesarise. However, from our prototype work, wehave concluded that certain functionality im-plemented in user space in MCR is best sup-ported by the kernel itself.

An important idea behind the design of MCRis that it is kernel-friendly and does not do ev-erything in kernel space. A balance needed tobe struck between striving towards a minimalkernel impact to facilitate proper forward portsand ensuring that functional correctness and ac-ceptable performance is achieved. This meansusing available kernel features and mechanismswhere possible and not violating importantprinciples which ensure that user space appli-cations work properly.

With that principle in mind, CPR from userspace makes your life much easier. It also en-

ables some nifty and useful extensions like dis-tributed CPR and system management.

4.1 Architecture

The following section provides an overview ofthe MCR architecture (Figure 1). It relies on aset of user level utilities which control the con-tainer: creation, checkpoint, restart, etc. Newfeatures in the kernel and a kernel module arerequired today to enable the container and theCPR features.

Process 1

Process 2

Process n

User Space

Kernel Space

MCRK (Kernel Module)

container 1

MCR (command line)

syscall

MCR Kernel APIs

/dev/mcr

MCRP (plugin)

Figure 1: MCR architecture

The CPR of the container is not handled byone component. It is distributed across the 3components of MCR depending on the local-ity of the resourse to checkpoint. The userlevel utility mcr (Section 4.1.2) invokes andorchestrates the overall checkpoint of a con-tainer. It also is in charge of checkpointing theresources which are global to a container, likeSYSV shared memory for instance. The userlevel plugin mcrp (Section 4.1.4) checkpointsthe resources at the process level, memory, andat the thread level, signals and cpu state. Bothrely on the kernel module mcrk (Section 4.1.3)to access kernel internals.


4.1.1 Kernel enhancements and API

Despite a user space-oriented approach, thekernel still requires modifications in order tosupport CPR. But, surprisingly, it may no beas much as one might expect. There are threehigh level needs:

The biggest challenge is to build a container forthe application. The aim here is neither securitynor resource containment, but making sure thatthe snapshot taken at checkpoint time is consis-tent. To that end we need a container much likethe VServer[8] context. This will isolate andidentify all kernel resources and objects usedby an application. This kernel feature is notonly a key requirement to application mobil-ity, but also for other frameworks in the secu-rity and resource management domains. Thecontainer will also virtualize system identifiersto make sure that the resources used by the ap-plication do not overlap with other containers.This includes resources such as processes IDs,threads IDs, SysV IPC IDs, UTS names, and IPaddresses, among others.

The second need regards freezing a con-tainer. Today, we use the SIGSTOP signal tofreeze all running tasks. This gives valid re-sults, but for upstream kernel development wewould prefer to use a container version of therefrigerator() service from swsusp.swsusp uses fake signals to freeze nearly allkernel tasks before dumping the memory.

Finally, MCR exposes the internals of differentLinux subsystems and provides new servicesto get and set their state. These interfaces areby necessity very intrusive, and expose inter-nal state. For an upstream migration solution,using a /proc or /sysfs interface which ex-poses more selective data would be more ap-propriate.

4.1.2 User level utilities

Applications are made mobile simply by beingstarted under mcr-execute. The containeris created before the exec() of the commandstarting the application. It is maintained aroundthe application until the last process dies.

mcr-checkpoint and mcr-restart in-voke and orchestrate the overall checkpoint orrestart of a container. These commands alsoperform the get and set of the resources whichare global to a container, as opposed to those lo-cal to a single process within the container. Forinstance, this is where the SYSV IPC and filedescriptors are handled. They rely on the kernelmodule (Section 4.1.3) to manage the containerand access kernel internals.

4.1.3 Kernel module

The kernel module is the container manager interms of resource usage and resource virtualiza-tion. It maintains a real time definition of thecontainer view around the application whichensures that a checkpoint will be consistent atany time.

It is in charge of the global synchronizationof the container during the CPR sequence. Itfreezes all tasks running in the container, andmaps into the process a user level plugin (seeSection 4.1.4). It provides the synchronizationbarriers which unrolls the full sequence of thecheckpoint before letting each process resumeits execution.

It also acts as a proxy to capture the states whichcannot be captured directly from user space. In-ternal states handled by this module include, forexample, the process’ memory page mapping,socket buffers, clone flags, and AIO states.


4.1.4 User level plugin

When a checkpoint or a restart of a container isinvoked, the kernel module maps a plugin intoeach process of the container. This plugin isrun in the process context and is removed af-ter completion of the checkpoint. It serves 2purposes. The first is synchronization, whichit orchestrates with the help of the kernel mod-ule. Secondly, it performs get and set of stateswhich can be handled from user space usingstandard syscalls. Such states include sigac-tions, memory mapping, and rlimits.

When all threads of a process enter the plugin, amaster thread, not necessarily the main thread,is elected to handle the checkpoint of the re-sources at process level. The other threads onlycheckpoint the resources at thread level, likecpu state.

4.2 Linux subsystems CPR

The following subsections describe the check-point and the restart of the essential resourceswithout doing a deep dive in all the issueswhich need to be addressed. The following sec-tion 5 will delve deeper into selected issues forthe interested reader.

4.2.1 CPU state

Checkpointing the cpu state is indirectly doneby the kernel because the checkpoint is signaloriented: it is saved by the kernel on the topof the stack before the signal handler is called.This stack is then saved with the rest of thememory. At restart, the kernel will restore thecpu state in sigreturn() when it jumps outof the signal handler.

4.2.2 Memory

The memory mapping is checkpointed fromthe process context, parsing /proc/self/

maps. However, some vm_area flags (i.e.MAP_GROWSDOWN) are not exposed throughthe /proc file system. The latter are read fromthe kernel using the kernel module. The samemethod is used to retrieve the list of the mappedpages for each vm_area and reduce signifi-cantly the size of the snapshot.

Special pages related to POSIX shared mem-ory and POSIX semaphores are detected andskipped. They are handled by the checkpointof resources global to a container.

4.2.3 Signals

Signal handlers are registered usingsigaction() and called when the pro-cess gets a signal. They are checkpointed andrestarted using the same service in the processcontext. Signals can be sent to the processor directly to an invidual thread using thetkill() syscall. In the former, the signalgoes into a shared sigpending queue, whereany thread can be selected to handle it. Inthe latter, the signal goes to a thread privatesigpending queue. To guarantee correct signalordering, these queues must be checkpointedand restored separately using a dedicatedkernel service.

4.2.4 Process hierarchy

The relationships between processes must bepreserved across CPR sequences. Groups andsessions leaders are detected and taken into ac-count. At checkpoint, each process and threadstores in the snapshot its execution commandusing /proc/self/exe and its pid and ppid


using getpid() and getppid(). Threadsalso need to save their tid, ptid, and theirstack frame. At restart time, the processesare recreated by execve() and immediatelykilled with the checkpoint signal. Each processthen jumps into the user level plugin (See Sec-tion 4.1.4) and spawns its children. Each pro-cess also respawns its threads using clone().The process tree is recreated recursively. Onrestart, attention must be paid to correctly set-ting the pid, pgid, tid, tgid for each newly cre-ated process and thread.

4.2.5 Interprocess communication

The contents and attributes of SYSV IPCs, andmore recently POSIX IPCs, are checkpointedas resources global to a container, excepting se-mundos.

Most of the IPC resource checkpoint is done atthe user level using standard system calls. Forexample, mq_receive() to drain all mes-sages from a queue, and mq_send() to putthem back into the queue. The two main draw-backs to such an approach are that access timeto resources are altered and that the processmust have read and write access to them. Somefunctionalities like mq_notify() are a bittrickier. In these cases, the kernel sends noti-fication cookies using an AF_NETLINK socketwhich also needs to be checkpointed.

4.2.6 Threads

Every thread in a process shares the same mem-ory, but has its own register set. The threads candump themselves in user context by asking thekernel for their properties. At restart time, themain thread can read other threads’ data backfrom the snapshot and respawn each with itsoriginal tid.

The thread local storage contains the threadspecific information, data set by pthread_

setspecific() and the pthread_self()

pointer. On some architectures it is stored ina general purpose register. In that case it isalready covered in the signal handler frame.But on some other architectures, like Intel, it isstored in a separate segment, and this segmentmapping must be saved using a dedicated callto kernel.

4.2.7 Open files

File descriptors reference open files, which canbe of any type, including regular files, pipes,sockets, FIFOs, and POSIX message queues.

CPR of file descriptors is not done entirely inthe process context because they can be shared.Processes get their fd list by walking /proc/

self/fd. They send this list, using ancillarymessages, to a helper daemon running in thecontainer during the checkpoint. Using the ad-dress of the struct file as a unique identi-fier, the daemon checkpoints the file descriptoronly once per container, since two file descrip-tors pointing to the same opened file will havethe same struct file.

File descriptors 0, 1, and 2 are considered spe-cial. We may not want to checkpoint or restartthem if we do the restart on another loginconsole for example. The file descriptors aretagged and bypassed at checkpoint time.

4.2.8 Asynchronous I/O

Asynchronous I/Os are difficult to handle bynature because they can not easily be frozen.The solution we found is to let the process reacha quiescence point where all AIOs have com-pleted before checkpointing the memory. The


other issue to cover is the ring buffer of com-pleted events which is mapped in user spaceand filled by the kernel. This memory areaneeds to be mapped at the same address whenthe process is restarted. This requires a smallpatch to control the address used for the map-ping.

5 Zooming in

This section will examine in more detail threemajor aspects of application mobility. The firsttopic covers a key requirement in process mi-gration: the ability to restart a process keep-ing the same pid. Next we will discuss issuesand solutions to VM migration, which has thebiggest impact on performance. Finally, wewill address network isolation and migration oflive network communications.

5.1 Process Virtualization

A pid is a handle to a particular task or taskgroup. Inside the kernel, a pid is dynamicallyassigned to a task at fork time. The relationshipis recorded in the pid hash table (pid → task)and remains in place until a task exits. For sys-tem calls that return a pid (e.g. getpid()),the pid is typically extracted straight out ofthe task structure. For system calls that uti-lize a user provided pid, the task associatedwith that pid is determined from the pid hashtable. In addition various checks need to beperformed that guarantee the isolation betweenusers and system tasks (in particular during thesys_kill() call).

Because pids might be cached at the userlevel, processes should be restarted with theiroriginal pids. However, it is difficult if notimpossible to ensure that the same pid will

always be available upon restart of a check-pointed application, as another process couldalready have been started with this pid. Hence,pids need to be virtualized. Virtualization inthis context can be and is interpreted in vari-ous manners. Ultimately the requirement, thatan application consistently sees the same pidassociated with a task (process/thread) acrossCPR, must be satisfied.

There are essentially three issues that need tobe dealt with in any solution:

1. container init process visibility,

2. where in the kernel the virtualization inter-ception will take place,

3. how the virtualization is maintained.

Particularly 1. is responsible for the non-trivialcomplexities of the various prototypes. It stemsfrom the necessity to “rewrite” the pid relation-ships between the top process of a container(short cinit) and its parent. cinit essen-tially lives in both contexts, the creating con-tainer and the created container. The creat-ing container requires a pid in its context forcinit to be able to deploy regular wait()semantics. At the same time, cinitmust referto its parent as the perceived system init process(vpid = 1).

5.1.1 Isolation

Various solutions have been proposed and im-plemented. Zap [12] intercepts and wraps allpid related system calls and virtualizes the pidsin the interception layer through a pid ↔ vpidlookup table associated with the caller’s con-tainer either before and/or after calling the orig-inal syscall implementation with the real pids.The benefit of this approach is that the kernel


does not need to be modified. However, theability of overwriting the syscall table is not adirection Linux embraces.

MCR, presented in greater detail in Section 4,pushes the interception further down into thevarious syscalls itself, but also utilizes a pid ↔vpid lookup function. In general, the callingtask provides the context for the lookup.

The isolation between containers is imple-mented in the lookup function. Tasks thatare created inside a container are looked upthrough this function. For global tasks the pid== vpid holds. In both implementations thevpid is not explicitly stored with the task, butis determined through the pid ↔vpid lookupeach and every time. On restart tasks canbe recreated through the fork(); exec()sequence and only the lookup table needs torecord the different pid. The cinit parentproblem mentioned earlier is solved by map-ping cinit twice, in the created context asvpid=1 and in the creating container contextswith the assigned vpid. The lookup function isstraight forward, essentially we need to ensurethat we identify any cinit process and returnthe vpid/task associated with it relative to theprovided container context.

The OpenVZ implementation [7] provides aninteresting, yet worthwhile optimization thatonly requires a lookup for tasks that havebeen restarted. OpenVZ relies on the fact thattasks do have a unique pid when tasks are intheir original incarnation (not yet C/R’d). Thelookup function, which is called at the samecode locations as the MCR implementation,hence only has to maintain the isolation prop-erty. In the case of a restarted task the unique-ness can no further be guaranteed, so the pidmust be virtualized. Common to all three ap-proaches is the fact that virtual pids are all rela-tive to their respective containers and that theyare translated into system-wide unique pids.The guts of the pidhash have not changed.

A different approach is taken by the namespaceproposal [6]. Here, the container principle isdriven further down into the kernel. The pid-hash now is defined as a ({pid,container} →task) function. The namespace approach natu-rally elevates the container as a first class ker-nel object. Hence minor changes were requiredto the pid allocation which maintains a pidmapfor each and every namespace now. The benefitof this approach is that the code modificationsclearly highlight the conditions where con-tainer boundaries need to be crossed, where inthe earlier virtualization approach these cross-ings came implicitly through the results of thelookup function. On the other hand, the names-pace approach needs special provisioning forthe cinit problem. To maintain the ability towait() on the cinit process from the cinit’sparent (child_reaper), a task->wid is de-fined, that reflects the pid in the parent’s contextand on which the parent needs to wait. There isno clear recommendation between the names-pace and virtualization approach that we wantto give in this paper; both the OpenVZ and thenamespace proposal are very promising.

5.1.2 CPR on Process Virtualization

The CPR of the Process Virtualization isstraight forward in both cases. In the ZAP,MCR and OpenVZ case, the lookup table isrecreated upon restart and populated with thevpid and the real pid translations, thus requir-ing the ability to select a specific vpid for arestarted process. Which real pid is chosen isirrelevant and is hence left to the pidmap man-agement of the kernel. In the namespace ap-proach since the pid selection is pushed intothe kernel a function requires that a task canbe forked at a specific pid with in a container’spidmap. Ultimately, both approaches are verysimilar to each other.


5.2 CPR on the Linux VM

At first glance, the mechanism for checkpoint-ing a process’s memory state is an easy task.The mechanism described in section 2 can beimplemented with a simple ptrace.

This approach is completely in user space, sowhy is it not used in the MCR prototype, norany commercial CPR systems? Or in otherwords, why do we need to push certain func-tionalities further down into the kernel?

5.2.1 Anonymous Memory

One of the simplest kind of memory to check-point is anonymous memory. It is never usedoutside the process in which it is allocated.

However, even this kind of memory would haveserious issues with a ptrace approach.

When memory is mapped, the kernel does notfill it in at that time, but waits until it is usedto populate it. Any user space program do-ing a checkpoint could potentially have to it-erate over multiple gigabytes of sparse, entirelyempty memory areas. While such an approachcould consolidate such empty memory after thefact, simply iterating over it could be an incred-ibly significant resource drain.

The kernel has intimate knowledge of whichmemory areas actually contain memory, andcan avoid such resource drains.

5.2.2 Shared Memory

The key to successful CPR is getting a consis-tent snapshot. If two interconnected processesare checkpointed at different times, they maybecome confused when restarted. Success-ful memory checkpointing requires a consistent

quiescence of all tasks sharing data. This in-cludes all shared memory areas and files.

5.2.3 Copy on Write

When a process forks, both the forker and thenew child have exactly the same view of mem-ory. The kernel gives both processes a read-only view into the same memory. Although notexplicit, these memory areas are shared as longas neither process writes to the area.

The above proposed ptrace mechanism wouldbe a very poor choice for any processes whichhave these copy-on-write areas. The areas haveno practical bounds on their sizes, and areindistinguishable from normal, writable areasfrom the user’s (and thus ptrace’s) perspective.

Any mechanism utilizing the ptrace mecha-nism could potentially be forced to write outmany, many copies of redundant data. Thiscould be avoided with checksums, but it causesuser space reconstruction of information aboutwhich the kernel already explicitly knows.

In addition, user space has no way of explic-itly recreating these copy-on-write shared areasduring a resume operation. The only mecha-nism is fork, which is an awfully blunt instru-ment by which to recreate an entire system fullof processes sharing memory in this manner.The only alternative is restoring all processesand breaking any sharing that was occurring be-fore the checkpoint. Breaking down any shar-ing is highly undesirable because it has the po-tential to greatly increase memory utilization.

5.2.4 Anonymous Shared

Anonymous shared memory is that which isshared, but has no backing in a file. In Linuxthere is no true anonymous shared memory.


The memory area is simply backed by a pseudofile on a ram-based filesystem. So, there is nodisk backing, but there certainly is a file back-ing.

It can only be created by an mmap() callwhich uses the MAP_SHARED and MAP_ANONYMOUS. Such a mapping is unique to asingle process and not truly shared. That is, un-til a fork().

No running processes may attach to such mem-ory because there is no handle by which to findor address it, neither does it have persistence.The pseudo-file is actually deleted, which cre-ates a unique problem for the CPR system.

Since the “anonymous” file is mapped by someprocess, the entire addressable contents of thefile can be recovered through the aforemen-tioned ptrace mechanism. Upon resume, the“anonymous” areas can be written to a real filein the same ram-based filesystem. After allprocesses sharing the areas have recreated theirreferences to the “anonymous” area, the file canbe deleted, preserving the anonymous seman-tics. As long as the process performing thecheckpoint has ptrace-like capabilities for allprocesses sharing the memory area, this shouldnot be difficult to implement.

5.2.5 File-backed Shared

Shared memory backed by files is perhaps thesimplest memory to checkpoint. As long asall dirty data has been written back, requir-ing filesystem consistency be kept between acheckpoint and restart is all that is required.This can be done completely from user space.

One issue is with deleted files. However, thesecan be treated in the same way as “anonymous”shared memory mentioned above.

5.2.6 File-backed Private

When an application wants a copy of a file tobe mapped into memory, but does not want anychanges reflected back on the disk, it will mapthe file MAP_PRIVATE.

These areas have the same issues as anonymousmemory. Just like anonymous memory, sepa-rately checkpointing a page is only necessaryafter a write. When simply read, these areasexactly mirror contents on the disk and do notneed to be treated differently from normal file-backed shared memory.

However, once a write occurs, these areas’treatment resembles that of anonymous mem-ory. The contents of each area must be readand preserved. As with anonymous memory,user space has no detailed knowledge of spe-cific pages having been written. It must sim-ply assume that the entire area has changed, andmust be checkpointed.

This assumption can, of course, be overriddenby actually comparing the contents of memorywith the contents of the disk, choosing not toexplicitly write out any data which has not ac-tually changed.

5.2.7 Using the Kernel

From the ptrace discussions above, it shouldbe apparent that the he various kinds of mem-ory mappings in Linux can be checkpointedfrom userspace while preserve many of theirimportant pre-checkpoint attributes. However,it should now be apparent that user space lacksthe detailed knowledge to do these operationsefficiently.

The kernel has exact knowledge of exactlywhich pages have been allocated and popu-lated. Our MCR prototype uses this informa-tion to efficiently create memory snapshots.


It walks the pagetables of each memory area,and marks for checkpoint only those pageswhich actually have contents, and have beentouched by the process being checkpointed. Forinstance, it records the fact that “page 14” ina memory area has contents. This solves theissues with sparsely populated anonymous andprivate file-backed memory areas, because itaccurately records the process’s actual use ofthe memory.

However, it misses two key points: the simplepresence of a page’s mapping in the page tablesdoes not indicate whether its contents exactlymirror those on the disk.

This is an issue for efficiently checkpointing thefile-backed private areas because the page maybe mapped, but it may be either a page whichhas been only read, or one to which a write hasoccurred. To properly distinguish between thetwo, the PageMappedToDisk() flag mustbe checked.

5.2.8 File-backed Remapped

Assume that a file containing two pages worthof data is mmap()ed. It is mapped from thebeginning of the file through the end. Onewould assume that the first page of that map-ping would contain the first page of data fromthe disk. By default, this is the behavior. But,Linux contains a feature which invalidates thisassumption: remap_file_pages().

That system call allows a user to remap a mem-ory area’s contents such that the nth page of amapping does not correspond to the nth pageon the disk. The only place in which the in-formation about the mapping is stored is in thepagetables. In addition, the presence of one ofthese areas is not openly available to user space.

Our user space ptrace mechanism could likelydetect these situations by double-checking that

each page in a file-backed memory area is trulybacked by the contents on the disk, but thatwould be an enormous undertaking. In addi-tion, it would not be a complete solution be-cause two different pages in the file could con-tain the same data. Userspace would have abso-lutely no way to uniquely identify the positionof a page in a file, simply given that page’s con-tents.

This means that the MCR implementation isincomplete, at least in regards to any mem-ory area to which remap_file_pages() hasbeen applied.

5.2.9 Implementation Proposal

Any effective and efficient checkpoint mecha-nism must implement, at the least:

1. Detection and preservation of sharing offile-backed and other shared memory ar-eas for both efficiency and correctness.

2. Efficient handling of sparse files and un-touched anonymous areas.

3. Lower-level visibility than simply the fileand contents for remap_file_pages()compatibility (such as effective page tablecontents).

There is one mechanism in the kernel todaywhich deals with all these things: the swapcode. It does not attempt to swap out areaswhich are file backed, or sparse areas whichhave not been populated. It also correctly han-dles the nonlinear memory areas from remap_

file_pages().

We propose that the checkpointing of a pro-cess’s memory could largely be done with asynthetic swap file used only by that container.


This swap file, along with the contents ofthe pagetables of the checkpointed processes,could completely reconstruct the contents of aprocess’ memory. The process of checkpoint-ing a container could become very similar tothe operation which swsusp performs on anentire system.

The swap code also has a feature which makesit very attractive to CPR: the swap cache. Theswap cache allows a page to be both mappedinto memory and currently written out to swapspace. The caveat is that, if there is a writeto the page, the on-disk copy must be thrownaway.

Memory which is very rarely written to, suchas the file-backed private memory used in thejump table in dynamically linked libraries, hasthe most to gain from the swap cache. Users ofthis memory can run unimpeded, even duringa checkpoint operation, as long as they do notperform writes.

Just as has been done with other live cross-system migration[11] systems, the process ofmoving the data across can be iterative. First,copy data in several passes until, despite the ef-forts to swap them out, the working set size ofthe applications ceases to decrease.

The application-level approach has the poten-tial to be at least marginally faster than thewhole-system migration because it is only con-cerned with application data. Xen must dealwith the kernel’s working set in addition to theapplication. This must increase the amount ofdata which must be migrated, and thus must in-crease the potential downtime during a migra-tion.

5.3 Migrating Sockets

In Section 3.4, the needs for a network migra-tion were roughly defined. This section focuses

on the four essential networking componentsrequired for container migration: network iso-lation, network quiescent points, network stateaccess for a CPR, and network resource clean-up.

5.3.1 Network isolation

The network interface isolation consists of se-lectively revealing network interfaces to con-tainers. These can be either physical or aliasedinterfaces. Aliased interfaces are more flexi-ble for managing the network in the contain-ers because different IP addresses can be as-signed to different containers with the samephysical network interface. The net_deviceand in_ifaddr structures have been modi-fied to store a list of the containers which mayview the interface.

The isolation ensures that each container usesits own IP address. But any return packet mustalso go to the right interface. If a container con-nects to a peer without specifying the sourceaddress, the system is free to assign a sourceaddress owned by an another container. Thismust be avoided. The tcp_v4_connect()udp_sendmsg() functions are modified inorder to choose a source address associatedwith an interface visible from the source con-tainer.

The network isolation ensures that networktraffic is dispatched to the right container.Therefore it becomes quite easy to drop thetraffic for any specific container.

5.3.2 Reaching a quiescent point

As the processes need to reach a quiescent pointin order to stop their activities, the networkmust reach this same point in order to retrievenetwork resource states at a fixed moment.


This point is reached by blocking the networktraffic for a specified container. The filteringmechanism relies on netfilter hooks. Everypacket is contained in a struct skbuff.This structure has a link to the struct sockconnection which has a record of the ownercontainer. Using this mechanism, packets re-lated to a container being checkpointed can beidentified and dropped.

For dropping packets, iptables is not directlysuitable because each drop rule returns a NF_DROP, which interacts with the TCP stack. But,we need the TCP stack to be frozen for ourcontainer. So a kernel module has been im-plemented which drops the packets but returnsNF_STOLEN instead of NF_DROP.

This of course relies on the TCP protocol’s re-transmission of the lost packets. However, traf-fic blocking has a drawback: if the time neededfor the migration is too large, the connectionson the peers will be broken. The same willoccur if the TCP keep alive time is too small.This encourages any implementation to have avery short downtime during a migration. How-ever, note that, when both sides of a connectionare checkpointed simultaneously, there are noproblems with TCP timeouts. In that case therestart could occurs years later.

5.3.3 Socket CPR

Now that we have successfully blocked the traf-fic and frozen the TCP state, the CPR can actu-ally be performed.

Retrieving information on UDP sockets isstraightforward. The protocol control block issimple and the queues can be dropped becauseUDP communication is not reliable. Retriev-ing a TCP socket is more complex. The TCPsockets are classified into two groups: SS_UNCONNECTED and SS_CONNECTED. The

former have little information to retrieve be-cause the PCB (Protocol Control Block) is notused and the send/receive queues are empty.The latter have more information to be check-pointed, such as information related to thesocket, the PCB, and the send/receive queues.Minisocks and the orphan sockets also fall inthe connected category.

A socket can be retrieved from /proc be-cause the file descriptors related to the currentprocess are listed. The getpeername(),getsockname() and getsockopt() canbe directly used with the file descriptors. How-ever, some information is not accessible fromuser space, particularly the list of the minisocksand the orphaned sockets, because no fd is as-sociated with them. The PCB is also unacces-sible because it is an internal kernel structure.MCR adds several accessors to the kernel inter-nals to retrieve this missing information.

The PCB is not completely checkpointed andrestored because there is a set of fields whichneed to be modified by the kernel itself. Forexample, the round time trip values.

5.3.4 Socket Cleanup

When the container is migrated, the local net-work resources remaining in the source hostshould be cleaned up in order to avoid dupli-cate resources on the network. This is done us-ing the container identifier. The IP addressescan not be used to find connections related toa container because if several interconnectedcontainers are running on the same machine,there is no way to find the connection owner.

5.3.5 CPR Dynamics

The fundamentals for the migration have beendescribed in the previous sections. Figure 2


illustrates how they are used to move networkresources from one machine to an another.

Virtual IP administrator

eth0192.168.10.111

eth0:010.0.10.10

IP

UDP TCP

Netfilter

Server

clients

Virtual IP administrator

eth0192.168.10.10

eth0:010.0.10.10

IP

UDP TCP

Netfilter

Server

Migrate

Container

Global

1

5

3a

3b

1

2

4a

4b

4c

4d

Figure 2: Migrating network resources

1. Creation

A network administration component islaunched; it creates an aliased interfaceand assigns it to a container.

2. Running

The IP address associated with the aliasedinterface is the only one seen by the appli-cations inside the container.

3. Checkpoint

(a) All the traffic related to the aliasedinterface assigned to the container isblocked.

(b) The network resources are retrievedfor each kind of socket and saved:addresses, ports, multicast groups,socket options, PCB, in-flight dataand listening points.

4. Restart

(a) The traffic is blocked

(b) The network resources are set fromthe file to the system.

(c) An ARP (adress request package) re-sponse is sent to the network in or-der to boost up and ensure correctmac↔ ip address association.

(d) The traffic is unblocked.

5. Destruction

The traffic is blocked, the aliased interfaceis destroyed and the sockets related to thecontainer are removed from the system.

6 What is the cost?

This section presents an overview of the costof virtualization in different frameworks. Wehave focused on VServer, OpenVZ, and ourown prototype MCR, which are all lightweightcontainers. We have also included Xen, whenpossible, as a point of reference in the field offull machine virtualization.

The first set of tests assesses the virtualizationoverhead on a single container for each abovementioned solution. The second measures scal-ability of each solution by measuring the im-pact of idle containers on one active container.The last set provides performance measures ofthe MCR CPR functionality with a real worldapplication.

6.1 Virtualization overhead

At the time of this writing, no single kernelversion was supported by each of VServer,OpenVZ, and MCR. Furthermore, patchesagainst newer kernel versions come out fasterthan we can collect results, and clearly by thetime of publication the patches and base ker-nel used in testing will be outdated anyway.Hence for each virtualization implementationwe present results normalized against resultsobtained from the same version vanilla kernel.


6.1.1 Virtualization overhead inside a con-tainer

The following tests were made on quad PIII700MHz running Debian Sarge using the fol-lowing versions:

• VServer version vs2.0.2rc9 on a 2.6.15.4kernel with util-vserver version0.30.210

• MCR version 2.5.1 on a 2.6.15 kernel

We used dbench to measure filesystem load,LMbench for microbenchmarks, and a kernelbuild test for a generic macro benchmark. Eachtest was executed inside a container, with onlyone container created.

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

AF_

UN

IX s

ocke

tba

ndw

idth

Lm

benc

h(l

ibc

bcop

yal

igne

d)

Lm

benc

h(M

map

rea

dop

en2c

lose

bw

)

Sock

et b

andw

idth

usin

g lo

calh

ost

Lm

benc

h(M

emor

y lo

adla

tenc

y)

Ker

nel

build

tim

e

Dbe

nch

Overhead with patched kernel and within container

MCRVServer

Figure 3: Various tests inside a container

The results shown in figure 3 demonstrate thatthe overhead is hardly measurable. OpenVZ,being a full virtualized server, was not takeninto account.

6.1.2 Virtualization overhead within a vir-tual server

The next set of tests were run in a full vir-tual server rather than a simple container. For

these tests, the nodes used were 64bit dualXeon 2.8GHz (4 threads/2 real processors).Nodes were equipped with a 25P3495a IBMdisk (SATA disk drive) and a Tigon3 gigabitethernet adapter. The host nodes were runningRHEL AS 4 update 1 and all guest servers wererunning Debian Sarge. We ran tbench and a2.6.15.6 kernel build test in three environments:on the system running the vanilla kernel, on thehost system running the patched kernel, and in-side a virtual server (or guest system). The ker-nel build test was done with warmed up cache.

• VServer version 2.1.0 on a 2.6.14.4 kernelwith util-vserver version 0.30.209

• OpenVZ version 022stab064 on a 2.6.8kernel with vzctl utilities version 2.7.0-26

• Xen version 3.0.1 on a 2.6.12.6 kernel

The results are shown in the figures 4 and 5.

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

Xen(2.6.12.6)Vserver(2.6.14.4)

Tbench bandwidth overhead

HostGuest

Figure 4: tbench results regarding a vanilla ker-nel

The OpenVZ virtual server did not survive allthe tests. tbench looped forever and the kernelbuild test failed with a virtual memory alloca-tion error. As expected, lightweight containersoutperform a virtual machine. Considering the


0.5

1

1.5

2

2.5

3

Xen(2.6.12.6)Vserver(2.6.14.4)

Kernel build system time

HostGuest

Figure 5: System time of a kernel build regard-ing a vanilla kernel

level of containment provided by Xen and theconfiguration of the domain, using file-backedvirtual block device, Xen also behaved quitewell. It would surely have better results witha LVM-backed VBD.

6.2 Resource requirement

Using the same tests, we have studied howperformance is impacted when the number ofcontainers increases. To do so, we have con-tinously added idle containers to the systemand recorded the application performance ofthe reference test in the presence of an increas-ing number of idle containers. This gives someinsight in the resource consumption of the var-ious virtualization techniques and its impact onapplication performance. This set of tests com-pared:

• VServer version 2.1.0 on a 2.6.14.4 kernelwith util-vserver version 0.30.209

• OpenVZ version 022stab064 on a 2.6.8kernel with vzctl utilities version 2.7.0-26

• Xen version 3.0.1 on a 2.6.12.6 kernel

• MCR version 2.5.1 on a 2.6.15 kernel

100

200

300

400

500

600

1000 500 200 100 50 20 10 5 2 1

Dbe

nch

band

wid

th

number of running containers

Performance Scalability

MCRVServer chcontext

VServer virtual serverOpenVZ

XenVanilla 2.6.15

Figure 6: dbench performance with an increas-ing number of containers

240

260

280

300

320

340

360

380

400

1000 500 200 100 50 20 10 5 2 1

Tben

ch b

andw

idth





XenVanilla 2.6.15

Figure 7: tbench performance with an increas-ing number of containers

The results are shown in the figures 6, 7, and 8.Lightweight containers are not really impactedby the number of idle containers. Xen over-head is still very reasonable but the number ofsimultaneous domains we were able to run sta-bly was quite low. This issue is a bug in currentXen, and is expected to be solved. OpenVZperformance was poor again and did not sur-vive the tbench test not the kernel build. Thetests should definitely be rerun with a newerversion.


40

60

80

100

120

140

160

1000 500 200 100 50 20 10 5 2 1

Ker

nel b

uild

syste

m ti

me





XenVanilla 2.6.15

Figure 8: Kernel build time with an increasingnumber of containers

6.3 Migration performance

To illustrate the cost of a migration, we haveset up a simple test case with Oracle (http://www.oracle.com/index.html) and Dots, adatabase benchmark coming from the LinuxTest Project (http://ltp.sourceforge.net/). The nodes used in our test weredual Xeon 2.4GHz HT (4 cpus/2 real proces-sors). Nodes were equipped with a ST380011A(ATA disk drive) and a Tigon3 gigabit ethernetadapter. Theses nodes were running a RHELAS 4 update 1 with a patched 2.6.9-11.EL ker-nel. We used Oracle version 9.2.0.1.0 runningunder MCR 2.5.1 on one node and Dots version1.1.1 running on another node (nodes are linkedby a gigabit switch). We measured the dura-tion of checkpoint, the duration of restart andthe size of the resulting snapshot with differentDots cpu workloads: no load, 25%, 50%, 75%,and 100% cpu load. The results are shown inFigures 9 and 10.

The duration of the checkpoint is not im-pacted by the load but is directly correlated tothe size of the snapshot (real memory size).Using a swap file dedicated to a containerand incremental checkpoints in that swap file(Section 5.2) should improve dramatically the

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90

Tim

e in

s

CPU Load in percentage

Checkpoint/Restart locally

Checkpoint timeRestart time

Checkpoint + Restart

Figure 9: Oracle CPR time under load on localdisk

6

6.5

7

7.5

8

8.5

9

9.5

10

120 130 140 150 160 170 180 190 200 210 220 230

Chec

kpoi

nt ti

me

in s

Statefile size in M

Checkpoint time regarding statefile size

Checkpoint time

Figure 10: Time of checkpoint according to thesnapshot size

checkpoint time. It will also improve servicedowntime when the application is migated froma node to another.

At the time we wrote the paper, we were notable to run the same test with Xen, their migra-tion framework not yet being available.

7 Conclusion

We have presented in this paper a motivationfor application mobility as an alternative to


the heavier virtual machine approach. We dis-cussed our prototype of application mobilityusing the simple CPR approach. This proto-type helped us to identify issues and work onsolutions that would bring useful features to theLinux kernel. These features are isolation ofresources through containers, virtualization ofresources and CPR of the kernel subsystem toprovide mobility. We also went through the var-ious other alternatives projects in that are cur-rently persued within community and exempli-fied the many communalities and currents inthis domain. We believe the time has come toconsolidate these efforts and drive the neces-sary requirements into the kernel. These are thenecessary steps that will lead us to live migra-tion of applications as a native kernel feature ontop of containers.

8 Acknowledgments

First all, we would like to thank all the for-mer Meiosys team who developed the ini-tial MCR prototype, Frédéric Barrat, LaurentDufour, Gregory Kurz, Laurent Meyer, andFrançois Richard. Without their work and ex-pertise, there wouldn’t be much to talk about.A very special thanks to Byoung-jip Kim fromthe Department of Computer Science at KAISTand Jong Hyuk Choi from the IBM T.J. Wat-son Research Center for their contribution tothe s390 port, tests, benchmarks and this pa-per. And, most of all we need to thank GerritHuizenga for his patience and his encourage-ments. Merci, DonkeySchön!

9 Download

Patches, documentations and benchmark re-sults will be available at http://lxc.sf.net.

References

[1] William R. Dieter and James E. Lumpp,Jr. User-level Checkpointing forLinuxThreads Programs, Department ofElectrical and Computer Engineering,University of Kentucky

[2] Hua Zhong and Jason Nieh, CRAK:Linux Checkpoint/Restart As a KernelModule, Department of ComputerScience, Columbia University, TechnicalReport CUCS-014-01, November, 2001.

[3] Duell, J., Hargrove, P., and Roman., E.The Design and Implementation ofBerkeley Lab’s Linux Checkpoint/Restart,Berkeley Lab Technical Report(publication LBNL-54941).

[4] Poul-Henning Kamp and Robert N. M.Watson, R. N. M. Jails: Confining theomnipotent root, in Proc. 2nd Intl. SANEConference (May, 2000).

[5] Victor Zandy, Ckpt—A processcheckpoint library, http://www.cs.wisc.edu/~zandy/ckpt/.

[6] Eric Biederman, Code to implementmultiple instances of various linuxnamespaces, git://git.kernel.org/pub/scm/linux/kernel/git/

ebiederm/linux-2.6-ns.git/.

[7] SWSoft, OpenVZ: Server VirtualizationOpen Source Project,http://openvz.org, 2005.

[8] Jacques Gélinas, Virtual private serversand security contexts, http://www.solucorp.qc.ca/miscprj/s_

context.hc?prjstate=1&nodoc=0,2004.

[9] VMware Inc, VMware,http://www.vmware.com/, 2005.


[10] Boris Dragovic, Keir Fraser, Steve Hand,Tim Harris, Alex Ho, Ian Pratt, AndrewWarfield, Paul Barham, and RolfNeugebauer, Xen and the art ofvirtualization, in Proceedings of theACM Symposium on Operating SystemsPrinciples, October 2003.

[11] Christopher Clark, Keir Fraser, StevenHand, Jakob Gorm Hansen, Eric Jul,Christian Limpach, Ian Pratt, andAndrew Warfield, Live Migration ofVirtual Machines, In Proceedings of the2nd Symposium on Networked SystemsDesign and Implementation (NSDI ’05),May, 2005, Boston, MA.

[12] Steven Osman, Dinesh Subhraveti, GongSu, and Jason Nieh. The design andimplementation of zap: A system formigrating computing environments, inProceedings of the 5th UsenixSymposium on Operating SystemsDesign and Implementation, pp.361–376, December, 2002.

[13] Daniel Price and Andrew Tucker.“Solaris Zones: Operating SystemSupport for Consolidating CommercialWorkloads.” From Proceedings of the18th Large Installation SystemsAdministration Conference (USENIXLISA ’04).

[14] Werner Almesberger, Tcp ConnectionPassing, http://tcpcp.sourceforge.net

Copyright c© 2006 IBM.

This work represents the view of the authors anddoes not necessarily represent the view of IBM.

IBM and the IBM logo are trademarks or registeredtrademarks of International Business Machines Cor-poration in the United States and/or other countries.

Linux is a registered trademark of Linus Torvaldsin the United States, other countries, or both. Othercompany, product, and service names may be trade-marks or service marks of others. References in thispublication to IBM products or services do not im-ply that IBM intends to make them available in allcountries in which IBM operates.

This document is provided “AS IS,” with no expressor implied warranties. Use the information in thisdocument at your own risk.


The What, The Why and the Where To ofAnti-Fragmentation

Mel GormanIBM Corp. and Uni. of Limerick

[email protected]

Andy WhitcroftLTC, IBM Corp.

[email protected]

Abstract

Linux R© uses a variant of the binary buddy allo-cator that is fast but suffers badly from externalfragmentation and is unreliable for large con-tiguous allocations. We begin by introducingtwo cases where large contiguous regions areneeded: the allocation of HugeTLB pages dur-ing the lifetime of the system and using mem-ory hotplug to on-line and off-line memory ondemand in support of changing loads. We alsomention subsystems that may benefit from us-ing contiguous groups of pages. We then de-scribe two anti-fragmentation strategies, dis-cuss their strengths and weaknesses and exam-ine their implementations within the kernel. Wecover the standardised tests, the metrics used,the system architectures tested in the evaluationof these strategies and conclude with an exam-ination of their effectiveness at satisfying largeallocations. We also look at a page reclamationstrategy that is suited to freeing contiguous re-gions of pages and finish with a look at the fu-ture direction of anti-fragmentation and relatedwork.

1 Introduction

The page allocator in any operating system isa critical component. It must be fast and have

the ability to satisfy all requests to avoid sub-systems building reserve page pools [4]. Linuxuses a variant of the binary buddy allocator thatis known to be fast in comparison to other allo-cator types [3] but behaves poorly in the face offragmentation [5].

Fragmentation is a space-efficiency problemaffecting all dynamic memory allocators andcomes in two varieties; internal and external.Internal fragmentation occurs when a largerfree block than necessary is granted for a re-quest, such as allocating one entire page to sat-isfy a request for 32 bytes. Linux uses a slaballocator for small requests to address this is-sue. External fragmentation refers to the inabil-ity to satisfy an allocation because a suitablylarge block of memory is not free even thoughenough memory may be free overall [6]. Linuxdeals with external fragmentation by rarely re-quiring larger (high order) pages. Althoughthis works well in general, Section 2 presentssituations where it performs poorly.

To be clear, anti-fragmentation is not the sameas defragmentation, which is a mechanism toreduce fragmentation by moving or reclaim-ing pages to have contiguous free space. Anti-fragmentation enables a system to conduct apartial defragmentation using the existing pagereclamation mechanism. The remainder of thispaper is arranged as described in the abstract.

370 • The What, The Why and the Where To of Anti-Fragmentation

2 Motivation for Low Fragmenta-tion

HugeTLB pages are contiguous regions thatmatch a large page size provided by an archi-tecture, which is 1024 small pages on x86 and4096 on PPC64. Use of these large pages re-duces both expensive TLB misses [2] and thenumber of Page Table Entries (PTEs) requiredto map an area, thus increasing performanceand reducing memory consumption. Linuxkeeps a HugeTLB freelist in the HugeTLB pagepool. This pool is sized at boot time, whichis a problem for workloads requiring differ-ent amounts of HugeTLB memory at differ-ent times. For example, workloads that uselarge in-memory data sets, such as X Win-dows, High-Performance Computing (HPC),many Java applications, and some desktop ap-plications (e.g. Konqueror) require variableamounts of memory depending on the inputdata and type of usage. It is not possible toguess their needs at boot time. Instead it wouldbe better to maintain low fragmentation so thattheir needs could be met as needed at run-time.

Contiguous regions are also required when asection of memory needs to be on-lined andthen off-lined later. For example, a virtual ma-chine running a service like a web server mayrequire more memory due to a spike in usage,but later need to return the memory to the host.Some architectures can return memory to a hy-pervisor using a balloon driver but this onlyworks when memory can be off-lined at thepage granularity. The minimum sized regionthat can be off-lined is the same as the sizeof a memory section defined for the SPARSE-MEM memory model. This model mandatesthat the memory section size be a power-of-twonumber of pages and the architecture selectsa size within that constraint. On the PPC64,the minimum sized region of memory that canbe off-lined is 16MiB which is the minimum

size OpenFirmware uses for a Logical MemoryBlock (LMB). On x86, the minimum sized re-gion is 64MiB. This is the smallest DIMM sizetaken by the IBM xSeries R© 445 which sup-ports the memory hot-add feature. Low frag-mentation increases the probability of findingregions large enough to off-line.

A third case where contiguous regions are de-sired, but not required, is for drivers that useDMA but do not support scatter/gather IO ef-ficiently or do not have an IO-MMU available.These drivers must spend time breaking up theDMA request into page-sized units. Ideally,drivers could ask for a page-aligned block ofmemory and receive a list of large contiguousregions. With low fragmentation, the expec-tation is that the driver would have a betterchance of getting one contiguous block and notneed to break up the request.

3 External Fragmentation

The extent of fragmentation depends on thenumber of free blocks1 in the system, their sizeand the size of the requested allocation. In thissection, we define two metrics that are used tomeasure the ability of a system to satisfy an al-location and the degree of fragmentation.

We measure the fraction of available free mem-ory that can be used to satisfy allocations of aspecific size using an unusable free space in-dex, Fu.

Fu( j) =TotalFree−∑

i=ni= j 2iki

TotalFree

1A free block is a single contiguous region stored ona freelist. In rare cases with the buddy allocator, two freeblocks are adjacent but not merged because they are notbuddies.


where TotalFree is the number of free pages,2n is the largest allocation that can be satisfied,j is the order of the desired allocation and ki isthe number of free page blocks of size 2i. WhenTotalFree is 0, we define Fu to be 1. A moretraditional, if slightly inaccurate2, view of frag-mentation is available by multiplying Fu( j) by100. At 0, there is 0% fragmentation, at 1, thereis 100% fragmentation, at 0.25, fragmentationis at 25% and 75% of available free memorycan be used to satisfy a request for 2 j contigu-ous pages.

Fu( j) can be calculated at any time, but externalfragmentation is not important until an alloca-tion fails [5] when Fu( j) will be 1. We furtherdefine a fragmentation index, Fi( j), which de-termines if the failure to allocate a contiguousblock of 2 j pages is due to lack of memory orto external fragmentation. The higher the frag-mentation of the system, the more free blocksthere will be. At the time of failure, the idealnumber of blocks shall be related to the size ofthe requested allocation. Hence, the index atthe time of an allocation failure is

Fi( j) = 1− TotalFree/2 j

BlocksFree

where TotalFree is the number of free pages,j is the order of the desired allocation andBlocksFree is the number of contiguous re-gions stored on freelists. When BlocksFree is0, we define Fi( j) to be 0. A negative value ofFi( j) implies that the allocation can be satisfiedand the fragmentation index is only meaning-ful when an allocation fails. A value tendingtowards 0 implies the allocation failed due toa lack of memory. A value tending towards 1implies that the failure is due to fragmentation.

2Discussions on fragmentation are typically con-cerned with internal fragmentation where the percentagerepresents wasted memory. A percentage value for ex-ternal fragmentation is not as meaningful because it de-pends on the request size.

Obviously the fewer times the Fi are calculated,the better.

4 Allocator Placement Policies

It is common for allocators to exploit knowncharacteristics of the request stream to improvetheir efficiency. For example, allocation sizeand the relative time of the allocation have beenused to heuristically group objects of an ex-pected lifetime together [1]. Similar heuristicscannot be used within an operating system asit does not have the same distinctive phases asapplication programs have. There is also lit-tle correlation between the size of an alloca-tion and its expected use. However, operatingsystem allocations do have unique characteris-tics that may be exploited to control placementthereby reducing fragmentation.

First, certain pages can be freed on demand;saved to backing storage; or discarded. Sec-ond, a large amount of kernel allocations arefor caches, such as the buffer and inode cacheswhich may be reclaimed on demand. Since it isknown in advance what the page will be usedfor, an anti-fragmentation strategy can grouppages by allocation type. We define three typesof reclaimability

Easy to reclaim (EasyRclm) pages are allo-cated directly for a user process. Almostall pages mapped to a userspace page tableand disk buffers, but not their managementstructures, are in this category.

Kernel reclaimable (KernRclm) pages areallocated for the kernel but can often bereclaimed on demand. Examples includeinodes, buffer head and directory entrycaches. Other examples, not applicableto Linux, include kernel data and PTEswhere the system is capable of pagingthem to swap.


Kernel non-reclaimable (KernNoRclm)pages are essentially impossible toreclaim on demand.

To distinguish among the reclamation types,additional GFP flags are used when callingalloc_pages(). For simplicity, the strategiespresented here treat KernNoRclm and KernR-clm the same so we use only one flag GFP_

EASYRCLM to distinguish between user and ker-nel allocations. Variations exist that deal withall three reclamation types, but the resultingcode is relatively more complex.

Allocation requests that specify the GFP_

EASYRCLM flag include requests for bufferpages, process faulted pages, high pages allo-cated with alloc_zeroed_user_highpage

and shared memory pages. The strategies prin-cipally differ in the semantics of the GFP flagand its treatment in the implementation.

5 Anti-Fragmentation With Lists

The binary buddy allocator maintainsmax_order lists of free blocks of eachpower-of-two from 20 to 2max_order−1. Insteadof one list at each order, this strategy uses twolists by extending struct free_area. Ateach order, one list is used to satisfy EasyRclmallocations and the second list is used for allother allocations. struct per_cpu_pages

is similarly extended to have one list forEasyRclm and one for kernel allocations.

The difference in design between the standardand list-based anti-fragmentation allocator isillustrated in Figure 1. Where possible, al-locations of a specified type use their ownfreelist but can steal pages from each otherin low memory conditions. When allocated,SetPageEasyRclm() is called for EasyRclm

allocations so that they will be freed back tothe correct lists. The two lists mean that apage’s buddy is likely to be of the same re-claimability. The success of this strategy de-pends on there being a large enough numberof EasyRclm pages and that there are no pro-longed bursts of requests for kernel pages lead-ing to excessive stealing.

One advantage of this strategy is that a high or-der kernel allocation can push out EasyRclmpages to satisfy the allocation. The assumptionis that high-order allocations during the lifetimeof the system are short-lived. Performance re-gressions tests did not show any problems de-spite the allocator hot paths being affected bythis strategy.

A disadvantage is related to the advantage. Askernel order-0 allocations can use the EasyR-clm freelists, the strategy can break down ifthere are prolonged periods of small allocationswithout frees. The likelihood is also that long-term light loads, such as desktops running fora number of days will allow kernel pages toslowly leak to all areas of physical memory.Over time, the list-based strategy would havesimilar success rates to the standard allocator.

6 Anti-Fragmentation With Zones

The Linux kernel splits available memory intoone or more zones, each representing mem-ory with different usage limitations as shownin Figure 2. On a typical x86, we haveZONE_DMA representing memory capable ofuse for Direct Memory Access (DMA), ZONE_NORMAL representing memory which is di-rectly accessible by the kernel, and ZONE_HIGHMEM covering the remainder. Each zonehas its own set of lists for the buddy allocatorto track free memory within the zone.


1

Linked lists offree pages

0

1

order

MAX_ORDER-1

...

...

...

...

...

4

3

2

1 x 2 block 3

2 x 2 block 2

2 x 2 block

RequestingProcess

HotPages

Per-CPU cache

ColdPages

One cache per CPUorder-0 pages only

Cache hitCache miss

0

1

3

2

MAX_ORDER-1

0

1

...

3

2

MAX_ORDER-1

... ......

Requesting Process

EasyRclm pages

KernNoRclm+KernRclmpages

Type of Allocation?

BINARY BUDDY ALLOCATOR ANTI-FRAGMENTATION WITH LISTS

Figure 1: Comparison of the standard and list-based anti-frag allocators

This strategy introduces a new memoryzone, ZONE_EASYRCLM, to contain EasyR-clm pages as illustrated in Figure 3. EasyR-clm allocations that cannot be satisfied fromthis zone fallback to regular zones, butnon-EasyRclm allocations cannot use ZONE_EASYRCLM. This is a crucial difference be-tween the list-based and zone-based strate-gies for anti-fragmentation as list-based allowsstealing in both directions.

While booting, the system memory is splitinto portions required by the kernel for itsoperation and that which will be used forEasyRclm allocations. The size of the ker-nel portion is defined by the system adminis-trator via the kernelcore= kernel parame-ter, which bounds the memory placed in thestandard zones; the remaining memory consti-tutes ZONE_EASYRCLM. If kernelcore= isnot specified, no pages are placed in ZONE_EASYRCLM.

The principal advantage of this strategy arethat it provides a high likelihood of beingable to reclaim appropriately sized portions of

ZONE_EASYRCLM for any higher order allo-cation if the high-order allocation is also eas-ily reclaimable. Another significant advan-tage is that ZONE_EASYRCLM may be usedfor HugeTLB page allocations as they do notworsen the fragmentation state of the system ina meaningful way. This allows us to use theZONE_EASYRCLM as a “soft allocation” zonefrom the HugeTLB pool to expand into.

One disadvantage is similar to the HugeTLBpool sizing problem because the usage of thesystem must be known in advance. Sizing isworkload dependant and performance may suf-fer if an inappropriate size is specified withkernelcore=. The second major disadvan-tage is that the strategy does not provide anyhelp for high-order kernel allocations.

ZONE_NORMALZONE_DMA ZONE_HIGHMEM

Figure 2: Standard Linux kernel zone layout


CPU Xeon R© 2.8GHz# Physical CPUs 2# CPUs 4Main Memory 1518MiB

CPU Power5 R© PPC64 1.9GHz# Physical CPUs 2# CPUs 4Main Memory 4019MiB

X86-BASED TEST MACHINE POWER5-BASED TEST MACHINE

Figure 4: Specification of Test Machines

ZONE_NORMALZONE_DMA ZONE_HIGHMEM ZONE_EASYRCLM

Figure 3: Easy Reclaim zone layout

7 Experimental Methodology

The strategies were evaluated using five tests,two related to performance and three related tothe system’s ability to satisfy large contiguousallocations. The system is cleanly booted at thebeginning of a single set of tests. Each of thefive tests are run in order without interveningreboots to maximise the chances of the systemsuffering fragmentation. The tests are as fol-lows

kbuild is similar to kernbench and it measuresthe time taken to extract and build a kernel. Thetest gives an overall view of the performance ofa kernel, including the rate the kernel is able tosatisfy allocations.

AIM9 is a micro-benchmark that includes testsfor VM-related operations like page allocationand the time taken to call brk(). AIM9 isa good barometer for performance regressions.Crucially, it is sensitive to regressions in thepage allocator paths.

HugeTLB-Capability is a kernel compilebased benchmark. For every 250MiB of phys-ical memory, a kernel compile is executed (inparallel, simultaneously). During the compile,one attempt is made to grow the HugeTLBpage pool from 0 by echoing a large numberto /proc/sys/vm/nr_hugepages. After there-size attempt, the pool is shrunk back to 0.

The kernel compiles are then stopped and anattempt is made to grow the pool while the sys-tem is under no significant load. A zero-filledfile that is the same size as physical memoryis then created with dd, then deleted, before athird attempt is made to re-size the HugeTLBpool. This test determines how capable the sys-tem is of allocating HugeTLB pages at run-timeusing the conventional interfaces.

Highalloc-Stress is a kernel compile basedbenchmark. Kernel compiles are started as inthe HugeTLB-Capability test, plus updatedbis also run in the background. A ker-nel module is loaded to aggressively allocateas many HugeTLB-sized pages as the sys-tem has by calling alloc_pages(). Thesepersistent attempts force kswapd to startreclaiming as well as triggering direct re-claim which does not occur when resiz-ing the HugeTLB pool via /proc/sys/

vm/nr_hugepages. Fu(hugetlb_order) iscalculated at each allocation attempt andFi(hugetlb_order) is calculated at each failure(see Section 3). The results are graphed at theend of the test. This test indicates how manyHugeTLB pages could be allocated under thebest of circumstances.

HotRemove-Capability is a memory hotplugremove test. For each section of memory re-ported in /sys/devices/system/memory,an attempt is made to off-line the memory. As-suming the kernel supports hotplug-remove, areport states how many sections and what per-centage of memory was off-lined. The basekernel used for this paper was 2.6.16-rc6


which did not support hotplug remove, so no re-sults were produced and it will not be discussedfurther.

All of these benchmarks were run using driverscripts from VMRegress 0.363 in conjunctionwith the same system that generates the reportson http://test.kernel.org. Two ma-chines were used to run the benchmarks basedon the x86 and Power5 R© architectures as de-tailed in Figure 4. In both cases, the tests wererun and results collected with scripts to min-imise variation and prevent bias during testing.Four sets of configurations were run on each ar-chitecture

1. List-based strategy under light load

2. List-based strategy under heavy load

3. Zone-based with no kernelcore spec-ified giving a ZONE_EASYRCLM with zeropages.

4. Zone-based with kernelcore=1024MB

on x86 and kernelcore=2048MB onPPC64.

The list-based strategy is tested under lightand heavy loads to determine if the strat-egy breaks down under pressure. We an-ticipated the results of the benchmarks tobe similar if no breakdown was occurring.The zone-based strategy is tested with andwithout kernelcore to show that ZONE_EASYRCLM is behaving as expected and thatthe existence of the zone does not incur a per-formance penalty. The choice of 2048MB onPPC64 is 50% of physical memory. The choiceof 1024MB on x86 is to give some memory toZONE_EASYRCLM, but to leave some memoryin ZONE_HIGHMEM for PTE use as CONFIG_HIGHPTE was set.

3http://www.csn.ul.ie/∼mel/projects/vmregress/vmregress-0.37.tar.gz

8 Results

On the successful completion of a test run, asummarised report is generated similar4 to theone shown in Figure 11. These reports get ag-gregated into the graphs shown in Figures 12and 13. For each architecture the graphs showhow the two strategies compare against thebase allocator in terms of performance and theability to satisfy HugeTLB allocations. Thesegraphs will be the focus of our discussion onperformance in Section 8.1.

Figures 5 and 6 shows the values ofFu(hugetlb_order) at each allocation at-tempt during the Highalloc-Stress Test whilethe system was under no load. Note that in allcases, the starting value of Fu(hugetlb_order)is close to 1 indicating that free memory wasnot in large contiguous regions after the kernelcompiles were stopped. The value dropsover time as pages are reclaimed and buddiescoalesce. Kernels using anti-fragmentationstrategies had a higher rate of decline for thevalue of Fu(hugetlb_order), which implies thatthe anti-fragmentation strategies had a measureof success. These figures will be the focusof our discussion on the ability of the systemto satisfy requests for contiguous regions inSection 8.2.

Finally, Figures 7 and 8 show the value ofFi(hugetlb_order) at each allocation failureduring the Highalloc-Stress Test while the sys-tem was under no load. These illustrate theroot cause of the allocation failures and are dis-cussed in Section 8.3.

8.1 Performance

On both architectures, absolute performancewas comparable. The “KBuild Comparison”

4Edited to fit


graphs in Figures 12 and 13 show the tim-ings were within seconds of each other and thiswas consistent among runs. The “AIM9 Com-parison” graphs show that any regression waswithin 3% of the base kernel’s performance.This is expected as that test varies by a fewpercent in each run and the results representone run, not an average. This leads us to con-clude that neither list-based nor zone-based hasa significant performance penalty on either x86or PPC64 architectures, at least for our sampleworkloads.

8.2 Free Space Usability

In general, zone-based was more predictableand reliable at providing contiguous free space.On both architectures, the zone-based anti-fragmentation kernels were able to allocate al-most all of the pages in ZONE_EASYRCLM atrest after the tests. As shown on Figure 6, 0.66was the final value of Fu(hugetlb_order) onPPC64 with half of physical memory in ZONE_EASYRCLM. We would expect it to reach 0.50after multiple HugeTLB allocation attempts.Without specifying kernelcore, the schememade no difference to absolute performance orfragmentation as ZONE_EASYRCLM is empty.

The list-based strategy was potentially ableto reduce fragmentation throughout physicalmemory. On x86, list-based anti-fragmentationkept overall fragmentation lower than zone-based but it was only fractionally better on thePPC64 than the standard allocator. An ex-amination of the x86 “High Allocation StressTest Comparison Test” report in Figure 12 hintswhy. On x86, advantage is being taken ofthe existing zone-based groupings of alloca-tion types in Normal and HighMem. Effec-tively, it was using a simple zone-based anti-fragmentation that did not take PTEs into ac-count. The list-based strategy succeeds on x86because it keeps the PTE pages in HighMem

grouped together in addition to some successin ZONE_NORMAL. Nevertheless, the strategyclearly breaks down in ZONE_NORMAL due tolarge amounts of kernel allocations falling backto the EasyRclm freelists in low-memory sit-uations. The breakdown is is illustrated bythe different values of Fu(hugetlb_order) afterthe different loads where similar values wouldbe expected if no breakdown was occurring.Figure 5 shows that the light-load performedworse than full-load due to the unpredictabilityof the strategy. On an earlier run, the list-basedstrategy under light load was able to allocate119 HugeTLB pages from ZONE_NORMAL butonly 77 after full-load.

Under load, neither scheme was significantlybetter than the other at keeping free areas con-tiguous. This is because we were depending onthe LRU-approximation to reclaim a contigu-ous region. Under load, zone-based was gen-erally better because page reclaim was able toreclaim within ZONE_EASYRCLM but the list-based strategy did not have the same focus.With either anti-fragmentation strategy, LRUsimply is not suitable for reclaiming contiguousregions and an alternative strategy is discussedin Section 10.

8.3 Fragmentation Index at Failure

Figures 7 and 8 clearly show that allocationsfailed with both strategies due to fragmenta-tion and not lack of memory. By design, thezone-based strategy does not reduce fragmen-tation in the kernel zones. When an allocationfails at rest, it is because ZONE_EASYRCLM islikely nearly depleted and we are looking at thehigh fragmentation in the kernel zones. Thefigures for list-based implied that, under load,fragmentation had crept into all zones whichmeans the strategy broke down due to excessivestealing.


0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350

Unu

sabl

e Fr

ee S

pace

Inde

x

Allocation Attempt

baselist-full

list-light

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350

Unu

sabl

e Fr

ee S

pace

Inde

x

Allocation Attempt

baselinear-zone-0MB

linear-zone-1024MB

X86 LIST-BASED X86 ZONE-BASED

Figure 5: x86 Unusable Free Space Index During Highalloc-Stress Test, System At Rest

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250

Unu

sabl

e Fr

ee S

pace

Inde

x

Allocation Attempt

baselist-full

list-light

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250

Unu

sabl

e Fr

ee S

pace

Inde

x

Allocation Attempt

baselinear-zone-0M

linear-zone-2048MB

PPC64 LIST-BASED PPC64 ZONE-BASED

Figure 6: PPC64 Unusable Free Space Index During Highalloc-Stress Test, System At Rest

0.97

0.975

0.98

0.985

0.99

0.995

1

0 50 100 150 200 250 300 350

Frag

men

tatio

n In

dex

Allocation Attempt

baselist-full

list-light

0.97

0.975

0.98

0.985

0.99

0.995

1

0 50 100 150 200 250 300 350

Frag

men

tatio

n In

dex

Allocation Attempt

baselinear-zone-0MB

linear-zone-1024MB

X86 LIST-BASED X86 ZONE-BASED

Figure 7: x86 Fragmentation Index at Allocation Failures During Highalloc-Stress Test


0.97

0.975

0.98

0.985

0.99

0.995

1

0 50 100 150 200 250

Frag

men

tatio

n In

dex

Allocation Attempt

baselist-full

list-light

0.97

0.975

0.98

0.985

0.99

0.995

1

0 50 100 150 200 250

Frag

men

tatio

n In

dex

Allocation Attempt

baselinear-zone-0M

linear-zone-2048MB

PPC64 LIST-BASED PPC64 ZONE-BASED

Figure 8: PPC64 Fragmentation Index at Allocation Failures During Highalloc-Stress

9 Results Conclusions

The two strategies had different advantages anddisadvantages but both were able to increaseavailability of HugeTLB pages. The fact thatlist-based does not require configuration andworks on all of memory makes it desirable butour figures show that it breaks down in its cur-rent implementation. Once correctly config-ured, zone-based is more reliable even thoughit does not help high-order kernel allocations.

The zone-based strategy is currently the bestavailable solution. In the short-to-mediumterm, the zone-based strategy creates a soft-area that can satisfy HugeTLB allocations ondemand. In the long-term, we intend to de-velop a strategy that takes the best from bothapproaches without incurring a performance re-gression.

10 Linear Reclaim

Anti-fragmentation improves our chances offinding contiguous regions of memory that maybe reclaimed to satisfy a high order allocation.

However, the existing LRU-approximation al-gorithm for page reclamation is not suitable forfinding contiguous regions.

In the worst-case scenario, the LRU list con-tains randomly ordered pages across the systemso the release of pages will also be in randomorder. To free a contiguous region of 2 j pageswithin a zone containing N pages, we may needto release Fr( j) pages in that zone where

Fr( j) = ( N2 j ∗ (2 j−1))+1

The table in Figure 9 shows the relative propor-tion of memory we will need to reclaim beforewe can guarantee to free a contiguous regionof sufficient size for the specified order. Wecan see that beyond the lowest orders we needto reclaim most pages in the system to guar-antee freeing pages of the desired order. Or-der 10 and 12 are interesting as they representthe HugeTLB page sizes for x86 and PPC64 re-spectively. The average case is not this severe,but a detailed analysis of the average case is be-yond the scope of this paper.

We introduced an alternative reclaim algorithmcalled Linear Reclaim designed to target larger


Order Percentage1 50.002 75.003 87.504 93.755 96.886 98.44

10 99.9012 99.98

Figure 9: Reclaim Difficulty

Reclaimable Reclaimable

Non-Reclaimable

Reclaimable

Free

Figure 10: Linear Reclaim

contiguous regions of pages. It is used whenthe failing allocation is of order 3 or greater.With linear reclaim we view the entire memoryspace as a set of contiguous regions, each of thesize we are trying to release. For each region,we check if all of the pages are likely to be re-claimable or are already free. If so, the allo-cated pages are removed from the LRU and anattempt is made to reclaim them. This contin-ues until a proportion of the contiguous regionshave been scanned.

In our example in Figure 10, linear reclaim willonly attempt to reclaim pages in the second andfourth regions, applying reclaim to all the pagesin the selected region at the same time. It isclear that in the case where reclaim succeedswe should be able to free the region by releas-ing just its pages which is significantly less thanthat required with LRU-based reclaim.

An early proof-of-concept implementation oflinear reclaim was promising. A HugeTLB-

capability test was run on the x86 machine.Under load, a clean kernel was able to allo-cate 6 HugeTLB pages, the zone-based anti-fragmentation allocator was able to allocate 10HugeTLB pages and with both zone-based anti-fragmentation and linear-reclaim, it was ableto allocate 41 HugeTLB pages. We do nothave detailed timing information but early in-dications are that linear reclaim is able to sat-isfy allocation requests faster but spends moretime scanning than the existing page reclama-tion policy before a failure. In summary, linearreclaim is promising, but needs further devel-opment.

11 Future Work

We intend to develop the zone-based anti-fragmentation strategy further. The patchesthat exist at the time of writing include somecomplex architecture-specific code that calcu-late the size of ZONE_EASYRCLM. As the codefor sizing zones and memory holes in each ar-chitecture is similar, we are developing codeto calculate the size of zones and holes in anarchitecture-independent fashion. Our initialpatches show a net reduction of code.

Once an anti-fragmentation strategy is in place,we would like to develop the linear reclaimscanner further as LRU reclaims far too muchmemory to satisfy a request for a contiguous re-gion. Our current testing strategy records howlong it takes to satisfy a large allocation andwe anticipate linear reclaim will show improve-ments in those figures.

In a perfect world, with everything in place, theplan is to work on the transparent support ofHugeTLB pages in Linux. Although there areknown applications that benefit from this suchas database and java-based software, we would


also like to show benefits for desktop softwaresuch as X.

We will then determine if there is a perfor-mance case for the use of higher-order alloca-tions by the kernel. If there is, we will revisitthe list-based approach and determine if a moregeneral solution can be developed to controlfragmentation throughout the system, and notjust in pre-configured zones.

Acknowledgements

We would like to thank Nishanth Aravamudanfor reviewing a number of drafts of this pa-per and his suggestions on how to improve thequality. We would like to thank Dave Hansenfor his clarifications on the content, particularlyon the size of SPARSEMEM memory sectionson x86. Finally, we would like to thank PaulMcKenney for his in-depth commentary on anearlier version of the paper and particularly forhis feedback on Section 3.

Legal Statement

This work represents the view of the authorsand does not necessarily represent the view ofIBM.

Linux is a trademark of Linus Torvalds in theUnited States, other countries, or both.

Other company, product, and service namesmay be the trademarks or service marks of oth-ers.

References

[1] D. A. Barrett and B. G. Zorn. Usinglifetime predictors to improve memory

allocation performance. In PLDI, pages187–196, 1993.

[2] J. B. Chen, A. Borg, and N. P. Jouppi. Asimulation based study of TLBperformance. In ISCA, pages 114–123,1992.

[3] D. G. Korn and K.-P. Bo. In search of abetter malloc. In Proceedings of theSummer 1985 USENIX Conference, pages489–506, 1985.

[4] M. K. McKusick. The design andimplementation of the 4.4BSD operatingsystem. Addison-Wesley, 1996.

[5] J. L. Peterson and T. A. Norman. Buddysystems. Communications of the ACM,20(6):421–431, 1977.

[6] B. Randell. A note on storagefragmentation and program segmentation.Commun. ACM, 12(7):365–369, 1969.


Kernel comparison report------------------------Architecture: x86Huge Page Size: 4 MBPhysical memory: 1554364 KBNumber huge pages: 379

KBuild Comparison-----------------

2.6.16-rc6-clean 2.6.16-rc6-zone-0MB 2.6.16-rc6-zone-1024MBTime taken to extract kernel: 25 24 24Time taken to build kernel: 393 391 391

AIM9 Comparison---------------

2.6.16-rc6-clean zone-0MB zone-1024MB1 creat-clo 105965.67 105866.67 -0.09% 106500.00 0.50% File Creations and Closes/s2 page_test 259306.67 271558.07 4.72% 258300.28 -0.39% System Allocations & Pages/s3 brk_test 1666572.24 1866883.33 12.02% 1880766.67 12.85% System Memory Allocations/s4 jmp_test 14805650.00 13949966.67 -5.78% 15088700.00 1.91% Non-local gotos/second5 signal_test 286252.29 280183.33 -2.12% 282950.00 -1.15% Signal Traps/second6 exec_test 131.79 131.98 0.14% 131.68 -0.08% Program Loads/second7 fork_test 3857.69 3842.69 -0.39% 3862.69 0.13% Task Creations/second8 link_test 21291.90 21693.58 1.89% 21499.37 0.97% Link/Unlink Pairs/second

High Allocation Stress Test Comparison--------------------------------------HighAlloc Under Load Test Results Pass 1

2.6.16-rc6-clean 2.6.16-rc6-zone-0MB 2.6.16-rc6-zone-1024MBOrder 10 10 10Success allocs 72 20 82Failed allocs 307 359 297DMA zone allocs 1 1 1Normal zone allocs 5 5 6HighMem zone allocs 66 14 7EasyRclm zone allocs 0 0 68% Success 18 5 21HighAlloc Under Load Test Results Pass 2

2.6.16-rc6-clean 2.6.16-rc6-zone-0MB 2.6.16-rc6-zone-1024MBOrder 10 10 10Success allocs 82 70 106Failed allocs 297 309 273DMA zone allocs 1 1 1Normal zone allocs 5 5 6HighMem zone allocs 76 64 7EasyRclm zone allocs 0 0 92% Success 21 18 27HighAlloc Test Results while Rested

2.6.16-rc6-clean 2.6.16-rc6-zone-0MB 2.6.16-rc6-zone-1024MBOrder 10 10 10Success allocs 110 130 181Failed allocs 269 249 198DMA zone allocs 1 1 1Normal zone allocs 16 46 44HighMem zone allocs 93 83 9EasyRclm zone allocs 0 0 127% Success 29 34 47

HugeTLB Page Capability Comparison----------------------------------

2.6.16-rc6-clean 2.6.16-rc6-zone-0MB 2.6.16-rc6-zone-1024MBDuring compile: 5 5 5At rest before dd of large file: 51 52 48At rest after dd of large file: 67 64 92

Figure 11: Example Kernel Comparison Report


0

100

200

300

400

500list-light

list-full

zone-1024MB

zone-0MB

base

Seconds

ExtractB

uild

0

50

100

150

200

250

300

350

list-light Rest

list-full Rest

zone-1024MB Rest

zone-0MB Rest

base Rest

list-light #2

list-full #2

zone-1024MB #2

zone-0MB #2

base #2

list-light #1

list-full #1

zone-1024MB #1

zone-0MB #1

base #1

HugeTLB pages

DM

AN

ormal

HighM

emE

asyRclm

KBUILD COMPARISON HIGH ALLOCATION STRESS TEST COMPARISON

-15

-10 -5 0 5

10

15

list-light fork_test

list-full fork_test

zone-1024MB fork_test

zone-0MB fork_test

base fork_test

list-light exec_test

list-full exec_test

zone-1024MB exec_test

zone-0MB exec_test

base exec_test

list-light brk_test

list-full brk_test

zone-1024MB brk_test

zone-0MB brk_test

base brk_test

list-light page_test

list-full page_test

zone-1024MB page_test

zone-0MB page_test

base page_test

Deviation %

0

10

20

30

40

50

60

70

80

90

100

list-light after

list-full after

zone-1024MB after

zone-0MB after

base after

list-light before

list-full before

zone-1024MB before

zone-0MB before

base before

list-light compile

list-full compile

zone-1024MB compile

zone-0MB compile

base compile

HugeTLB pages

AIM9 COMPARISON HUGETLB PAGE CAPABILITY COMPARISON

Figure 12: Anti-Fragmentation Strategy Comparison on x86


0

100

200

300

400

500

600

700

list-light

list-full

zone-2048MB

zone-0MB

base

Seconds

ExtractB

uild

0

20

40

60

80

100

120

140

list-light Rest

list-full Rest

zone-2048MB Rest

zone-0MB Rest

base Rest

list-light #2

list-full #2

zone-2048MB #2

zone-0MB #2

base #2

list-light #1

list-full #1

zone-2048MB #1

zone-0MB #1

base #1

HugeTLB pages

DM

AE

asyRclm

KBUILD COMPARISON HIGH ALLOCATION STRESS TEST COMPARISON

-4 -2 0 2 4

list-light fork_test

list-full fork_test

zone-2048MB fork_test

zone-0MB fork_test

base fork_test

list-light exec_test

list-full exec_test

zone-2048MB exec_test

zone-0MB exec_test

base exec_test

list-light brk_test

list-full brk_test

zone-2048MB brk_test

zone-0MB brk_test

base brk_test

list-light page_test

list-full page_test

zone-2048MB page_test

zone-0MB page_test

base page_test

Deviation %

0

10

20

30

40

50

60

list-light after

list-full after

zone-2048MB after

zone-0MB after

base after

list-light before

list-full before

zone-2048MB before

zone-0MB before

base before

list-light compile

list-full compile

zone-2048MB compile

zone-0MB compile

base compile

HugeTLB pages

AIM9 COMPARISON HUGETLB PAGE CAPABILITY COMPARISON

Figure 13: Anti-Fragmentation Strategy on PPC64


GIT—A Stupid Content Tracker

Junio C. HamanoTwin Sun, Inc.

[email protected]

Abstract

Git was hurriedly hacked together by LinusTorvalds, after the Linux kernel project lostits license to use BitKeeper as its sourcecode management system (SCM). It has sincequickly grown to become capable of managingthe Linux kernel project source code. Otherprojects have started to replace their existingSCMs with it.

Among interesting things that it does are:

1. giving a quick whole-tree diff,

2. quick, simple, stupid-but-safe merge,

3. facilitating e-mail based patch exchangeworkflow, and

4. helping to pin-point the change that causeda particular bug by a bisection search inthe development history.

The core git functionality is implemented as aset of programs to allow higher-layer systems(Porcelains) to be built on top of it. SeveralPorcelains have been built on top of git, to sup-port different workflows and individual taste ofusers. The primary advantage of this architec-ture is ease of customization, while keepingthe repositories managed by different Porce-lains compatible with each other.

The paper gives an overview of how git evolvedand discusses the strengths and weaknesses ofits design.

1 Low level design

Git is a “stupid content tracker.” It is designedto record and compare the whole tree states ef-ficiently. Unlike traditional source code controlsystems, its data structures are not geared to-ward recording changes between revisions, butfor making it efficient to retrieve the state of in-dividual revisions.

The unit in git storage is an object. It records:

• blob – the contents of a file (either thecontents of a regular file, or the pathpointed at by a symbolic link).

• tree – the contents of a directory, byrecording the mapping from names to ob-jects (either a blob object or a tree objectthat represents a subdirectory).

• commit – A commit associates a treewith meta-information that describes howthe tree came into existence. It records:

– The tree object that describes theproject state.

386 • GIT—A Stupid Content Tracker

– The author name and time of creationof the content.

– The committer name and time of cre-ation of the commit.

– The parent commits of this commit.

– The commit log that describes whythe project state needs to be changedto the tree contained in this commitfrom the trees contained in the parentcommits.

• tag – a tag object names another objectand associates arbitrary information withit. A person creating the tag can attest thatit points at an authentic object by GPG-signing the tag.

Each object is referred to by taking the SHA1hash (160 bits) of its internal representation,and the value of this hash is called its objectname. A tree object maps a pathname to theobject name of the blob (or another tree, for asubdirectory).

By naming a single tree object that representsthe top-level directory of a project, the entire di-rectory structure and the contents of any projectstate can be recreated. In that sense, a tree ob-ject is roughly equivalent to a tarball.

A commit object, by tying its tree object withother commit objects in the ancestry chain,gives the specific project state a point in projecthistory. A merge commit ties two or more linesof developments together by recording whichcommits are its parents. A tag object is used toattach a label to a specific commit object (e.g. aparticular release).

These objects are enough to record the projecthistory. A project can have more than one linesof developments, and they are called branches.The latest commit in each line of developmentis called the head of the branch, and a reposi-tory keeps track of the heads of currently active

branches by recording the commit object namesof them.

To keep track of what is being worked on inthe user’s working tree, another data structurecalled index, is used. It associates pathnameswith object names, and is used as the stagingarea for building the next tree to be committed.

When preparing an index to build the next tree,it also records the stat(2) information fromthe working tree files to optimize common op-erations. For example, when listing the set offiles that are different between the index andthe working tree, git does not have to inspectthe contents of files whose cached stat(2)information match the current working tree.

The core git system consists of many relativelylow-level commands (often called Plumbing)and a set of higher level scripts (often calledPorcelain) that use Plumbing commands. EachPlumbing command is designed to do a spe-cific task and only that task. The Plumbingcommands to move information between therecorded history and the working tree are:

• Files to index. git-update-indexrecords the contents of the working treefiles to the index; to write out the blobrecorded in the index to the working treefiles, git-checkout-index is used;and git-diff-files compares whatis recorded in the index and the workingtree files. With these, an index is built thatrecords a set of files in the desired state tobe committed next.

• Index to recorded history. To writeout the contents of the index as a treeobject, git-write-index is used;git-commit-tree takes a tree object,zero or more commit objects as its parents,and the commit log message, and createsa commit object. git-diff-index


compares what is recorded in a tree ob-ject and the index, to serve as a previewof what is going to be committed.

• Recorded history to index. The directorystructure recorded in a tree object is readby git-read-tree into the index.

• Index to files. git-checkout-indexwrites the blobs recorded in the index tothe working tree files.

By tying these low-level commands together,Porcelain commands give usability to thewhole system for the end users. For ex-ample, git-commit provides a UI to askfor the commit log message, create a newcommit object (using git-write-tree andgit-commit-tree), and record the objectname of that commit as the updated topmostcommit in the current line of development.

2 Design Goals

From the beginning, git was designed specifi-cally to support the workflow of the Linux ker-nel project, and its design was heavily influ-enced by the common types of operations in thekernel project. The statistics quoted below arefor the 10-month period between the beginningof May 2005 and the end of February 2006.

• The source tree is fairly large. It hasapproximately 19,000 files spread across1,100 directories, and it is growing.

• The project is very active. Approximately20,000 changes were made during the 10-month period. At the end of the examinedperiod, around 75% of the lines are fromthe version from the beginning of the pe-riod, and the rest are additions and modifi-cations.

• The development process is highly dis-tributed. The development history ledto v2.6.16 since v2.6.12-rc2 containschanges by more than 1,800 authors thatwere committed by a few dozen people.

• The workflow involves many patch ex-changes through the mailing list. Among20,000 changes, 16,000 were committedby somebody other than the original au-thor of the change.

• Each change tends to touch only a handfulfiles. The source tree is highly modularand a change is often very contained to asmall part of the tree. A change touchesonly three files on average, and modifiesabout 160 lines.

• The tree reorganization by addition anddeletion is not so uncommon, but oftenhappens over time, not as a single renamewith some modifications. 4,500 files wereadded or deleted, but less than 600 wererenamed.

• The workflow involves frequent mergesbetween subsystem trees and the mainline.About 1,500 changes are merges (7%).

• A merge tends to be straightforward. Themedian number of paths involved in the1,500 merges was 185, and among them,only 10 required manual inspection ofcontent-level merges.

Initial design guidelines came from the aboveproject characteristics.

• A few dozen people playing the integratorrole have to handle work by 2,000 contrib-utors, and it is paramount to make it effi-cient form them to perform common oper-ations, such as patch acceptance and merg-ing.


• Although the entire project is large, in-dividual changes tend to be localized.Supporting patch application and mergingwith a working tree with local modifica-tions, as long as such local modificationsdo not interfere with the change being pro-cessed, makes the integrators’ job more ef-ficient.

• The application of an e-mailed patch mustbe very fast. It is not uncommon to feedmore than 1,000 changes at once during async from the -mm tree to the mainline.

• When existing contents are moved aroundin the project tree, renaming of an entirefile (with or without modification at thesame time) is not a majority. Other con-tent movements happen more often, suchas consolidating parts of multiple files toone new file or splitting an existing filesinto multiple new files. Recording file re-names and treating them specially does nothelp much.

• Although the merge plays an importantrole in building the history of the project,clever merge algorithms do not makemuch practical difference, because ma-jority of the merges are trivial; nontriv-ial cases need to be examined carefullyby humans anyway, and the maintainercan always respond, “This does not apply,please rework it based on the latest ver-sion and resubmit.” Faster merge is moreimportant, as long as it does not silentlymerge things incorrectly.

• Frequent and repeated merges are thenorm. It is important to record what hasalready been merged in order to avoid hav-ing to resolve the same merge conflictsover and over again.

• Two revisions close together tend to havemany common directories unchanged be-

tween them. Tree comparison can take ad-vantage of this to avoid descending intosubdirectories that are represented by thesame tree object while examining changes.

3 Evolution

The very initial version of git, released by LinusTorvalds on April 7, 2005, had only a handfulof commands to:

• initialize the repository;

• update the index to prepare for the nexttree;

• create a tree object out of the current indexcontents;

• create a commit object that points at itstree object and its parent commits;

• print the contents of an object, given itsobject name;

• read a tree object into the current index;

• show the difference between the index andthe working tree.

Even with this only limited set of commands, itwas capable of hosting itself. It needed script-ing around it even for “power users.”

By the end of the second week, the Plumbinglevel already had many of the fundamental datastructure of today’s git, and the initial commitof the modern Linux kernel history hosted ongit (v2.6.12-rc2) was created with this version.It had commands to:

• read more than one tree object to processa merge in index;


• perform content-level merges by iteratingover an unmerged index;

• list the commit ancestry and find the mostrecent common commit between two linesof development;

• show differences between two tree objects,in the raw format;

• fetch from a remote repository over rsyncand merge the results.

When two lines of development meet, git usesthe index to match corresponding files from thecommon ancestor (merge base) and the tips ofthe two branches. If one side changed a filewhile the other side didn’t, which often hap-pens in a big project, the merge algorithm cantake the updated version without looking at thecontents of the file itself. The only case thatneeds the content-level merge is when both sidechanged the same file. This tree merge opti-mization is one of the foundations of today’sgit, and it was already present there. The firstever true git merge in the Linux kernel repos-itory was made with this version on April 17,2005.

By mid May, 2005, it had commands to:

• fetch objects from remote repositoriesover HTTP;

• create tags that point at other objects;

• show differences between the index and atree, working tree files and a tree, in ad-dition to the original two tree compari-son commands—both raw and patch for-mat output were supported;

• show the commit ancestry along with thelist of changed paths.

By the time Linus handed the project over to thecurrent maintainer in late July 2005, the corepart was more or less complete. Added duringthis period were:

• packed archive for efficient storage, ac-cess, and transfer;

• the git “native” transfer protocol, and thegit-daemon server;

• exporting commits into the patch formatfor easier e-mail submission.

• application of e-mailed patches.

• rename detection by diff commands.

• more “user friendliness” layer commands,such as git-add and git-diff wrap-pers.

The evolution of git up to this point primar-ily concentrated on supporting the people inthe integrator role better. Support for individ-ual developers who feed patches to integratorswas there, but providing developers with morepleasant user experiences was left to third-partyPorcelains, most notably Cogito and StGIT.

4 Features and Strengths

This section discusses a few examples of howthe implementation achieves the design goalsstated earlier.

4.1 Patch flows

There are two things git does to help developerswith the patch-based workflow.


Generating a patch out of a git-managed historyis done by using the git-diff-tree com-mand, which knows how to look at only sub-trees that are actually different in two trees forefficient patch generation.

The diff commands in git can optionally be toldto detect file renames. When a file is renamedand modified at the same time, with this op-tion, the change is expressed as a diff betweenthe file under the old name in the original treeand the file under the new name in the updatedtree. Here is an example taken from the kernelproject:

5e7b83ffc67e15791d9bf8b2a18e4f5fd0eb69b8diff --git a/arch/um/kernel/sys_call_table....similarity index 99%rename from arch/um/kernel/sys_call_table.crename to arch/um/sys-x86_64/sys_call_table.cindex b671a31..3f5efbf 100644--- a/arch/um/kernel/sys_call_table.c+++ b/arch/um/sys-x86_64/sys_call_table.c@@ -16,2 +16,8 @@ #include "kern_util.h"

+#ifdef CONFIG_NFSD+#define NFSSERVCTL sys_nfsservctl+#else+#define NFSSERVCTL sys_ni_syscall+#endif+#define LAST_GENERIC_SYSCALL __NR_keyctl

This is done to help reviewing such a change bymaking it easier than expressing it as a deletionand a creation of two unrelated files.

The committer (typically the subsystem main-tainer) keeps the tip of the development branchchecked out, and applies e-mailed patches to itwith git-apply command. The commandchecks to make sure the patch applies cleanlyto the working tree, paths affected by the patchin the working tree are unmodified, and the in-dex does not have modification from the tip ofthe branch. These checks ensure that after ap-plying the patch to the working tree and the in-dex, the index is ready to be committed, even

when there are unrelated changes in the work-ing tree. This allows the subsystem maintainerto be in the middle of doing his own work andstill accept patches from outside.

Because the workflow git supports should notrequire all participants to use git, it understandsboth patches generated by git and traditionaldiff in unified format. It does not matter howthe change was prepared, and does not nega-tively affect contributors who manage their ownpatches using other tools, such as Andrew Mor-ton’s patch-scripts, or quilt.

4.2 Frequent merges

A highly distributed development processinvolves frequent merges between differentbranches. Git uses its commit ancestry relationto find common ancestors of the branches beingmerged, and uses a three-way merge algorithmat two levels to resolve them. Because mergestend to happen often, and the subprojects arehighly modular, most of the merges tend todeal with cases where only one branch modi-fies paths that are left intact by the other branch.This common case is resolved within the index,without even having to look at the contents offiles. Only paths that need content-level mergesare given to an external three-way merge pro-gram (e.g. “merge” from the RCS suite) to beprocessed. Similarly to the patch-applicationprocess, the merge can be done as long as theindex does not have modification from the tip ofthe branch and there is no change to the work-ing tree files that are involved in the merge, toallow the integrator to have local changes in theworking tree.

While merging, if one branch renamed a filefrom the common ancestor while the otherbranch kept it at the same location (or renamedit to a different location), the merge algorithm


notices the rename between the common an-cestor and the tips of the branches, and ap-plies three-way merge algorithm to merge therenames (i.e. if one branch renamed but theother kept it the same, the file is renamed). Theexperiences by users of this “merging renamedpaths” feature is mixed. When merges are fre-quently done, it is more likely that the differ-ences in the contents between the common an-cestor and the tip of the branch is small enoughthat automated rename detector notices it.

When two branches are merged frequently witheach other, there can be more than one closestcommon ancestor, and depending on which an-cestor is picked, the three-way merge is knownto produce different results. The merge al-gorithms git uses notice this case and try tobe safe. The faster “resolve” algorithm leavesthe resolution to the end-user, while the morecareful “recursive” algorithm first attempts tomerge the common ancestors (recursively—hence its name) and then uses the result as themerge base of the three-way merge.

4.3 Following changes

The project history is represented as a parent-child ancestry relation of commit objects, andthe Plumbing command git-rev-list isused to traverse it. Because detecting subdirec-tory changes between two trees is a very cheapoperation in git, it can be told to ignore commitsthat do not touch certain parts of the directoryhierarchy by giving it optional pathnames. Thisallows the higher-level Porcelain commands toefficiently inspect only “interesting” commitsmore closely.

The traversal of the commit ancestry graph isalso done while finding a regression, and isused by the git-bisect command. Thistraversal can also be told to omit commits thatdo not touch a particular area of the project di-rectory; this speeds up the bug-hunting process

when the source of the regression is known tobe in a particular area.

4.4 Interoperating with other SCM

A working tree checked out from a foreignSCM system can be made into a git reposi-tory. This allows an individual participant ofa project whose primary SCM system is notgit to manage his own changes with git. Typ-ically this is done by using two git branchesper the upstream branch, one to track the for-eign SCM’s progress, another to hold his ownchanges based on that. When changes areready, they are fed back by the project’s pre-ferred means, be it committing into the foreignSCM system or sending out a series of patchesvia e-mail, without affecting the workflow ofthe other participants of the projects. This al-lows not just distributed development but dis-tributed choice of SCM. In addition, there arecommands to import commits from other SCMsystems (as of this writing, supported systemsare: GNU arch, CVS, and Subversion), whichhelps to make this process smoother.

There also is an emulator that makes a gitrepository appear as if it is a CVS reposi-tory to remote CVS clients that come over thepserver protocol. This allows people morefamiliar with CVS to keep using it while otherswork in the same project that is hosted on git.

4.5 Interoperating among git users

Due to the clear separation of the Plumbing andPorcelain layers, it is easy to implement higher-level commands to support different workflowson top of the core git Plumbing commands. ThePorcelain layer that comes with the core git re-quires the user to be fairly familiar with how thetools work internally, especially how the indexis used. In order to make effective use of the


tool, the users need to be aware that there arethree levels of entities: the histories recorded incommits, the index, and the working tree files.

An alternative Porcelain, Cogito, takes a dif-ferent approach by hiding the existence of theindex from the users, to give them a more tra-ditional two-level world model: recorded his-tories and the working tree files. This may falldown at times, especially for people playing theintegrator role during merges, but gives a morefamiliar feel to new users who are used to otherSCM systems.

Another popular tool based on git, StGIT, is de-signed to help a workflow that depends moreheavily on exchange of patches. While the pri-mary way to integrate changes from differenttracks of development is to make merges inthe workflow git and Cogito primarily targets,StGIT supports the workflow to build up pilesof patches to be fed upstream, and re-sync withthe upstream when some or all of the patchesare accepted by rebuilding the patch queue.

While different Porcelains can be used by dif-ferent people with different work habits, the de-velopment history recorded by different Porce-lains are eventually made by the commonPlumbing commands in the same underlyingformat, and therefore are compatible with eachother. This allows people with different work-flow and choice of tools to cooperate on thesame project.

5 Weakness and Future Works

There are various missing features and unfin-ished parts in the current system that requirefurther improvements. Note that the system isstill evolving at a rapid pace and some of theissues listed here may have already been ad-dressed when this paper is published.

5.1 Partial history

The system operates on commit ancestry chainsto perform many of the interesting things itdoes, and most of the time it only needs to lookat the commits near the tip of the branches. Anobvious example is to look at recent develop-ment histories. Merging branches need accessto the commits on the ancestry chain down tothe latest common ancestor commit, and no ear-lier history is required. One thing that is oftendesired but not currently supported is to makea “shallow” clone of a repository that recordsonly the recent history, and later deepen it byretrieving older commits.

Synchronizing two git repositories is done bycomparing the heads of branches on both repos-itories, finding common ancestors and copyingthe commits and their associated tree and blobobjects that are missing from one end to theother. This operation relies on an invariant thatall history behind commits that are recorded asthe heads of branches are already in the repos-itory. Making a shallow clone that has onlycommits near the tip of the branch violates thisinvariant, and a later attempt to download olderhistory would become a no-operation. To sup-port “shallow” cloning, this invariant needs tobe conditionally lifted during “history deepen-ing” operation.

5.2 Subprojects

Multiple projects overlayed in a single direc-tory are not supported. Different repositoriescan be stored along with their associated work-ing trees in separate subdirectories, but cur-rently there is no support to tie the versionsfrom the different subprojects together.

There have been discussions on this topic andtwo alternative approaches were proposed, but


there were not enough interest to cause eitherapproach to materialize in the form of concretecode yet.

5.3 Implications of not recording renames

In an early stage of the development, we de-cided not to record rename information in treesnor commits. This was both practical andphilosophical.

Files are not renamed that often, and it was ob-served that moving file contents around with-out moving the file itself happened just as of-ten. Not having to record renames specially,but always recording the state of the whole treein each revision, was easier to implement froma practical point of view.

When examining the project history, the ques-tion “where did this function come from, andhow did it get into the current form?” is farmore interesting than “where did this file comefrom?” and when the former question is an-swered properly (i.e. “it started in this shapein file X, but later assumed that shape and mi-grated to file Y”), the latter becomes a nar-row special case, and we did not want to onlysupport the special case. Instead, we wantedto solve the former problem in a way generalenough to make the latter a non-issue. How-ever, deliberately not recording renames oftencontradicts people’s expectations.

Currently we have a merge strategy that looksat the common ancestor and two branch headsbeing merged to detect file renames and try tomerge the contents accordingly. If the modifi-cations made to the file in question across re-names is too big, the rename detection logicwould not notice that they are related. It ispossible for the merge algorithm to inspect allcommits along the ancestry chain to make therename detection more precise, but this wouldmake merges more expensive.

6 Conclusion

Git started as a necessity to have a minimallyusable system, and during its brief developmenthistory, it has quickly become capable of host-ing one of the most important free softwareprojects, the Linux kernel. It is now used byprojects other than the kernel (to name a few:Cairo, Gnumeric, Wine, xmms2, the X.org Xserver). Its simple model and tool-based ap-proach allow it to be enhanced to support differ-ent workflows by scripting around it and still becompatible with other people who use it. Thedevelopment community is active and is grow-ing (about 1500 postings are made to the mail-ing list every month).


Reducing fsck time for ext2 file systems

Val HensonIntel, Inc.

[email protected]

Zach BrownOracle, Inc.

[email protected]

Theodore Ts’oIBM, Inc.

[email protected]

Arjan van de VenIntel, Inc.

[email protected]

Abstract

Ext2 is fast, simple, robust, and fun to hack on.However, it has fallen out of favor for one ma-jor reason: if an ext2 file system is not cleanlyunmounted, such as in the event of kernel crashor power loss, it must be repaired using fsck,which takes minutes or hours to complete, dur-ing which time the file system is unavailable.In this paper, we describe some techniques forreducing the average fsck time on ext2 file sys-tems. First, we avoid running fsck in somecases by adding a filesystem-wide dirty bit in-dicating whether the file system was being ac-tively modified at the time it crashed. The per-formance of ext2 with this change is close tothat of plain ext2, and quite a bit faster thanext3. Second, we propose a technique calledlinked writes which uses dependent writes anda list of dirty inodes to allow recovery of an ac-tive file system by only repairing the dirty in-odes and avoiding a full file system check.

1 Introduction

The Second Extended File System, ext2, wasimplemented in 1993 by Remy Card, Theodore

T’so, and Stephen Tweedie, and for many yearswas the file system of choice for Linux sys-tems. Ext2 is similar in on-disk structure to theBerkeley FFS file system [14], with the notableexception of sub-block size fragments [2]. Inrecent years, ext2 has been overtaken in popu-larity by the ext3 [6, 16] and reiser3 [4] file sys-tems, both journaling file systems. While thesefile systems are not as fast as ext2 in some cases[1], and are certainly not as simple, their recov-ery after crash is very fast as they do not haveto run fsck.

Like the original Berkeley FFS, ext2 file sys-tem consistency is maintained on a post hoc ba-sis, by repair after the fact using the file systemchecker, fsck [12]. Fsck works by traversingthe entire file system and building up a consis-tent picture of the file system metadata, whichit then writes to disk. This kind of post hocdata repair has two major drawbacks. One, ittends to be fragile. A new set of test and repairfunctions had to be written for every commonkind of corruption. Often, fsck had to fall backto manual mode—that is, asking the human tomake decisions about repairing the file systemfor it. As ext2 continued to be used and newtests and repairs were added to the fsck codebase, this occurred less and less often, and nowmost users can reasonably expect fsck to com-

396 • Reducing fsck time for ext2 file systems

plete unattended after a system crash.

The second major drawback to fsck is total run-ning time. Since fsck must traverse the entirefile system to build a complete picture of allo-cation bitmaps, number of links to inodes, andother potentially incorrect metadata, it takesanywhere from minutes to hours to complete.File system repair using fsck takes time pro-portional to the size of the file system, ratherthan the size of the ongoing update to the filesystem, as is the case for journaling file sys-tems like ext3 and reiserfs. The cost of the sys-tem unavailability while fsck is running is sogreat that ext2 is generally only used in nichecases, when high ongoing performance is worththe cost of occasional system unavailability andpossible greater chance of data loss.

On the other hand, ext2 is fast, simple, easyto repair, uses little CPU, performs well withmulti-threaded reads and writes, and benefitsfrom over a decade of debugging and fine tun-ing. Our goal is to find a way to keep these at-tributes while reducing the average time it takesto recover from crashes—that is, reducing theaverage time spent running fsck. Our targetuse case is a server with many users, infrequentwrites, lots of read-only file system data, andtolerance for possibly greater chance of dataloss.

Our first approach to reducing fsck time is toimplement a filesystem-wide dirty bit. Whilewrites are in progress, the bit is set. Afterthe file system has been idle for some periodof time (one second in our implementation),we force out all outstanding writes to disk andmark the file system as clean. If we crash whilethe file system is marked clean, fsck knows thatit does not have to do a full fsck. Instead, itdoes some minor housekeeping and marks thefile system as valid. Orphan inodes and blockpreallocation added some interesting twists tothis solution, but overall it remains a simplechange. While this approach does not improve

worst case fsck time, it does improve averagefsck time. For comparison purposes, recall thatext3 runs a full fsck on the file system every 30mounts as usually installed.

Our second approach, which we did not imple-ment, is an attempt to limit the data fsck needsto examine to repair the file system to a set ofdirty inodes and their associated metadata. Ifwe add inodes to an on-disk dirty inode list be-fore altering them and correctly order metadatawrites to the file system, we will be able to cor-rect allocation bitmaps, directory entries, andinode link counts without rebuilding the entirefile system, as fsck does now.

Some consistency issues are difficult to solvewithout unsightly and possibly slow hacks,such as keeping the number of links consis-tent for a file with multiple hard links duringan unlink() operation. However, they occurrelatively rarely, so we are considering combin-ing this approach with the filesystem-wide dirtybit. When a particular operation is too ugly toimplement using the dirty inode list, we sim-ply mark the file system as dirty for the du-ration of the operation. It may be profitableto merely narrow the window during which acrash will require a full fsck rather than to closethe window fully. Whether this can be done andstill preserve the properties of simplicity of im-plementation and high performance is an openquestion.

2 Why ext2?

Linux has a lot of file systems, many of whichhave better solutions for maintaining file sys-tem consistency than ext2. Why are we work-ing on improving crash recovery in ext2 whenso many other solutions exist? The answer isa combination of useful properties of ext2 anddrawbacks of existing file systems.


First, the advantages of ext2 are simplicity, ro-bustness, and high performance. The entireext2 code base is about 8,000 lines of code;most programmers can understand and beginaltering the codebase within days or weeks. Forcomparison, most other file systems come inanywhere from 20,000 (ext3 + jbd) to 80,000(XFS) lines of code. Ext2 has been in activeuse since 1993, and benefits from over a decadeof weeding out bugs and repairing obscure andseldom seen failure cases. Ext2 performanceis quite good overall, especially considering itssimplicity, and definitely superior to ext3 inmost cases.

The main focus of file systems developmentin Linux today is ext3. On-disk, ext3 is al-most identical to ext2; both file systems can bemounted as either ext2 or ext3 in most cases.Ext3 is a journalled file system; updates to thefile system are first written as compact entriesin the on-disk journal region before they arewritten to their final locations. If a crash oc-curs during an update, the journal is replayedon the next mount, completing any unfinishedupdates.

Our primary concern with ext3 is lower per-formance from writing and sharing the journal.Work is being done to improve performance,especially in the area of multi-threaded writes[9], but it is hard to compete in performanceagainst a file system which has little or no re-strictions in terms of sharing resources or writeordering. Our secondary concern is complexityof code. Journaling adds a whole layer of codeto open transactions, reserve log space, and bailout when an error occurs. Overall, we feel thatext3 is a good file system for laptops, but notvery good for write-intensive loads.

The reiser3 [4] file system is the default filesystem for the SuSE distribution. It is also ajournaling file system, and is especially goodfor file systems with many small files because

it packs the file together, saving space. The per-formance of reiser3 is good and in some casesbetter than ext2. However, reiser3 was devel-oped outside the mainstream Linux commu-nity and never attracted a community developerbase. Because of this and the complexity of theimplementation, it is not a good base for filesystem development. Reiser4 [4] has less de-veloper buy-in, more code, worse performancein many cases [1], and may not be merged intothe mainline Linux tree at all [3].

XFS is another journaling file system. It hasmany desirable properties, and is ideal for ap-plications requiring thousands or millions offiles in one directory, but also suffers fromcomplexity and lack of developer community.Performance of more common case operations(such as file create) suffers for the benefit of fastlook ups in directories with many entries [1].

Other techniques for maintaining file systemcomplexity are soft updates [13] and copy-on-write [11]. Without a team of full-time pro-grammers and several years to work on theproblem, we did not feel we could implementeither of these techniques. In any case, we didnot feel we could maintain the simplicity or thebenefits of more than a decade of testing of ext2if we used these techniques.

Given these limitations, we decided to lookfor “90% solutions” to the file system con-sistency problem, starting with the ext2 codebase. This paper describes one technique weimplemented, the filesystem-wide dirty bit, andone we are considering implementing, linkedwrites.

3 The fsck program

Cutting down crash recovery time for an ext2file system depends on understanding how the


file system checker program, fsck works. AfterLinux has finished booting the kernel, the rootfile system is mounted read-only and the kernelexecutes the init program. As part of normalsystem initialization, fsck is run on the root filesystem before it is remounted read-write and onother file systems before they are mounted. Re-pair of the file system is necessary before it canbe safely written.

When fsck runs, it checks to see if the ext2file system was cleanly unmounted by readingthe state field in the file system superblock. Ifthe state is set as VALID, the file system is al-ready consistent and does not need recovery;fsck exits without further ado. If the state isINVALID, fsck does a full check of the file sys-tem integrity, repairing any inconsistencies itfinds. In order to check the correctness of al-location bitmaps, file nlinks, directory entries,etc., fsck reads every inode in the system, ev-ery indirect block referenced by an inode, andevery directory entry. Using this information, itbuilds up a new set of inode and block alloca-tion bitmaps, calculates the correct number oflinks of every inode, and removes directory en-tries to unreferenced inodes. It does many otherthings as well, such as sanity check inode fields,but these three activities fundamentally requirereading every inode in the file system. Other-wise, there is no way to find out whether, forexample, a particular block is referenced by afile but is marked as unallocated on the blockallocation bitmap. In summary, there are noback pointers from a data block to the indirectblock that points to it, or from a file to the direc-tories that point to it, so the only way to recon-struct reference counts is to start at the top leveland build a complete picture of the file systemmetadata.

Unsurprisingly, it takes fsck quite some timeto rebuild the entirety of the file system meta-data, approximately O(total file system size +data stored). The average laptop takes several

minutes to fsck an ext2 file system; large fileservers can sometimes take hours or, on oc-casion, days! Straightforward tactical perfor-mance optimizations such as requesting readsof needed blocks in sequential order and read-ahead requests can only improve the situationso much, given that the whole operation willstill take time proportional to the entire file sys-tem. What we want is file system recovery timethat is O(writes in progress), as is the case forjournal replay in journaling file systems.

One way to reduce fsck time is to eliminate theneed to do a full fsck at all if a crash occurswhen the file system is not being changed. Thisis the approach we took with the filesystem-wide dirty bit.

Another way to reduce fsck time is to reducethe amount of metadata we have to check inorder to repair the file system. We propose amethod of ordering updates to the file systemin such a way that full consistency can be re-covered by scanning a list of dirty inodes.

4 Implementation of filesystem-wide dirty bit

Implementing the fs-wide dirty bit seemed atfirst glance to be relatively simple. Intuitively,if no writes are going on in the file system, weshould be able to sync the file system (makesure all outstanding writes are on disk), resetthe machine, and cleanly mount the unchangedfile system. Our intuition is wrong in two ma-jor points: orphan inodes, and block prealloca-tion. Orphan inodes are files which have beenunlinked from the file system, but are still heldopen by a process. On crash and recovery, theinode and its blocks need to be freed. Blockpreallocation speeds up block allocation by pre-allocating a few more blocks than were actually


requested. Unfortunately, as implemented, pre-allocation alters on-disk data, which needs to becorrected if the file is not cleanly closed. Firstwe’ll describe the overall implementation, thenour handling of orphan inodes and preallocatedblocks.

4.1 Overview of dirty bit implementation

Our first working patch implementing the fs-wide dirty bit included the following high-levelchanges:

• Per-mount kernel thread to mark file sys-tem clean

• New ext2_mark_*_dirty() func-tions

• Port of ext3 orphan inode list

• Port of ext3 reservation code

The ports of the ext3 orphan inode list andreservation code were not frivolous; withoutthem, the file system would be an inconsistentstate even when no writes were occurring with-out them.

4.2 Per-mount kernel thread

The basic outline of how the file system ismarked dirty or clean by the per-mount kernelthread is as follows:

• Mark the file system dirty whenever meta-data is altered.

• Periodically check the state of the file sys-tem.

• If the file system is clean, sync the file sys-tem.

• If no new writes occurred during the sync,mark the file system clean.

The file system is marked clean or dirty by up-dating a field in the superblock and submittingthe I/O as a barrier write so that no writes canpass it and hit the disk before the dirty bit is up-dated. The update of the dirty bit is done asyn-chronously, so as to not stall during the firstwrite to a clean file system (since it is a barrierwrite, waiting on it will not change the order ofwrites to disk anyway).

In order to implement asynchronous update ofthe dirty bit in the superblock, we needed tocreate an in-memory copy of the superblock.Updates to the superblock are written to the in-memory copy; when the superblock is ready tobe written to disk, the superblock is locked, thein-memory superblock is copied to the bufferfor the I/O operation, and the I/O is submitted.The code implementing the superblock copy islimited to the files ext2/super.c and oneline in ext2/xattr.c.

One item on our to-do list is integration withthe laptop mode code, which tries to mini-mize the number of disk spin-up and spin-downevents by concentrating disk write activity intobatches. Marking the file system clean shouldprobably be triggered by the timeout for flush-ing dirty data in laptop mode.

4.3 Marking the file system dirty

Before any metadata changes are sched-uled to be written to disk, the file systemmust first be marked dirty. Ext2 al-ready uses the functions mark_inode_

dirty(), mark_buffer_dirty(), andmark_buffer_dirty_inode() to markchanged metadata for write-out by the VFSand I/O subsystems. We created ext2-specific


versions of these functions which first markthe file system dirty and then call the originalfunction.

4.4 Orphan inodes

The semantics of UNIX file systems allow anapplication to create a file, open it, unlink thefile (removing any reference to it from the filesystem), and keep the file open indefinitely.While the file is open, the file system can notdelete the file. In effect, this creates a tem-porary file which is guaranteed to be deleted,even if the system crashes. If the system doescrash while the file is still open, the file sys-tem contains an orphan inode—an inode whichmarked as in use, but is not referenced by anydirectory entry. This behavior is very conve-nient for application developers and a real painin the neck for file system developers, who wishthey would all use files in tmpfs instead.

In order to clean up orphan inodes after acrash, we ported the ext3 orphan inode list toext2. The orphan inode list is an on-disk singlylinked list of inodes, beginning in the orphaninode field of the superblock. The i_dtimefield of the inode, normally used to store thetime an inode was deleted, is (ab)used as theinode number of the next item in the orphan in-ode list. When fsck is run on the file system,it traverses the linked list of orphan inodes andfrees them. Fortunately for us, the code in fsckthat does this runs regardless of whether the filesystem is mounted as ext2 or ext3.

Our initial implementation followed the ext3practice of writing out orphan inodes immedi-ately in order to keep the orphan inode list asup-to-date as possible on disk. This is expen-sive, and an up-to-date orphan inode list is su-perfluous except when the file system is markedclean. We modified the orphan inode code toonly maintain the orphan inode list in memory,

and write it out to disk on file system sync. Wewill need to add a patch to keep fsck from com-plaining about a corrupted orphan inode list.

4.5 Preallocated blocks

The existing code in ext2 for preallocatingblocks unfortunately alters on-disk metadata,such as the block group free and allocated blockcounts. One solution was to simply turn offpreallocation. Fortunately, Mingming Cao im-plemented new block preallocation code forext3 which reserves blocks without touchingon-disk data, and is superior to the ext2 preal-location code in several other ways. We choseto port Mingming Cao’s reservation code toext2, which in theory should improve blockallocation anyway. ext2 and ext3 were sim-ilar enough that we could complete the portquickly, although porting some parts of the newget_blocks() functionality was tricky.

4.6 Development under User-mode Linux

We want to note that the implementation of thefilesystem-wide dirty bit was tested almost en-tirely on User-mode Linux [10], a port of Linuxthat runs as a process in Linux. UML is wellsuited to file system development, especiallywhen the developer is limited to a single lap-top for both development host and target plat-form (as is often the case on an airplane). WithUML, we could quickly compile, boot, crash,and reboot our UML instance, all without wor-rying about corrupting any important file sys-tems. When we did corrupt the UML file sys-tem, all that was necessary was to copy a cleanfile system image back over the file containingthe UML file system image. The loopback de-vice made it easy to mount, fsck, or otherwiseexamine the UML file system using tools on thehost machine. Only one bug required running


on a non-UML system to discover, which waslack of support for suspend in the dirty bit ker-nel thread.

However, getting UML up and running andworking for file system development was some-what non-intuitive and occasionally baffling.Details about running UML on recent 2.6 ker-nels, including links to a sample root file sys-tem and working .config file, can be foundhere:

http://www.nmt.edu/~val/uml_tips.html

5 Performance

We benchmarked the filesystem-wide dirty bitimplementation to find out if it significantly im-pacted performance. On the face of it, we ex-pected a small penalty on the first write, due toissuing an asynchronous write barrier the firsttime the file system is written.

The benchmarks we ran were kuntar, postmark,and tiobench [7]. Kuntar simply measures thetime to extract a cached uncompressed kerneltarball and sync the file system. Postmark cre-ates and deletes many small files in a directoryand is a metadata intensive workload. We ran itwith numbers = 10000 and transactions

= 10000. We also added a sync() systemcall to postmark before the final timing mea-surement was made, in order to measure thetrue performance of writing data all the way tothe disk. Tiobench is a benchmark designed tomeasure multi-threaded I/O to a single file; weran it mainly as a sanity check since we didn’texpect anything to change in this workload. Weran tiobench with 16 threads and a 256MB filesize.

The file systems we benchmarked were ext2,ext2 with the reservations-only patch, ext2 with

reservations only with reservations turned off,ext2 with the fs-wide bit patch and reserva-tions, ext3 with defaults, and ext3 with data=writeback mode. All file systems used 4KBblocks and were mounted with the noatimeoption. The kernel was 2.6.16-mm1. The ma-chine had two 1533 MHZ AMD Athlon proces-sors and 1GB of memory. We recored elapsedtime, sectors read, sectors written, and kernelticks. The results are in Table 1.

The results are somewhat baffling, but overallpositive for the dirty bit implementation. Thetimes for the fs-wide dirty bit are within 10%of those of plain ext2 for all benchmarks ex-cept postmark. For postmark, writes increasedgreatly for the fs-wide dirty bit; we are not surewhy yet. The results for the reservations-onlyversions of ext2 are even more puzzling; wesuspect that our port of reservations is buggy orsuboptimal. We will continue researching theperformance issues.

We would like to briefly discuss the noatimeoption. All file systems were mounted with thenoatime option, which turns off updates tothe “last accessed time” field in the inode. Weturned this option off not only because it wouldprevent the fs-wide dirty bit from being effec-tive when a file system is under read activity,but also because it is a common technique forimproving performance. noatime is widelyregarded as the correct behavior for most filesystems, and in some cases is shipped as thedefault behavior by distributions. While correctaccess time is sometimes useful or even critical,such as in tracing which files an intruder read,in most cases it is unnecessary and only addsunnecessary I/O to the system.

6 Linked writes

Our second idea for reducing fsck time is to or-der writes to the file system such that the file


ext2 ext2r ext2rnor ext2fw ext3 ext3wbkuntar secs 20.32 21.03 19.06 18.87 20.99 32.02

read 5152 5176 5176 5176 168 168write 523272 523272 523288 523304 523256 544160ticks 237 269 357 277 413 402

krmtar secs 9.79 10.92 9.99 10.90 55.64 9.74read 20874 20842 20874 20874 20866 20874write 5208 5176 5208 5960 36296 10560ticks 61 61 62 61 7943 130

postmark secs 33.98 49.34 42.93 50.46 43.48 41.82read 2568 2568 2568 2568 56 48write 168312 168392 168392 240720 260704 173936ticks 641 650 838 674 1364 1481

tiobench secs 37.48 35.22 33.68 33.57 35.16 36.69read 32 32 32 32 24 112write 64 64 64 72 136 136ticks 441 450 456 463 452 463

kuntar: expanding a cached uncompressed kernel tarball and syncingkrmtar: rm -rf on cold untarred kernel tree, syncpostmark: postmark + sync() patch, numbers = 10000, transactions = 10000tiobench: tiobench: 16 threads, 256m sizeext2: ext2ext2r: ext2, reservationsext2rnor: ext2, reservations, -o noreservation optionext2fw: ext2, reservations, fswideext3: ext3, 256m journalext3wb: ext3, 256m journal, data=writeback

Table 1: Benchmark results

system can repaired to a consistent state afterprocessing a short list of dirty inodes. Before anoperation begins, the relevant inodes are addedto an on-disk list of dirty inodes. During theoperation, we only overwrite references to data(such as indirect blocks or directory entries)after we have finished all updates that requirethat information (such as updating allocationbitmaps or link counts). If we crash half-waythrough an operation, we examine each inodeon the dirty inode list and repair any consis-tencies in the metadata it points to. For exam-ple, if we were to crash half-way through allo-

cating a block, we would check if each blockwere marked as allocated in the block alloca-tion bitmap. If it was not, we would free thatblock from the file (and all blocks that it pointsto). We call this scheme linked writes—a writeerasing a pointer is linked or dependent on thewrite of another block completing first.

Some cases are ambiguous as to what operationwas in progress, such as truncating and extend-ing a file. In these cases, we will take the safestaction. For example, in an ambiguous trun-cate/extend, we would assume a truncate opera-tion was in progress, because if we were wrong,


the new block would contain uninitialized data,resulting in a security hole. It might be possi-ble to indicate which operation was in progressusing other metadata, such as inode size, butif that is not possible or would harm perfor-mance, we have this option as a fail safe. Thedifference between restoring one or the other oftwo ambiguous operations is the difference be-tween restoring the file as of a short time beforethe crash versus restoring it as of after the com-pletion of the operation in progress at the timeof crash. Either option is allowed; only callingsync() defines what state the file is in on-diskat any particular moment.

Some operations may not be recoverable onlyby ordering writes. Consider removing onehard link to a file with multiple hard links fromdifferent directories. The only inodes on thedirty inode list are the inode for the directorywe are removing the link from, and the fileinode—not the inodes for the other directorieswith hard links to this file. Say we decrementthe link count for the inode, and then crash. Inthe one link case, when we recover, we willfind an inode with link count equal to 0, anda directory with an entry pointing to this in-ode. Recovery is simple; free the inode anddelete the directory entry. But if we have mul-tiple hard links to the file, and the inode hasa link count of one or more, we have no wayof telling whether the link count was alreadydecremented before we crashed or not. A so-lution to this is to overwrite the directory en-try with an invalid directory entry with a magicrecord that contains the inode’s correct linkcount which is only replayed if the inode hasnot already been updated. This regrettably addsyet another linked write to the process of delet-ing an entry. On the other hand, adding or re-moving links to files with link counts greaterthan one is painful but blessedly uncommon.Typically only directories have a link countgreater than one, and in modern Linux, direc-tory hard links are not allowed, so a directory’s

link count can be recalculated simply by scan-ning the directory itself.

Another problem is circular dependencies be-tween blocks that need to be written out. Saywe need to write some part of block A to diskbefore we write some part of block B. We up-date the buffers in memory and mark them tobe written out in order A, B. But then some-thing else happens, and now we need to writesome part of block B to disk before some partof block A. We update the buffers in memory—but now we can’t write either block A or blockB. Linked writes doesn’t run into this problembecause (a) every block contains only one kindof metadata, (b) the order in which differentkinds of metadata must be written is the samefor every option. This is equivalent to the lockordering solution to the deadlock problem; ifyou define the order for acquiring locks and ad-here to it, you can’t get into a deadlock.

Ordinarily, writing metadata in the same orderaccording to type for all operations would notbe possible. Consider the case of creating a fileversus deleting it. In the create case, we mustwrite the directory entry pointing to the inodebefore updating the bitmap in order to avoidleaking an inode. In the delete case, we mustwrite the bitmap before we delete the entry toavoid leaking an inode. What gets us out ofthis circular dependency is the dirty inode list.If we instead put the inode to be deleted on thedirty inode list, then we can delete the direc-tory entry before the bitmap, since if we crash,the inode’s presence on the dirty inode list willallow us to update the bitmap correctly. This al-lows us to define the dependency order “writebitmaps before directory entries.” The order ofmetadata operations for each operation must becarefully defined and adhered to.

When writing a buffer to disk, we need to besure it does not change in flight. We havetwo options for accomplishing this: either lockthe buffer and stall any operations that need to


write to it while it is in flight, or clone the bufferand send the copy to disk. The first option iswhat soft updates uses [13]; surprisingly per-formance is quite good so it may be an option.The second option requires more memory butwould seem to have better performance.

Another issue is reuse of freed blocks or inodesbefore the referring inode is removed from thedirty inode list. If we free a block, then reuse itbefore the inode referring to it is removed fromthe dirty list, it could be erroneously marked asfree again at recovery time. To track this, weneed a temporary copy of each affected bitmapshowing which items should not be allocated,in addition to the items marked allocated in themain copy of the bitmap. Overall, we occasion-ally need three copies of each active bitmap inmemory. The required memory usage is com-parable to that of journaling, copy-on-write, orsoft updates.

6.1 Implementing write dependencies

Simply issuing write barriers when we writethe first half of a linked write would be terriblyinefficient, as the only method of implement-ing this operation that is universally supportedby disks is: (1) issue a cache flush command;(2) issue the write barrier I/O; (3) wait for theI/O to complete; (4) issue a second cache flushcommand. (Even this implementation may bean illusion; reports of IDE disks which do notcorrectly implement the cache flush commandabound.) This creates a huge bubble in the I/Opipeline. Instead, we want to block only the de-pendent write. This can be implemented usingasynchronous writes which kick off the linkedwrite in the I/O completion handler.

6.2 Comparison of linked writes

Linked writes bears a strong resemblance tosoft updates [13]. Indeed, linked writes can

be thought of as soft updates from the oppo-site direction. Soft updates takes the approachof erring on the side of marking things allo-cated when they are actually free, and then re-covering leaked inodes and blocks after mountby running a background fsck on the file sys-tem. Linked writes errs on the side of mark-ing things unallocated when they are still refer-enced by the file system, and repairing incon-sistencies by reviewing a list of dirty inodes.Soft updates handles circular buffer dependen-cies (where block A must be written out be-fore block B and vice versa) by rolling back thedependent data before writing the block out todisk. Linked writes handle circular dependen-cies by making them impossible.

Linked writes can also be viewed as a formof journaling in which the journal entries arescattered across the disk in the form of inodesand directory entries, and linked together bythe dirty inode list. The advantages of linkedwrites over journaling is that changes are writ-ten once, no journal space has to be allocated,writes aren’t throttled by journal size, and thereare no seeks to a separate journal region.

6.3 Reinventing the wheel?

Why bother implementing a whole new methodof file system consistency when we have somany available to us already? Simply put, frus-tration with code complexity and performance.The authors have had direct experience with theimplementation of ZFS [8], ext3 [6], and ocfs2[5] and were disappointed with the complexityof the implementation. Merely counting linesof code for reiser3 [4], reiser4 [4], or XFS [15]incites dismay. We have not yet encounteredsomeone other than the authors of the originalsoft updates [13] implementation who claimsto understand it well enough to re-implementfrom scratch. Yet ext2, one of the smallest,simplest file systems out there, continues to be


the target for performance on general purposeworkloads.

In a sense, ext2 is cheating, because it does notattempt to keep the on-disk data structures in-tact. In another sense, ext2 shows us what ourrock-bottom performance expectations for newfile systems should be, as relatively little efforthas been put into optimizing ext2.

With linked writes, we hope for a file system alittle more complex, a lot more consistent, andnearly the same performance as ext2.

6.4 Feasibility of linked writes implemen-tation

We estimate that implementing linked writeswould take on the order of half the effort neces-sary to implement ext3. Adjusting for program-mer capability and experience (translation: I’mno Kirk McKusick or Greg Ganger), we es-timate that implementing linked writes wouldtake one fifth the staff-years required by softupdates.

We acknowledge that the design of linkedwrites is half-finished at best and may end uphaving fatal flaws, nor do we expect our de-sign to survive implementation without majorchanges—“There’s many a slip ’twixt cup andlip.”

7 Failed ideas

Linked writes grew out of our original ideato implement per-block group dirty bits. Wewanted to restrict how much of the file sys-tem had to be reviewed by fsck after a crash,and dividing it up by block groups seemed tomake sense. In retrospect, we realized that theonly checks we could do in this case would

start with the inodes in this block group andcheck file system consistency based on the in-formation they point to. On the other hand,given a block allocation bitmap, we can’t checkwhether a particular block is correctly markedunless we rebuild the entire file system by read-ing all of the inodes. In the end, we real-ized that per-bg dirty bits would basically bea very coarse hash of which inodes need to bechecked. It may make sense to implement somekind of bitmap showing which inodes need tobe checked rather than a linked list, otherwisethis idea is dead in the water.

Another idea for handling orphan inodes was toimplement a set of “in-memory-only” bitmapsthat record inodes and blocks which are al-located only for the lifetime of this mount—in other words, orphan inodes and their data.However, these bitmaps would in the worst caserequire two blocks per cylinder group of unre-claimable memory. A workaround would be toallocate space on disk to write them out undermemory pressure, but we abandoned this ideaquickly.

8 Availability

The most recent patches are available from:

http://www.nmt.edu/~val/patches.html

9 Future work

The filesystem-wide dirty bit seems worthwhileto polish for inclusion in the mainline kernel,perhaps as a mount option. We will continue todo work to improve performance and test cor-rectness.

Implementing linked writes will take a signifi-cant amount of programmer sweat and may not


be considered, shall we say, business-critical toour respective employers. We welcome discus-sion, criticism, and code from interested thirdparties.

10 Acknowledgments

Many thanks to all those who reviewed andcommented on the initial patches. Our workwas greatly reduced by being able to port theorphan inode list from ext3, written by StephenTweedie, as well as the ext3 reservation patchesby Mingming Cao.

11 Conclusion

The filesystem-wide dirty bit feature allowsext2 file systems to skip a full fsck when thefile system is not being actively modified dur-ing a crash. The performance of our initial, un-tuned implementation is reasonable, and will beimproved. Our proposal for linked writes out-lines a strategy for maintaining file system con-sistency with less overhead than journaling andsimpler implementation than copy-on-write orsoft updates.

We take this opportunity to remind file systemdevelopers that ext2 is an attractive target forinnovation. We hope that developers rediscoverthe possibilities inherent in this simple, fast, ex-tendable file system.

References

[1] Benchmarking file systems part II LG#122. http://linuxgazette.net/122/piszcz.html.

[2] Design and implementation of the secondextended filesystem. http://e2fsprogs.sourceforge.net/ext2intro.html.

[3] Linux: Reiser4 and the mainline kernel.http://kerneltrap.org/node/5679.

[4] Namesys.http://www.namesys.com/.

[5] OCFS2. http://oss.oracle.com/projects/ocfs2/.

[6] Red hat’s new journaling file system:ext3. http://www.redhat.com/support/wpapers/redhat/ext3/.

[7] Threaded I/O tester.http://sourceforge.net/projects/tiobench.

[8] ZFS at OpenSolaris.org.http://www.opensolaris.org/os/community/zfs/.

[9] Mingming Cao, Theodore Y. Ts’o,Badari Pulavarty, Suparna Bhattacharya,Andreas Dilger, and Alex Tomas. Stateof the art: Where we are with the ext3filesystem. In Ottawa Linux Symposium2005, July 2005.

[10] Jeff Dike. User-mode linux. In OttawaLinux Symposium 2001, July 2001.

[11] Dave Hitz, James Lau, and Michael A.Malcolm. File system design for an nfsfile server appliance. In USENIX Winter,pages 235–246, 1994.

[12] T. J. Kowalski and Marshall K.McKusick. Fsck - the UNIX file systemcheck program. Technical report, BellLaboratories, March 1978.


[13] Marshall K. McKusick and Gregory R.Ganger. Soft updates: A technique foreliminating most synchronous writes inthe fast filesystem. In USENIX AnnualTechnical Conference, FREENIX Track,pages 1–17. USENIX, 1999.

[14] Marshall K. McKusick, William N. Joy,Samuel J. Leffler, and Robert S. Fabry. Afast file system for unix. ACM Trans.Comput. Syst., 2(3):181–197, 1984.

[15] Adam Sweeney, Doug Doucette, Wei Hu,Curtis Anderson, Michael Nishimoto,and Geoff Peck. Scalability in the XFSfile system. In Proceedings of the 1996USENIX Technical Conference, 1996.

[16] Stephen Tweedie. Journaling the Linuxext2fs filesystem. In LinuxExpo ’98,1998.


Native POSIX Threads Library (NPTL) Support foruClibc.

Steven J. HillReality Diluted, Inc.

[email protected]

Abstract

Linux continues to gain market share in embed-ded systems. As embedded processing powerincreases and more demanding applications inneed of multi-threading capabilities are devel-oped, Native POSIX Threads Library (NPTL)support becomes crucial. The GNU C library[1] has had NPTL support for a number of yearson multiple processor architectures. However,the GNU C library is more suited for work-station and server platforms and not embed-ded systems due to its size. uClibc [2] is aPOSIX-compliant C library designed for sizeand speed, but currently lacking NPTL sup-port. This paper will present the design andimplementation of NPTL support in uClibc. Inaddition to the design overview, benchmarks,limitations and comparisons between glibc anduClibc will be discussed. NPTL for uClibc iscurrently only supported for the MIPS proces-sor architecture.

1 The Contenders

Every usable Linux system has applicationsbuilt atop a C library run-time environment.The C library is at the core of user space andprovides all the necessary functions and system

calls for applications to execute. Linux is for-tunate in that there are a number of C librariesavailable for varying platforms and environ-ments. Whether an embedded system, high-performance computing, or a home PC, thereis a C library to fit each need.

The GNU C library, known also as glibc [1],and uClibc [2] are the most common Linux C li-braries in use today. There are other C librarieslike Newlib [3], diet libc [4], and klibc [5] usedin embedded systems and small root file sys-tems. We list them only for completeness, yetthey are not considered in this paper. Our focuswill be solely on uClibc and glibc.

2 Comparing C Libraries

To understand the need for NPTL in uClibc, wefirst examine the goals of both the uClibc andglibc projects. We will quickly examine thestrengths and weaknesses of both C implemen-tations. It will then become evident why NPTLis needed in uClibc.

2.1 GNU C Library Project Goals

To quote from the main GNU C Library webpage [1], “The GNU C library is primarily de-signed to be a portable and high performance C

410 • Native POSIX Threads Library (NPTL) Support for uClibc.

library. It follows all relevant standards (ISO C99, POSIX.1c, POSIX.1j, POSIX.1d, Unix98,Single Unix Specification). It is also interna-tionalized and has one of the most complete in-ternationalization interfaces known.” In short,glibc, aims to be the most complete C libraryimplementation available. It succeeds, but atthe cost of size and complexity.

2.2 uClibc Project Goals

Let us see what uClibc has to offer. Again,quoting from the main page for uClibc [2],“uClibc (a.k.a. µClibc, pronounced yew-see-lib-see) is a C library for developing embed-ded Linux systems. It is much smaller thanthe GNU C Library, but nearly all applica-tions supported by glibc also work perfectlywith uClibc. Porting applications from glibcto uClibc typically involves just recompilingthe source code. uClibc even supports sharedlibraries and threading. It currently runs onstandard Linux and MMU-less (also known asµClinux) systems. . . ” Sounds great for embed-ded systems development. Obviously, uClibc isgoing to be missing some features since its goalis to be small in size. However, uClibc has itsown strengths as well.

2.3 Comparing Features

Table 1 shows the important differentiating fea-tures between glibc and uClibc.

It should be obvious from the table above thatglibc certainly has better POSIX compliance,backwards binary compatibility, and network-ing services support. uClibc shines in that it ismuch smaller (how much smaller will be cov-ered later), more configurable, supports moreprocessor architectures and is easier to buildand maintain.

The last two features in the table warrant ad-ditional explanation. glibc recently removedlinuxthreads support from its main de-velopment tree. It was moved into a sepa-rate ports tree. It is maintained on a volun-teer basis only. uClibc will maintain both thelinuxthreads and nptl thread models ac-tively. Secondly, glibc only supports a coupleof primary processor architectures. The rest ofthe architectures were also recently moved intothe ports tree. uClibc continues to actively sup-port many more architectures by default. Forembedded systems, uClibc is clearly the win-ner.

3 Why NPTL for Embedded Sys-tems?

uClibc supports multiple thread librarymodels. What are the shortcomings oflinuxthreads? Why is nptl better, orworse? The answer lies in the requirements,software and hardware, of the embeddedplatform being developed. We need to firstcompare linuxthreads and nptl tochoose the thread library that best meets theneeds of our platform. Table 2 lists the keyfeatures the two thread libraries have to offer.

Using the table above, the nptl model is use-ful in systems that do not have severe memoryconstraints, but need threads to respond quicklyand efficiently. The linuxthreads model isuseful mostly for resource constrained systemsstill needing basic thread support. As men-tioned at the beginning of this paper, embed-ded systems with faster processors and greatermemory resources are being required to domore, with less. Using NPTL in conjunctionwith the already small C library provided byuClibc, creates a perfectly balanced embeddedLinux system that has size, speed and an high-performance multi-threading.


Feature glibc uClibcLGPL Y YComplete POSIX compliance Y NBinary compatibility across releases Y NNSS Support Y NNIS Support Y NLocale support Y YSmall disk storage footprint N YSmall runtime memory footprint N YSupports MMU-less systems N YHighly configurable N YSimple build system N YBuilt-in configuration system N YEasily maintained N YNPTL Support Y YLinuxthreads Support N (See below) YSupport many processor architectures N (See below) Y

Table 1: Library Feature Comparison

Feature Description LinuxThreads NPTL

Storage Size The actual amount of storage space consumedin the file system by the libraries.

Smallest Largest

Memory UsageThe actual amount of RAM consumed at run-time by the thread library code and data. In-cludes both kernel and user space memory us-age.

Smallest Largest

Number of Threads The maximum number of threads available in aprocess.

Hard Coded Value Dynamic

Thread Efficiency Rate at which threads are created, destroyed,managed, and run.

Slowest Fastest

Per-Thread Signals Signals are handled on a per-thread basis andnot the process.

No Yes

Inter-Thread Threads can share synchronization primitivesSynchronization like mutexes and semaphores. No Yes

POSIX.1 Compliance Thread library is compliant No

Table 2: Thread Library Features


4 uClibc NPTL Implementation

The following sections outline the major tech-nical components of the NPTL implementationfor uClibc. For the most part, they should applyequally to glibc’s implementation except wherenoted. References to additional papers and in-formation are provided should the reader wishto delve deeper into the inner workings of vari-ous components.

4.1 TLS—The Foundation of NPTL

The first major component needed for NPTL onLinux systems is Thread Local Storage (TLS).Threads in a process share the same virtual ad-dress space. Usually, any static or global datadeclared in the process is visible to all threadswithin that process. TLS allows threads to havetheir own local static and global data. An ex-cellent paper, written by Ulrich Drepper, cov-ers the technical details for implementing TLSfor the ELF binary format [6]. Supporting TLSrequired extensive changes to binutils [7], GCC[8] and glibc [1]. We cover the changes madein the C library necessary to support TLS data.

4.1.1 The Dynamic Loader

The dynamic loader, also affectionately knownas ld.so, is responsible for the run-time link-ing of dynamically linked applications. Theloader is the first piece of code to execute be-fore the main function of the application iscalled. It is responsible for loading and map-ping in all required shared objects for the appli-cation.

For non-TLS applications, the process of load-ing shared objects and running the applicationis trivial and straight-forward. TLS data typescomplicate dynamic linking substantially. The

loader must detect any TLS sections, allocateinitial memory blocks for any needed TLS data,perform initial TLS relocations, and later per-form additional TLS symbol look-ups and re-locations during the execution of the process.It must also deal with the loading of sharedobjects containing TLS data during programexecution and properly allocate and relocateits data. The TLS paper [6] provides ampleoverview of how these mechanisms work. Thedocument does not, however, currently coverthe specifics of MIPS-specific TLS storage andimplementation. MIPS TLS information isavailable from the Linux/MIPS website [9].

Adding TLS relocation support into uClibc’sdynamic loader required close to 2200 linesof code to change. The only functionality notavailable in the uClibc loader is the handling ofTLS variables in the dynamic loader itself. Itshould also be noted that statically linked bi-naries using TLS/NPTL are not currently sup-ported by uClibc. Static binaries will not besupported until all processor architectures ca-pable of supporting NPTL have working sharedlibrary support. In reality, shared library sup-port of NPTL is a prerequisite for debuggingstatic NPTL support. Thank you to Daniel Ja-cobowitz for pointing this out.

4.1.2 TLS Variables in uClibc

There are four TLS variables currently used inuClibc. The noticeable difference that can beobserved in the source code is that they havean additional type modifier of __thread. Ta-ble 3 lists the TLS in detail. There are 15 ad-ditional TLS variables for locale support thatwere not ported from glibc. Multi-threaded lo-cale support with NPTL is currently not sup-ported with uClibc.


Variable Description

errnoThe number of the last error set by system calls and some functions inthe library. Previously it was thread-safe, but still shared by threadsin the same process. For NPTL, it is a TLS variable and thus eachthread has its own instantiation.

h_errnoThe error return value for network database operations. This vari-able was also previously thread safe. It is only available internally touClibc.

_res The resolver context state variable for host name look-ups.RPC_VARS Pointer to internal Remote Procedure Call (RPC) structure for multi-

threaded applications.

Table 3: TLS Variables in uClibc

4.2 Futexes

Futexes [10] [11] are fast user space mutexes.They are an important part of the locking nec-essary for a responsive pthreads library imple-mentation. They are supported by Linux 2.6kernels and no code porting was necessary forthem to be usable in uClibc other than addingprototypes in a header file. glibc also uses themextensively for the file I/O functions. Futexeswere ported for use in uClibc’s I/O functions aswell, although not strictly required by NPTL.Futex I/O support is a configurable for uClibcand can be selected with the UCLIBC_HAS_

STDIO_FUTEXES option.

4.3 Asynchronous Thread Cancellation

A POSIX compliant thread library contains thefunction pthread_cancel, which cancelsthe execution of a thread. Please see the officialdefinition of this function at The Open Groupwebsite [12]. Cancellation of a thread mustbe done carefully and only at certain points inthe library. There are close to 40 functionswhere thread cancellation must be checked forand possibly handled. Some of these are heav-ily used functions like read, write, open,

close, lseek and others. Extensive changeswere made to uClibc’s C library core in orderto support thread cancellation. A list of theseare available on the NPTL uClibc developmentsite [13].

4.4 Threads Library

The code in the nptl directory of glibc was ini-tially copied verbatim from a snapshot dated20050823. An entirely new set of build fileswere created in order to build NPTL withinuClibc. Almost all of the original files re-main with the exception of Asynchronous I/O(AIO) related code. All backwards binarycompatibility code and functions have been re-moved. Any code that was surrounded with#ifdef SHLIB_COMPAT or associated withthose code blocks was also removed. TheuClibc NPTL implementation should be assmall as possible and not constrained by oldthread compatibility code. This also means thatany files in the root nptl directory with theprefix of old_pthread_ were also removed.Finally, there were minor header files changesand some functions renamed.


4.5 POSIX Timers

In addition to the core threads library, therewere also code changes for POSIX Timers.These changes were integrated into the librtlibrary in uClibc. These changes were not in theoriginal statement of work from Broadcom, butwere implemented for completeness. All thetimer tests for POSIX timers associated withNPTL do pass, but were not required.

For those interested in further details of theNPTL design, please refer to Ulrich Drepper’sdesign document [14].

5 uClibc NPTL Testing

There were a total of four test suites thatthe uClibc NPTL implementation was testedagainst. Hopefully, with all of these tests,uClibc NPTL functionality should be at or nearthe production quality of glibc’s implementa-tion.

5.1 uClibc Testsuite

uClibc has its own test suite distributed with thelibrary source. While there are many tests ver-ifying the inner workings of the library and thevarious subsystems, pthreads tests are minimal.There are only 7 of them, with another 5 for thedynamic loader. With the addition of TLS dataand NPTL, these were simply not adequate fortesting the new functionality. They were, how-ever, useful as regression testing. The test suitecan be retrieved with uClibc from the mainsite [2].

5.2 glibc Testsuite

The author would first like to convey immenseappreciation to the glibc developers for creating

such a comprehensive and usable test suite forTLS and NPTL. Had it not been for the testsdistributed with their code, I would have re-leased poor code that would have resulted incustomer support nightmares. The tests werevery well designed and extremely helpful infinding holes in the uClibc NPTL implemen-tation. 182 selected NPTL tests and 15 TLStests were taken from glibc and passed success-fully with uClibc. There were a number of testsnot applicable to uClibc’s NPTL implementa-tion that were omitted. For further details con-cerning the tests, please visit the uClibc NPTLproject website [13].

5.3 Linux Test Project

The Linux Test Project (LTP) suite [15] is usedto “validate the reliability, robustness, and sta-bility of Linux.” It has 2900+ tests that will notonly test pthreads, but act as a large set of re-gression tests to make sure uClibc is still func-tioning properly as a whole.

5.4 Open POSIX Test Suite

To quote from the website [16], “The POSIXTest Suite is an open source test suite with thegoal of performing conformance, functional,and stress testing of the IEEE 1003.1-2001 Sys-tem Interfaces specification in a manner that isagnostic to any given implementation.” Thissuite of tests is the most important indicatorof how correct the uClibc NPTL implemen-tation actually is. It tests pthreads, timers,asynchronous I/O, message queues and otherPOSIX related APIs.

5.5 Hardware Test Platform

All development and testing was done with anAMD Alchemy DBAu1500 board graciously


Source Versionbinutils 2.16.1gcc 4.1.0glibc 20050823uClibc-nptl 20060318Linux Kernel Headers 2.6.15LTP 20050804Open POSIX Test Suite

20050804(Distributed w/LTP)buildroot 20060328crosstool 0.38Linux/MIPS Kernel 2.6.15

Table 4: Source and Tool Versions

donated by AMD. Broadcom also providedtheir own hardware, but it was not ready for useuntil later in the software development cycle.

The DBAu1500 development board is designedaround a 400MHz 32-bit MIPS Au1500 pro-cessor core. The board has 64MB of 100MHzSDRAM and 32MB of AMD MirrorBit Flashmemory. The Au1500 utilizes the MIPS32 In-struction Set and has a 16KB Instruction and16KB Data Cache. (2) 10/100Mbit EthernetPorts, USB host controller, PCI 2.2 complianthost controller, and other peripherals.

5.6 Software Versions

Table 4 lists the versions of all the sources usedin the development and testing of uClibc NPTL.buildroot [17] was the build system used to cre-ate both the uClibc and glibc root filesystemsnecessary for running the test suites. crosstool[18] was used for building the glibc NPTLtoolchain.

The actual Linux kernel version used on theAMD development board for testing is the re-leased Linux/MIPS 2.6.15 kernel. The 2.6.16release is not currently stable enough for test-ing and development. A number of system calls

glibc uClibc53m 8.983s 21m 33.129s

Table 5: Toolchain Build Times

appear to be broken along with serial text con-sole responsiveness. The root filesystem wasmounted over NFS.

6 uClibc NPTL Test Results

6.1 Toolchain Build Time

Embedded system targets do not usually havethe processor and/or memory resources avail-able to host a complete development environ-ment (compiler, assembler, linker, etc). Usuallydevelopment is done on an x86 host and the bi-naries are cross compiled for the target usinga cross development toolchain. crosstool [18]was used to build the x86 hosted MIPS NPTLtoolchain using glibc, and buildroot [17] wasused to build the MIPS NPTL toolchain usinguClibc.

Building a cross development toolchain is atime consuming process. Not only are they dif-ficult to get working properly, they also takea long time to build. glibc itself usually mustbe built twice in order to get internal paths andlibrary dependencies to be correct. uClibc onthe other hand, need only be built once dueto its simpler design and reduced complexityof the build system. The toolchains were builtand hosted on a dual 248 Opteron system with1GB of RAM, Ultra 160 SCSI and SATA harddrives. Times include the actual extraction ofsource from the tarballs for toolchain compo-nents. See Table 5. The toolchains above com-pile both C and C++ code for the MIPS tar-get processor. Not only are uClibc’s libraries


smaller (as you will see shortly), but creating adevelopment environment is much simpler andless time consuming.

6.2 Library Code Size

Throughout our discussion, we have stressedthe small size of uClibc as compared to glibc.Tables 6 and 7 show these comparisons of li-braries and shared objects as compared to oneanother.

The size difference between the C libraries isdramatic. uClibc is better than 2 times smallerthan glibc. uClibc’s libdl.a is larger be-cause some TLS functions only used for sharedobjects are being included in the static library.This is due to a problem in the uClibc buildsystem that will be addressed. The libnsl.aand libresolv.a libraries are dramaticallysmaller for uClibc only because they are stublibraries. The functions usually present in thecorresponding glibc libraries are contained in-side uClibc. Finally, libm.a for uClibc ismuch smaller do to reduced math functionalityin uClibc as compared to glibc. Most functionsfor handling double data types are not presentfor uClibc.

The shared objects of most interest arelibc.so, ld.so, and libpthread.so.uClibc’s main C library is over 2 times smallerthan glibc. The dynamic loader for uClibcis 4 times smaller. glibc’s dynamic loaderis complex and larger, but it has to be in or-der to handle binary backwards compatibility.Additionally, uClibc’s dynamic loader is can-not be executed as an application like glibc’s.Although the NPTL pthread library code wasported almost verbatim from glibc, uClibc’s li-brary is 30% smaller. Why? The first reasonis that any backward binary compatibility codewas removed. Secondly, the comments in thenptl directory of glibc say that the NPTL code

should be compiled with the -O2 compiler op-tion. For uClibc, the -Os option was used toreduced the code size and yet NPTL still func-tioned perfectly. This optimization worked forMIPS, but other architectures may not be ableto use this optimization.

6.3 glibc NPTL Library Test Results

The developers of NPTL for glibc created alarge and comprehensive test suite for testingits functionality. 182 tests were taken fromglibc and tested with uClibc’s TLS and NPTLimplementation. All of these tests passed withuClibc NPTL. For a detailed overview of theselected tests, please visit the uClibc NPTLproject website.

6.4 Linux Test Project (LTP) Results

Out of over 2900+ tests executed, there wereonly 31 failed tests by uClibc. glibc also failed31 tests. A number of tests that passed withuClibc, failed with glibc. The converse wasalso true. These differences will be examinedat a later date. However, passing all the tests inthe LTP is a goal for uClibc. Detailed test logscan be obtained from the uClibc NPTL projectwebsite.

6.5 Open POSIX Testsuite Results

Table 8 shows the results of the test runs forboth libraries.

The first discrepancy observed is the total num-ber of tests. glibc has a larger number of testsavailable because of Asynchronous I/O supportand the sigqueue function, which is not cur-rently available in uClibc. Had these featuresbeen present in uClibc, the totals would have


Static Library glibc [bytes] uClibc [bytes]libc.a 3 426 208 1 713 134libcrypt.a 29 154 15 630libdl.a 10 670 36 020libm.a 956 272 248 598libnsl.a 161 558 1 100libpthread.a 281 502 250 852libpthread_nonshared.a 1 404 1 288libresolv.a 111 340 1 108librt.a 79 368 29 406libutil.a 11 464 9 188TOTAL 5 068 940 2 306 324

Table 6: Static Library Sizes (glibc vs. uClibc)

Shared Object glibc [bytes] uClibc [bytes]libc.so 1 673 805 717 176ld.so 148 652 35 856libcrypt.so 28 748 13 676libdl.so 16 303 13 716libm.so 563 876 80 040libnsl.so 108 321 5 032libpthread.so 120 825 97 189libresolv.so 88 470 5 036librt.so 45 042 14 468libutil.so 13 432 9 320TOTAL 2 807 474 991 509

Table 7: Shared Object Sizes (glibc vs. uClibc)


RESULT glibc uClibcTOTAL 1830 1648PASSED 1447 1373FAILED 111 83UNRESOLVED 151 95UNSUPPORTED 22 29UNTESTED 92 60INTERRUPTED 0 0HUNG 1 3SEGV 5 5OTHERS 1 0

Table 8: OPT Results (glibc vs. uClibc)

most likely been the same. The remaining re-sults are still being analyzed and will be pre-sented at the Ottawa Linux Symposium in July,2006. The complete test logs are available fromthe uClibc NPTL project website.

7 Conclusions

NPTL support in uClibc is now a reality.A fully POSIX compliant threads library inuClibc is a great technology enabler for embed-ded systems developers who need fast multi-threading capability in a small memory foot-print. The results from the Open POSIX Test-suite need to be analyzed in greater detail in or-der to better quantify what, if any POSIX sup-port is missing.

8 Future Work

There is still much work to be done for theuClibc NPTL implementation. Below is a listof the important items:

• Sync NPTL code in uClibc tree with latestglibc mainline code.

• Implement NPTL for other processor ar-chitectures.

• Get static libraries working for NPTL.

• Merge uClibc-NPTL branch with uClibctrunk.

• Implement POSIX message queues.

• Implement Asynchronous I/O.

• Implement sigqueue call.

• Fix outstanding LTP and Open POSIXTest failures.

9 Acknowledgements

I would first like to acknowledge and thank Godfor getting me through this project. It has beena 9-1/2 month journey full of difficulty andfrustration at times. His strength kept me going.Secondly, my wife Jennifer who was a constantencourager and supporter of me. I spent manyweekends and evenings working while she keptthe household from falling apart. Obviously, Iwould like to thank Broadcom Corporation forsupporting this development effort and provid-ing the code back to the community. Thanksto Erik Andersen from Code Poet Consultingwho was my partner in this endeavor, handlingall the contracts and legal issues for me. Thankyou to AMD for supplying me multiple MIPSdevelopment boards free of charge. Specialthanks to Mathieu Chouinard for formatting mypaper in the final hours before submittal. Fi-nally, I would like to dedicate this paper andentire effort to my first child, Zachary JamesHill who just turned a year old on March 5th,2006. I love you son.


References

[1] GNU C Libary at http://www.gnu.org/software/libc/ http://sourceware.org/glibc/

[2] uClibc athttp://www.uclibc.org/

[3] Newlib at http://sourceware.org/newlib/

[4] Diet Libc at http://www.fefe.de/dietlibc/

[5] Klibc at ftp://ftp.kernel.org/pub/linux/libs/klibc/

[6] ELF Handling for Thread Local Storageat http://people.redhat.com/drepper/tls.pdf

[7] Binutils at http://www.gnu.org/software/binutils/

[8] GCC at http://gcc.gnu.org/

[9] NPTL Linux/MIPS at http://www.linux-mips.org/wiki/NPTL

[10] Hubertus Franke, Matthew Kirkwood,Rusty Russell. Fuss, Futexes andFurwocks: Fast Userlevel Locking inLinux. In Proceedings of the OttawaLinux Symposium, pages 479–494, June2002.

[11] Futexes Are Tricky athttp://people.redhat.com/drepper/futex.pdf

[12] pthread_cancel function definitionat http://www.opengroup.org/onlinepubs/007908799/xsh/

pthread_cancel.html

[13] uClibc NPTL Project athttp://www.realitydiluted.com/

nptl-uclibc/

[14] The Native POSIX Thread Library forLinux at http://people.redhat.com/drepper/nptl-design.pdf

[15] Linux Test Project athttp://ltp.sourceforge.net/

[16] Open POSIX Test Suite at http://posixtest.sourceforge.net/

[17] buildroot at http://buildroot.uclibc.org/

[18] crosstool at http://www.kegel.com/crosstool/


Playing BlueZ on the D-Bus

Marcel HoltmannBlueZ Project

[email protected]

Abstract

The integration of the Bluetooth technologyinto the Linux kernel and the major Linux dis-tributions has progressed really fast over thelast two years. The technology is present al-most everywhere. All modern notebooks andmobile phones are shipped with built-in Blue-tooth. The use of Bluetooth with a Linux basedsystem is easy and in most cases it only needsan one-time setup, but all the tools are still com-mand line based. In general this is not so bad,but for a greater success it is needed to seam-lessly integrate the Bluetooth technology intothe desktop. There have been approaches forthe GNOME and KDE desktops. Both havebeen quite successful and made the use of Blue-tooth easy. The problem however is that bothimplemented their own framework around theBluetooth library and its daemons and therewere no possibilities for programs from onesystem to talk to the other. With the final ver-sion of the D-Bus framework and its adaptioninto the Bluetooth subsystem of Linux, it willbe simple to make all applications Bluetoothaware.

The idea is to establish one central Bluetoothdaemon that takes care of all task that can’t orshouldn’t be handled inside the Linux kernel.These jobs include PIN code and link key man-agement for the authentication and encryption,caching of device names and services and alsocentral control of the Bluetooth hardware. All

possible tasks and configuration options are ac-cessed via the D-Bus interface. This will al-low to abstract the internals of GNOME andKDE applications from any technical details ofthe Bluetooth specification. Even other appli-cation will get access to the Bluetooth technol-ogy without any hassle.

1 Introduction

The Bluetooth specification [1] defines a clearabstraction layer for accessing different Blue-tooth hardware options. It is called the HostController Interface (HCI) and is the basis ofall Bluetooth protocols stacks (see Figure 1).

This interface consists of commands and eventsthat provide support for configuring the localdevice and creating connections to other Blue-tooth devices. The commands are split into sixdifferent groups:

• Link Control Commands

• Link Policy Commands

• Host Controller and Baseband Commands

• Informational Parameters

• Status Parameters

• Testing Commands

422 • Playing BlueZ on the D-Bus

Radio

Baseband

Link Manager

HCI

L2CAP

RFCOMM

Applications and Profiles

OBEX

SDP

Figure 1: Simple Bluetooth stack

With the Link Control Commands it is pos-sible to search for other Bluetooth devices inrange and to establish connections to other de-vices. This group also includes commands tohandle authentication and encryption. The LinkPolicy Commands are controlling the estab-lished connections between two or more Blue-tooth devices. They also control the differentpower modes. All local settings of a Blue-tooth device are modified with commands fromthe Host Controller and Baseband Commandsgroup. This includes for example the friendlyname and the class of device. For detailed in-formation of the local device, the commandsfrom the Informational Paramters group can beused. The Status Parameters group providescommands for detailed information from the re-mote device. This includes the link quality andthe RSSI value. With the group Testing Com-mands the device provides commands for Blue-tooth qualification testing. All commands areanswered by an event that returns the requestedvalue or information. Some events can also ar-rive at any time. For example to request a PIN

code or to notify of a changed power state.

Every Bluetooth implementation must imple-ment the Host Controller Interface and forLinux a specific set of commands has been in-tegrated into the Linux kernel. Another set ofcommands are implemented through the Blue-tooth library. And some of the commands arenot implemented at all. This is because they arenot needed or because they have been depre-cated by the latest Bluetooth specification. Therange of commands implemented in the kernelare mostly dealing with Bluetooth connectionhandling. The commands in the Bluetooth li-brary are for configuration of the local deviceand handling of authentication and encryption.

While the Host Controller Interface is a cleanhardware abstraction, it is not a clean or easyprogramming interface. The Bluetooth libraryprovides an interface to HCI and an applica-tion programmer has to write a lot of code toget Bluetooth specific tasks done via HCI. Tomake it easy for application programmers andalso end users, a task based interface to Blue-tooth has been designed. The definition of thistasks has been done from an application per-spective and they are exported through D-Busvia methods and signal.

2 D-Bus integration

The hcid daemon is the main daemon whenrunning Bluetooth on Linux. It handles all de-vice configuration and authentication tasks. Allconfiguration is done via a simple configurationfile and the PIN code is handled via PIN helperscript. This means that every the configurationoption needed to be changed, it was needed toedit the configuration file (/etc/bluetooth/hcid.conf) and to restart hcid. The config-uration file still configures the basic and alsodefault settings of hcid, but with the D-Bus


integration all other settings are configurablethrough the D-Bus API. The current API con-sists of three interfaces:

• org.bluez.Manager

• org.bluez.Adapter

• org.bluez.Security

The Manager interface provides basic meth-ods for listing all attached adapter and gettingthe default adapter. In the D-Bus API termsan adapter is the local Bluetooth device. Inmost cases this might be an USB dongle ora PCMCIA card. The Adapter interface pro-vides methods for configuration of the localdevice, searching for remote device and han-dling of remote devices. The Security interfaceprovides methods to register passkey agents.These agents can provide fixed PIN codes, di-alog boxes or wizards for specific remote de-vices. All Bluetooth applications using the D-Bus API don’t have to worry about any Blue-tooth specific details or details of the Linux spe-cific implementation (see Figure 2).

Besides the provided methods, every interfacecontains also signals to broadcast changes orevents from the HCI. This allows passive ap-plications to get the information without ac-tively interacting with any Bluetooth relatedtask. An example for this would be an appletthat changes its icon depending on if the localdevice is idle, connected or searching for otherdevices.

Every local device is identified by its path.For the first Bluetooth adapter, this wouldbe /org/bluez/hci0 and this path will beused for all methods of the Adapter inter-face. The best way to get this path is to callDefaultAdapter() from the Manager in-terface. This will always return the current de-fault adapter or in error if no Bluetooth adapter

is attached. With ListAdapters() it is pos-sible to get a complete list of paths of the at-tached adapters.

If the path is known, it is possible to use thefull Adapter interface to configure the local de-vice or handle tasks like pairing or searchingfor other devices. An example task would bethe configuration of the device name. WithGetName() the current name can be retrievedand with SetName() it can be changed.Changing the name results in storing it on thefilesystem and changing the name with an ap-propriate HCI command. If the local device al-ready supports the Bluetooth Lisbon specifica-tion, then the Extended Inquiry Response willbe also modified.

With the DiscoverDevices() method itis possible to start the search for other Blue-tooth devices in range. This method call actu-ally doesn’t return any remote devices. It onlystarts the inquiry procedure of the Bluetoothchip and every found device is returned via theRemoteDeviceFound signal. This allowsall applications to handle new devices even ifthe discovery procedure has been initiated by adifferent application.

3 Current status

The methods and signals for the D-Bus APIfor Bluetooth were chosen very carefully. Thegoal was to design it with current applicationneeds in mind. It also aims to fulfill the needsof current established desktop frameworks likethe GNOME Bluetooth subsystem and the KDEBluetooth framework. So it covers the commontasks and on purpose not everything that mightbe possible. The API can be divided into thefollowing sections:

• Local


Security Manager

Bluetooth Core

Passkey Manager

org.bluez.Security

Host Controller Interface

Adapter ManagerCore Manager

hcid

org.bluez.Adapterorg.bluez.Manager

kernel

applications

Bluetooth Drivers

Figure 2: D-Bus API overview

– version, revision, manufacturer– mode, name, class of device

• Remote

– version, revision, manufacturer– name, class of device– aliases– device discovery– pairing, bondings

• Security

– passkey agent

With these methods and signals all standardtasks are covered. The Manager, Adapter andSecurity interfaces are feature complete at themoment.

4 Example application

The big advantage of the D-Bus frameworkis that it has bindings for multiple program-

ming languages. With the integration of D-Bus into the Bluetooth subsystem, the use ofBluetooth from various languages becomes re-ality. The Figure 3 shows an example of chang-ing the name of the local device into My Blue-tooth dongle using the Python programminglanguage.

The example in Python is straight forward andsimple. Using the D-Bus API within a C pro-gram is a little bit more complex, but it is stilleasier than using the native Bluetooth libraryAPI. Figure 4 shows an example on how to getthe name of the local device.

5 Conclusion

The integration of a D-Bus API into the Blue-tooth subsystem makes it easy for applicationsto access the Bluetooth technology. The cur-rent API is a big step into the right direction,but it is still limited. The Bluetooth technologyis complex and Bluetooth services needs to beextended with an easy to use D-Bus API.


#!/usr/bin/python

import dbus

bus = dbus.SystemBus();

obj = bus.get_object(’org.bluez’,’/org/bluez’)

manager = dbus.Interface(obj,’org.bluez.Manager’)

obj = bus.get_object(’org.bluez’,manager.DefaultAdapter())

adapter = dbus.Interface(obj,’org.bluez.Adapter’)

adapter.SetName(’My Bluetooth dongle’)

Figure 3: Example in Python

The next steps would be integration of D-Businto the Bluetooth mouse and keyboard ser-vice. Another goal is the seamless integrationinto the Network Manager. This would allowto connect to Bluetooth access points like anyother WiFi access point.

The current version of the D-Bus API for Blue-tooth will be used in the next generation ofthe Maemo platform which is that basis for theNokia 770 Internet tablet.

References

[1] Special Interest Group Bluetooth:Bluetooth Core Specification Version 2.0+ EDR, November 2004.

[2] freedesktop.org: D-BUS SpecificationVersion 0.11.

#include <stdio.h>#include <stdlib.h>

#include <dbus/dbus.h>

int main(int argc, char **argv) {DBusConnection *conn;DBusMessage *msg, *reply;const char *name;

conn = dbus_bus_get(DBUS_BUS_SYSTEM, NULL);msg = dbus_message_new_method_call(

"org.bluez","/org/bluez/hci0","org.bluez.Adapter", "GetName");

reply =dbus_connection_send_with_reply_and_block(

conn, msg, -1, NULL);

dbus_message_get_args(reply, NULL,DBUS_TYPE_STRING, &name,DBUS_TYPE_INVALID);

printf("%s\n", name);

dbus_message_unref(msg);dbus_message_unref(reply);dbus_connection_close(conn);

return 0;}

Figure 4: Example in C


FS-Cache: A Network Filesystem Caching Facility

David HowellsRed Hat UK Ltd

[email protected]

Abstract

FS-Cache is a kernel facility by which a net-work filesystem or other service can cache datalocally, trading disk space to gain performanceimprovements for access to slow networks andmedia. It can be used by any filesystem thatwishes to use it, for example AFS, NFS, CIFS,and ISOFS. It can support a variety of back-ends: different types of cache that have differ-ent trade-offs.

FS-Cache is designed to impose as little over-head and as few restrictions as possible on theclient network filesystem using it, whilst stillproviding the essential services.

The presence of a cache indirectly improvesperformance of the network and the server byreducing the need to go to the network.

1 Overview

The FS-Cache facility is intended for use withnetwork filesystems, permitting them to usepersistent local storage to cache data and meta-data, but it may also be used to cache other sortsof media such as CDs.

The basic principle is that some media are ef-fectively slower than others—either becausethey are physically slower, or because they

must be shared—and so a cache on a fastermedium can be used to improve general per-formance by reducing the amount of traffic toor across the slower media.

Another reason for using a cache is that theslower media may be unreliable for somereason—for example a laptop might lose con-tact with a wireless network, but the workingfiles might still need to be available. A cachecan help with this by storing the working setof data and thus permitting disconnected oper-ation (offline working).

1.1 Organisation

FS-Cache is a thin layer (see Figure 1) in thekernel that permits client filesystems (such asNFS, AFS, CIFS, ISOFS) on one side to requestcaching services without knowing what sort ofcache is attached, if any.

NFS

AFS

ISO9660

FS−Cache

CacheFS

CacheFiles

Figure 1: Cache architecture

428 • FS-Cache: A Network Filesystem Caching Facility

On the other side FS-Cache farms those re-quests off to the available caches, be theyCacheFS, CacheFiles, or whatever (see sec-tion 4)—or the request is gracefully denied ifthere isn’t an available cache.

FS-Cache permits caches to be shared betweenseveral different sorts of netfs1, though it doesnot in any way associate two different views ofthe same file obtained by two separate means.If a file is read by both NFS and CIFS, for in-stance, two copies of the file will end up in thecache (see section 1.7).

It is possible to have more than one cache avail-able at one time. In such a case, the availablecaches have unique tags assigned to them, anda netfs may use these to bind a mount to a spe-cific cache.

1.2 Operating Principles

FS-Cache does not itself require that a netfs filebe completely loaded into the cache before thatfile may be accessed through the cache. This isbecause:

1. it must be practical to operate without acache;

2. it must be possible to open a remote filethat’s larger than the cache;

3. the combined size of all open remotefiles—including mapped libraries—mustnot be limited to the size of the cache; and

4. the user should not be forced to downloadan entire file just to do a one-off access ofa small portion of it (such as might be donewith the file program).

1Note that the client filesystems will be referred togenerically as the netfs in this document.

FS-Cache makes no use of the i_mappingpointer on the netfs inode as this would forcethe filesystems using the cache either to be bi-modal2 in implementation or to always requirea cache for operation, with the files completelydownloaded before use—none of which is ac-ceptable for filesystems such as NFS.

FS-Cache is built instead around the idea thatdata should be served out of the cache in pagesas and when requested by the netfs using it.That said, the netfs may, if it chooses, down-load the whole file and install it in the cachebefore permitting the file to be used—rejectingthe file if it won’t fit. All FS-Cache would seeis a reservation (see section 1.4) followed by astream of pages to entirely fill out that reserva-tion.

Furthermore, FS-Cache is built around the prin-ciple that the netfs’s pages should belong tothe netfs’s inodes, and so FS-Cache reads andwrites data directly to or from those pages.

Lastly, files in the cache are accessed by se-quences of keys, where keys are arbitrary blobsof binary data. Each key in a sequence is usedto perform a lookup in an index to find the nextindex to consult or, finally, the file to access.

1.3 Facilities Provided

FS-Cache provides the following facilities:

1. More than one cache can be used at once.Caches can be selected explicitly by use oftags.

2. Caches can be added or removed at anytime.

2Bimodality would involve having the filesystem op-erate very differently in each case


3. The netfs is provided with an interface thatallows either party to withdraw caching fa-cilities from a file (required for point 2).See section 5.

4. The interface to the netfs returns as few er-rors as possible, preferring rather to let thenetfs remain oblivious. This includes I/Oerrors within the cache, which are hiddenfrom the netfs. See section 5.8.

5. Cookies are used to represent indices, datafiles and other objects to the netfs. See sec-tions 3 and 5.1.

6. Cache absence is handled gracefully; thenetfs doesn’t really need to do anything asthe FS-Cache functions will just observea NULL pointer—a negative cookie—andreturn immediately. See section 5.2.

7. Cache objects can be “retired” upon re-lease. If an object is retired, FS-Cache willmark it as obsolete, and the cache backendwill delete the object — data and all—andrecursively retire all that object’s children.See section 5.5.

8. The netfs is allowed to propose—dynamically—any index hierarchy itdesires, though it must be aware that theindex search function is recursive, stackspace is limited, and indices can only bechildren of other indices. See section 3.2.

9. Data I/O is done on a page-by-page basis.Only pages which have been stored in thecache may be retrieved. Unstored pagesare passed back to the netfs for retrievalfrom the server. See section 5.7.

10. Data I/O is done directly to and from thenetfs’s pages. The netfs indicates that pageA is at index B of the data-file representedby cookie C, and that it should be read orwritten. The cache backend may or maynot start I/O on that page, but if it does, a

netfs callback will be invoked to indicatecompletion. The I/O may be either syn-chronous or asynchronous.

11. A small piece of auxiliary data may bestored with each object. The format andusage of this data is entirely up to the netfs.The main purpose is for coherency man-agement.

12. The netfs provides a “match” function forindex searches. In addition to sayingwhether or not a match was made, this canalso specify that an entry should be up-dated or deleted. This should make use ofauxiliary data to maintain coherency. Seesection 5.4.

1.4 Disconnected Operation

Disconnected operation (offline working) re-quires that the set of files required for opera-tion is fully loaded into the cache, so that thenetfs can provide their contents without havingto resort to the network. Not only that, it mustbe possible for the netfs to save changes intothe cache and keep track of them for later syn-chronisation with the server when the networkis once again available.

FS-Cache does not, of itself, provide discon-nected operation. That facility is left up tothe netfs to implement—in particular with re-gard to synchronisation of modifications withthe server.

That said, FS-Cache does provide three facili-ties to make the implementation of such a facil-ity possible: reservations, pinning and auxil-iary data.

Reservations permit the netfs to reserve a chunkof the cache for a file, so that file can be loadedor expanded up to the specified limit.


Pinning permits the netfs to prevent a file frombeing discarded to make room in the cache forother files. The offline working set must bepinned in the cache to make sure it will be therewhen it’s needed. The netfs would have to pro-vide a way for the user to nominate the files tobe saved, since they, and not the netfs, knowwhat their working set will be.

Auxiliary data permits the netfs to keep trackof a certain amount of writeback control infor-mation in the cache. The amount of primaryauxiliary data is limited, but more can be madeavailable by adding child objects to a data ob-ject to hold the extra information.

To implement potential disconnected operationfor a file, the netfs must download all the miss-ing bits of a file and load them into the cache inadvance of the network going away.

Disconnected operation could also be of usewith regard to ISOFS: the contents of a CD orDVD could be loaded into the cache for laterretrieval without the need for the disc to be inthe drive.

1.5 File Attributes

Currently arbitrary file attributes (such as ex-tended attributes or ACLs) can be retained inthe cache in one of two ways: either they can bestored in the auxiliary data (which is restrictedin size - see section 1.4) or they can be attachedto objects as children of a special object type(see section 3).

Special objects are data objects of a type thatisn’t one of the two primary types (index anddata). How special objects are used is at thediscretion of the netfs that created it, but spe-cial objects behave otherwise exactly like dataobjects.

Optimisations may be provided later to per-mit cache file extended attributes to be used to

cache file attributes - especially with the pos-sibility of attribute sharing on some backingfilesystems. This will improve the performanceof attribute-heavy systems such as those thatuse SE Linux.

1.6 Performance Trade-Offs

The use of a local cache for remote filesystemsrequires some trade-offs be made in terms ofclient machine performance:

• File lookup timeThis will be INCREASED by checkingthe cache before resorting to the net-work and also by making a note ofa looked-up object in the cache.This should be DECREASED by localcaching of metadata.

• File read timeThis will be INCREASED by check-ing the cache before resorting tothe network and by copying thedata obtained back to the cache.This should be DECREASED by localcaching of data as a local disk should bequicker to read.

• File write timeThis could be DECREASED by doingwriteback caching using the disk.Write-through caching should be more orless neutral since it’s possible to write toboth the network and the disk at once.

• File replacement timeThis will be INCREASED by having to re-tire an object or tree of objects from thedisk.

The performance of the network and the serverare also affected, of course, since the use of


a local cache should hopefully reduce networktraffic by satisfying from local storage some ofthe requests that would have otherwise beencommitted to the network. This may to someextent counter the increases in file lookup timeand file read time due to the drag of the cache.

1.7 Cache Aliasing

As previously mentioned, through the interac-tion of two different methods of retrieving a file(such as NFS and CIFS), it is possible to end upwith two or more copies of a remote file storedlocally. This is known as cache aliasing.

Cache aliasing is generally considered bad fora number of reasons: it requires extra resourcesto maintain multiple copies, the copies may be-come inconsistent, and the process of main-taining consistency may cause the data in thecopies to bounce back and forth. It’s generallyup to the user to avoid cache aliasing in such asituation, though the netfs can help by keepingthe number of aliases down.

The current NFS client can also suffer fromcache aliasing with respect to itself. If twomounts are made of different directories on thesame server, then two superblocks will be cre-ated, each with its own set of inodes. Yet someof the inodes may actually represent the samefile on the server, and would thus be aliases.Ways to deal with this are being examined.

FS-Cache deals with the possibility of cachealiasing by refusing multiple acquisitions of thesame object (be it an index object or a data ob-ject). It is left up to the netfs to multiplex ob-jects.

1.8 Direct File Access

Files opened with O_DIRECT should not gothrough the cache. That is up to the netfs to im-

plement, and FS-Cache shouldn’t even see thedirect I/O operations.

If a file is opened for direct file access whenthere’s data for that file in the cache, the cacheobject representing that file should be retiredand a new one not created until the file is nolonger open for direct access.

1.9 System Administration

Use of the FS-Cache facility by a netfs doesnot require anything special on the part of thesystem administrator, unless the netfs designerwills it. For instance, the in-kernel AFS filesys-tem will use it automatically if it’s there, whilstthe NFS filesystem currently requires an extramount option to be passed to enable caching onthat particular mount.

Whilst the exact details are subject to change, itshould not be a problem to use the cache withautomounted filesystems as there should be noneed to wrap the mount call or issue a post-mount enabler.

2 Other Caching Schemes

Some network filesystems that can be used onLinux already have their own caching facilitiesbuilt into each individually, including Coda andOpenAFS. In addition, other operating systemshave caching facilities, such as Sun’s CacheFS.

2.1 Coda

Coda[1] requires a cache. It fully downloadsthe target file as part of the open process andstores it in the cache. The Coda file operationsthen redirect the various I/O operations to the


equivalents on the cache file, and i_mappingis used to handle mmap() on a Coda file (thisis required as Coda inodes do not have theirown pages). i_mapping is not required withFS-Cache as the cache does I/O directly to thenetfs’s pages, and so mmap() can just use thenetfs inode’s pages as normal.

All the changes made to a Coda file are storedlocally, and the entire file is written backwhen a file is either flushed on close() orfsync().

All this means that Coda may not handle a setof files that won’t fit in its cache, and Codacan’t operate without a cache. On the otherhand, once a file has been downloaded, it oper-ates pretty much at normal disk-file speeds. Butimagine running the file program on a file of100MB in size... Probably all that is requiredis the first page, but Coda will download all ofit—that’s fine if the file is then going to be used;but if not, that’s a lot of bandwidth wasted.

This does, however, make Coda good for do-ing disconnected operation: you’re guaranteedto have to hand the entirety of any file you wereworking with.

And it does potentially make Coda bad at han-dling sparse files, since Coda must downloadthe whole file, holes and all, unless the Codaserver can be made to pass on informationabout the gaps in a file.

2.2 OpenAFS

OpenAFS[2] can operate without a cache. Itdownloads target files piecemeal as the appro-priate bits of the file are accessed, and placesthe bits in the cache if there is one.

No use is made of i_mapping, but insteadOpenAFS inodes own their own pages, and the

contents are exchanged with pages in the cachefiles at appropriate times.

OpenAFS’s caching operates using the mainmodel assumed for FS-Cache. OpenAFS,however, locates its cache files by invokingiget() on the cache superblock with the in-ode number for what it believes to be the cachefile inode number as a parameter.

2.3 Sun’s CacheFS

Modern Solaris[3] variants have their ownfilesystem caching facilities available for usewith NFS (CacheFS). The mounting protocol issuch that the cache must manually be attachedto each NFS mount after the mount has beenmade.

FS-Cache does things a little differently: thenetfs declares an interest in using caching facil-ities when the netfs is mounted, and the cachewill be automatically attached either immedi-ately if it’s already available, or at the point itbecomes available.

It would also be possible to get a netfs torequest caching facilities after it has beenmounted, though it might be trickier from animplementation point of view.

3 Objects and Indexing

Part of FS-Cache can be viewed as an objectstorage interface. The objects it stores comein two primary types: index objects and dataobjects, but other special object types may bedefined on a per-parent-object basis as well.

Cache objects have certain properties:


• All objects apart from the root indexobject—which is inaccessible on the netfsside of things—have a parent object.

• Any object may have as many child ob-jects as it likes.

• The children of an object do not all haveto be of the same type.

• Index objects may only be the children ofother index objects.

• Non-index objects3 may carry data as wellas children.

• Non-index objects have a file size set be-yond which pages may not be accessed.

• Index objects may not carry data.

• Each object has a key that is part of akeyspace associated with its parent object.

• Child keyspaces from two separate ob-jects do not overlap—so two objects withequivalent binary blobs as their keys butwith different parent objects are differentobjects.

• Each object may carry a small blob ofnetfs-specific auxiliary metadata that canbe used to manage cache consistency andcoherence.

• An object may be pinned in the cache,preventing it from being culled to makespace.

• A non-index object may have space re-served in the cache for data, thus guaran-teeing a minimum amount of page storage.

Note that special objects behave exactly likedata objects, except in two cases: when they’rebeing looked up, the type forms part of the key;

3Data objects and special objects

and when the cache is being culled, special ob-jects are not automatically culled, but they arestill removed upon request or when their parentobject goes away.

3.1 Indices

Index objects are very restricted objects as theymay only be the children of other indices andthey may not carry data. However, they mayexist in more than one cache if they don’t haveany non-index children, and they may be boundto specific caches—which binds all their chil-dren to the same cache.

Index object instantiation within any particularcache is deferred until an index further downthe branch needs a non-index type child objectinstantiating within that cache—at which pointthe full path will be instantiated in one go, rightup to the root index if necessary.

Indices are used to speed up file lookup by split-ting up the key to a file into a sequence of log-ical sections, and can also be used to cut downkeys that are too long to use in one lump. In-dices may also be used define a logical groupof objects so that the whole group can be inval-idated in one go.

Records for index objects are created in thevirtual index tree in memory whether or not acache is available, so that cache binding infor-mation can be stored for when a cache is finallymade available.

3.2 Virtual Indexing Tree

FS-Cache maintains a virtual indexing tree inmemory for all the active objects it knowsabout. There’s an index object at the root of thetree for FS-Cache’s own use. This is the rootindex.


The children of the root index are keyed on thename of the netfs that wishes to use the of-fered caching services. When a netfs requestscaching services an index object specific to thatservice will be created if one does not alreadyexist (see Figure 2).

.fsdef

NFS AFS ISOFS

Figure 2: Primary Indices

Each of these is the primary index for thenamed netfs , and each can be used by its ownernetfs in any way it desires. AFS, for example,would store per-cell indices in its primary in-dex, using the cell name as the key.

Each primary index is versioned. Should anetfs request a primary index of a version otherthan the one stored in the cache, the entire in-dex subtree rooted at that primary index willbe scrapped, and a new primary index will bemade.

Note that the index hierarchy maintained bya netfs will not normally reflect the directorytree that that netfs will display to the VFS andthe user. Data objects generally are equiva-lent to inodes, not directory entries, and sohardlink and rename maintenance is not nor-mally a problem for the cache.

For instance, with NFS the primary index mightbe used to hold an index per server—keyed byIP address—and each server index used to holda data object per inode—keyed by NFS filehan-dle (see Figure 3).

The inode objects could then have child objectsof their own to represent extended attributes ordirectory entries (see Figure 4).

.fsdef

NFS

Server Server Server

Inode Inode Inode

Figure 3: NFS Index Tree

xattr xattrdirdir

Inode Inode

xattr

Figure 4: NFS Inode Attributes

Note that the in-memory index hierarchy maynot be fully representative of the union of theon-disk trees in all the active caches on a sys-tem. FS-Cache may discard inactive objectsfrom memory at any time.

3.3 Data-Containing Objects

Any data object may contain quantities of pagesof data. These pages are held on behalf of thenetfs. The pages are accessed by index num-ber rather than by file position, and the objectcan be viewed as having a sparse array of pagesattached to it.

Holes in this array are considered to representpages as yet unfetched from the netfs server,


and if FS-Cache is asked to retrieve one ofthese, it will return an appropriate error ratherthan just returning a block full of zeros.

Special objects may also contain data in exactlythe same was as data objects can.

4 Cache Backends

The job of actually storing and retrieving data isthe job of a cache backend. FS-Cache passesthe requests from the netfs to the appropriatecache backend to actually deal with it.

There are currently two candidate cache back-ends:

• CacheFS

• CacheFiles

CacheFS is a quasi-filesystem that permits ablock device to be mounted and used as a cache.It uses the mount system call to make the cacheavailable, and so doesn’t require any special ac-tivation interface. The cache can be deactivatedsimply by unmounting it.

CacheFiles is a cache rooted in a directory inan already mounted filesystem. This is moreuse where an extra block device is hard to comeby, or re-partitioning is undesirable. This usesthe VFS/VM filesystem interfaces to get an-other filesystem (such as Ext3) to do the req-uisite I/O on its behalf.

Both of these are subject to change in the futurein their implementation details, and neither arefully complete at the time of writing this pa-per. See section 6 for information on the stateof these components, and section 6.1 for per-formance data at the time of writing.

5 The Netfs Kernel Interface

The netfs kernel interface is documented in:

Documentation/filesystems/caching/netfs-api.txt

The in-kernel client support can be obtained byincluding:

linux/fscache.h

5.1 Cookies

The netfs and FS-Cache talk to each other bymeans of cookies. These are elements of thevirtual indexing tree that FS-Cache maintains,but they appear as opaque pointers to the netfs.They are of type:

struct fscache_cookie *

A NULL pointer is considered to be a negativecookie and represents an uncached object.

A netfs receives a cookie from FS-Cache whenit registers. This cookie represents the primaryindex of this netfs. A netfs can acquire fur-ther cookies by asking FS-Cache to perform alookup in an object represented by a cookie italready has.

When a cookie is acquired by a netfs, an objectdefinition must be supplied. Object definitionsare described using the following structure:

struct fscache_object_def

This contains the cookie name; the object type;and operations to retrieve the object key andauxiliary data, to validate an object read fromdisk by it auxiliary data, to select a cache, andto manage netfs pages.

Note that a netfs’s primary index is defined byFS-Cache, and is not subject to change.


5.2 Negative Cookies

A negative cookie is a NULL cookie pointer.Negative cookies can be used anywhere thatnon-negative cookies can, but with the effectthat the FS-Cache header file wrapper functionsreturn an appropriate error as fast as possible.

Note that attempting to acquire a new cookiefrom a negative cookie will simply result in an-other negative cookie. Attempting to store orretrieve a page using a negative cookie as theobject specifier will simply result in ENOBUFSbeing issued.

FS-Cache will also issue a negative cookie ifan error such as ENOMEM or EIO occurred, anon-index object’s parent has no backing cache,the backing cache is being withdrawn from thesystem, or the backing cache is stopped due toan earlier fatal error.

5.3 Registering The Netfs

Before the netfs may access any of the cachingfacilities, it must register itself by calling:

fscache_register_netfs()

This is passed a pointer to the netfs definition.

The netfs definition doesn’t contain a lot at themoment: just the netfs’s name and index struc-ture version number, and a pointer to a table ofper-netfs operations which is currently empty.

After a successful registration, the primary in-dex pointer in the netfs definition will have beenfilled in with a pointer to the primary index ob-ject of the netfs.

The registration will fail if it runs out of mem-ory or if there’s another netfs of the same namealready registered.

When a netfs has finished with the caching fa-cilities, it should unregister itself by calling:

fscache_unregister_netfs()

This is also passed a pointer to the netfs defini-tion. It will relinquish the primary index cookieautomatically.

5.4 Acquiring Cookies

A netfs can acquire further cookies by passinga cookie it already has along with an object def-inition and a private datum to:

fscache_acquire_cookie()

The cookie passed in represents the object thatwill be the parent of the new one.

The private datum will be recorded in thecookie (if one is returned) and passed to thevarious callback operations listed in the objectdefinition.

The cache will invoke those operations in thecookie definition to retrieve the key and theauxiliary data, and to validate the auxiliary dataassociated with an object stored on disk.

If the object requested is of non-index type, thisfunction will search the cache to which the par-ent object is bound to see if the object is alreadypresent. If a match is found, the owning netfswill be asked to validate the object. The valida-tion routine may request that the object be used,updated or discarded.

If a match is not found, an object will be cre-ated if sufficient disk space and memory areavailable, otherwise a negative cookie will bereturned.

If the parent object is not bound to a cache, thena negative cookie will be returned.


Cookies may not be acquired twice without be-ing relinquished in between. A netfs must it-self deal with potential cookie multiplexing andaliasing—such as might happen with multiplemounts off the same NFS server.

5.5 Relinquishing Cookies

When a netfs no longer needs the object at-tached to a cookie, it should relinquish thecookie:

fscache_relinquish_cookie()

When this is called, the caller may also in-dicate that they wish the object to be retiredpermanently—in which case the object and allits children, its children’s children, etc. will bedeleted from the cache.

Prior to relinquishing a cookie, a netfs musthave uncached all the pages read or allocatedto that cookie, and all the child objects acquiredon that cookie must have been themselves relin-quished.

The primary index should not be relinquisheddirectly. This will be taken care of when thenetfs definition is unregistered.

5.6 Control Operations

There are a number of FS-Cache operationsthat can be used to control the object attachedto a cookie.

fscache_set_i_size()This is used to set the maximum file sizeon a non-index object. Error ENOBUFS will beobtained if an attempt is made to access a pagebeyond this size. This is provided to allow thecache backend to optimise the on-disk cache tostore an object of this size; it does not implythat any storage will be set aside.

fscache_update_cookie()This can be used to demand that the aux-iliary data attached to an object be updatedfrom a netfs’s own records. The auxiliary datamay also be updated at other times, but there’sno guarantee of when.

fscache_pin_cookie()fscache_unpin_cookie()These can be used to request that an ob-ject be pinned in the cache it currently residesand to unpin a previously pinned cache.

fscache_reserve_space()This can be used to reserve a certain amount ofdisk space in the cache for a data object to storedata in. The reservation will be extended toinclude for any metadata required to store thereserved data. A reservation may be cancelledby reducing the reservation size to zero.

The pinning and reservation operations mayboth issue error ENOBUFS to indicate that anobject is unbacked, and error ENOSPC to indi-cate that there’s not enough disk space to setaside some for pinning and reservation.

Both reservation and pinning persist beyond thecookie being released unless the cookie or oneof its ancestors in the tree is also retired.

5.7 Data Operations

There are a number of FS-Cache operationsthat can be used to store data in the object at-tached to a cookie and then to retrieve it again.Note that FS-Cache must be informed of themaximum data size of a non-index object be-fore an attempt is made to access pages in thatobject.

fscache_alloc_page()This is used to indicate to the cache thata netfs page will be committed to the cache


at some point, and that any previous contentsmay be discarded without being read.

fscache_read_or_alloc_page()This is used to request the cache attemptto read the specified page from disk, andotherwise allocate space for it if not present asit will be fetched shortly from the server.

fscache_read_or_alloc_pages()This is used to read or allocate several pages inone go. This is intended to be used from thereadpages address space operation.

fscache_write_page()This is used to store a netfs page to apreviously read or allocated cache page.

fscache_uncache_page()fscache_uncache_pagevec()These are used to release the referenceput on a cache page or a set of cache pages bya read or allocate operation.

The allocate, read, and write operations will is-sue error ENOBUFS if the cookie given is nega-tive or if there’s no space on disk in the cache tohonour the operation. The read operation willissue error ENODATA if asked to retrieve data itdoesn’t have but that it can reserve space for.

The read and write operations may completeasynchronously, and will make use of the sup-plied callback in all cases where I/O is startedto indicate to the netfs the success or failure ofthe operation. If a read operation failed on apage, then the netfs will need to go back to theserver.

5.8 Error Handling

FS-Cache handles many errors as it can inter-nally and never lets the netfs see them, prefer-ring to translate them into negative cookies orENOBUFS as appropriate to the context.

Out-of-memory errors are normally passedback to the netfs , which is then expected to dealwith them appropriately, possibly by abortingthe operation it was trying to do.

I/O errors in a cache are more complex to dealwith. If an I/O error happens in a cache, thenthe cache will be stopped. No more cache trans-actions will take place, and all further attemptsto do cache I/O will be gracefully failed.

If the I/O error happens during cookie acqui-sition, then a negative cookie will be returned,and all caching operations based on that cookiewill simply give further negative cookies orENOBUFS.

If the I/O error happens during the reading ofpages from the cache, then if any pages as yetunprocessed will be returned to the caller if thefscache reader function is still in progress; andany pages already committed to the I/O processwill either complete normally, or will have theircallbacks invoked with an error indication. Inthe latter case, the netfs should fetch the pagefrom the server again.

If the I/O error happens during the writing ofpages to the cache, then either the fscache writewill fail with ENOBUFS or the callback will beinvoked with an error. In either case, it can beassumed that the page is not safely written intothe cache.

5.9 Data Invalidation And Truncation

FS-Cache does not provide data invalidationand truncation operations per-se. Instead theobject should be retired (by relinquishing itwith the retirement option set) and acquiredanew. Merely shrinking the maximum file sizedown is not sufficient, especially as represen-tations of extended attributes and suchlike maynot be expunged by truncation.


6 Current State

The FS-Cache facility and its associated cachebackends and netfs interfaces are not, at thetime of writing, upstream. They are under de-velopment at Red Hat at this time. The statesof the individual components are as follows:

• FS-CacheAt this time FS-Cache is stable. Newfeatures may be added, but none areplanned.

• CacheFSCacheFS is currently stalled. Althoughthe performance numbers obtained areinitially good, after a cache has beenused for a while read-back performancedegrades badly due to fragmentation.There are ways planned to ameliorate this,but they require implementation.

• CacheFilesCacheFiles has been prototyped andis under development at the momentin preference to CacheFS as it doesn’trequire a separate block device to bemade available, but can instead run on analready mounted filesystem. Currentlyonly Ext3 is being used with it.

• NFSThe NFS interface is sufficiently completeto give read/write access through thecache. It does, however, suffer from localcache aliasing problems that need sortingout.

• AFSThe AFS interfaces is complete as faras the in-kernel AFS filesystem is cur-rently able to go. AFS does not suffer fromcache aliasing locally, but the filesystemitself does not yet have write support.

6.1 Current Performance

The caches have been tested with NFS to getsome idea of the performance. CacheFiles wasbenchmarked on Ext3 with 1K and 4K blocksizes and on also CacheFS. The two caches andthe block device raw tests were run on the samepartition on the client’s disk.

The client test machine contains a pair of200MHz PentiumPro CPUs, 128MB of mem-ory, an Ethernet Pro 100 NIC, and a FujitsuMPG3204AT 20GB 5400rpm hard disk driverunning in MDMA2 mode.

The server machine contains an Athlon64-FX51 with 5GB of RAM, an Ethernet Pro 100NIC, and a pair of RAID1’d WDC WD2000JD7200rpm SATA hard disk drives running inUDMA6 mode.

The client is connected through a pair of100Mbps switches to the server, and the NFSconnection was NFS3 over TCP. Before doingeach test the files on the server were pulledinto the server’s pagecache by copying them to/dev/null. Each test was run several times,rebooting the client between iterations. Thelowest number for each case was taken.

Reading a 100MB file:

Cache CacheFiles CacheFSstate 1K Ext3 4K Ext3None 26s 26s 26sCold 44s 35s 27sWarm 19s 14s 11s

Reading 100MB of raw data from the sameblock device used to host the caches can bedone in 11s.

And reading a 200MB file:


Cache CacheFiles CacheFSstate 1K Ext3 4K Ext3None 46s 46s 46sCold 79s 62s 47sWarm 37s 29s 23s

Reading 200MB of raw data from the sameblock device used to host the caches can bedone in 22s.

As can be seen, a freshly prepared CacheFSgives excellent performance figures, but thesenumbers don’t show the degradation over timefor large files.

The performance of CacheFiles will degradeover time as the backing filesystem does, if itdoes—but CacheFiles’s biggest problem is thatit currently has to bounce the data between thenetfs pages and the backing filesystems’s pages.This means it does a lot of page-sized memoryto memory copies. It also has to use bmap toprobe for holes when retrieving pages, some-thing that can be improved by implementinghole detection in the backing filesystem.

The performance of CacheFiles could possiblybe improved by using direct I/O as well—thatway the backing filesystem really would readand write directly from/to the netfs’s pages.That would obviate the need for backing pagesand would reduce the large memory copies.

Note that CacheFiles is still being imple-mented, so these numbers are very preliminary.

7 Further Information

There’s a mailing list available for FS-Cachespecific discussions:

mailto:[email protected]

Patches may be obtained from:

http://people.redhat.com/

~dhowells/cachefs/

and:

http://people.redhat.com/~steved/

cachefs/

The FS-Cache patches add documentation intothe kernel sources here:

Documentation/filesystems/caching/

References

[1] Information about Coda can be found at:

http://www.coda.cs.cmu.edu/

[2] Information about OpenAFS can befound at:

http://www.openafs.org/

[3] Information about Sun’s CacheFS facilitycan be found in their onlinedocumentation:

http://docs.sun.com/

Solaris 9 12/02 System AdministratorCollection » System Administration Guide:Basic Administration » Chapter 40 UsingThe CacheFS File System (Tasks)

http://docs.sun.com/app/docs/

doc/816-4552/6maoo3121?a=view

Why Userspace Sucks—Or 101 Really Dumb ThingsYour App Shouldn’t Do

Dave JonesRed Hat

<[email protected]>

Abstract

During the development of Fedora Core 5 Ifound myself asking the same questions day af-ter day:

• Why does it take so long to boot?

• Why does it take so long to start X?

• Why do I get the opportunity to go fetcha drink after starting various applicationsand waiting for them to load?

• Why does idling at the desktop draw somuch power?

• Why does it take longer to shut down thanit does to boot up?

I initially set out to discover if there was some-thing the kernel could do better to speed upbooting. A number of suggestions have beenmade in the past ranging from better read-ahead, to improved VM caching strategies, orbetter on-disk block layout. I did not get thatfar however, because what I found in my initialprofiling was disturbing.

We have an enormous problem with applica-tions doing unnecessary work, causing wastedtime, and more power-drain than necessary.

This talk will cover a number of examples ofcommon applications doing incredibly waste-ful things, and will also detail what can be, andwhat has been done to improve the situation.

I intend to show by example numerous applica-tions doing incredibly dumb things, from silli-ness such as reloading and re-parsing XMLfiles 50 times each run, to applications thatwake up every few seconds to ask the kernelto change the value of something that has notchanged since it last woke up.

I created my tests using patches [1] to the Linuxkernel, but experimented with other approachesusing available tools like strace and systemtap.I will briefly discuss the use of these tools, asthey apply to the provided examples, in latersections of this paper.

Our userspace sucks. Only through better edu-cation of developers of “really dumb things notto do” can we expect to resolve these issues.

1 Overview

A large number of strange things are happen-ing behind the scenes in a lot of userspace pro-grams. When their authors are quizzed aboutthese discoveries, the responses range from “Ihad no idea it was doing that,” to “It didn’t

442 • Why Userspace Sucks—Or 101 Really Dumb Things Your App Shouldn’t Do

do that on my machine.” This paper hopesto address the former by shedding light onseveral tools (some old, some new) that en-able userspace programmers to gain some in-sight into what is really going on. It addressesthe latter by means of showing examples thatmay shock, scare, and embarrass their authorsinto writing code with better thought out algo-rithms.

2 Learning from read-ahead

Improving boot up time has been a targetedgoal of many distributions in recent years, witheach vendor resorting to a multitude of differ-ent tricks in order to shave off a few more sec-onds between boot and login. One such trickemployed by Fedora is the use of a read-aheadtool, which, given a list of files, simply readsthem into the page cache, and then exits. Dur-ing the boot process there are periods of timewhen the system is blocked on some non-diskI/O event such as waiting for a DHCP lease.Read-ahead uses this time to read in files thatare used further along in the boot process. Byseeding the page cache, the start-up of sub-sequent boot services will take less time pro-vided that there is sufficient memory to preventit from being purged by other programs startingup during the time between the read-ahead ap-plication preloaded it, and the real consumer ofthe data starting up.

The read-ahead approach is a primitive solu-tion, but it works. By amortising the costof disk IO during otherwise idle periods, weshave off a significant amount of time dur-ing boot/login. The bootchart [2] project pro-duced a number of graphs that helped visualiseprogress during the early development of thistool, and later went on to provide a rewrittenversion for Fedora Core 5 which improved onthe bootup performance even further.

The only remaining questions are, what files dowe want to prefetch, and how do we generatea list of them? When the read-ahead servicewas first added to Fedora, the file list was cre-ated using a kernel patch that simply printk’dthe filename of every file open()’d during thefirst five minutes of uptime. (It was necessary tocapture the results over a serial console, due tothe huge volume of data overflowing the dmesgring buffer very quickly.)

This patch had an additional use however,which was to get some idea of just what IO pat-terns userspace was creating.

During Fedora Core 5 development, I decidedto investigate these patterns. The hope was thatinstead of the usual approach of ’how do wemake the IO scheduler better’, we could makeuserspace be more intelligent about the sort ofIO patterns it creates.

I started by extending the kernel patch to logall file IO, not just open()s. With this newpatch, the kernel reports every stat(), delete(),and path_lookup(), too.

The results were mind-boggling.

• During boot-up, 79576 files were stat()’d.26769 were open()’d, 1382 commandswere exec’d.

• During shutdown, 23246 files werestat()’d, 8724 files were open()’d.

2.1 Results from profiling

Picking through a 155234 line log took sometime, but some of the things found were trulyspectacular.

Some of the highlights included:

• HAL Daemon.


– Reread and reparsed dozens of XMLfiles during startup. (In some cases,it did this 54 times per XML file).

– Read a bunch of files for devices thatwere not even present.

– Accounted for a total of 1918open()’s, and 7106 stat()’s.

• CUPS

– Read in ppd files describing everyprinter known to man. (Even thoughthere was not even a printer con-nected.)

– Responsible for around 2500 stat()’s,and around 500 open()’s.

• Xorg

A great example of how not to do PCI busscanning.

– Scans through /proc/bus/pci/ in or-der.

– Guesses at random bus numbers,and tries to open those devices in/proc/bus/pci/.

– Sequentially probes for devices onbusses 0xf6 through 0xfb (eventhough they may not exist).

– Retries entries that it has alreadyattempted to scan regardless ofwhether they succeeded or not.

Aside from this, when it is not busy scan-ning non-existent PCI busses, X reallylikes to stat and reopen lot of files it has al-ready opened, like libGLcore.so. A weak-ness of its dynamic loader perhaps?

• XFS

– Was rebuilding the font cache everytime it booted, even if no changeshad occurred in the fonts directories.

• gdm / gnome-session.

– Tried to open a bunch of non-existent files with odd-lookingnames like /usr/share/

pixmaps/Bluecurve/cursors/

00000000000000000000000000

– Suffers from font madness (See be-low).

2.2 Desktop profiling

Going further, removing the “first 5 minutes”check of the patch allowed me to profile whatwas going on at an otherwise idle desktop.

• irqbalance.

– Wakes up every 10 seconds to re-balance interrupts in a round-robinmanner. Made a silly mistake whereit was re-balancing interrupts whereno IRQs had ever occurred. Athree line change saved a few dozensyscalls.

– Was also re-balancing interruptswhere an IRQ had not occurred insome time every 10 seconds.

– Did an open/write/close of each/proc/irq/n/smp_affinity fileeach time it rebalanced, instead ofkeeping the fd’s open, and doing1/3rd of syscalls.

Whilst working with /proc files does notincur any I/O, it does trigger a transitionto-and-from kernel space for each systemcall, adding up to a lot of unneeded workon an otherwise ‘idle’ system.

• gamin

– Was stat()’ing a bunch of gnomemenu files every few seconds for noapparent reason.


% time seconds usecs/call calls errors syscall32.98 0.003376 844 4 clone27.87 0.002853 4 699 1 read23.50 0.002405 32 76 getdents10.88 0.001114 0 7288 10 stat1.38 0.000141 0 292 munmap1.31 0.000134 0 785 382 open

Figure 1: strace -c output of gnome-terminal with lots of fonts.

• nautilus

– Was stat’ing $HOME/Templates,/usr/share/applications,and $HOME/.local/share/

applications every few secondseven though they had not changed.

• More from the unexplained department. . .

– mixer_applet2 did a real_lookup onlibgstffmpegcolorspace.so for somebizarre reason.

– Does trashapplet really need to statthe svg for every size icon when it israrely resized?

2.3 Madness with fonts

I had noticed through reviewing the log, thata lot of applications were stat()’ing (and oc-casionally open()’ing) a bunch of fonts, andthen never actually using them. To try to makeproblems stand out a little more, I copied 6000TTF’s to $HOME/.fonts, and reran the tests.The log file almost doubled in size.

Lots of bizarre things stood out.

• gnome-session stat()’d 2473 and open()’d2434 ttfs.

• metacity open()’d another 238.

• Just to be on the safe side, wnck-appletopen()’d another 349 too.

• Nautilus decided it does not want to be leftout of the fun, and open()’d another 301.

• mixer_applet rounded things off byopen()ing 860 ttfs.

gnome-terminal was another oddball. Itopen()’ed 764 fonts and stat()’d another 770including re-stat()’ing many of them multi-ple times. The vast majority of those fontswere not in the system-wide fonts prefer-ences, nor in gnome-terminals private pref-erences. strace -c shows that gnome-terminal spends a not-insignificant amount ofits startup time, stat()’ing a bunch of fonts thatit never uses. (See Figure 1.)

Another really useful tool for parsing hugestrace logs is Morten Wellinders strace-account [3] which takes away a lot of the te-dious parsing, and points out some obviousproblem areas in a nice easy-to-read summary.

Whilst having thousands of fonts is a somewhatpathological case, it is not uncommon for usersto install a few dozen (or in the case of artytypes, a few hundred). The impact of this de-fect will be less for most users, but it is stilldoing a lot more work than it needs to.

After my initial experiments were over, DanBerrange wrote a set of systemtap scripts [4]


to provide similar functionality to my tracingkernel patch, without the need to actually patchand rebuild the kernel.

3 Learn to use tools at your dis-posal

Some other profiling techniques are not as in-trusive as to require kernel modifications, yetremarkably, they remain under-utilised.

3.1 valgrind

For some unexplained reason, there are devel-opers that still have not tried (or in many cases,have not heard of) valgrind [5]. This is evidentfrom the number of applications that still outputlots of scary warnings during runtime.

Valgrind can find several different types ofproblems, ranging from memory leaks, to theuse of uninitialised memory. Figure 2 shows anexample of mutt running under valgrind.

The use of uninitialised memory can be de-tected without valgrind, by setting the envi-ronment variable _MALLOC_PERTURB_ [6] toa value that will cause glibc to poison memoryallocated with malloc() to the value the variableis set to, without any need to recompile the pro-gram.

Since Fedora Core 5 development, I run withthis flag set to %RANDOM in my .bashrc. Itadds some overhead to some programs whichcall malloc() a lot, but it has also found a num-ber of bugs in an assortment of packages. Agdb backtrace is usually sufficient to spot thearea of code that the author intended to use acalloc() instead of a malloc(), or in some cases,had an incorrect memset call after the mallocreturns.

3.2 oprofile

Perceived by many as complicated, oprofile isactually remarkably trivial to use. In a majorityof cases, simply running

opcontrol --start(do application tobe profiled)opcontrol --shutdownopreport -l

is sufficient to discover the functions wheretime is being spent. Should you be using adistribution which strips symbols out to sep-arate packages (for example, Fedora/RHEL’s-debuginfos), you will need to install the rel-evant -debuginfo packages for the applicationsand libraries being profiled in order to get sym-bols attributed to the data collected.

3.3 Heed the warnings

A lot of developers ignore, or even suppresswarnings emitted by the compiler, proclaiming“They are just warnings.” On an average day,the Red Hat package-build system emits around40–50,000 warnings as part of its daily use giv-ing some idea of the scale of this problem.

Whilst many warnings are benign, there areseveral classes of warnings that can have un-desirable effects. For example, an implicit dec-laration warning may still compile and run justfine on your 32-bit machine, but if the compilerassumes the undeclared function has int argu-ments when it actually has long arguments, un-usual results may occur when the code is runon a 64-bit machine. Leaving warnings un-fixed makes it easier for real problems to hideamongst the noise of the less important warn-ings.


==20900== Conditional jump or move depends on uninitialised value(s)==20900== at 0x3CDE59E76D: re_compile_fastmap_iter (in /lib64/libc-2.4.so)==20900== by 0x3CDE59EBFA: re_compile_fastmap (in /lib64/libc-2.4.so)==20900== by 0x3CDE5B1D23: regcomp (in /lib64/libc-2.4.so)==20900== by 0x40D978: ??? (color.c:511)==20900== by 0x40DF79: ??? (color.c:724)==20900== by 0x420C75: ??? (init.c:1335)==20900== by 0x420D8F: ??? (init.c:1253)==20900== by 0x422769: ??? (init.c:1941)==20900== by 0x42D631: ??? (main.c:608)==20900== by 0x3CDE51D083: __libc_start_main (in /lib64/libc-2.4.so)

Figure 2: Mutt under valgrind

Bonus warnings can be enabled with com-piler options -Wall and -Wextra (this op-tion used to be -W in older gcc releases)

For the truly anal, static analysis tools such assplint [7], and sparse [8] may turn up additionalproblems.

In March 2006, a security hole was found inXorg [9] by the Coverity Prevent scanner [10].The code looked like this.

if (getuid() == 0 || geteuid != 0)

No gcc warnings are emitted during this compi-lation, as it is valid code, yet it does completelythe wrong thing. Splint on the other hand in-dicates that something is amiss here with thewarning:

Operands of != have incompatible types([function (void) returns__uid_t], int): geteuid != 0Types are incompatible.(Use -type to inhibit warning)

Recent versions of gcc also allow programs tobe compiled with the -D_FORTIFY_SOURCE=

2 which enables various security checks in var-ious C library functions. If the size of memorypassed to functions such as memcpy is knownat compile time, warnings will be emitted if thelen argument overruns the buffer being passed.

Additionally, use of certain functions withoutchecking their return code will also result in awarning. Some 30-40 or so C runtime functionshave had such checks added to them.

It also traps a far-too-common1 bug: mem-set with size and value arguments transposed.Code that does this:

memset(ptr, sizeof(foo), 0);

now gets a compile time warning which lookslike this:

warning: memset used with

constant zero length parameter;

this could be due to transposed

parameters

Even the simpler (and deprecated) bzero func-tion is not immune from screwups of the sizeparameter it seems, as this example shows:

bzero(pages + npagesmax, npagesmax

- npagesmax);

Another useful gcc feature that was deployed inFedora Core 5 was the addition of a stack over-flow detector. With all applications compiledwith the flags

1Across 1282 packages in the Fedora tree, 50 of themhad a variant of this bug.


-fstack-protector --param=

ssp-buffer-size=4

any attempt at overwriting an on-stack bufferresults in the program being killed with the fol-lowing message:

*** stack smashing detected ***:

./a.out terminated

Aborted (core dumped)

This turned up a number of problems during de-velopment, which were usually trivial to fix up.

4 Power measurement

The focus of power management has tradition-ally been aimed at mobile devices; lower powerconsumption leads to longer battery life. Overthe past few years, we have seen increased in-terest in power management from data centers,too. There, lowering power consumption has adirect affect on the cost of power and cooling.The utility of power savings is not restricted tocosts though, as it will positively affect up-timeduring power outages, too.

We did some research into power usage dur-ing Fedora Core 5 development, to find out ex-actly how good/bad a job we were doing atbeing idle. To this end, I bought a ‘kill-a-watt’ [11] device (and later borrowed a ‘Watts-up’ [12] which allowed serial logging). Theresults showed that a completely idle EM64Tbox (Dell Precision 470) sucked a whopping153 Watts of power. At its peak, doing a ker-nel compile, it pulled 258W, over five timesas much power as its LCD display. By com-parison, a VIA C3 Nehemiah system pulled 48Watts whilst idle. The lowest power usage Imeasured on modern hardware was 21W idleon a mobile-Athlon-based Compaq laptop.

Whilst vastly lower than the more heavyweightsystems, it was still higher than I had antici-pated, so I investigated further as to where thepower was being used. For some time, peo-ple have been proclaiming the usefulness of the‘dynamic tick’ patch for the kernel, which stopsthe kernel waking up at a regular interval tocheck if any timers have expired, instead idlinguntil the next timer in the system expires.

Without the patch, the Athlon XP laptop idledat around 21W. With dynticks, after settlingdown for about a minute, the idle routineauto-calibrates itself and starts putting offdelays. Suddenly, the power meter started reg-istering. . . 20,21,19,20,19,20,18,21,19,20,22changing about once a second. Given theoverall average power use went down belowits regular idle power use, the patch does seemlike a win. (For reference, Windows XP doesnot do any tricks similar to the results of thedynticks patch, and idles at 20W on the samehardware). Clearly the goal is to spend longerin the lower states, by not waking up so often.

Another useful side-effect of the dyntick patchwas that it provides a /proc file that allows youto monitor which timers are firing, and their fre-quency. Watching this revealed a number ofsurprises. Figure 3 shows the output of thisfile (The actual output is slightly different, Imunged it to include the symbol name of thetimer function being called.)

• Kernel problems Whilst this paper focuseson userspace issues, for completeness, Iwill also enumerate the kernel issues thatthis profiling highlighted.

– USB. Every 256ms, a timer was fir-ing in the USB code. Apparently theUSB 2.0 spec mandates this timer,but if there are no USB devices con-nected (as was the case when I mea-sured), it does call into the question


peer_check_expire 181 cronddst_run_gc 194 syslogdrt_check_expire 251 auditdprocess_timeout 334 haldit_real_fn 410 automountprocess_timeout 437 kjournaldprocess_timeout 1260it_real_fn 1564 rpc.idmapdcommit_timeout 1574wb_timer_fn 1615 initprocess_timeout 1652 sendmailprocess_timeout 1653process_timeout 1833neigh_periodic_timer 1931process_timeout 2218 hald-addon-storprocess_timeout 3492 cpuspeeddelayed_work_timer_fn 4447process_timeout 7620 watchdog/0it_real_fn 7965 Xorgprocess_timeout 13269 gdmgreeterprocess_timeout 15607 pythoncursor_timer_handler 34096i8042_timer_func 35437rh_timer_func 52912

Figure 3: /proc/timertop

what exactly it is doing. For the pur-poses of testing, I worked around thiswith a big hammer, and rmmod’d theUSB drivers.

– keyboard controller. At HZ/20, thei8042 code polls the keyboard con-troller to see if someone has hot-plugged a keyboard/mouse or not.

– Cursor blinking. Hilariously, at HZ/5we wake up to blink the cursor.(Even if we are running X, and notsat at a VT)

• gdm

– For some reason, gdm keeps gettingscheduled to do work, even when itis not the active tty.

• Xorg

– X is hitting it_real_fn a loteven if it is not the currently activeVT. Ironically, this is due to X us-ing its ’smart scheduler’, which hitsSIGALRM regularly, to punish Xclients that are hogging the server.Running X with -dumbsched madethis completely disappear. At thetime it was implemented, itimer wasconsidered the fastest way of gettinga timer out of the kernel. With ad-vances from recent years speedingup gettimeofday() through the use ofvsyscalls, this may no longer be themost optimal way for it to go aboutthings.

• python

– The python process that kept wak-ing up belonged to hpssd.py, a partof hplip. As I do not have a printer,


this was completely unnecessary.

By removing the unneeded services and kernelmodules, power usage dropped another watt.Not a huge amount, but significant enough tobe measured.

Work is continuing in this area for FedoraCore 6 development, including providing bet-ter tools to understand the huge amount of dataavailable. Current gnome-power-manager CVSeven has features to monitor /proc/acpifiles over time to produce easy-to-parse graphs.

5 Conclusions.

The performance issues discussed in this paperare not typically reported by users. Or, if theyare, the reports lack sufficient information toroot-cause the problem. That is why it is impor-tant to continue to develop tools such as thoseoutlined in this paper, and to run these toolsagainst the code base moving forward. Thework is far from over; rather, it is a continualeffort that should be engaged in by all involvedparties.

With increased interest in power management,not only for mobile devices, but for desktopsand servers too, a lot more attention needs tobe paid to applications to ensure they are “op-timised for power.” Further development ofmonitoring tools such as the current gnome-power-manager work is key to understandingwhere the problem areas are.

Whilst this paper pointed out a number of spe-cific problems, the key message to be conveyedis that the underlying problem does not lie withany specific package. The problem is that de-velopers need to be aware of the tools that areavailable, and be informed of the new tools be-ing developed. It is through the use of thesetools that we can make Linux not suck.

References

[1] http://people.redhat.com/davej/filemon

[2] http://www.bootchart.org

[3] http://www.gnome.org/~mortenw/files/strace-account

[4] http://people.redhat.com/berrange/systemtap/bootprobe

[5] http://valgrind.org

[6] http://people.redhat.com/drepper/defprogramming.pdf

[7] http://www.splint.org

[8] http://www.codemonkey.org.uk/projects/git-snapshots/sparse

[9] http://lists.freedesktop.org/archives/xorg/2006-March/013992.htmlhttp://blogs.sun.com/roller/page/alanc?entry=security_hole_in_xorg_6

[10] http://www.coverity.com

[11] kill-a-watt:http://www.thinkgeek.com/gadgets/electronic/7657

[12] watts-up:http://www.powermeterstore.com/plug/wattsup.php


Proceedings of the Linux Symposium Volume One · 2006-07-19 · Eric W. Biederman Fully Automated Testing of the Linux Kernel 113 ... Kristen Carlson Accardi Open Source Technology

Documents