overviewcourses.cs.vt.edu/.../Virtualization/Nemesis-Overview.pdfFairbairns, Eoin Hyden, Ian Leslie, Derek McAuley and Timothy Roscoe. The ﬁrst version of Nemesis was written from

Overview

Editors: Dickon Reed and Robin Fairbairns

May 20, 1997

Contents

Preface iv

1 The Structure of Nemesis 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Nemesis Trusted Supervisor Code (NTSC) . . . . . . . . . . . . . . . . . . . 3

1.3 Virtual Processor Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Processor Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Inter-Domain Communication (IDC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.1 Shared Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.2 Remote Procedure Call (RPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.3 Rbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 Inter-Process Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.2 Intra-Process Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Device Driver Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.6.1 Hardware Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.6.2 Kernel Critical Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7 Nemesis and I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Architecture overview 152.1 Types and MIDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Objects and Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Pervasives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Linkage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Address Space Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Naming and Runtime Typing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 Context Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1.1 Ordered Merges of Contexts . . . . . . . . . . . . . . . . . . . . . . 222.4.1.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Run Time Type System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 CLANGER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Domain bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.1 The Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

i

3 Structural overview 273.1 NTSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Virtual Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Primal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 The Nemesis Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Stretch Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.2 Binder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.3 Interrupt Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.4 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.5 CallPriv allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.6 Type system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Shared library services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.1 Domain Creation services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1.1 Domain Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.1.2 Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.1.3 Plumber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Inter Domain Communicaton . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2.1 Object Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2.2 Marshalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2.3 Transports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2.4 Stubs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.2.5 Gatekeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.3 Naming Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.4 Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.5 IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.5.1 FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.5.2 IO transports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.5.3 IO Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.5.4 Quality of Service Entries . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.6 Readers and Writers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.6.1 Redirectable readers and writers . . . . . . . . . . . . . . . . . . . . 31

3.4.7 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.7.1 Threads Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.7.2 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.7.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.7.4 SRC Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.7.5 ANSWare/RT Tasks and Entries . . . . . . . . . . . . . . . . . . . . 32

3.4.8 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.8.1 Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.8.2 Exec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.9 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.10 libc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 The Trader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Multimedia applications 344.1 Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Conference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

ii

5 Graphics 395.1 S3 Frame Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1.2 Blitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 WSSvr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.1 Talking to the Window Manager . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.2 Dealing with the frame buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 Locking updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.4 Mouse and Keyboard events . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.5 Passing events to clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 SWM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.1 Window Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Moving, Raising and Lowering windows . . . . . . . . . . . . . . . . . . . . 415.3.3 Editing window clip regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Communications Support in Nemesis 426.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2 Device Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 Receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.2 Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.3 Driver scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Flow Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.4 Application Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4.1 Moving the copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.5 Current Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.5.1 Flow Manager Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 456.5.2 Application Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . 46

6.5.2.1 Constructing a new stack . . . . . . . . . . . . . . . . . . . . . . . . 466.5.3 Device drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Memory Management 487.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.2.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2.2 Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8 Scheduler Accounting 508.1 Scheduler Accounting mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.2 NFS dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.3 Loadbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.4 Loadgraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.5 Quality of Service control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9 Build mechanism 539.1 Tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

9.1.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 57

iii

Preface

The Nemesis operating system has been developed at the University of Cambridge under theægis of the Pegasus1 [Leslie 93] and Pegasus II projects2. Nemesis runs on a number of platformsincluding Pentium and upwards PC architecture machines, the DEC 3000/AXP series of work-stations [DEC 94], DECchip EB164 and DECchip EB64 Alpha evaluation boards [DEC 93], DEC5000/25 (Maxine) [Voth 91], the Fairisle FPC3 Port Controller [Hayter 94] and the DEC SystemsResearch Center IT board.

Development of Nemesis has reached a reasonably stable state, and the present document hasbeen developed in order to expedite use of Nemesis at sites other than the Cambridge ComputerLaboratory: while some hardy groups have managed to use Nemesis in its early instantiations,the strain on all concerned is not sustainable in the large � � �

Nemesis was designed in Cambridge by a team that included Paul Barham, Richard Black, RobinFairbairns, Eoin Hyden, Ian Leslie, Derek McAuley and Timothy Roscoe.

The first version of Nemesis was written from scratch over a 2 year period by Timothy Roscoe,David Evers and Paul Barham. Other related systems are Nemo [Hyden 94] and Fawn [Black 94],both of which had some influence on the original implementation. This first version of Nemesisconstituted a deliverable of the original Pegasus project.

Since then, Nemesis has been further developed by Paul Barham, Richard Black, Steven Hand,Dickon Reed, Austin Donnelly, Stephen Early, Neil Stratford, Paul Menage and KookJin Nam.A variety of deliverables of the Pegasus II project constitute ports of Nemesis, or developmentsof Nemesis, but the project has also to deliver a variety of things that are, in Nemesis parlance,applications. Nemesis is moving out of the laboratory!

Chapters 1 and 2 derive (respectively) from [Barham 96] and [Roscoe 95b]. Other chapters havebeen written specifically for the purposes of this document.

Comments, either pointing out problems or correcting errors, will be welcomed by the editors.Such comments should be posted to the newsgroup nemesis�misc3, or mailed direct to one ofthe editors, rf�cl�cam�ac�uk or dr��cl�cam�ac�uk

1Pegasus was funded by the European Commission under the Esprit programme, as project BRA 68652Pegasus II is funded by the European Commission under the Esprit programme, as project LTR 21 9173The newsgroup is available to Pegasus II partners only

iv

Chapter 1

The Structure of Nemesis

1.1 Introduction

The purpose of an operating system is to multiplex shared resources between applications. Tra-ditional operating systems have presented physical resources to applications by virtualisation,e.g. UNIX applications run in virtual time on a virtual processor – most are unaware of the pas-sage of real time and that they often do not receive the CPU for prolonged periods. The operatingsystem proffers the illusion that they are exclusive users of the machine.

Multimedia applications tend to be sensitive to the passage of real time. They need to knowwhen they will have access to a shared resource and for how long. In the past, it has been consid-ered sufficient to implement little more than access-control on physical resources. It is becomingincreasingly important to account, schedule and police shared resources so as to provide someform of Quality of Service (QoS) guarantee.

Whilst it is necessary to provide the mechanisms for multiplexing resources, it is important that theamount of policy hard-wired into the operating system kernel is kept to an absolute minimum.That is, applications should be free to make use of system-provided resources in the mannerwhich is most appropriate. At the highest level, a user may wish to impose a globally consistentpolicy, but in the Nemesis model this is the job of a QoS-Manager agent acting on the user’sbehalf and under the user’s direction. This is analogous to the use of a window manager to allowthe user to control the decoration, size and layout of windows on the screen, but which does nototherwise constrain the behaviour of each application.

1.2 The Design Principles

Nemesis was designed to provide QoS guarantees to applications. In a microkernel environment,an application is typically implemented by a number of processes, most of which are servers per-forming work on behalf of more than one client. This leads to enormous difficulty in accountingresource usage to the application. The guiding principle in the design of Nemesis was to struc-ture the system in such a way that the vast majority of functionality comprising an applicationcould execute in a single process, or domain. As mentioned previously, this leads to a vertically-structured operating system (figure 1.1).1

The Nemesis kernel consists of a scheduler (one version was less than 250 instructions) and asmall amount of code known as the Nemesis Trusted Supervisor Code (NTSC), used for Inter-

1This diagram is derived from one in [Hyden 94].

1

Stubs

Hardware

NTSC

SystemDomains

DeviceDriver

Domains

ApplicationDomains

UserMode

Kernel Mode

Figure 1.1: The Structure of a Nemesis System

Domain Communication (IDC) and to interact with the scheduler. The kernel also includes theminimum code necessary to initialise the processor immediately after booting and handle proces-sor exceptions, memory faults, unaligned accesses, TLB misses and all other low-level processorfeatures. The Nemesis kernel bears a striking resemblance to the original concept of an operatingsystem kernel or nucleus expressed in [Brinch-Hansen 70].

The kernel demultiplexes hardware interrupts to the stage where a device specific first-level in-terrupt handler may be invoked. First-level interrupt handlers consist of small stubs which maybe registered by device-drivers. These stubs are entered with all interrupts disabled and withthe minimal number of registers saved and usually do little more than send notification to theappropriate device driver.

1.2.1 Domains

The term domain is used within Nemesis to refer to an executing program and can be thoughtof as analogous to a UNIX process – i.e. a domain encapsulates the execution state of a Nemesisapplication. Each domain has an associated scheduling domain (determining CPU time allocation)and protection domain (determining access rights to regions of the virtual address space).

Nemesis is a Single Address Space (SAS) operating system i.e. any accessible region of physicalmemory appears at the same virtual address in each protection domain. Access rights to a regionof memory, however, need not be the same. The use of a single address space allows the useof pointers in shared data-structures and facilitates rich sharing of both program text and dataleading to significant reduction in overall system size.

All operating system interfaces are written using an Interface Definition Language (IDL) knownas MIDDL which provides a platform and language independent specification. Modules, withinterfaces written in MIDDL,2 are used to support the single address space and allow operatingsystem code to be migrated into the application. These techniques are discussed in detail in[Roscoe 95b].

2Files containing MIDDL interfaces in Nemesis by convention have a �if suffix, e.g. Activation�if

2

Unprivileged DomainsName Purpose

ntsc rfa Return from activation.ntsc rfa resume Return from activation, restoring a context.ntsc rfa block Return from activation and block.

ntsc block Block awaiting an event.ntsc yield Relinquish CPU allocation for this period.ntsc send Send an event.

Privileged DomainsName Purpose

ntsc swpipl Change interrupt priority level.ntsc entkern Enter kernel mode.

ntsc leavekern Leave kernel mode.ntsc regstub Register an interrupt stub.ntsc kevent Send an event from an interrupt stub.

ntsc rti Return from an interrupt stub.

Table 1.1: Alpha NTSC Call Interface.

1.2.2 Nemesis Trusted Supervisor Code (NTSC)

The NTSC is the low level operating system code within the Nemesis kernel which may be in-voked by user domains. Its implementation is potentially architecture and platform specific; forexample, the NTSC is implemented almost entirely in PALcode on Alpha platforms [Sites 92],whilst, on MIPS [Kane 88], Intel and ARM [ARM 91] platforms, the NTSC is invoked using thestandard system call mechanisms.

The NTSC interface may be divided into two sections – those calls which may be invoked by anydomain and those which may only be invoked by a privileged domain. Unprivileged calls areused for interaction with the kernel scheduler and to send events to other domains. Privilegedcalls are used to affect the processor mode and interrupt priority level and to register first-levelinterrupt stubs. As an example, table 1.1 lists the major NTSC calls for the Alpha architecture.

The NTSC interacts with domains and the kernel scheduler via a per-domain area of shared mem-ory known as the Domain Control Block (DCB). Portions of the DCB are mapped read-write intothe domain’s address-space, whilst others are mapped read-only to prevent modification of priv-ileged state. The read-only DCB contains scheduling and accounting information used by thekernel, the domain’s privilege level, read-only data structures used for implementing IDC chan-nels and miscellaneous other information. The read-write section of the DCB contains an arrayof processor-context save slots and user-writable portions of the IDC channel data structures.

1.3 Virtual Processor Interface

Nemesis presents the processor to domains via the Virtual Processor Interface (VP�if). This in-terface specifies a platform independent abstraction for managing the saving and restoring ofCPU context, losing and regaining the real processor and communicating with other domains. Itdoes not however attempt to hide the multiplexing of the underlying processor(s). The virtualprocessor interface is implemented over the NTSC calls described in section 1.2.2.

3

1.3.1 Activation

Whenever a domain is given the CPU, it is upcalled via a vector in the DCB known as the activa-tion handler, in a similar manner to Scheduler Activations [Anderson 92]. A flag is set disablingfurther upcalls until the domain leaves activation mode allowing code on the activation vectorto perform atomic operations with little or no overhead. Information is made available to theactivation handler including an indication of the reason why the domain has been activated, thetime when it last lost the real processor and the current system time. The purpose of this upcall isto afford QoS-aware applications an opportunity to assess their progress and make application-specific policy decisions so as to make most efficient usage of the available resources.

When a domain is handed the processor, it is informed whether it is currently running on guar-anteed time, or merely being offered use of some of the slack-time in the system. QoS-aware ap-plications must take account of this before deciding to adapt to apparent changes in system load.This may be used to prevent QoS feedback mechanisms from reacting to transient improvementsin resource availability.

1.3.2 Processor Context

When a virtual processor loses the CPU, its context must be saved. The DCB contains an arrayof context-save slots for this purpose. Two indices into this array specify the slots to use when inactivation mode and when in normal mode, based on the current state of the activation flag.

When a domain is preempted it will usually be executing a user-level thread. The context ofthis thread is stored in the save slot of the DCB and may be reloaded by the activation handlerof the domain when it is next upcalled. If a domain is preempted whilst in activation mode, theprocessor context is saved in the resume slot and restored transparently when the domain regainsthe CPU rather than the usual upcall.

1.3.3 Events

The only means of communication directly provided by the Nemesis kernel is the event. Eachdomain has a number of channel-endpoints which may be used either to transmit or to receiveevents. A pair of endpoints may be connected by a third party known as the Binder, to providean asynchronous simplex communications channel.

This channel may be used to transmit a single 64-bit value between two domains. The eventmechanism was designed purely as a synchronisation mechanism for shared memory commu-nication, although several simple protocols have been implemented which require nothing morethan the event channel itself, e.g. the TAP protocol, described in [Black 94], used for start-of-daycommunication with the binder. Unlike message-passing systems such as Mach [Accetta 86] orChorus [Rozier 90], the kernel is not involved in the transfer of bulk data between two domains.

Nemesis also separates the act of sending an event and that of losing the processor. Domains mayexploit this feature to send a number of events before being preempted or voluntarily relinquish-ing the CPU. For bulk data transports such as the Rbufs mechanism, described in section 1.4.3,pipelined execution is usually desirable and the overheads of repeatedly blocking and unblock-ing a domain may be avoided. For more latency-sensitive client-server style communication adomain may choose to cause a reschedule immediately in order to give the server domain achance to execute.

4

IDC Transport ServerCode

DCB DCB

ObjectTbl

Binder Domain

Client Domain Server Domain

Binder ObjectTbl

ClientStubs

Client

Offer

Export

ServerStubs Main

2

1

3 BindIDC Offer IDC Service

Figure 1.2: Offering an RPC service

1.4 Inter-Domain Communication (IDC)

Various forms of IDC have been implemented on top of the Nemesis event mechanism. Some ofthe most commonly used are described below.

1.4.1 Shared Data Structures

Since Nemesis domains share a single address space, the use of shared memory for communi-cation is relatively straightforward. Data structures containing pointers are globally valid andthe only further requirement is to provide some synchronisation mechanism to allow the datastructure to be updated atomically and to prevent readers from seeing an inconsistent state. Verylightweight locking primitives may easily be built on top of the kernel-provided event mecha-nism.

1.4.2 Remote Procedure Call (RPC)

Same-machine Remote Procedure Call (RPC) [Birrell 84] is widely used within Nemesis. Al-though the majority of operating system functionality is implemented within the application,there are many out-of-band operations which require interaction with a server in order to updatesome shared state.

When a server wishes to offer a service to other domains, it locates an IDC transport module andrequests a new offer (step 1 in figure 1.2) from it. It then places it into the Object Table of the serverdomain (2). The object table maintains a mapping between offers and interfaces.

Before RPC can take place, the offer must be given to the client by the server. Typically, the serverwill place an offer into a traded namespace shared between all domains on the system and theclient will retrieve it from there but it could just be passed directly.

When the client obtains the offer, it invokes a bind operation on the offer (step 3 in figure 1.2,step 1 in figure 1.3). This causes the offered IDC service to be added to the client domain’sobject table. A third party domain, the binder, then establishes the event channels necessary for

5

Server

DCB DCB

ObjectTbl

Binder Domain


Binder ObjectTbl

SharedMemory

GateKeeper GateKeeper

Client

Connect Request

Import

IDC Offer ClientStubs

ServerStubs

IDC Service

Bind

2

1

3 4

Request5

Figure 1.3: Binding to an RPC service

invocations between the client and the server. The binder domain also causes the connectiondetailed to be looked up in to the object table of the server and thus the server side connectionstate to be established.

The default RPC transport is based on an area of shared memory and a pair of event channelsbetween the client and server domains. To make an invocation (step 1 in figure 1.4), the clientmarshalls an identifier for the call and the invocation arguments into the shared memory (step 2)and sends an event to the server domain (step 3). The server domain receives the event, unmar-shalls the arguments (step 4) and performs the required operation (step 5). The results of thecall, or any exception raised are then marshalled into the shared memory (step 6) and an eventsent back to the client (step 7). The client unmarshalls the result (step 8) and returns from theclient stub. Marshalling code and the client and server stubs are generated automatically fromthe MIDDL interface definition and loaded as shared libraries.

The average cost of a user-thread to user-thread null-RPC between two Nemesis domains, usingthe default machine-generated stubs and the standard user-level threads package, was measuredat just over ��s on the Sandpiper platform [Roscoe 95b].

It is worth noting that the above complexity is largely hidden by the use of standard macros.Typical code on the server side looks like:

ServerType server�

IDC�EXPORTsvc�myservice� ServerType� server �

And, on the client side:

ServerType server�

server � IDC�OPENsvc�myservice� ServerType �

result � Server�Methodserver� args �

1.4.3 Rbufs

Whilst RPC provides a natural abstraction for out-of-band control operations and transactionstyle interactions, it is ill-suited to transfer of bulk data. The mechanism adopted by Nemesis fortransfer of high volume packet-based data is the Rbufs scheme detailed in [Black 94]. The trans-

6

Server

DCB DCB


SharedMemory

Client

ClientStubs

ClientBinding

Invoke

Send Request

2PerformOperation5

ServerStubs

ServerBinding

Threads

VP

Marshal Arguments

UnmarshalResults

1

8

3

4UnmarshalArguments

6MarshalResults

Send Reply7

Threads

VP

Figure 1.4: Invoking an RPCservice

7

Data Area

Data Data Data

Control Area(circular buffer)

headtail

2 1

Event Counts

(a) Control Area Usage

FullEmpty

Unused

Control Area (B to A)

Domain BDomain A Data Area

Control Area (A to B

Event Channelsheadtail

headtail

headtail

headtail

(b) Overview

Figure 1.5: High Volume I/O Using Rbufs

8

port mechanism is once again implemented using Nemesis event-channels and shared memory.Three areas of shared memory are required as shown in figure 1.5. One contains the data tobe transferred and the other two are used as FIFOs to transmit packet descriptors between thesource and sink. The head and tail pointers of these FIFOs are communicated by Nemesis event-channels.

Packets comprised of one or more fragments in a large pool of shared memory are described by asequence of �base� length� pairs known as iorecs. Figure 1.5a shows iorecs describing two packets,one with two fragments and the other with only a single fragment. Rbufs are highly suited toprotocol processing operations since they allow simple addition or removal of both headers andtrailers and facilitate segmentation and reassembly operations.

In receive mode, the sink sends iorecs describing empty buffer space to the source, which fillsthe buffers and updates the iorecs accordingly before returning them to the sink. In transmitmode, the situation is the converse. The closed loop nature of communication provides back-pressure and feedback to both ends of the connection when there is a disparity between the ratesof progress of the source and sink.

The intended mode of operation relies on the ability to pipeline the processing of data in order toamortise the context-switch overheads across a large number of packets. Sending a packet on anRbufs connection does not usually cause a domain to lose the CPU. Figure 1.6 shows the MIDDLinterface type for the Rbufs transport.

1.5 Scheduling

Scheduling can be viewed as the process of multiplexing the CPU resource between computa-tional tasks. The schedulable entity of an operating system often places constraints both on thescheduling algorithms which may be employed and the functionality provided to the application.

The recent gain in popularity of multi-threaded programming due to languages such as Modula-3 [Nelson 91] has led many operating system designers to provide kernel-level thread supportmechanisms [Accetta 86, Rozier 90]. The kernel therefore schedules threads rather than pro-cesses. Whilst this reduces the functionality required in applications and usually results in moreefficient processor context-switches, the necessary thread scheduling policy decisions must alsobe migrated into the kernel. As pointed out in [Barham 96], this is highly undesirable.

Attempts to allow applications to communicate thread scheduling policy to the kernel scheduler[Coulson 93, Coulson 95] lead to increased complexity of the kernel and the possibility for unco-operative applications to misrepresent their needs to the operating system and thereby gain anunfair share of system resources. For example, in the above systems user processes are requiredto communicate the earliest deadline of any of their threads to the kernel thread scheduler.

Nemesis allows domains to employ a split-level scheduling regime with the multiplexing mech-anisms being implemented at a low level by the kernel and the application-specific policy deci-sions being taken at user-level within the application itself. Note that the operating system onlymultiplexes the CPU resource once. Most application domains make use of a threads packageto control the internal distribution of CPU resource between a number of cooperating threads ofexecution.

1.5.1 Inter-Process Scheduling

Inter-process scheduling in Nemesis is performed by the kernel scheduler. This scheduler isresponsible for controlling the exact proportions of bulk processor bandwidth allocated to each

9

IO � LOCAL INTERFACE �NEEDS IDC�

BEGIN

�� An �IO� channel has to be one of two kinds� a �Read�er or�� Write�r� Readers remove valid packets from the channel with�� GetPkt� and send back empty �IORec�s with �PutPkt�� Writers�� send valid packets with �PutPkt� and acquire empty �IORec�s by�� calling �GetPkt��Kind � TYPE � Read� Write ��

�� The values passed through �IO� channels are �IORecs�� essentially �base� � �length� pairs describing the data�Rec � TYPE � RECORD

base � ADDRESS�len � WORD

��

�� PutPkt� sends a vector of �IORec�s down the channel� The�� operation sends �nrecs� records in a vector starting at �recs� in�� memory� Of these� the first �validrecs� are declared as holding�� useful data�PutPkt � PROC nrecs � CARDINAL�

recs � REF Rec�validrecs � CARDINAL �

RETURNS �� Send a vector of I�O records down the channel� or release them�� at the receiving end�

�� GetPkt� acquires a maximum of �maxrecs� �IORec�s� which are�� copied into memory at address �recs�� At the receive end these�� typically constitute a packet� which uses the first �validrecs�� for pointing to its data� The total number of records read is�� returned in �nrecs��GetPkt � PROC maxrecs � CARDINAL�

recs � REF Rec�OUT validrecs � CARDINAL �

RETURNS nrecs � CARDINAL �� Pull a vector of I�O records out of the channel�

�� PutPktNoBlock� sends a packet assuming that the client has�� already determined that �PutPkt� would not block�PutPktNoBlock � PROC nrecs � CARDINAL�

recs � REF Rec�validrecs � CARDINAL �

RETURNS �� Guaranteed non�blocking �PutPkt��

�� GetPktNoBlock� checks whether it would block� and returns�� False� if this is the case�GetPktNoBlock � PROC maxrecs � CARDINAL�

recs � REF Rec�OUT nrecs � CARDINAL�OUT validrecs � CARDINAL �

RETURNS avail � BOOLEAN �� As �GetPkt�� but fails rather than block�

�� GetPoolInfo� returns information about the pool used to send�� data�GetPoolInfo � PROC OUT buf� IDC�Buffer � RETURNS ��

�� Return the main pool buffer�

Slots � PROC � RETURNS ns� CARDINAL �� Return the number of slots of the tx fifo�

Dispose � PROC � RETURNS ��

END�

Figure 1.6: MIDDL interface for Rbufs (IO�if)

10

domain according to a set of QoS parameters in the DCB. Processor bandwidth requirements arespecified using a tuple of the form �p� s� x� l� with the following meaning:

p The period of the domain in ns.

s The slice of CPU time allocated to the domain every period in ns.x A flag indicating willingness to accept extra CPU time.l A latency hint to the kernel scheduler in ns.

The p and s parameters may be used both to control the amount of processor bandwidth andthe smoothness with which it is provided. The latency hint parameter is used to provide thescheduler with an idea as to how soon the domain should be rescheduled after unblocking.

The kernel scheduler interacts with the event mechanism allowing domains to block until theynext receive an event, possibly with a timeout. When a domain blocks it loses any remainingCPU allocation for its current period – it is therefore in the best interest of a domain to completeas much work as possible before giving up the processor.

The current kernel scheduler employs a variant of the Earliest Deadline First (EDF) algorithm[Liu 73] where the deadlines are derived from the QoS parameters of the domain and are purelyinternal to the scheduler. The scheduler is capable of ensuring that all guarantees are respectedprovided that

Pisipi� � and is described in detail in [Roscoe 95b]. Despite the internal use of

deadlines, this scheduler avoids the inherent problems of priority or deadline based schedulingmechanisms which focus on determining who should be allocated the entire processor resourceand provide no means to control the quantity of resource handed out.

In order to provide fine-grained timeliness guarantees to applications which are latency sensi-tive, higher rates of context-switching are unavoidable. The effects of context-switches on cacheand memory-system performance are analysed in [Mogul 91]. Mogul showed that a high rateof context switching leads to excessive numbers of cache and translation lookaside buffer (TLB)misses reducing the performance of the entire system. The use of a single address space in Neme-sis removes the need to flush a virtually addressed cache on a context switch and the process-IDfields present in most TLBs can be used to reduce the number of TLB entries which need to beinvalidated. The increased sharing of both code and data in a SAS environment also helps toreduce the cache-related penalties of context-switches.

1.5.2 Intra-Process Scheduling

Intra-process scheduling in a multimedia environment is an entirely application-specific area.Nemesis does not have a concept of kernel threads for this reason. A domain may use a user-levelscheduler to internally distribute the CPU time provided by the kernel scheduler using its ownpolicies. The application specific code for determining which context to reload is implementedin the domain itself.

The activation mechanism described in section 1.3.1 provides a convenient method for imple-menting a preemptive user-level threads package. The current Nemesis distribution providesboth preemptive and non-preemptive threads packages as shared library code.

The default thread schedulers provide lightweight user-level synchronisation primitives such asevent counts and sequencers [Reed 79] and the mutexes and condition variables of SRC threads[Birrell 87]. The implementation of various sets of synchronisation primitives over the top ofevent counts and sequencers is discussed in [Black 94].

It is perfectly possible for a domain to use an application specific threads package, or even torun without a user-level scheduler. A user-level threads package based on the ANSAware/RT[ANSA 95] concepts of Tasks and Entries has been developed as part of the DCAN project at

11

the Computer Laboratory.3 The ANSAware/RT model maps naturally onto the Nemesis VirtualProcessor interface.

1.6 Device Driver Support

In order to present shared I/O resources to multiple clients safely, device-drivers are necessary.The driver is responsible for ensuring that clients are protected from each other and that the hard-ware is not programmed incorrectly. This often involves context-switching the hardware betweenmultiple concurrent activities. The exact nature of the hardware dictates the methods employedand therefore the level of abstraction at which a device may be presented to applications.

Device drivers typically require access to hardware registers which can not safely be made ac-cessible directly to user-level code. This can be achieved by only mapping the registers into theaddress space of the device driver domain.

Some hardware registers are inherently shared between multiple device drivers, e.g. interruptmasks and bus control registers. The operating system must provide a mechanism for atomic up-dates to these registers. In kernel-based operating systems this has traditionally been performedby use of a system of interrupt-priority levels within the kernel. On most platforms, Nemesisprovides similar functionality via privileged NTSC calls.

In the design of Nemesis it was considered essential that it was possible to limit the use of sys-tem resources by device driver code so that the behaviour of the system under overload couldbe controlled. For this reason, Nemesis device drivers are implemented as privileged domainswhich are scheduled in exactly the same way as other domains, but have access to additionalNTSC calls.

1.6.1 Hardware Interrupts

The majority of I/O devices have been designed with the implicit assumption that they can asyn-chronously send an interrupt to the operating system which will cause appropriate device-drivercode to be scheduled immediately with absolute priority over all other tasks. Indeed, failureto promptly service interrupt requests from many devices can result in serious data loss. It isironic that serial lines, the lowest bit-rate I/O devices on most workstations, often require themost timely processing of interrupts due to the minimal amounts of buffering and lack of flow-control mechanisms in the hardware. [Barham 96] describes how this phenomenon influencesDMA arbitration logic on the Sandpiper.

More recently designed devices, particularly those intended for multimedia activities, are moretolerant to late servicing of interrupts since they usually have more internal buffering and areexpected to cope with transient overload situations.

In order to effectively deal with both types of device, Nemesis allows drivers to register smallsections of code known as interrupt-stubs to be executed immediately when a hardware interruptis raised. These sections of code are entered with a minimal amount of saved context and with allinterrupts disabled. They thus execute atomically. In the common case, an interrupt-stub will dolittle more than send an event to the associated driver causing it to be scheduled at a later date,but for devices which are highly latency sensitive it is possible to include enough code to preventerror conditions arising. The unblocking latency hint to the kernel scheduler is also useful forthis purpose.

3This work was performed by Timothy Roscoe and David Evers.

12

This technique of decoupling interrupt notification from interrupt servicing is similar to thescheme used in Multics which is described in [Reed 76], but the motivation in Nemesis is to al-low effective control of the quantity of resources consumed by interrupt processing code, ratherthan for reasons of system structure. [Dixon 92] describes a situation where careful adjustmentof the relative priorities of interrupt processing threads led to increased throughput under highloads when drivers were effectively polling the hardware and avoiding unnecessary interruptoverhead. The Nemesis mechanisms are more generic and have been shown to provide betterthroughput on the same hardware platform [Black 94].

1.6.2 Kernel Critical Sections

The majority of device-driver code requires no privilege, but small regions of device driver codeoften need to execute in kernel mode. For example, performing I/O on a number of processorsrequires the use of instructions only accessible within a privileged processor mode. Nemesisprovides a lightweight mechanism for duly authorised domains to switch between kernel anduser mode.4

Although the current implementation requires explicit calls to enter and exit kernel mode, analternative would be to register these code sections (ranges of PC values) in advance and performthe switch to kernel mode when the processor detects a privilege violation. The PC range tablesgenerated by many compilers to enable efficient language-level exception mechanisms may beused for this purpose. Although this support is expected soon, the version of gcc currently in usedoes not include these features.5

1.7 Nemesis and I/O

Although Nemesis is intended as an operating system for a personal multimedia workstation,much of the previous experimental work has been evaluated using workloads which are unrep-resentative of those found in multimedia systems – i.e. processes which are entirely CPU-bound.The Nemesis approach to QoS provision has been proven to be highly successful in environmentswhere the bottleneck resource for all applications is the CPU.

The work described in [Roscoe 95b] approaches the QoS-crosstalk problem by migrating oper-ating system code into user domains. Whilst this solution works well for code which does notrequire to run with elevated privilege, such as protocol-processing code, it cannot be used in sit-uations which require write access to shared state or access to hardware registers. A multimediasystem by definition deals with a large volume of I/O which invariably involves privileged op-erations at the lowest levels. Since these operations must therefore be performed by a privilegeddomain and Nemesis provides no low-level mechanisms for resource transfer, some degree ofQoS-crosstalk is unavoidable.

The problem of effective control over I/O resources is tackled more convincingly in [Black 94].Although a number of useful mechanisms for streamlining I/O in a connection-oriented environ-ment are presented, the prototype system known as Fawn was designed for the port-controller ofan Asynchronous Transfer Mode (ATM) switch and so multimedia activities were restricted. Theonly high-bandwidth I/O device available was an ATM interface which required use of the CPUon a per-cell basis.

Explicitly scheduling the activities of the ATM device driver as a user-level domain, rather thanperforming the cell-forwarding function in an interrupt handler with no resource accounting,

4The implementation takes approximately 16 PALcode instructions on the Alpha.5Version 2.7.2.

13

was demonstrated both to improve overall throughput and to allow QoS firewalls to be in-troduced protecting various other activities such as connection management. This scheduling,however, was only effective due to the the lack of DMA support causing the CPU resource to bethe system bottleneck. Provision of QoS guarantees during concurrent use of the ATM deviceby multiple clients was not addressed, but would require high-level QoS management functionsand more sophisticated intra-process scheduling mechanisms within the device driver.

1.8 Summary

Nemesis as described in [Roscoe 95b] provides highly effective mechanisms for multiplexing theCPU resource between a number of concurrent activities according to QoS contracts negotiatedout-of-band with a QoS manager.

For these guarantees to be meaningful, the majority of in-band operations traditionally per-formed by the operating system are performed by unprivileged code in shared libraries formingpart of the application. Only infrequent out-of-band operations are performed by trusted serversrequired to maintain shared state in a consistent manner.

The CPU, however, is only one of a number of resources required by a second-generation multi-media application. Effective partitioning of other system resources, particularly those involvedin I/O, has not been previously addressed.

14

Chapter 2

Architecture overview

The programming model of Nemesis is a framework for describing how programs are structured;in a sense, it is how a programmer thinks about an application. In particular, it is concerned withhow components of a program or subsystem interact with one another.

The goal of the programming model is to reduce complexity for the programmer. This is particu-larly important in Nemesis where applications tend to be more complex as a consequence of thearchitecture. The model is independent of programming language or machine representation,though its form has been strongly influenced by the model of linkage to be presented in section2.3.

In systems, complexity is typically managed by the use of modularity: decomposing a com-plex system into a set of components which interact across well-defined interfaces. In softwaresystems, the interfaces are often instances of abstract data types (ADTs), consisting of a set ofoperations which manipulate some hidden state. This approach is used in Nemesis.

Although the model is independent of representation, it is often convenient to describe it in termsof the two main languages used in the implementation of Nemesis: the interface definition lan-guage MIDDL and a stylised version of the programming language C.

2.1 Types and MIDDL

Nemesis, like Spring, is unusual among operating systems in that all interfaces are stronglytyped, and these types are defined in an interface definition language. It is clearly important,therefore, to start with a good type system, and [Evers 93] presents a good discussion of the issuesof typing in a systems environment. As in many RPC systems, the type system used in Nemesisis a hybrid: it includes notions both of the abstract types of interfaces and of concrete data types.It represents a compromise between the conceptual elegance and software engineering benefitsof purely abstract type systems such as that used in Emerald [Raj 91], and the requirements of ef-ficiency and inter-operability: the goal is to implement an operating system with few restrictionson programming language.

Concrete types are data types whose structure is explicit. They can be predefined (such asbooleans, strings, and integers of various sizes) or constructed (as with records, arrays, etc). Thespace of concrete types also includes typed references to interfaces1.

1The term interface reference is sometimes used to denote a pointer to an interface. Unfortunately, this can lead toconfusion when the reference and the interface are in different domains or address spaces. [Roscoe 95b] gives a betterdefinition of an interface reference. In the local case described in this chapter, interfaces references can be thought of aspointers to interfaces.

15

Interfaces are instances of ADTs. Interfaces are rarely static: they can be dynamically created andreferences to them passed around freely. The type system includes a simple concept of subtyping.An interface type can be a subtype of another ADT, in which case it supports all the operationsof the supertype, and an instance of the subtype can be used where an instance of the supertypeis required.

The operations supported by interfaces are like procedure calls: they take a number of argumentsand normally return a number of results. They can also raise exceptions, which themselves cantake arguments. Exceptions in Nemesis behave in a similar way to those in Modula-3 [Nelson 91].

Interface types are defined in an IDL called MIDDL [Roscoe 94b]. MIDDL is similar in function-ality to the IDLs used in object-based RPC systems, with some additional constructs to handlelocal and low-level operating system interfaces. A MIDDL specification defines a single ADTby declaring its supertype, if any, and giving the signatures of all the operations it supports. Aspecification can also include declarations of exceptions, and concrete types. Figure 2.1 shows atypical interface specification, the (slightly simplified) definition of the Context interface type.

Context � LOCAL INTERFACE �

NEEDS Heap�

NEEDS Type�

BEGIN

��

�� Interface to a naming context�

��

Exists � EXCEPTION ��

�� Name is already bound�

�� Type used for listing names in a context�

Names � TYPE � SEQUENCE OF STRING�

�� List returns all the names bound in the context�

List � PROC ��

RETURNS � nl � Names �

RAISES Heap�NoMemory�

�� Get maps pathnames to objects�

Get � PROC � IN name � STRING�

OUT o � Type�Any �

RETURNS � found � BOOLEAN ��

�� Add binds an object to a pathname�

Add � PROC � name � STRING� obj � Type�Any �

RETURNS ��

RAISES Exists�

�� Remove deletes a binding�

Remove � PROC � name � STRING � RETURNS ��

END�

Figure 2.1: MIDDL specification of the Context interface type

16

2.2 Objects and Constructors

The word object in Nemesis denotes what lies behind an interface: an object consists of state andcode to implement the operations of the one or more interfaces it provides. A class is a set ofobjects which share the same underlying implementation, and the idea of object class is distinctfrom that of type, which is a property of interfaces rather than objects.

This definition of an object as hidden state and typed interfaces may be contrasted with the useof the term in some object-oriented programming languages like C++ [Stroustrup 91]. In C++there is no distinction between class and type, and hence no clear notion of an interface2. Thetype of an interface is always purely abstract: it says nothing about the implementation of anyobject which exports it. It is normal to have a number of different implementations of the sametype.

When an operation is invoked upon an object across one of its interfaces, the environment inwhich the operation is performed depends only on the internal state of the object and the argu-ments of the invocation. There are no global symbols in the programming model. Apart from thebenefits of encapsulation this provides, it facilitates the sharing of code described in section 2.3.

An object is created by an invocation on an interface, which returns a set of references to theinterfaces exported by the new object. As in Emerald, constructors are the basic instantiationmechanism rather than classes. By removing the artificial distinction between objects and themeans used to create them, creation of interfaces in the operating system can be more flexible thanthe ‘institutionalised’ mechanisms of language runtime systems. This is particularly importantin the lower levels of an operating system, where a language runtime is not available.

2.2.1 Pervasives

The programming model described so far enforces strict encapsulation of objects: the environ-ment in which an interface operation executes is determined entirely by the operation argumentsand the object state. Unfortunately, there are cases where this is too restrictive from a practicalpoint of view. Certain interfaces provided by the operating and runtime systems are used sopervasively by application code that it is more natural to treat them as part of the thread contextthan the state of some object. These include:

� Exception handling

� Current thread operations

� Domain control

� Default memory allocation heap

Many systems make these interfaces ‘well-known’, and hardwired into programs either as part ofthe programming language or as procedures linked into all images. This approach was rejected inNemesis: the objects concerned have domain-specific state which would have to be instantiatedat application startup time. This conflicts with the needs of the linkage model (section 2.3), inparticular, it severely restricts the degree to which code and data can be shared. Furthermore, thesimplicity of the purely object-based approach allows great flexibility, for example in running thesame application components simultaneously in very different situations.

However, passing references to all these interfaces as parameters to every operation is ugly andcomplicates code. The references could be stored as part of the object state, but this still requiresthat they be passed as arguments to object constructors, and complicates the implementation

2C++ abstract classes often contain implementation details, and were added as an afterthought [Stroustrup 94, p. 277].

17

of objects which would otherwise have no mutable state (and could therefore be shared amongdomains as is).

Pervasive interfaces are therefore viewed as part of the context of the currently executing thread.As such they are always available, and are carried across an interface when an invocation ismade. This view has a number of advantages:

� The references are passed implicitly as parameters.

� Pervasives are context switched with the rest of the thread state.

� If necessary, particular interfaces can be replaced for the purposes of a single invocation.

2.2.2 Memory Allocation

The programming model has to address the problem of memory allocation. An invocation acrossan interface can cause the creation of a concrete structure which occupies an area of memory.There needs to be a convention for determining:

� where this memory is allocated from, and

� how it may be freed.

In many systems the language runtime manages memory centrally (to the domain) and all ob-jects may be allocated and freed in the same way. Some systems provide a garbage collector forautomatic management of storage.

Unfortunately, Nemesis does not provide a central garbage collector3 and a domain typically hasa variety of pools to allocate memory from, each corresponding to an interface of type Heap (mul-tiple heaps are used to allocate shared memory from areas with different access permissions).Moreover, it is desirable to preserve a degree of communication transparency: wherever pos-sible, a programmer should not need to know whether a particular interface is exported by anobject local to the domain or is a surrogate for a remote one.

Network-based RPC systems without garbage collection use conventions to decide when the RPCruntime has allocated memory for unmarshalling large or variable-sized parameters. Usually thismemory is allocated by the language heap, although some RPC systems have allowed callers tospecify different heaps at bind time (for example, [Roscoe 94c]). To preserve transparency, in allcases the receiver of the data is responsible for freeing it. This ensures that the application codeneed not be aware of whether a local object or the RPC run time system has allocated memory.

In systems where caller and object are in different protection domains but share areas of memory,the situation is complicated because of the desire to avoid unnecessary memory allocation anddata copies. Ideally, the conventions used should accommodate both the cases where the callerallocates space for the results in advance, and where the callee allocates space on demand fromcaller memory during the invocation.

Nemesis uses parameter passing modes to indicate memory allocation policy: each parameter ina MIDDL operation signature has an associated mode, which is one of the following:

IN Memory is allocated and initialised by client.Client does not alter parameter during invocation.Server may only access parameter during invocation, and cannot alter parameter.

IN OUT Memory is allocated and initialised by client.Client does not alter parameter during invocation.Server may only access parameter during invocation, and may alter parameter.

3The problems of garbage collection in an environment where most memory is shared between protection domains isbeyond the scope of this thesis. This issue is touched upon in the ‘conclusions’ chapter of [Roscoe 95b].

18

OUT Memory is allocated but not initialised by client.Server may only access parameter during invocation, and is expected to initialise it.

RESULT Memory is allocated by server, on client pervasive heap, and result copied into it.Pointer to this space is returned to the client.

The OUT mode allows results to be written by a local object into space already allocated by theclient (in the stack frame, for example). In the remote case, it is more efficient than the IN OUT

mode because the value does not need to be transmitted to the server; it is only returned.

These modes are all implemented on the Alpha processor using call by reference, except RESULT,which returns a pointer to the new storage. For values small enough to fit into a machine word,IN is coded as call by value and RESULT returns the value itself rather than a reference to it.

These conventions have been found to cover almost all cases encountered in practice. As a lastresort, MIDDL possesses a REF type constructor which allows pointers to values of a particulartype to be passed explicitly.

2.3 Linkage Model

The linkage model concerns the data structures used to link program components, and their inter-pretation at runtime. An early version of the linkage mechanism was described in [Roscoe 94a].Its goal is twofold:

1. To support and implement the Programming Model.

2. To reduce the total size of the system image through sharing of code and data.

A stub compiler is used to map MIDDL type definitions to C language types. The compiler,known as middlc processes an interface specification and generates a header file giving C typedeclarations for the concrete types defined in the interface together with special types used torepresent instances of the interface.

2.3.1 Interfaces

An interface is represented in memory as a closure: a record of two pointers, one to an array offunction pointers and one to a state record (figure 2.2).

To invoke an operation on an interface, the client calls through the appropriate element of theoperation table, passing as first argument the address of the closure itself. The middlc compilergenerates appropriate C data types so that an operation can be coded as, for example:

b � ctxt��op��Get ctxt� modules�DomainMgr� �dma �

In this case, ctxt is the interface reference. middlc generates C preprocessor macros so one mayuse the CLU-like syntax:

b � Context�Get ctxt� modules�DomainMgr� �dma �

2.3.2 Modules

A Nemesis module is a unit of loadable code, analogous to an object file in an UNIX system. Allthe code in Nemesis exists in one module or another. These modules are quite small, typicallyabout 10 kilobytes of text and about the same of constant data. The use of constructor interfacesfor objects rather than explicit class structures makes it natural to write a module for each kind

19

List

. . . op

st

per-instance

state

Text and read-onlydata implementing

List method

c : IREF Context

Contextinterface operation

table

Figure 2.2: Context interface closure

of object, containing code to implement both the object’s interfaces and its constructors. Suchmodules are similar to CLU clusters [Liskov 81], though ‘own’ variables are not permitted.

Modules are created by running the UNIX ld linker on object files. The result is a file which has nounresolved references, a few externally visible references, and no uninitialised or writable data4.

All linkage between modules is performed via pointers to closures. A module will export one ormore fixed closures (for example, the constructors) as externally visible symbols, and the systemloader installs these in a name space (see section 2.4) when the module is loaded. To use the codein a module, an application must locate an interface for the module, often by name lookup. Inthis sense linking modules is entirely dynamic.

If a domain wishes to create an object with mutable state, it must invoke an operation on anexisting interface which returns an interface reference of the required type and class.

Figure 2.3 shows an example where a domain has instantiated a naming context by calling the Newoperation of an interface of type ContextMod. The latter is implemented by a module with no mu-table state, and has instantiated an object with two interfaces, of types Context and Debugging.The module has returned pointers to these in the results c and d. The state of the object includesa heap interface reference, passed as a parameter to the constructor and closed over.

2.3.3 Address Space Structure

The use of interfaces and modules in Nemesis permits a model where all text and data occupiesa single address space, since there is no need for data or text to be at well-known addresses ineach domain. The increasing use of 64-bit processors with very large virtual address spaces (theAlpha processor on which Nemesis runs implements 43 bits of a 64-bit architectural addressingrange) makes the issue of allocating single addresses to each object in the system relatively easy.

It must be emphasised that this in no way implies a lack of memory protection between domains.The virtual address translations in Nemesis are the same for all domains, while the protection rightson a given page may vary. Virtual address space in Nemesis is divided into segments (sometimescalled stretches) which have access control lists associated with them. What it does mean is that

4In other words, there is no bss and the contents of the data segment are constant.

20

Dump

...

List

...

New

op

st

op

st

op

st

op

st

per-instance

state("object")

code implementingContext module

Debugging closure

Heap closure

ContextModclosure

Contextoperations

Debugging operations

text andread-only data

shared betweendomains

c: IREF Context

IREF ContextMod

Contextclosure

h: IREF Heap

d: IREF Debugging

Figure 2.3: Module and instantiated object

any area of memory in Nemesis can be shared, and addresses of memory locations do not changebetween domains.

2.4 Naming and Runtime Typing

While simple addresses in the single address space suffice to identify any interface (or other datavalue) in the system, a more structured system of naming is also required.

The name space in Nemesis is completely independent of the rest of the operating system. Whilesome operating system components do implement part of the name space, most naming contextsare first-class objects: they can be created at will and are capable of naming any value which hasa MIDDL type.

There are few restrictions on how the name space is structured. The model followed is that of[Saltzer 79]: a name is a textual string, a binding is an association of a name with some value,and a context is a collection of bindings. Resolving a name is the process of locating the valuebound to it. Name resolution requires that a context be specified.

2.4.1 Context Interfaces

Naming contexts are represented by interfaces which conform to the type Context. Operationsare provided to bind a name to any value, resolve the name in the context and delete a bindingfrom the context. The values bound in a context can be of arbitrary type, in particular they canbe references to other interfaces of type Context. Naming graphs can be constructed in this way,and a pathname may be presented to a context in place of a simple name. A pathname consistsof a sequence of names separated by ‘�‘ distinguished characters. To resolve such a pathname,the context object examines the first component of the name. If this name resolves to a context,this second context is invoked to resolve the remainder of the pathname.

21

...

Heap

...

Modules

IDC

Services

A

B

C

D

...

ContextMod

ThreadsPackage

DomMgrMod

TimerMod

StretchAlloc

FramesMod

TypeSystem

TypeSystem

...

Timer

DomainMgr

Figure 2.4: Example name space

2.4.1.1 Ordered Merges of Contexts

The MergedContext interface type is a subtype of Context, modelled after a similar facility inSpring [Radia 93]. An instance of MergedContext represents a composition of naming contexts;when the merge is searched, each component context is queried in turn to try and resolve the firstelement of the name. Operations are provided to add and remove contexts from the merge.

2.4.1.2 An Example

Figure 2.4 illustrates part of a naming graph created by the Nemesis system at boot time. ContextA is the first to be created. Since one must always specify a context in which to resolve a name,there is no distinguished root. However A serves as a root for the kernel by convention. Context Bholds local interfaces created by the system, thus ‘Services�DomainMgr’ is a name for the DomainManager service, relative to context A. Any closures exported by loaded modules are stored incontext C (‘Modules’), and are used during domain bootstrapping.

Context D has two names relative to the root, ‘Services�TypeSystem’and ‘Modules�TypeSystem’.This context is not in fact implemented in the usual way, but is part of the runtime type system,described in the next section.

2.4.2 Run Time Type System

The Type System is a system service which adds a primitive form of dynamic typing, similar to[Rovner 85]. Each MIDDL type is assigned a unique Type�Code, and the Type�Any type providessupport for data values whose type is not known at compile time. The TypeSystem interfaceprovides the operations IsType, to determine whether a Type�Any conforms to a particular type,

22

and Narrow, which converts a Type�Any to a specified type if the type equivalence rules permit.A major use of Type�Any is in the naming interfaces: values of this type are bound to names inthe name space.

The Type System data structures are accessible at run time through a series of interfaces whosetypes are subtypes of Context. For example, an operation within an interface is represented by aninterface of type Operation, whose naming context includes all the parameters of the operation.Every MIDDL interface type is completely represented in this way.

2.4.3 CLANGER

A good example of how the programming and linkage models work well in practice is CLANGER5

[Roscoe 95a] , a novel interpreted language for operating systems. CLANGER relies on the follow-ing three system features:

� a naming service which can name any typed value in the system,

� complete type information available at runtime, and

� a uniform model of interface linkage.

In CLANGER a variable name is simply a pathname relative to a naming context specified whenthe interpreter was instantiated. All values in the language are represented as Type�Anys. Thelanguage allows operations to be invoked on variables which are interface references by synthe-sising C call frames. Access to the Type System allows the interpreter to type-check and narrowthe arguments to the invocation, and select appropriate types for the return values.

The invocation feature means that the language can be fully general without a complex inter-preter or the need to write interface ‘wrappers’ in a compiled language. This capability waspreviously only available in development systems such as Oberon [Gutknecht ] and not in ageneral-purpose, protected operating system. CLANGER can be used for prototyping, debug-ging, embedded control, operating system configuration and as a general purpose programmablecommand shell.

2.5 Domain bootstrapping

The business of starting up a new domain in Nemesis is of interest, partly because the system isvery different from UNIX and partly because it gives an example of the use of a single addressspace to simplify some programming problems.

The traditional UNIX fork primitive is not available in Nemesis. The state of a running domainconsists of a large number of objects scattered around memory, many of which are specific to thedomain. Duplicating this information for a child domain is not possible, and would create muchconfusion even if it were, particularly for domains with communication channels to the parent.In any case, fork is rarely used for producing an exact duplicate of the parent process, rather it isa convenient way of bootstrapping a process largely by copying the relevant data structures. InNemesis, as in other systems without fork such as VMS, this can be achieved by other means.

The kernel’s view of a domain is limited to a single data structure called the Domain ControlBlock, or DCB. This contains scheduling information, communication end-points, a protectiondomain identifier, an upcall entry point for the domain, and a small initial stack. The DCB isdivided into two areas. One is writable by the domain itself, the other is readable but not writable.

5CLANGER has been implemented by Steven Hand.

23

Parent domain Domain Manager Child domain

Create remainingdomain state

InstantiateDomainEntryPoint

closure & state

CreateDomain Control Block

Figure 2.5: Creation of a new domain

A privileged service called the Domain Manager creates DCBs and links them into the schedulerdata structures.

The arguments to the Domain Manager are a set of QoS parameters for the new domain, to-gether with a single closure pointer of type DomainEntryPoint. This closure provides the initialentry point to the domain in its sole operation (called Go), and the state record should containeverything the new domain needs to get going.

The creation of this DCB is the only involvement the operating system proper has in the process.Everything else is performed by the two domains involved: the parent creates the initial closurefor the domain, and the child on startup locates all the necessary services it needs which have notbeen provided by the parent. Figure 2.5 shows the process.

The DomainEntryPoint closure is the equivalent of main in UNIX, with the state record taking theplace of the command line arguments. By convention the calling domain creates the minimumnecessary state, namely:

� A naming context.

� A heap for memory allocation.

� The runtime type system (see below).

From the name space, a domain can acquire all the interface references it needs to execute. Oneuseful consequence of this is that an application can be debugged in an artificial environment bypassing it a name space containing bindings to debugging versions of modules. The type systemis needed to narrow types returned from the name space. The heap is used to create the initialobjects needed by the new domain.

2.5.1 The Builder

To save a programmer the tedium of writing both sides of the closure initialisation code, a moduleis provided called the Builder. The Builder takes a ThreadClosure6 which represents the initialthread of a multi-threaded domain to be created. The Builder instantiates an initial heap for thenew domain. Most heap implementations in Nemesis use a single area of storage for both theirinternal state and the memory they allocate, so the parent domain can create a heap, allocateinitial structures for the child within it, and then hand it over in its entirety to the new domain.

The Builder returns a DomainEntryPoint closure which can be passed to the Domain Manager.When started up, the new domain executes code within the Builder module which carries outconventional initialisation procedures, including instantiating a threads package. The main thread

6The user-level threads equivalent of a domain entry point.

24

entry point is also Builder code, creating the remaining state before entering the thread procedureoriginally specified. Figure 2.6 shows the sequence of events.

Builder createsnew domain state

within heap

Builder invokesHeap module to

create initial heap

Domain Managercreates and

activates DCB

Child threadentered

Builder createsThreads package

Heap object nowexecutes in

child domain

Entry point inBuilder calls

heap to createdata structures

Parent invokesDomain Manager

to create newdomain

Parent invokesBuilder to build

new domainclosure

Heap Moduleconstructs heap

Parent Builder Heap DomainManager

Child

Modules

DomainManager

ChildDomain

ParentDomain

Figure 2.6: Action of the Builder

This illustrates two situations where a module executes in two domains. In the first, part ofthe Builder executes the parent and another part executes in the child. In the second, a singleheap object is instantiated and executes in the parent, and is then handed off to the child, whichcontinues to invoke operations upon it. In both cases the programmer need not be aware ofthis, and instantiating a new domain with a fully functional runtime system is a fairly painlessoperation.

2.6 Summary

Nemesis programs and subsystems are composed of objects which communicate across typedinterfaces. Interface types are ADTs defined in the MIDDL interface definition language, which

25

also supports definitions of concrete data types and exceptions. Objects are constructed by invo-cations across interfaces. There is no explicit notion of a class. There are no well-known interfaces,but a number of pervasive interfaces are regarded as part of a thread context.

Interfaces are implemented as closures. The system is composed of stateless modules whichexport constant closures. All linking between modules is dynamic, and the system employs asingle address space to simplify organisation and enhance sharing of data and text.

A uniform, flexible and extensible name service is implemented above interfaces, together witha run time type system which provides dynamic types, type narrowing, and information ontype representation which can be used by the command language to interact with any systemcomponents.

The single-address space aspect of Nemesis together with its programming model based on ob-jects rather than a single data segment prohibit the use of a fork-like primitive to create domains.Instead, runtime facilities are provided to instantiate a new domain specified by an interfaceclosure. The single address space enables the parent domain to hand off stateful objects to thechild.

The performance overhead of using closures for linkage is small, roughly equivalent to the useof virtual functions in C++. However, it is clear that the cache design of the machines on whichNemesis currently runs presents serious obstacles to the measurement of any performance bene-fits of small code size.

26

Chapter 3

Structural overview

Every significant module in the core of Nemesis will be described in this chapter.

3.1 NTSC

The Nemesis Trusted Supervisor Code provides a low level layer over the machine hardwarecontaining only:

� The domain scheduler. This is driven by tables constructed by the domain manager. Itonly schedules domains; it knows nothing about any threads packages some domains mayhave installed. The current implementation, Atropos, is a modified earliest-deadline-firstscheduler.

� A small set of system calls, of which only 6 are available in general.

� Interrupt dispatch code to allow device driver domains to install interrupt stubs.

� Start of day routines. The NTSC start of day routine is executed when the system boots andprovides just enough of an environment for Primal to run.

� Timer support for the scheduler.

� Kernel console routines, used only for the few messages produced by the NTSC in extraor-dinary circumstances.

In general, the user-domain-visible interface exported by the NTSC is identical across platforms.

3.1.1 Virtual Processor

The NTSC communicates with domains through the Virtual Processor interface. By dealing withthe Virtual Processor interface domains may set the way they are activated by the NTSC.

3.2 Primal

After the NTSC has completed processor specific initialisation control is passed to Primal. Itprovides enough of an environment for ordinary domains to run. When that has been achievedscheduling begins with just the Nemesis domain. Primal also contains a cut down serial driver

27

to provide output during initialisation. By placing nearly all of the start of day code that needs tobe executed before starting the first domain in primal, that code may be executed in user mode.Thus primal uses standard Nemesis module code.

3.3 The Nemesis Domain

The Nemesis domain is responsible for allocating basic system services. It communicates withthe kernel via the shared kernel state and privileged NTSC calls. The Domain Manager, StretchAllocator, Binder, Interrupt Allocator and Console run as threads within this domain.

3.3.1 Stretch Allocator

The Nemesis single virtual address space is divided up into different stretches. A stretch is a con-tiguous region of virtual address space containing the same protection throughout. The StretchAllocator manages the system’s memory. In addition to allocating stretches to domains it main-tains different protection domains.

3.3.2 Binder

Event channels between domains are set up and torn down by the Binder. In addition, it providesa mechanism by which the creation of most domains is blocked until a number of Boot Domainshave signalled their readiness to the binder.

3.3.3 Interrupt Allocator

Device drivers install interrupt stubs by negotiation with the Interrupt Allocator. The InterruptAllocator then manages the data structures used by the NTSC interrupt dispatcher to cause in-terrupts to be routed to the appropiate stub.

3.3.4 Console

The Console redirects the output and error streams of each domain to appropiate drivers, suchas the serial driver. It provides a mechanism for the console output to be switched around oncerecipients for the console output have been set up.

3.3.5 CallPriv allocator

CallPrivs provide a mechanism for privileged domains to install small pieces for code as priv-ileged operations of application domains. The CallPriv allocator manages the use of CallPrivsand installation of CallPrivs into the NTSC.

3.3.6 Type system

A single instance of the Typesystem exists. It is an extension of the Context interface that allowsthe type system database to be queried to obtain details about any type declared in a MIDDLinterface. Furthermore, it is able to narrow types.

28

3.4 Shared library services

All these services run as code in user level domains.

3.4.1 Domain Creation services

3.4.1.1 Domain Manager

The Domain Manager runs as a thread in the Nemesis domain. It creates, sets scheduling pa-rameters and destroys domains. It is only concerned with kernel-level scheduling. The DomainManager is responsible for maintaining the tables that drive the NTSC scheduler.

3.4.1.2 Builder

The Builder is a library module that establishes most of the state most domains require. Useof the Builder is not compulsory to the creation of domains. For example, the builder creates apervasives record and builds in to the pervasives record connections to system services and somenew closures.

3.4.1.3 Plumber

The Plumber deals with event channels and delivery at the lowest possible level. It is a privilegedlibrary which may only be invoked by system-privileged code (typically the binder).

3.4.2 Inter Domain Communicaton

3.4.2.1 Object Table

The Object Table maintains a mapping from offers to interfaces for both imported server servicesand exported services being provided. The Binder uses the Object Table to translate request foroffers to requests for services. The service requested is then invoked to create the connection.

3.4.2.2 Marshalling

Marshalling code converts client arguments and server results to a form where they may becommunicated between domains.

One implementation of marhsalling causes arguments to be placed into shared memory. Thecode for this is automatically generated by the middlc compiler from MIDDL interfaces.

3.4.2.3 Transports

Each IDC transport provides a particular mechanism by which domains may communicate. Thetransport module in use is invoked by servers wishing to IDC export services to create offers.The offers are then passed around and client domains invoked to obtain bindings (connections)to the service being offered.

29

3.4.2.4 Stubs

When an IDC A client stub is invoked by the client domain and causes the marshalling andtransport to take place. The server stub is invoked by the transport and causes a server routine tobe invoked. Client and server stubs are automatically generated from MIDDL interfaces by themiddlc compiler.

3.4.2.5 Gatekeeper

The Gatekeeper provides a convenient way for domains to manage memory shared betweendomains for inter domain communication.

3.4.3 Naming Contexts

A naming context is a mapping from strings to objects. These objects may be other NamingContexts, allowing arbitrary contexts to be built up. Naming Contexts may be merged. NamingContexts may be active- that is, code is invoked to return an object given a string.

3.4.4 Heap

The heap module provides traditional malloc style heaps. Heaps may be emptied and destroyed.Heap modules may be created per stretch.

3.4.5 IO

Bulk IO between protection domains is managed using the IO�if interface to access an Rbufsimplementation. An Rbufs channel is a simplex communications channel, managed by a pair ofcontrol FIFOs. The FIFOs are used to send packet descriptors; one FIFO is used for descriptorsto full packets, the other is used to return descriptors to empty packets. A packet descriptorconsists of a header followed by (base address, length) pairs1. The header gives the number ofthese IO�Recs following. It also carries an extra value field, which might be used to hold successor error codes, timestamps, etc. The base addresses point into a well defined data transfer area.This allows addresses to be easily checked for validity, and in the case of device drivers doingDMA, to be pinned in memory.

The channel can be run in one of two ways: Transmit Master Mode (TMM) or Receive MasterMode (RMM). In TMM, the data source (transmitter) formats the packet descriptors. In RMM,the transmitter receives packet descriptors formatted by the receiver, and writes its data into thedata memory pointed to by the base addresses. Once it has filled the data areas, it sends thedescriptor back to the receiver. Note that the master mode describes which end is generatingthe descriptors: in all cases the other end merely sends the same descriptor back after havingprocessed the described contents in some way.

3.4.5.1 FIFOs

FIFOs are implementations of pipes of fixed sizes. They are implemented using event counts tomanage the data flow.

1A pair is called an IO�Rec

30

3.4.5.2 IO transports

Transports create IDC offers for IO closures. These IDC offers may then be passed between do-mains.

3.4.5.3 IO Entries

An IO entry is the multiplexing point for IO channels. IO entries allow multiple threads to blockwaiting to service packets arriving on multiple IO channels by calling IOEntry�Rendezvous.When an IO channel becomes active, a thread will be unblocked and given a reference to theIO which it is to service. Once it has performed all the work it can on the IO, the thread canthen call block until more work is available (through the use of the IO interface’s Rendezvousoperation).

An IO entry is similar to the select system call under Unix.

3.4.5.4 Quality of Service Entries

A Quality of Service entry is a particular IO Entry. It performs the same functions of blockingthreads until some work is available. The current implementation is not thread safe: only asingle thread should use the entry. The service thread should call QoSEntry�Charge at somestage when it is performing work in order to inform the entry how much time it should chargefor the work done. The IREF QoSEntry�if is then in a position to schedule the IO work to bedone in a manner described by scheduler parameters very similar to the Nemesis CPU scheduler:each IO channel has a period and slice within that period associated with it. Service on timescalesbelow the period length is probabilistic, however over the period the particular slice requestedwill have been spent.

3.4.6 Readers and Writers

The Reader and Writer interfaces are pretty much a direct port of Modula-3’s ideas on readersand writers. They are used to do low-bandwidth character- and line-based IO. They includeroutines for reading and writing single characters, lines, NULL-terminated strings, and (length,base) buffer descriptors. Some implementations of readers and writers are buffered and someare seekable.

3.4.6.1 Redirectable readers and writers

It is sometimes useful to be able to have clients use a reader or writer without knowing where theinput is coming from or output going to. In this case a redirectable reader or writer can be used2.When the controlling thread wants a redirection to take place, it calls WrRedir�SetWr to set thenew output destination. Clearly, redirecting readers and writers don’t do any actual reading orwriting. They also add the cost of an extra procedure call overhead to each operation, as youmight expect from the extra level of indirection which they provide.

2At this point in time, redirectable readers have not been implemented

31

3.4.7 Threads

3.4.7.1 Threads Packages

Threads pacakges provide user level schedulers, so that domains may contain multiple threads.There may be many different threads packages available that implement the Threads interfaces.

3.4.7.2 Exceptions

The Exceptions module provides run time support for exceptions. The current exception mecha-nism in particular uses setjmp and longjmp.

3.4.7.3 Events

Event counts are used throughout Nemesis as the basic concurrency control mechanism. Threadspackages will often use events as the underlying concurrency control mechanism within a do-main. They implement services directly on top of the Virtual Processor interface.

3.4.7.4 SRC Threads

The SRC Threads package provides mutexes and condition variables on top of the event countmechanism.

3.4.7.5 ANSWare/RT Tasks and Entries

Support for the ANSWare/RT computation and engineering models are provided on top of thethreads interface.

3.4.8 Loading

3.4.8.1 Loader

To load a module or module with state, the loader reads in data, creates stretches for each segmentand relocates the code being loaded. It returns a set of segments and a set of symbols exportedfrom the code that has been loaded.

3.4.8.2 Exec

To start a new domain, exec loads a module by invoking the loader. Then, by invoking theBuilder, a new domain is created and registered with the Domain Manager.

3.4.9 Tables

A variety of tables are provided and used throughout the operating system. Tables indexed bycardinals, strings and machine words are currently available.

32

3.4.10 libc

A large subset of the ANSI C library is provided. When a module references a ANSI C libraryroutine, a tiny stub is linked against that module. Each stub jumps at the shared implementationcode. The stubs are collectively known as a veneer. The maths, stdio, stdlib and string facilitiesare currently provided in this manner.

3.5 The Trader

An ANSA style trader is provided. This is central to the Nemesis namespace system, providinga server to facilitate sharing portions of the namespace. It is used by service providing domainsto advertise their services.

33

Chapter 4

Multimedia applications

Several domains are involved in executing mulitmedia applications under Nemesis. These in-clude:

� The Nemesis domain and the trader. These provide the scaffolding used to setup applica-tions.

� The Clanger domain executes scripts that cause the other domains to be started and config-ured.

� The high bandwidth device drivers, such as the Nicstar ATM device driver and the frame-buffer device driver. These provide a single lowest possible level multiplexing point for thedata streams involved in multimedia applications.

� The application domains. These exchange data between the device drivers, processingthem, converting them and making decisions based on the data.

4.1 Video

RawVideoPlay is an application domain that converts data between a particular video formatgenerated by Fore Systems AVA-200 and AVA-300 units and a format suitable for display on a15-bit linear framebuffer.

Its actions are as follows:

1. Fetch the ATM connection particulars from the namespace of that domain. Fetch the displayand video stream particulars from the same namespace.

2. Establish an IDC connection to the ATM device driver.

3. Obtain a connection to the local windowing system and open a window on the local display

4. Obtain a DFS16 format update stream to the frame buffer from the windowing system.

5. Start a thread to handle window system update events.

6. Open an IO connection from the ATM device driver to the application

7. Repeatedly:

(a) Suck packets from the ATM device driver

(b) Convert packets to video tiles

34

(c) Blit tiles to the framebuffer, using the CallPriv mechanism.

4.2 Audio

A domain, NetworkAudio, receives any number of streams through IO channels. It exports anIDC interface (AudioSink) that enables audio connections to be established into it.

When more than one stream is connected to it, it processes every stream that is connected to it.It calculates the volume of each stream, and plays the loudest. Furthermore, the stream whichit is currently playing is globally readable. It is this functionality that enables floor control to beperformed.

A window may optionally be displayed, showing the volume levels of each stream currentlybeing received.

The primary data structure of this application is a table of audio streams being played, with thecurrent volume of each stream.

In detail, the actions of this domain are as follows:

1. Install a preemptive thread scheduler

2. Export the current stream condition variable as a globally readable variable referenced fromthe root namespace.

3. Query the domain’s namespace to find the parameters of the audio processing to be pe-formed.

4. Obtain an IDC connection to the Sound Blaster driver.

5. Obtain a Play Control from the Sound Blaster.

6. IDC export an AudioSink interface.

After this point, other operations can call AudioSink Open and Close operations. These are pro-cessed and a table of audio streams updated under mutual exclusion.

7. Optionally open a window using the client rendering library.

8. Repeatedly:

(a) Optionally plot the volume of each stream

(b) Make a decision as to which stream to play.

(c) Wait for a short time

As new audio streams are registered, new threads are started:

1. Update the streams table

2. Start a thread to process the audio stream

3. In that thread, obtain an IO connection to the source of the stream

4. Ensure the Sound Blaster driver play control is set to be able to play the new stream as wellas the other streams in the table

5. Prime the IO connection from the audio source

6. Repeatedly:

(a) Take a packet from the audio source

35

(b) Update the running volume calculation

(c) Check the packet sequence number for missing packets and deal with these drops in asensible manner

(d) If the main thread has decided to play this thread, then encapsulate the data for theSound Blaster and send the packet to the driver

4.3 Conference

Conference is a modified version of RawVideoPlay that can be configured to receive data frommultiple video sources and display one at a time according to the actions of the NetworkAudiodomain. The other data streams are dropped.

The result of running this application with the NetworkAudio domain is to provide video con-ferencing with floor control. A number of pairs of audio and video streams, one pair per partyin the video conference should be established. Then, whenever someone speaks all users of thevideo conference will see that person speaking.

The domains actions are as follows:

1. Fetch all the ATM connection particulars for both audio and video from the namespace ofthat domain. Fetch the display and video stream particulars from the same namespace.

2. For each audio stream, invoke the IDC exported AudioSink interface from the NetworkAu-dio domain to inform the NetworkAudio domain of that audio stream.

3. Establish an IDC connection to the ATM device driver.

4. Obtain a connection to the local windowing system and open a window on the local display

5. Obtain a DFS16 update stream to the frame buffer from the windowing system.

6. Start a thread to handle window system update events.

7. Start a thread for each video stream. In each thread:

8. Open an IO connection from the ATM device driver to the application

9. Repeatedly:

(a) Suck packets from the ATM device driver

(b) If the NetworkAudio domain is playing the audio stream corresponding to this audiostream, convert packets to video tiles and blit them to the framebuffer.

4.4 Setup

Clanger is a scripting language developed specially for Nemesis. It is able to express Nemesistypes such as contexts. It understands the Nemesis namespace scheme.

“Start of day” setup is handled by a clanger script containing most of the parameters necessary tostart up a demonstration. It should be noted that this scheme could be replaced by a sophisticateddistributed conference setup system with facilities such as admission control straightforwardly.

Clanger is able to start up domains with a particular Quality of Service parameters, particularmemory requirements and environments containing typed data. It can be used interactively orexecuted from an “rc” file that is loaded via NFS when a Nemesis workstation starts.

36

Fore System’s AVA-200 and AVA-300 are currently controlled via their managers and distributedcontrol mechanism. Nemesis be controlled by this mechanism. Nemesis is also able to controlAVA-200 and AVA-300 units remotely.

We now give an RC file that may be used to start up a video conference with a number of addi-tional services running. First, the Windowing System is started up:

wscpu � �� qos�cpu � �� p � �ms� s��us� l��ms� x�true� neps��

qos�mem � �� sHeapBytes � �� heapBytes � ��

run modules�WSSvr wscpu

pause ��ms

Some quality of service parameters are specified for later use:

minimalcpu � �� p � ��ms� s��us� l��ms� x�false ��

lotsofcpu � �� p � ��ms� s��us� l��ms� x�false ��

droppycpu � �� p � ��ms� s��us� l��ms� x�false ��

The NetworkAudio application is started:

envnetau � �� qos�cpu � �� p � ��ms� s � ��us� l� ��ms� x�true ��

run modules�NetworkAudio envnetau

pause ��ms

A spinning graphic application is started up:

envcarnage � �� priv�root�x�� priv�root�y�� qos�cpu � minimalcpu ��

run modules�Carnage envcarnage

pause ��ms

A processor bound application is started up:

envcrend � �� priv�root�x�� priv�root�y�� qos�cpu � minimalcpu ��

run modules�CRendTest envcrend

pause ��ms

Loadbar accounting is started:

lb � �� priv�root� �� bars�true� graph�true� dump�false�

latency��lead��

qos�cpu�� p � ��ms� s��ms� l��ms� x�false ��

run modules�Loadbars lb

pause ��ms

An RawVideoPlay application is started:

env� � �� priv�root � �� x�� y�� height�� width��

qos�cpu � droppycpu ��

run modules�RawVideoPlay env�

pause ��ms

And finally, the conference application is started and a video conference begins:

envconf � �� qos�cpu � lotsofcpu�

37

priv�root � ��

x �� y�� width � �� height� ��

parties � ��

�� audio�� vci�� pack��

video�� vci��

�� audio�� vci�� pack��

video�� vci��

��

��

��

run modules�Conference envconf

38

Chapter 5

Graphics

Currently there exists an implementation of the Nemesis framebuffer device (FB) running overS3 graphics cards, a generic windowing system server (WSSvr) and SWM, the Simple WindowManager.

5.1 S3 Frame Buffer

5.1.1 Control

The S3 Frame Buffer fully implements the FB interface, giving the window system completecontrol over the window layout of the screen.

Window updates can either be done synchronously, by calling UpdateWindows (or a combina-tion of ExposeWindow, MoveWindow and ResizeWindow), or asynchronously using a NemesisIO channel.

5.1.2 Blitting

The S3 Frame Buffer supports asynchronous update streams via a standard Nemesis IO channel,and synchronous update streams using FBBlit mapped over Nemesis Device Privileged Sections(callprivs).

It fully supports clipping as pixel level, and supports 8 and 16 bit update streams using tiledRGB, block RGB and block YUV.

Internally it maintains an array of tags, one for each screen pixel, used as the clipmask for clippingthe screen. In response to expose requests, it sets the tag for each pixel in the expose region to thetag of the window being exposed.

When drawing to the screen, each blit method only draws over pixels whose tag matches the tagof the window into which the drawing is being done. How the clipmask is actually used variesbetween different blit methods. Block YUV, for example, is drawn using the hardware blittingand colour space conversion available on the S3 — in this case, the clipmask is converted, 64pixels at a time, into a 64 bit bitmask indicating on or off and passed to the hardware clipper.When drawing Block RGB, this check is done in software (the hardware is only used for YUVsince converting YUV�RGB and dithering is expensive in software).

39

5.2 WSSvr

WSSvr presents to clients a unified interface encompassing the frame buffer, keyboard and mouse.Clients see a view of the world involving a set of windows, receiving events from the mouse andkeyboard when their window has the focus. WSSvr’s responsibility is to ensure that the orderedstack of windows is mapped correctly into a flat clipmask for the frame buffer, and that eventsfrom the mouse and keyboard drivers are sent to the correct windows (although the WindowManager can redirect these events if it wishes to).

5.2.1 Talking to the Window Manager

The window manager (WM) is informed whenever a new client connects to WSSvr. It is passedthe WSF closure that controls the client’s connection, and passes back a WS closure that is passedto the client. Thus all calls into WSSvr from clients are redirected through the WM, giving theWM freedom to implement any policy for controlling the display, without having to worry aboutimplementing the mechanism. The WM can create windows “owned” by a particular client bymaking calls on the WSF closure passed to it when the client connected.

5.2.2 Dealing with the frame buffer

Whenever a client action causes a change to the arrangement of windows in the display (e.g.opening, closing or moving a window) a conservative approximation to the affected region isproduced. When moving a window, for example, only pixels in the union of the old and new clipregions of the window can have their access rights changed, so this region is a safe approximationto the affected region. In reality, some of the pixels in this region might not be affected, since thewindow being moved might have been obscured by another window in some places.

Once the affected region has been selected, WSSvr travels down the stack of windows portioningout any unclaimed areas of the region to each window in turn. Once the region has been por-tioned out, any changes are passed to the Frame Buffer, to cause it to update its clipmask, and tothe clients, to cause them to refresh their windows.

5.2.3 Locking updates

When the WM locks updates on the WSSvr, instead of dealing with client actions immediately,WSSvr merges the affected region with an accumulator region. When the WM unlocks updates,the entire accumulated region is dealt with at once. This enables the WM to perform multipleactions atomically on the WSSvr, without flickering occuring between each action.

5.2.4 Mouse and Keyboard events

Separate threads wait for events from the mouse and keyboard drivers, and convert them into WSMouse, KeyPress, KeyRelease, EnterNotify and LeaveNotify events. Successive mouse eventswith the same button state in the same window are coalesced if WSSvr or its clients are notkeeping up with the event stream.

Mouse events also cause WSSvr to call the frame buffer to move the mouse cursor.

40

5.2.5 Passing events to clients

All client events are queued up and dealt with by a dedicated thread. This prevents WSSvr frombecoming unresponsive to mouse movements if the WM is taking too long to process events.

The window manager can take any desired action on receiving an event, including passing it toa client by calling WSF$DeliverEvent(). This causes the event to be inserted into the client’s IOchannel, provided that they have requested a channel and there are empty event records availablein the channel. Events which cannot be passed to the client are thrown away.

5.3 SWM

SWM is the Simple Window Manager. It implements several usability enhancements over the rawWSSvr.

5.3.1 Window Borders

When a client requests a new window, SWM creates the window for the client, and also creates awindow of its own to act as a decoration. This decoration contains in the title bar the domain IDand name, and optionally a window title if the client has supplied one. The rest of the decorationis a simple 3D shaded border. The decoration window is given a clipmask that fits around thewindow with which it is associated. The focused window has a highlighted border.

5.3.2 Moving, Raising and Lowering windows

The left, middle and right mouse buttons, when used on SWM’s decoration windows, are mappedto the functions of Raise, Move and Lower. When a client window is raised, moved or lowered, itsdecoration window is moved with it. These operations are all carried out atomically by lockingupdates on WSSvr.

5.3.3 Editing window clip regions

Within a client window, the mouse can be used to add or subtract rectangles from the clip re-gion of the window, allowing users to paint and erase areas to create irregular shaped windows.Because the decoration windows are ‘hollow’, lower windows will show through in any areaswhere the clip region of the client’s window has been erased.

41

Chapter 6

Communications Support inNemesis

6.1 Introduction

The Nemesis networking environment is quite different from that traditionally found in Unix.The design of the network sub-system follows closely the Nemesis model of reducing the use ofshared servers to the bare minimum, while doing as much processing as possible in the client.

The upshot of this is that clients do all protocol processing themselves, producing wire-readyEthernet frames. These frames are then passed to a minimal device driver which performs a fewsecurity checks on the data before enqueuing it for transmission.

For receive, the device driver exists only to demultiplex the packet to the correct network connec-tion: it performs no protocol checking itself. The upshot of this is that all communications mustbe connection-oriented, however, by classifying IP traffic into “flows”, this is easily achievable.

These flows are set up by a “Flow Manager” – a domain trusted to connect user applicationsto device drivers and install corresponding protection / flow information in the device drivercorrectly. Figure 6.1 gives an idea of how these different components interact.

The figure shows four domains (two applications, the Flow Manager, and a device driver), eachhaving its own protection domain (shown as dotted rectangles). Each solid arrow represents asimplex Rbuf I/O channel; there are two for each duplex flow.

These actions and requirements are considered in greater depth in the sections below.

6.2 Device Support

Network adapters be classified into two categories, self-selecting and non-self-selecting. Self-selectinghardware is able to demultiplex received data to its (hopefully final) destination directly; 3Com’spopular 3c509 Ethernet board is an example of non-self-selecting hardware; it delivers all incom-ing data to a small on-card circular buffer, leaving the device driver or (more commonly) theprotocol stack to perform the demultiplexing.

Some protocols stacks have sufficiently simple (de)multiplexing that it is reasonable to make useof such straightforward support directly in the hardware. Obvious examples include basic AAL5

42

sharedprotocol

code

FlowManager

applicationcode

network card driver

packet filter demux

applicationcode

shar

ed p

roto

col c

ode

shar

ed p

roto

col c

ode

flow setuprequests

filterinstallation

Figure 6.1: Overview of interaction of components in the network subsystem

access over either the U-net modified Fore Systems ATM interface, or the Nemesis ATMWorksdriver.

For other protocols (such as the IP protocol family), things are somewhat more tricky and canrarely be done in hardware. For receive it is at least necessary to perform some form of packet-filter operation on the packets.

6.2.1 Receive

In the current architecture, all network adapter cards are abstracted into ones capable of self-selection. This requires drivers for non-self-selecting cards to include some packet filtering inorder to demultiplex received data. This must be done before the memory addresses that thedata is to be placed in can be known, and will usually require that the data be placed in someprivate memory belonging to the device driver where it can be examined. Some interfaces useioports1 to present their hardware to the system (e.g. 3c509). For such systems the device driverneed only copy the maximal sized header such that the packet may be demultiplexed, leaving thepayload still in the on-card memory or FIFO. Once the packet’s final resting place is discovered,only the header needs copying again: the payload can come direct from the card.

For cards which support DMA (e.g. the de4x5 familly), the entire packet must be received intodriver private memory and copied to user buffers later – the QoSeffects of this extra copy can bemitigated by use of the techniques discussed in section 6.4.1.

6.2.2 Transmission

For transmission the device driver has a similar though slightly simpler procedure to perform.The header of the outgoing packet must be checked to ensure compliance with security. This islike packet filtering except that for almost all protocols this can be done using a simple compare

1Or strangely fragmented memory regions.

43

and mask which is extremely efficient; there is no demultiplexing on the fields and much infor-mation is pre-computable. Since the I/O channel remains concurrently writable by the client, theheader of the packet must be copied to some device-driver private memory either before or aspart of the checking process.

6.2.3 Driver scheduling

Since the bandwidth available for transmission on an Ethernet is limited, being able to rate-limiteach domain’s access to it is a necessary part of providing QoSguarantees to user applications.To this end, the device driver uses the QoSEntry�if interface to charge transmissions to I/Oconnections. Each I/O connection may have a different QoSspecification. Currently, we use a“slice out of period” scheme very similar to the CPU scheduler to limit the average transmitbandwidth.

Receive bandwidth is limited by how fast the user application can read data; if it does not sendempty buffers back to the device driver fast enough its packets are dropped as early as possible,either in the device driver or in the hardware (better).

6.3 Flow Setup

Flow setup and close is handled by a Flow Manager server process. The Flow Manager is respon-sible for keeping host- and interface-wide state. This includes: arbitrating access to resourcessuch as port numbers, ARP cache maintenance, and holding the routeing tables. Once a con-nection has been set up, the Flow Manager takes no further part in handling the actual data forclient applications; the Flow Manager is on the control path only. All protocols (including data-gram protocols such as UDP) are treated as based on a flow between a local and a remote ServiceAccess Point (SAP).2

Since the flow manager is a trusted system process, it is possible for it to synthesize native ma-chine code dynamically either to do the packet filtering on receive, or the packet checking ontransmit, and to have the device drivers simply execute this code. Currently, it generates a Berke-ley Packet Filter (BPF)specification for the protocols to allow, and downloads it to the devicedriver.

6.4 Application Stack

The application contains a vertically structured protocol stack for each IP flow. By this, we meanthat it never interacts with stacks for other flows. The bottom of the stack is directly connected tothe device driver using two Rbufs channels. Under Nemesis, Rbufs channels are implemented bythe (IO�if) interface. Since Rbufs provide simplex channels, this explains the need to have twoof them for a duplex protocol – clearly simplex protocols need have only the one.

An application is free to implement its protocol stack as it wishes, since due to the simple checksperformed by the driver, it cannot affect the rest of the system. The concept of Integrated LayerProcessing (ILP) which may be used for efficiency reasons within the application’s stack is com-pletely orthogonal to the mechanism by which the stacks are separated.

In practice, most protocol stacks created by an application will use a standard shared library ofprotocol processing code, as described in section 6.5.2.

2This is effectively required by RFC1122 (Host requirements) section 4.1.3.4, though many other networking stacksignore this.

44

6.4.1 Moving the copy

Earlier we described a problem for QoS in which the data copy required in the device driver forDMA-based non-self-selecting interfaces causes CPU time not to be attributed to the applicationbut to the device driver.

Our solution is an extremely lightweight trap directly into some driver-provided code. Thiscode runs in a tightly restricted environment, and is trusted not to affect the rest of the systemundesirably. This is called a “call-priv”, a name not dissimilar to the Alpha architecture’s “call-pal” instruction, since we think of these call-privs much like an extra CPU instruction.

The driver export a call-priv to allow user applications controlled access to the private drivermemory packets are DMAed to. The call-priv should (of course) check the client is allowed toaccess the buffer in question before copying the data. The whole operation is charged to the userapplication.

Note that the call-priv is very different from the thread migration model, since the executionenvironment during the changed privilege state is very restricted.

6.5 Current Implementation

This section describes our current (simple) implementation of the described design. It was in-tended to provide NFS client access to the other research groups using Nemesis as soon as possi-ble. Although under active development it is nevertheless still useful.

6.5.1 Flow Manager Implementation

The ConnMan�if interface documents the services offered by the Flow Manager. Applicationswishing to use the network make a flow setup request to the Flow Manager using the Graft

method; if the local port requested is free the request succeeds. The Flow Manager constructs atransmit filter, extends the receive packet filter, and installs them within the device driver domain.The Flow Manager completes the binding between the two domains and returns IDC offers forthe Rbufs I/O channels to the application.

Although applications are free to use whatever protocol processing code they consider most ap-propriate, all currently use a standard shared library, available as a set of modules implementingsubtypes of the Protocol�if interface.

The protocol interfaces use information provided by the Flow Manager to configure themselvesto format packets correctly for that particular flow. This configuration information includes suchdetails as the local and remote IP address and port numbers, but also includes the source anddestination3 Ethernet addresses. All of these details are provided by the Flow Manager to theapplication to enable it to generate correctly formatted packets. However the guarantee that theyare correct comes from the transmit filter downloaded from the flow manager into the devicedriver.

Our current implementation of the Flow Manager is used primarily for connection setup; its otherfunctions as described in section 6.3 are not yet fully implemented.

3Or next hop.

45

6.5.2 Application Protocol Implementation

As previously mentioned, an application builds a stack however it chooses, but in order to allowus to experiment with a wide variety of protocol compositions, we have have implemented ahighly modular stack. In this particular library, each protocol layer is separately manipulatable:layers may be created, configured, composed and destroyed dynamically.

Each protocol layer (Protocol�if) presents two Rbuf-like interfaces by which data can be sent toor received from the network. These IO�if interfaces can be retrieved by calling GetTX and GetRX

on the protocol layer of interest. Because each protocol layer looks just like a direct connection tothe device driver, layers can be composed with ease. Note that although the semantics of an Rbufcommunications channel are used between layers, there is no need to actually have a real RbufI/O channel present: no protection domain is being crossed. Instead, ordinary procedure callsare used to pass iorecs from layer to layer, leading to an efficient yet flexible implementation.

Since iorecs are used to describe packets in transit, a naıve way of adding a header to a packetwould be to re-write the iorec to have an additional entry for the new header (similar to addingan mbuf on the front of a chain in Unix). This is opposed to the Rbuf conventions, and inefficientas it increases the length of the iorec, and adds embedded store management. Instead the flow-based nature of the communications is used to pre-calculate the total size of header required forall layers. Whilst this is standard practice in many fixed composition systems, it is rarer whenprotocols can be composed arbitrarily on a per-flow basis.

For receive, it is necessary for the user application to supply empty buffers for the device driverto receive packets into. Sending these empty buffers down is known as “priming the pipe”, sincethey are buffered in a FIFO manner before being presented to the device driver. Failure to primethe pipe results in no packets being received: the driver must assume that the user application isnot keeping up with the network traffic, so if it find no empty client buffers, it drops the packetsilently.

When a packet arrives, the device driver sends it up the RX I/O. In this manner, a continuousloop of buffers is formed: empty buffers “owned” by the client are send down the RX I/O withIO�PutPkt, are consumed by the device driver, and are eventually sent back up the RX I/O onpacket receive, where they are consumed by the client using IO�GetPkt. Once the client hasprocessed the packet’s payload to its satisfaction, the buffer is once more free, and the cyclerepeats.

This idea of a continuous loop of packet buffers also holds for the transmit side of the stack.Packet buffers full of data to be transmitted are sent down the TX I/O using IO�PutPkt by theclient application. They are picked up by the device driver, and once they have been put on thewire, the empty buffer is sent up to the client application where it is recovered using IO�GetPkt

on the TX I/O.

6.5.2.1 Constructing a new stack

This is the current way of building a stack: we realise it has many deficiencies – it is currentlyundergoing a re-design.

First the Flow Manager is contacted. The source Ethernet address for this machine is found byasking the Flow Manager to route the requested local IP address to an interface.

The Flow Manager is also asked to ARP for the remote IP address – it should return the MACaddress of the gateway out of the subnet for the case of non-local addresses, but doesn’t.

The Graft function of the Flow Manager is called, to set up the I/O channels to the device driver.

A heap suitable for sharing with the device driver domain is requested from the Gatekeeper.

46

The user-application stack is finally grown, from the bottom up: Ethernet encapsulation, then IP,and finally UDP. Finally, the bottom of the user stack is connected to the driver I/O offers usingBaseProtocol�Go.

The final stack is probed to find out how large its headers are, its MTU etc. Buffer space forreceive and transmit is allocated, and the receive pipe is primed.

6.5.3 Device drivers

There currently exist device drivers for a the 3c509 and the de4x5 family of cards. Various supportlibraries exist to allow transformation of the iorecs into physical addresses for DMA, and forperforming the checks required by the Flow Manager.

The following describes the new scheme for implementing device driver will work: at this stageit is approximately half-implemented. It is described because some of the ideas presented are stillvalid for the old driver organisation.

Device drivers share a large portion of policy-setting code in the form of the Netif�if interface.This is the interface to the Flow Manager, allowing flows to be set up.

The card-specific portion of a driver presents a Netcard�if interface to the bottom of Netif�if– the Netcard�if interface is thus just an internal interface within the device driver domain.Similarly, NetifCallback�if is how the card-specific portion calls up to the common code.

On packet arrival, the card driver calls up to Netif with a descriptor to the header of the packet,asking where the packet is to go. Netif makes a policy decision (currently using BPF, but almostcertainly something more sophisticated later). The I/O channel to send the received packet upis returned, along with a descriptor to some memory available for future headers to be receivedinto.

DMA capable cards receive the entire packet into these buffers, while cards which can deliverjust the headers can do so. In this way, DMA capable cards are supported just as efficiently asthose with polled-mode ioport access.

For transmit, the Netcard�if interface defines a method to enqueue a packet for transmission.The card-driver calls Netif back asynchronously, once the packet buffer has been sent, and isfree.

47

Chapter 7

Memory Management

This chapter describes the Intel memory management scheme.

7.1 Background

The Nemesis operating system runs in a single virtual address space. This means that the trans-lation from virtual to physical addresses will be the same no matter which domain is running.The access rights that different parts of the system have for each address may differ.

All of the allocated virtual addresses are covered by stretches. A stretch is a contiguous areaof the virtual address space, with the same accessibility throughout the range. Accessibility isexpressed in terms of four rights: ‘Read’, ‘Write’, ‘Execute’ and ‘Meta’.

The ‘Execute’ right is supposed to control whether code can be executed in the stretch. Unfortu-nately it is impossible to support this on Intel architecture machines; Execute access is equivalentto Read access. The ‘Meta’ right indicates whether the permissions on the stretch can be modified.

The accessibility of a stretch is determined by a combination of two things: the permissions forthe stretch in the current protection domain and the global permissions of the stretch. The globalpermissions specify a minimum level of access that all domains share.

Each domain has a corresponding protection domain. Two or more domains may share the sameprotection domain; the protection domain of a domain is set when the domain is created. When-ever the processor is allocated to a domain, the corresponding protection domain is switchedto.

The aim of this system of memory protection is to enable lightweight sharing of data between do-mains. Having a single virtual address space means that domains may store and pass pointers toany other domain’s data. It should be possible to switch between protection domains much morequickly than in a system with multiple virtual address spaces, because none of the translationsin the translation lookaside buffer (TLB) should be invalid at context switch time.

(footnote? The translation lookaside buffer is a cache of virtual to physical address translations.On every memory reference the processor looks up the virtual address in this cache; if the trans-lation is present in the cache then the appropriate physical address is used. If the translation isnot present then a ‘TLB miss’ has occurred and the processor must spend some time finding thecorrect translation. The TLB also caches access rights for virtual addresses.)

48

7.2 Implementation

Implementations of this memory protection scheme on architectures like the Alpha are reason-ably easy; a TLB miss handler can be provided that looks up the rights for the current stretch inthe current protection domain and in the table of global access rights.

The Intel architecture does not provide for a user-supplied TLB miss handler routine. Instead, theprocessor handles TLB misses by walking a page table data structure in memory. The structure ofthis table is fixed by the processor architecture. A user-supplied routine (the page fault handler)will only be called if the page table data structure indicates that the requested page is unavailable,or the task requesting the page is not allowed to access it.

7.2.1 Simple

A simple implementation of the Nemesis memory management model could be made by build-ing a page table structure for each protection domain. These page tables all have exactly the sametranslations, but different access rights for each page. Whenever the current protection domain ischanged, the register in the processor is changed to point to the appropriate page table structure.

This implementation has a number of problems. It is inefficient in terms of memory: each pagetable structure will take up at least 16k. Whenever a virtual to physical address translation ischanged, all of the page tables must be updated. All of the page tables must be updated in theevent of a global protection change as well.

Possibly the greatest problem is that whenever the page table base register is changed, the proces-sor flushes out all of the entries in the TLB. This leads to poor context switch performance, whichis unfortunate in an operating system like Nemesis that provides fine-grain processor scheduling.

7.2.2 Current

The current implementation of the Nemesis memory management model under Intel is reason-ably efficient. Only one page table structure is present in the system; the processor’s page tablebase register points at it permanently.

The ‘natural state’ of the page table is to contain the global access rights for each page. While do-mains only need global access rights to memory no further intervention is needed; the processordoes all the work of putting entries in the TLB.

If a domain needs more than the global rights to access a page of memory, a page fault exceptionis raised. The page fault handler routine checks in the current protection domain whether thedomain is allowed to perform the memory access that it has requested. If the memory access is tobe allowed, the page fault handler modifies the processor’s page table to indicate that the accessis allowed. At the same time, it puts the page into a list of altered pages.

On exit from the page fault handler the processor retries the memory access, which should nowsucceed. Further accesses to the page should also succeed immediately, using either the rightscached in the TLB or the new rights present in the page table.

When a protection domain switch occurs, the ‘expanded’ access rights must be removed fromthe processor’s page table and TLB. This is done by scanning the list of altered pages which wasbuilt up earlier, restoring the global access rights to the page table and flushing just those TLBentries which are affected. The list is then emptied.

49

Chapter 8

Scheduler Accounting

One of the Nemesis domains schedulers, Atropos, provides processor time quality of serviceguarantees to applications. Scheduling policy for an application is described by three parameters:

� Period. The time scale over which domains are scheduled.

� Slice. The amount of time that the processor can potentially use the processor during aperiod.

The processor will not necessarily be given to the domain for this time; typically, it willreceive it over a number of smaller schedules that will add up to a figure less than or equalto the slice of the domain.

It should be noted that having a non-zero slice does not guarantee that a domain will receivethe processor for that time each period. Rather, it guarantees that if the domain does notblock it will receive in total that slice. There is no facility for returning a proportion of aslice in the same period for a domain that has squandered its guarantee by blocking.

� Extra Flag. If this flag is set, the domain may receive any extra processor time available.The scheduling of extra time is not fair; it does however tend towards that.

8.1 Scheduler Accounting mechanism

The scheduler may optionally be compiled to potentially use an accounting module. There is aclosure pointer in the kernel state which if non-null and if support for accounting is compiledinto the scheduler will cause an Account interface operations to be invoked at certain events inthe scheduler. These operations are described under Account.if in the interface manual.

With each event, the following information is recorded in a large circular log:

� The nature of the event.

� The processor cycle counter when the event occurred.

� A parameter.

The most common entries in the accounting log are:

� End of period events. These occur when a domain has finished it’s period and is beingallocated a new period. The parameter field is not used.

50

� Deschedule events. These occur whenever the scheduler is entered. They can be subdi-vided into deschedules of domains running in guaranteed time and domains running inextra time. The parameter is the time, in cycles, since the last event.

8.2 NFS dump

A domain can install a domain accounting module in to the NTSC and then wait for the corre-sponding accounting log to fill up and then dump the data across NFS to a remote machine.

Typically, a Unix box will be used as a NFS server to collect this data. The dump produced is inASCII. Each scheduler event is listed as a line containing first the processor cycle counter of theevent time, in hexadecimal. Then, the event type is indicated by one of:

� “F” for a period finished event

� “X” for an extra time deschedule event

� “C” for a contracted (ie guaranteed) time deschedule event

� “B” for a block event

� “U” for a unblock event

Next comes the ID of the event in hex. Finally, the parameter of the event is given in hex.

An extract from a log produced by a Nemesis Workstation:

�eca��ec�C�e��cc�

�edd��X��a

�edd��F�c��edd��

�ede�a��C�c�c�a�

�ee��X��a�a��

�ee��e�C��bd�

�ee��d�X��a�f

�ee��X��b��

�ef�b��f�F�d��ef�b��

�ef��C�d�bb��

�efa��a�C�b��

�f��b�fc�F�e��f��b�d�

�f��e��C�e��

�f��afeb�X��ca��

�f��e��C��f�

�f��f��X��c�

�f�fd��a�X��e�d�

�f��ca��X��fce

�f��c��X��d

�f��dad�F�c��f��d��

�f�a�b��C�c�baff

�f�fbdac�F�e��f�fbd��

8.3 Loadbars

A Nemesis application, Loadbars, has also been produced which represents the activity of eachdomain graphically. By querying the proc namespace, the names and Domain Control Blocks of

51

the current domains can be found. For each domain, Loadbars displays an animated bar of theproportion of processor time used by each domain.

These statistics are produced by working over the accounting log, and looking at a time window.In real time, the Loadbars work out what proportion of time is spent by each domain whenrunning in contracted and extra time.

Note that this time window is not necessarily lined up to deschedule events. All descheduleevents that could correspond to processor time being used within that window are examined.So, if a deschedule occurs within the window then only that proportion of time since the start ofthe window is accounted. If a deschedule occurs after the time window, any of the time in thatschedule inside the window is included.

The result from this processing of a the logs is a table of contracted and extra cycles used by eachdomain as well as the total number of cycles used in that time. This information is then plottedas horizontal “bars”. A full bar extending across the whole window indicates that all the time inthe time window was used by that bar’s domain. Contracted time is coloured green and extratime red.

8.4 Loadgraph

The same application is also able to produce a graph showing the same information plotted overtime, as a colour coded line graph, with one line per graph. It is possible to alter the line graphingstyle by disabling lines for particular domains or requesting different plotting styles. Severalalternative styles differentiate between guaranteed and extra time for instance.

The loadgraph can be used to observe behaviour of the system over times. Transitive behaviourof the scheduler when guarantees are changed can be observed, for instance. Medium terminteractions between domains also become apparent.

8.5 Quality of Service control

If the loadbars are given kernel privileges, they can be used to interactively vary the qualityof service parameters. A domain can be selected by clicking on it’s name or loadbar. Then,the middle mouse button toggles extra time on that domain. The left mouse button, if clickedsomewhere in the loadbar display, sets the slice of that domain to a value so that the domain, ifrunning without blocking, would reach that horizontal displacement.

(In the current implementation, no check is made to ensure that the scheduler is not over com-mitted by guaranteeing more than 100 percent of the processor time. This enables scheduleroverloads to be studied).

The slice is indicated by a thin blue horizontal line draw on top of the loadbars display for eachdomain. The guaranteed green proportion of a loadbar should never exceed a

The period of a domain may be set interactively by simply selecting a domain by clicking onit’s title or loadbar. Then, the period should be typed in to the window from the workstationkeyboard, as a number of digits followed a one of “M”, “U” or “N” followed by a “S”, to specifya number of milli, micro or nano seconds respectively.

It should be noticed that the scheduler is rather unstable in the face of this kind of subversion.

52

Chapter 9

Build mechanism

9.1 Tree structure

The Nemesis kernel tree is divided into eleven top-level sections.

mk This section holds Makefiles and scripts associated with building the tree.

if Interface files for the whole tree. Currently all interface files must go in this directory.

h Header files for the whole tree.

$MACHINE Platform specific header files.

$ARCH Architecture specific header files.

ntsc Nemesis trusted supervisor code.

generic Generic parts of the NTSC.

$ARCH Architecture specific parts of the NTSC.

lib Files that are linked with other parts of the tree.

veneer Assorted veneers, each within its own subdirectory.

static Static libraries. Each library will be within its own subdirectory.

mod Stateless shared module code, grouped by function. Some of the groups include:

nemesis Core nemesis modules.

net Networking related modules.

ws Window system related modules.

venimpl Jump library implementations that lurk behind veneers. To have the same direc-tory structure as lib/veneer.

sys Vital Nemesis system code. Any Nemesis system requires this code, no matter what archi-tecture or configuration it is.

dev Device driver programs. These are generally grouped by bus architecture.

app Application programs. There are system applications like the flow manager and the windowsystem server in this directory, as well as some other less vital programs.

53

boot Boot images and associated infrastructure. The final Nemesis kernel image is created some-where under this directory.

loader Machine or architecture specific loaders.

images Binary images, grouped by machine.

doc Documentation. Some documentation is automatically generated from the files in the ‘if’directory.

Any directory may contain a ‘contrib’ subdirectory. Code within ‘contrib’ directories is officiallyunsupported, and may not even compile. Code in ‘contrib’ directories is not compiled by default.

As far as possible, directory names within a particular directory are unique on the first letter. Thisenables quick navigation using shells with filename completion. Directory names are all in lowercase.

9.1.1 Configuration

Nemesis is very flexible; it can be configured for a wide variety of machines and environments. Asystem has been developed to make this configuration rather easier than editing large numbersof files all over the kernel tree.

The configuration system consists of a set of scripts in the mk�cfg directory. These scripts ask theuser a number of questions about the desired configuration and output two files, h�autoconf�hand mk�autoconf�mk. Together these two files control which parts of the tree are built, and whichparts are included in the final kernel image.

The details for each configuration option are held in two files; options�mk and nemesis�nbf. Thefiles indicate which parts of the system reside in which directories, and which interfaces need tobe built for those parts to compile.

Although the ifdirectory is flat at the moment, the configuration system imposes a structure on itand allows only the required interfaces to be built. This speeds up the build process considerably.

54

Bibliography

[Accetta 86] Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid,Avadis Tevanian, and Michael Young. Mach: A New Foundation for UNIXDevelopment. In USENIX, pages 93–112, Summer 1986. (pp 4, 9)

[Anderson 92] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, andHenry M. Levy. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. ACM Transactions on Computer Systems,10(1):53–79, February 1992. (p 4)

[ANSA 95] Architecture Projects Management Limited, Poseidon House, Castle Park,Cambridge, CB3 0RD, UK. ANSAware/RT 1.0 Manual, March 1995. (p 11)

[ARM 91] Advanced RISC Machines. ARM6 Macrocell Datasheet, 0.5 edition, November1991. (p 3)

[Barham 96] Paul R. Barham. Devices in a Multi-Service Operating System. PhD thesis, Uni-versity of Cambridge Computer Laboratory, July 1996. Available as Techni-cal Report No. 403. (pp iv, 9, 12)

[Bayer 79] R. Bayer, R. M. Graham, and G. Seegmuller, editors. Operating Systems: anAdvanced Course, volume 60 of LNCS. Springer-Verlag, 1979. (p 57)

[Birrell 84] Andrew D. Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls.ACM Transactions on Computer Systems, 2(1):39–59, February 1984. (p 5)

[Birrell 87] A.D. Birrell, J.V. Guttag, J.J. Horning, and R. Levin. Synchronisation Primi-tives for a Multiprocessor: A Formal Specification. Technical Report 20, DigitalEquipment Corporation Systems Research Centre, August 1987. (p 11)

[Black 94] Richard J. Black. Explicit Network Scheduling. PhD thesis, University of Cam-bridge Computer Laboratory, 1994. (pp iv, 4, 6, 11, 13)

[Brinch-Hansen 70] Per Brinch-Hansen. The Nucleus of a Multiprogramming System. Communica-tions of the ACM, 13(4):238–241,250, April 1970. (p 2)

[Coulson 93] G. Coulson, G. Blair, P. Robin, and D. Shepherd. Extending the Chorus Micro-kernel to support Continuous Media Applications. In Proceedings of the 4th In-ternational Workshop on Network and Operating System Support for Digi-tal Audio and Video, pages 49–60, November 1993. (p 9)

[Coulson 95] G. Coulson, A. Campbell, P. Robin, M. Papathomas G. Blair, and D. Shep-herd. The Design of a QoS-Controlled ATM-Based Communications System inChorus. IEEE Journal on Selected Areas In Communications, 13(4):686–699,May 1995. (p 9)

[DEC 93] Digital Equipment Corporation. DECchip 21064 Evaluation Board User’sGuide, May 1993. Order Number EC-N0351-72. (p iv)

55

[DEC 94] Digital Equipment Corporation. DEC3000 300/400/500/600/700/800/900 AXPModels: System Programmer’s Manual, 2nd edition, July 1994. Order NumberEK-D3SYS-PM.B01. (p iv)

[Dixon 92] Michael Joseph Dixon. System Support for Multi-Service Traffic. PhD thesis,University of Cambridge Computer Laboratory, January 1992. Available asTechnical Report no. 245. (p 12)

[Evers 93] David Evers. Distributed Computing with Objects. PhD thesis, University ofCambridge Computer Laboratory, September 1993. Available as TechnicalReport No. 332. (p 15)

[Gutknecht ] Jurg Gutknecht. The Oberon Guide. (version 2.2). (p 23)

[Hayter 94] M. Hayter and R. Black. Fairisle Port Controller Design and Ideas. In ATMDocument Collection 3 (The Blue Book), chapter 23. University of CambridgeComputer Laboratory, March 1994. (p iv)

[Hyden 94] Eoin Hyden. Operating System Support for Quality of Service. PhD thesis,University of Cambridge Computer Laboratory, February 1994. Available asTechnical Report No. 340. (pp iv, 1)

[Kane 88] Gerry Kane. MIPS RISC Architecture. Prentice-Hall, 1988. (p 3)

[Leslie 93] I. M. Leslie, D. R. McAuley, and S. J. Mullender. Pegasus—Operating SystemSupport for Distributed Multimedia Systems. ACM Operating Systems Review,27(1):69–78, January 1993. (p iv)

[Liskov 81] Barbara Liskov, Russell Atkinson, Toby Bloom, Eliot Moss, J. Craig Schaffert,Robert Scheifler, and Alan Snyder. CLU Reference Manual, volume 114 ofLNCS. Springer-Verlag, 1981. (p 20)

[Liu 73] C. L. Liu and James W. Layland. Scheduling Algorithms for Multiprogrammingin a Hard-Real-Time Environment. Journal of the Association for ComputingMachinery, 20(1):46–61, January 1973. (p 11)

[Mogul 91] Jeffrey C. Mogul and Anita Borg. The Effect of Context Switches on Cache Per-formance. In Proceedings of the 18th International Symposium on ComputerArchitecture, 1991. (p 11)

[Nelson 91] Greg Nelson, editor. Systems Programming With Modula-3. Prentice-Hall,Englewood Cliffs, NJ 07632, 1991. (pp 9, 16)

[Radia 93] Sanjay Radia, Michael N. Nelson, and Michael L. Powell. The SpringName Service. Technical Report 93-16, Sun Microsystems Laboratories, Inc.,November 1993. (p 22)

[Raj 91] Rajendra K. Raj, Ewan Tempero, Henry M. Levy, Andrew P. Black, Nor-man C. Hutchinson, and Eric Jul. Emerald: A General-Purpose ProgrammingLanguage. Software—Practice and Experience, 21(1):91–118, January 1991.(p 15)

[Reed 76] D. P. Reed. Processor Multiplexing in a Layered Operating System. PhD thesis,Massachusetts Institute of Technology Computer Science Laboratory, June1976. Available as Technical Report no 164. (p 12)

[Reed 79] David P. Reed and Rajendra K. Kanodia. Synchronization with Eventcountsand Sequencers. Communications of the ACM, 22(2):115–123, February 1979.(p 11)

[Roscoe 94a] Timothy Roscoe. Linkage in the Nemesis Single Address Space Operating System.ACM Operating Systems Review, 28(4):48–55, October 1994. (p 19)

56

[Roscoe 94b] Timothy Roscoe. The MIDDL Manual. Pegasus Working Document (4th Edi-tion), available from ftp��ftp�cl�cam�ac�uk�pegasus�Middl�ps�gz, Au-gust 1994. (p 16)

[Roscoe 94c] Timothy Roscoe, Simon Crosby, and Richard Hayton. The MSRPC2 UserManual. In ATM Document Collection 3 (The Blue Book), chapter 16. Uni-versity of Cambridge Computer Laboratory, March 1994. (p 18)

[Roscoe 95a] Timothy Roscoe. CLANGER : An Interpreted Systems Programming Language.ACM Operating Systems Review, 29(2):13–20, April 1995. (p 23)

[Roscoe 95b] Timothy Roscoe. The Structure of a Multi-Service Operating System. PhD thesis,University of Cambridge Computer Laboratory, April 1995. Available asTechnical Report No. 376. (pp iv, 2, 6, 11, 13, 14, 15, 18)

[Rovner 85] Paul Rovner. On Adding Garbage Collection and Runtime Types to a Strongly-Typed, Statically-Checked, Concurrent Language. Technical Report CSL-84-7,Xerox Corporation, Palo Alto Research Center, July 1985. (p 22)

[Rozier 90] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont,F. Herrmann, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser. Overviewof the CHORUS Distributed Operating Systems. Technical Report Technical Re-port CS-TR-90-25, Chorus Systemes, 1990. (p 4)

[Saltzer 79] J. H. Saltzer. Naming and Binding of Objects. In Bayer et al. [Bayer 79], chapter3.A, pages 100–208. (p 21)

[Sites 92] Richard L. Sites, editor. Alpha Architecture Reference Manual. Digital Press,1992. (p 3)

[Stroustrup 91] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, 2ndedition, 1991. (p 17)

[Stroustrup 94] Bjarne Stroustrup. The Design and Evolution of C++. Addison-Wesley, 1994.(p 17)

[Voth 91] David Voth. MAXine System Module Functional Specification. Technical Re-port, Workstation Systems Engineering, Digital Equipment Corporation, 100Hamilton Avenue, Palo Alto, CA 94301, July 1991. revision 1.2. (p iv)

57

overviewcourses.cs.vt.edu/.../Virtualization/Nemesis-Overview.pdfFairbairns, Eoin Hyden, Ian Leslie, Derek McAuley and Timothy Roscoe. The ﬁrst version of Nemesis was written from

Documents