Virtual Memory and Address Translation. Virtual Addressing text data BSS user stack args/env kernel data virtual memory (big) physical memory (small)

Virtual Memory and Address TranslationVirtual Memory and Address Translation

Virtual AddressingVirtual Addressing

text

data

BSS

user stack

args/envkernel

data

virtualmemory

(big)

physicalmemory(small)

virtual-to-physical translations

User processes address memory through virtual

addresses.

The kernel and the machine collude to

translate virtual addresses to

physical addresses.

The kernel controls the virtual-physical translations in effect

for each space.

The machine does not allow a user process to access memory unless the kernel “says it’s OK”.

The specific mechanisms for memory management and address translation are

machine-dependent.

What’s in an Object File or Executable?What’s in an Object File or Executable?

int j = 327;char* s = “hello\n”;char sbuf[512];

int p() { int k = 0; j = write(1, s, 6); return(j);}

text

dataidata

wdata

header

symboltable

relocationrecords

Used by linker; may be removed after final link step and strip.

Header “magic number”indicates type of image.

Section table an arrayof (offset, len, startVA)

program sections

program instructionsp

immutable data (constants)“hello\n”

writable global/static dataj, s

j, s ,p,sbuf

The Program and the Process VASThe Program and the Process VAS

text

dataidatawdata

header

symboltable

relocationrecords

program

text

data

BSS

user stack

args/envkernel

data

process VAS

sections segments

BSS“Block Started by Symbol”(uninitialized global data)e.g., heap and sbuf go here.

Args/env strings copied in by kernel when the process is created.

Process text segment is initialized directly from program text

section.

Process data segment(s) are

initialized from idata and wdata sections.

Process stack and BSS (e.g., heap) segment(s) are

zero-filled.

Process BSS segment may be expanded at runtime with a system call (e.g., Unix sbrk) called by the heap manager

routines.

Text and idata segments may be write-protected.

Virtual Address TranslationVirtual Address Translation

VPN offset

29 013

Example: typical 32-bitarchitecture with 8KB pages.

addresstranslation

Virtual address translation maps a virtual page number (VPN) to a physical page frame number (PFN): the rest is easy.

PFN

offset

+

00

virtual address

physical address{

Deliver exception toOS if translation is notvalid and accessible inrequested mode.

Role of MMU Hardware and OSRole of MMU Hardware and OS

VM address translation must be very cheap (on average).

• Every instruction includes one or two memory references.

(including the reference to the instruction itself)

VM translation is supported in hardware by a Memory Management Unit or MMU.

• The addressing model is defined by the CPU architecture.

• The MMU itself is an integral part of the CPU.

The role of the OS is to install the virtual-physical mapping and intervene if the MMU reports that it cannot complete the translation.

The OS Directs the MMUThe OS Directs the MMU

The OS controls the operation of the MMU to select:

(1) the subset of possible virtual addresses that are valid for each process (the process virtual address space);

(2) the physical translations for those virtual addresses;

(3) the modes of permissible access to those virtual addresses;

read/write/execute

(4) the specific set of translations in effect at any instant.

need rapid context switch from one address space to another

MMU completes a reference only if the OS “says it’s OK”.MMU raises an exception if the reference is “not OK”.

The Translation Lookaside Buffer (TLB)The Translation Lookaside Buffer (TLB)

An on-chip hardware translation buffer (TB or TLB) caches recently used virtual-physical translations (ptes).

Alpha 21164: 48-entry fully associative TLB.

A CPU pipeline stage probes the TLB to complete over 99% of address translations in a single cycle.

Like other memory system caches, replacement of TLB entries is simple and controlled by hardware.

e.g., Not Last Used

If a translation misses in the TLB, the entry must be fetched by accessing the page table(s) in memory.

cost: 10-500 cycles

Care and Feeding of TLBsCare and Feeding of TLBs

The OS kernel carries out its memory management functions by issuing privileged operations on the MMU.

Choice 1: OS maintains page tables examined by the MMU.• MMU loads TLB autonomously on each TLB miss

• page table format is defined by the architecture

• OS loads page table bases and lengths into privileged memory management registers on each context switch.

Choice 2: OS controls the TLB directly.• MMU raises exception if the needed pte is not in the TLB.

• Exception handler loads the missing pte by reading data structures in memory (software-loaded TLB).

A Simple Page TableA Simple Page Table

PFN 0PFN 1

PFN i

page #i offset

user virtual address

PFN i+

offset

process page table

physical memorypage frames

In this example, each VPN j maps to PFN j,

but in practice any physical frame may be

used for any virtual page.

Each process/VAS has its own page table.

Virtual addresses are translated relative to

the current page table.

The page tables are themselves stored in memory; a protected

register holds a pointer to the current page table.

Page Tables (2)Page Tables (2)

32 bit address with 2 page table fields

Two-level page tables

Second-level page tables

Top-level page table

[from Tanenbaum]

A Page Table Entry (PTE)A Page Table Entry (PTE)

PFN

valid bit: OS uses this bit to tell the MMU if the translation is valid.

write-enable: OS touches this to enable or disable write access for this mapping.

reference bit: MMU sets this when a reference is made through the mapping.

dirty bit: MMU sets this when a store is completed to the page (page is modified).

This is (roughly) what a MIPS/Nachos page table entry (pte) looks like.

Page Tables (3)Page Tables (3)

Typical page table entry [from Tanenbaum]

Completing a VM ReferenceCompleting a VM Reference

raiseexception

probepage table

loadTLB

probe TLB

accessphysicalmemory

accessvalid?

pagefault?

signalprocess

allocateframe

page ondisk?

fetchfrom disk

zero-fillloadTLB

starthere

MMU

OS

Demand Paging and Page FaultsDemand Paging and Page Faults

OS may leave some virtual-physical translations unspecified.

mark the pte for a virtual page as invalid

If an unmapped page is referenced, the machine passes control to the kernel exception handler (page fault).

passes faulting virtual address and attempted access mode

Handler initializes a page frame, updates pte, and restarts.

If a disk access is required, the OS may switch to another process after initiating the I/O.

Page faults are delivered at IPL 0, just like a system call trap.

Fault handler executes in context of faulted process, blocks on a semaphore or condition variable awaiting I/O completion.

Where Pages Come FromWhere Pages Come From

text

data

BSS

user stack

args/envkernel

data

file volumewith

executable programs

Fetches for clean text or data are typically fill-from-file.

Modified (dirty) pages are pushed to backing store (swap) on eviction.

Paged-out pages are fetched from backing store when needed.

Initial references to user stack and BSS are satisfied by zero-fill on demand.

Processes and the KernelProcesses and the Kernel

data dataprocesses in private

virtual address spaces

system call traps...and upcalls (e.g.,

signals)

shared kernel code and data

in shared address space

Threads or processes enter the kernel for services.

The kernel sets up process execution contexts to

“virtualize” the machine.

CPU and devices force entry to the kernel to handle exceptional events.

The KernelThe Kernel• Today, all “real” operating systems have protected kernels.

The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset.

Our “kernel” is called the executive in some systems (e.g., XP).

• The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected:

User code cannot access internal kernel data structures directly, and it can invoke the kernel only at well-defined entry points (system calls).

• Kernel code is like user code, but the kernel is privileged:

The kernel has direct access to all hardware functions, and defines the machine entry points for interrupts and exceptions.

Protecting Entry to the KernelProtecting Entry to the Kernel

Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, XP, etc).

• The machine defines a small set of exceptional event types.

• The machine defines what conditions raise each event.

• The kernel installs handlers for each event at boot time.

e.g., a table in kernel memory read by the machine

The machine transitions to kernel mode only on an exceptional event.

The kernel defines the event handlers.

Therefore the kernel chooses what code will execute in kernel mode, and when.

user

kernel

interrupt orexceptiontrap/return

Example: System Call TrapsExample: System Call Traps

User code invokes kernel services by initiating system call traps.

• Programs in C, C++, etc. invoke system calls by linking to a standard library of procedures written in assembly language.

the library defines a stub or wrapper routine for each syscall

stub executes a special trap instruction (e.g., chmk or callsys or int)

syscall arguments/results passed in registers or user stack

read() in Unix libc.a library (executes in user mode):

#define SYSCALL_READ 27 # code for a read system call

move arg0…argn, a0…an # syscall args in registers A0..ANmove SYSCALL_READ, v0 # syscall dispatch code in V0callsys # kernel trapmove r1, _errno # errno = return statusreturn

Alpha CPU architecture

FaultsFaultsFaults are similar to system calls in some respects:

• Faults occur as a result of a process executing an instruction.

Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel.

• The completed fault handler may return to the faulted context.

But faults are different from syscall traps in other respects:

• Syscalls are deliberate, but faults are “accidents”.

divide-by-zero, dereference invalid pointer, memory page fault

• Not every execution of the faulting instruction results in a fault.

may depend on memory state or register contents

The Role of EventsThe Role of Events

A CPU event is an “unnatural” change in control flow.Like a procedure call, an event changes the PC.

Also changes mode or context (current stack), or both.

Events do not change the current space!

The kernel defines a handler routine for each event type.Event handlers always execute in kernel mode.

The specific types of events are defined by the machine.

Once the system is booted, every entry to the kernel occurs as a result of an event.

In some sense, the whole kernel is a big event handler.

CPU Events: Interrupts and ExceptionsCPU Events: Interrupts and Exceptions

An interrupt is caused by an external event.device requests attention, timer expires, etc.

An exception is caused by an executing instruction.CPU requires software intervention to handle a fault or trap.

unplanned deliberatesync fault syscall trapasync interrupt AST

control flow

event handler (e.g., ISR: Interrupt Service

Routine)

exception.cc

AST: Asynchronous System TrapAlso called a software interrupt or an Asynchronous or Deferred Procedure Call (APC or DPC)

Note: different “cultures” may use some of these terms (e.g., trap, fault, exception, event, interrupt) slightly differently.

Mode, Space, and ContextMode, Space, and Context

At any time, the state of each processor is defined by:

1. mode: given by the mode bitIs the CPU executing in the protected kernel or a user program?

2. space: defined by V->P translations currently in effectWhat address space is the CPU running in? Once the system is

booted, it always runs in some virtual address space.

3. context: given by register state and execution streamIs the CPU executing a thread/process, or an interrupt handler?

Where is the stack?

These are important because the mode/space/context determines the meaning and validity of key operations.

The Virtual Address SpaceThe Virtual Address Space

A typical process VAS space includes:• user regions in the lower half

V->P mappings specific to each process

accessible to user or kernel code

• kernel regions in upper halfshared by all processes, but accessible only

to kernel code

• NT (XP?) on x86 subdivides kernel region into an unpaged half and a (mostly) paged upper half at 0xC0000000 for page tables and I/O cache.

• Win95/98 uses the lower half of system space as a system-wide shared region.

text

data

BSS

user stack

args/env

0

data

kernel textand

kernel data

2n-1

2n-1

0x0

0xffffffff

A VAS for a private address space system (e.g., Unix, NT/XP) executing on a typical 32-bit system (e.g., x86).

sbrk()

jsr

Process and Kernel Address SpacesProcess and Kernel Address Spaces

data

0

2n-1-1

2n-1

2n-1

data

0x7FFFFFFF

0x80000000

0xFFFFFFFF

0x0

n-bit virtual address space

32-bit virtual address space

FaultsFaultsFaults are similar to system calls in some respects:

• Faults occur as a result of a process executing an instruction.

Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel.

• The completed fault handler may return to the faulted context.

But faults are different from syscall traps in other respects:

• Syscalls are deliberate, but faults are “accidents”.

divide-by-zero, dereference invalid pointer, memory page fault

• Not every execution of the faulting instruction results in a fault.

may depend on memory state or register contents

Options for Handling a Fault (1)Options for Handling a Fault (1)

1. Some faults are handled by “patching things up” and returning to the faulted context.

Example: the kernel may resolve an address fault (virtual memory fault) by installing a new virtual-physical translation.

The fault handler may adjust the saved PC to re-execute the faulting instruction after returning from the fault.

2. Some faults are handled by notifying the process that the fault occurred, so it may recover in its own way.

Fault handler munges the saved user context (PC, SP) to transfer control to a registered user-mode handler on return from the fault.

Example: Unix signals or Microsoft NT user-mode Asynchronous Procedure Calls (APCs).

Options for Handling a Fault (2)Options for Handling a Fault (2)

3. The kernel may handle unrecoverable faults by killing the user process.

Program fault with no registered user-mode handler?

Destroy the process, release its resources, maybe write the memory image to a file, and find another ready process/thread to run.

In Unix this is the default action for many signals (e.g., SEGV).

4. How to handle faults generated by the kernel itself?Kernel follows a bogus pointer? Divides by zero? Executes an

instruction that is undefined or reserved to user mode?

These are generally fatal operating system errors resulting in a system crash, e.g., panic()!

Issues for Paged Memory ManagementIssues for Paged Memory Management

The OS tries to minimize page fault costs incurred by all processes, balancing fairness, system throughput, etc.

(1) fetch policy: When are pages brought into memory?

prepaging: reduce page faults by bring pages in before needed

clustering: reduce seeks on backing storage

(2) replacement policy: How and when does the system select victim pages to be evicted/discarded from memory?

(3) backing storage policy:

Where does the system store evicted pages?

When is the backing storage allocated?

When does the system write modified pages to backing store?

Virtual Memory as a CacheVirtual Memory as a Cache

text

dataidatawdata

header

symboltable, etc.

programsections

text

data

BSS

user stack

args/envkernel

data

processsegments

physicalpage frames

virtualmemory

(big)

physicalmemory(small)

executablefile

backingstorage

virtual-to-physical translations

pageout/eviction

page fetch

Rationale for I/O Cache StructureRationale for I/O Cache Structure

Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m).

1. What happens on the first access to each item?Fetch it into some slot of the cache, use it, and leave it there to

speed up access if it is needed again later.

2. How to determine if an item is resident in the cache?Maintain a directory of items in the cache: a hash table.

Hash on a unique identifier (tag) for the item (fully associative).

3. How to find a slot for an item fetched into the cache?Choose an unused slot, or select an item to replace according to

some policy, and evict it from the cache, freeing its slot.

Mechanism for Cache Eviction/ReplacementMechanism for Cache Eviction/Replacement

Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse.

• Busy items in active use are not on the list.

E.g., some in-memory data structure holds a pointer to the item.

E.g., an I/O operation is in progress on the item.

• The best candidates are slots that do not contain valid items.

Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed).

• Other slots are listed in order of value of the items they contain.

These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.

Replacement PolicyReplacement Policy

The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list.

defines the replacement policy

A typical cache replacement policy is Least Recently Used.

• Assume hot items used recently are likely to be used again.

• Move the item to the tail of the free list on every release.

• The item at the front of the list is the coldest inactive item.

Other alternatives:

• FIFO: replace the oldest item.

• MRU/LIFO: replace the most recently used item.

The Page Caching ProblemThe Page Caching Problem

Each thread/process/job utters a stream of page references.

• reference string: e.g., abcabcdabce..

The OS tries to minimize the number of faults incurred.

• The set of pages (the working set) actively used by each job changes relatively slowly.

• Try to arrange for the resident set of pages for each active job to closely approximate its working set.

Replacement policy is the key.

• On each page fault, select a victim page to evict from memory; read the new page into the victim’s frame.

• Most systems try to approximate an LRU policy.

VM Page Cache InternalsVM Page Cache Internals

HASH(memory object/segment, logical block)1. Pages in active use are mapped through the page table of one or more processes.

2. On a fault, the global object/offset hash table in kernel finds pages brought into memory by other processes.

3. Several page queues wind through the set of active frames, keeping track of usage.

4. Pages selected for eviction are removed from all page tables first.

Managing the VM Page CacheManaging the VM Page Cache

Managing a VM page cache is similar to a file block cache, but with some new twists.

1. Pages are typically referenced by page table (pmap) entries.

Must invalidate mappings before reusing the frame.

2. Reads and writes are implicit; the TLB hides them from the OS.

How can we tell if a page is dirty?

How can we tell if a page is referenced?

3. Cache manager must run policies periodically, sampling page state.

Continuously push dirty pages to disk to “launder” them.

Continuously check references to judge how “hot” each page is.

Balance accuracy with sampling overhead.

The Paging DaemonThe Paging Daemon

Most OS have one or more system processes responsible for implementing the VM page cache replacement policy.

• A daemon is an autonomous system process that periodically performs some housekeeping task.

The paging daemon prepares for page eviction before the need arises.

• Wake up when free memory becomes low.

• Clean dirty pages by pushing to backing store.

prewrite or pageout

• Maintain ordered lists of eviction candidates.

• Decide how much memory to allocate to file cache, VM, etc.

LRU Approximations for PagingLRU Approximations for Paging

Pure LRU and LFU are prohibitively expensive to implement.

• most references are hidden by the TLB

• OS typically sees less than 10% of all references

• can’t tweak your ordered page list on every reference

Most systems rely on an approximation to LRU for paging.

• periodically sample the reference bit on each page

visit page and set reference bit to zero

run the process for a while (the reference window)

come back and check the bit again

• reorder the list of eviction candidates based on sampling

FIFO with Second ChanceFIFO with Second Chance

Idea: do simple FIFO replacement, but give pages a “second chance” to prove their value before they are replaced.• Every frame is on one of three FIFO lists:

active, inactive and free

• Page fault handler installs new pages on tail of active list.

• “Old” pages are moved to the tail of the inactive list.Paging daemon moves pages from head of active list to tail of

inactive list when demands for free frames is high.

Clear the refbit and protect the inactive page to “monitor” it.

• Pages on the inactive list get a “second chance”.If referenced while inactive, reactivate to the tail of active list.

Illustrating FIFO-2CIllustrating FIFO-2C

activelist

inactivelist

freelist

Consume frames from the head ofthe free list (free).

If free shrinks below threshhold, kickthe paging daemon to start a scan (I, II).

2. Page has not been referenced? pmap_page_protect and place on tail of free list.

3. Page is dirty? Push to backing store and return it to inactive list tail (clean).

I. Restock inactive list by pulling pages fromthe head of the active list: clear the ref bit andplace on inactive list (deactivation).

II. Inactive list scan from head:1. Page has been referenced? Return to tail of active list (reactivation).

Paging daemon typically scans a few times persecond, even if not needed to restock free list.

FIFO-2C in Action (FreeBSD)FIFO-2C in Action (FreeBSD)

What Do the Pretty Colors Mean?What Do the Pretty Colors Mean?

This is a plot of selected internal kernel events during a run of a process that randomly reads/writes its virtual memory.

• x-axis: time in milliseconds (total run is about 3 seconds)

• y-axis: each event pertains to a physical page frame, whose PFN is given on the y-axis

The machine is an Alpha with 8000 8KB pages (64MB total)

The process uses 48MB of virtual memory: force the paging daemon to do FIFO-2C bookkeeping, but little actual paging.

• events: page allocate (yellow-green), page free (red), deactivation (duke blue), reactivation (lime green), page clean (carolina blue).

What to Look ForWhat to Look For

• Some low physical memory ranges are reserved to the kernel.

• Process starts and soaks up memory that was initially free.

• Paging daemon frees pages allocated to other processes, and the system reallocates them to the test process.

• After an initial flurry of demand-loading activity, things settle down after most of the process memory is resident.

• Paging daemon begins to run more frequently as memory becomes overcommitted (dark blue deactivation stripes).

• Test process touches pages deactivated by the paging daemon, causing them to be reactivated.

• Test process exits (heavy red bar).

page allocpage alloc

deactivatedeactivate

cleanclean

freefree

activateactivate

Questions for Paged Virtual MemoryQuestions for Paged Virtual Memory

1. How do we prevent users from accessing protected data?

2. If a page is in memory, how do we find it?Address translation must be fast.

3. If a page is not in memory, how do we find it?

4. When is a page brought into memory?

5. If a page is brought into memory, where do we put it?

6. If a page is evicted from memory, where do we put it?

7. How do we decide which pages to evict from memory?Page replacement policy should minimize overall I/O.

What You Should KnowWhat You Should Know

• Basics of paged memory management

• Typical address space layout

• Basics of address translation

• Architectural mechanisms to support paged memory

• Importance for kernel protection and process isolation

• Why the simple page table is inadequate

• Motivation for and structure of hierarchical tables

• Optional: motivation for and structure of hashed (inverted) tables

BackgroundBackground

Be sure you understand why page-based memory allocation is more memory-efficient than the old way: allocating contiguous physical memory for each address space (partitioning).

• Two partitioning strategies: fixed and variable

• How to make partitioning transparent to programs

• How to protect memory in a partitioned system

• Fragmentation: internal and external

• Fragmentation issues for each strategy

• Relevance to heap managers today

• Approaches to variable partitioning:

First fit, best fit, etc., and the role of compaction.

Memory Management 101Memory Management 101

Once upon a time...memory was called “core”, and programs (“jobs”) were loaded and executed one by one.

• load image in contiguous physical memory

start execution at a known physical location

allocate space in high memory for stack and data

• address text and data using physical addresses

prelink executables for known start address

• run to completion

Memory and MultiprogrammingMemory and Multiprogramming

One day, IBM decided to load multiple jobs in memory at once.

• improve utilization of that expensive CPU

• improve system throughput

Problem 1: how do programs address their memory space?load-time relocation?

Problem 2: how does the OS protect memory from rogue programs?

???

Base and Bound RegistersBase and Bound Registers

Goal: isolate jobs from one another, and from their placement in the machine memory.

• addresses are offsets from the job’s base address

stored in a machine base register

machine computes effective address on each reference

initialized by OS when job is loaded

• machine checks each offset against job size

placed by OS in a bound register

Base and Bound: Pros and ConsBase and Bound: Pros and Cons

Pro:

• each job is physically contiguous

• simple hardware and software

• no need for load-time relocation of linked addresses

• OS may swap or move jobs as it sees fit

Con:

• memory allocation is a royal pain

• job size is limited by available memory

Variable PartitioningVariable Partitioning

Variable partitioning is the strategy of parking differently sized carsalong a street with no marked parking space dividers.

Wasted space from externalfragmentation

Fixed PartitioningFixed Partitioning

Wasted space from internal fragmentation

Alpha Page Tables (Forward Mapped)Alpha Page Tables (Forward Mapped)

2110

POL3L2L1

base+

10 10 13

+

+

PFN

seg 0/1

three-level page table(forward-mapped)

sparse 64-bit address space(43 bits in 21064 and 21164)

offset at each level isdetermined by specific bits in VA

Virtual Memory and Address Translation. Virtual Addressing text data BSS user stack args/env kernel data virtual memory (big) physical memory (small)

Documents

Virtual Memory and Address Translation. Virtual Addressing text data BSS user stack args/env kernel data virtual memory (big) physical memory (small)