Virtual Memory and Address TranslationVirtual Memory and Address Translation
Virtual AddressingVirtual Addressing
text
data
BSS
user stack
args/envkernel
data
virtualmemory
(big)
physicalmemory(small)
virtual-to-physical translations
User processes address memory through virtual
addresses.
The kernel and the machine collude to
translate virtual addresses to
physical addresses.
The kernel controls the virtual-physical translations in effect
for each space.
The machine does not allow a user process to access memory unless the kernel “says it’s OK”.
The specific mechanisms for memory management and address translation are
machine-dependent.
What’s in an Object File or Executable?What’s in an Object File or Executable?
int j = 327;char* s = “hello\n”;char sbuf[512];
int p() { int k = 0; j = write(1, s, 6); return(j);}
text
dataidata
wdata
header
symboltable
relocationrecords
Used by linker; may be removed after final link step and strip.
Header “magic number”indicates type of image.
Section table an arrayof (offset, len, startVA)
program sections
program instructionsp
immutable data (constants)“hello\n”
writable global/static dataj, s
j, s ,p,sbuf
The Program and the Process VASThe Program and the Process VAS
text
dataidatawdata
header
symboltable
relocationrecords
program
text
data
BSS
user stack
args/envkernel
data
process VAS
sections segments
BSS“Block Started by Symbol”(uninitialized global data)e.g., heap and sbuf go here.
Args/env strings copied in by kernel when the process is created.
Process text segment is initialized directly from program text
section.
Process data segment(s) are
initialized from idata and wdata sections.
Process stack and BSS (e.g., heap) segment(s) are
zero-filled.
Process BSS segment may be expanded at runtime with a system call (e.g., Unix sbrk) called by the heap manager
routines.
Text and idata segments may be write-protected.
Virtual Address TranslationVirtual Address Translation
VPN offset
29 013
Example: typical 32-bitarchitecture with 8KB pages.
addresstranslation
Virtual address translation maps a virtual page number (VPN) to a physical page frame number (PFN): the rest is easy.
PFN
offset
+
00
virtual address
physical address{
Deliver exception toOS if translation is notvalid and accessible inrequested mode.
Role of MMU Hardware and OSRole of MMU Hardware and OS
VM address translation must be very cheap (on average).
• Every instruction includes one or two memory references.
(including the reference to the instruction itself)
VM translation is supported in hardware by a Memory Management Unit or MMU.
• The addressing model is defined by the CPU architecture.
• The MMU itself is an integral part of the CPU.
The role of the OS is to install the virtual-physical mapping and intervene if the MMU reports that it cannot complete the translation.
The OS Directs the MMUThe OS Directs the MMU
The OS controls the operation of the MMU to select:
(1) the subset of possible virtual addresses that are valid for each process (the process virtual address space);
(2) the physical translations for those virtual addresses;
(3) the modes of permissible access to those virtual addresses;
read/write/execute
(4) the specific set of translations in effect at any instant.
need rapid context switch from one address space to another
MMU completes a reference only if the OS “says it’s OK”.MMU raises an exception if the reference is “not OK”.
The Translation Lookaside Buffer (TLB)The Translation Lookaside Buffer (TLB)
An on-chip hardware translation buffer (TB or TLB) caches recently used virtual-physical translations (ptes).
Alpha 21164: 48-entry fully associative TLB.
A CPU pipeline stage probes the TLB to complete over 99% of address translations in a single cycle.
Like other memory system caches, replacement of TLB entries is simple and controlled by hardware.
e.g., Not Last Used
If a translation misses in the TLB, the entry must be fetched by accessing the page table(s) in memory.
cost: 10-500 cycles
Care and Feeding of TLBsCare and Feeding of TLBs
The OS kernel carries out its memory management functions by issuing privileged operations on the MMU.
Choice 1: OS maintains page tables examined by the MMU.• MMU loads TLB autonomously on each TLB miss
• page table format is defined by the architecture
• OS loads page table bases and lengths into privileged memory management registers on each context switch.
Choice 2: OS controls the TLB directly.• MMU raises exception if the needed pte is not in the TLB.
• Exception handler loads the missing pte by reading data structures in memory (software-loaded TLB).
A Simple Page TableA Simple Page Table
PFN 0PFN 1
PFN i
page #i offset
user virtual address
PFN i+
offset
process page table
physical memorypage frames
In this example, each VPN j maps to PFN j,
but in practice any physical frame may be
used for any virtual page.
Each process/VAS has its own page table.
Virtual addresses are translated relative to
the current page table.
The page tables are themselves stored in memory; a protected
register holds a pointer to the current page table.
Page Tables (2)Page Tables (2)
32 bit address with 2 page table fields
Two-level page tables
Second-level page tables
Top-level page table
[from Tanenbaum]
A Page Table Entry (PTE)A Page Table Entry (PTE)
PFN
valid bit: OS uses this bit to tell the MMU if the translation is valid.
write-enable: OS touches this to enable or disable write access for this mapping.
reference bit: MMU sets this when a reference is made through the mapping.
dirty bit: MMU sets this when a store is completed to the page (page is modified).
This is (roughly) what a MIPS/Nachos page table entry (pte) looks like.
Page Tables (3)Page Tables (3)
Typical page table entry [from Tanenbaum]
Completing a VM ReferenceCompleting a VM Reference
raiseexception
probepage table
loadTLB
probe TLB
accessphysicalmemory
accessvalid?
pagefault?
signalprocess
allocateframe
page ondisk?
fetchfrom disk
zero-fillloadTLB
starthere
MMU
OS
Demand Paging and Page FaultsDemand Paging and Page Faults
OS may leave some virtual-physical translations unspecified.
mark the pte for a virtual page as invalid
If an unmapped page is referenced, the machine passes control to the kernel exception handler (page fault).
passes faulting virtual address and attempted access mode
Handler initializes a page frame, updates pte, and restarts.
If a disk access is required, the OS may switch to another process after initiating the I/O.
Page faults are delivered at IPL 0, just like a system call trap.
Fault handler executes in context of faulted process, blocks on a semaphore or condition variable awaiting I/O completion.
Where Pages Come FromWhere Pages Come From
text
data
BSS
user stack
args/envkernel
data
file volumewith
executable programs
Fetches for clean text or data are typically fill-from-file.
Modified (dirty) pages are pushed to backing store (swap) on eviction.
Paged-out pages are fetched from backing store when needed.
Initial references to user stack and BSS are satisfied by zero-fill on demand.
Processes and the KernelProcesses and the Kernel
data dataprocesses in private
virtual address spaces
system call traps...and upcalls (e.g.,
signals)
shared kernel code and data
in shared address space
Threads or processes enter the kernel for services.
The kernel sets up process execution contexts to
“virtualize” the machine.
CPU and devices force entry to the kernel to handle exceptional events.
The KernelThe Kernel• Today, all “real” operating systems have protected kernels.
The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset.
Our “kernel” is called the executive in some systems (e.g., XP).
• The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected:
User code cannot access internal kernel data structures directly, and it can invoke the kernel only at well-defined entry points (system calls).
• Kernel code is like user code, but the kernel is privileged:
The kernel has direct access to all hardware functions, and defines the machine entry points for interrupts and exceptions.
Protecting Entry to the KernelProtecting Entry to the Kernel
Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, XP, etc).
• The machine defines a small set of exceptional event types.
• The machine defines what conditions raise each event.
• The kernel installs handlers for each event at boot time.
e.g., a table in kernel memory read by the machine
The machine transitions to kernel mode only on an exceptional event.
The kernel defines the event handlers.
Therefore the kernel chooses what code will execute in kernel mode, and when.
user
kernel
interrupt orexceptiontrap/return
Example: System Call TrapsExample: System Call Traps
User code invokes kernel services by initiating system call traps.
• Programs in C, C++, etc. invoke system calls by linking to a standard library of procedures written in assembly language.
the library defines a stub or wrapper routine for each syscall
stub executes a special trap instruction (e.g., chmk or callsys or int)
syscall arguments/results passed in registers or user stack
read() in Unix libc.a library (executes in user mode):
#define SYSCALL_READ 27 # code for a read system call
move arg0…argn, a0…an # syscall args in registers A0..ANmove SYSCALL_READ, v0 # syscall dispatch code in V0callsys # kernel trapmove r1, _errno # errno = return statusreturn
Alpha CPU architecture
FaultsFaultsFaults are similar to system calls in some respects:
• Faults occur as a result of a process executing an instruction.
Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel.
• The completed fault handler may return to the faulted context.
But faults are different from syscall traps in other respects:
• Syscalls are deliberate, but faults are “accidents”.
divide-by-zero, dereference invalid pointer, memory page fault
• Not every execution of the faulting instruction results in a fault.
may depend on memory state or register contents
The Role of EventsThe Role of Events
A CPU event is an “unnatural” change in control flow.Like a procedure call, an event changes the PC.
Also changes mode or context (current stack), or both.
Events do not change the current space!
The kernel defines a handler routine for each event type.Event handlers always execute in kernel mode.
The specific types of events are defined by the machine.
Once the system is booted, every entry to the kernel occurs as a result of an event.
In some sense, the whole kernel is a big event handler.
CPU Events: Interrupts and ExceptionsCPU Events: Interrupts and Exceptions
An interrupt is caused by an external event.device requests attention, timer expires, etc.
An exception is caused by an executing instruction.CPU requires software intervention to handle a fault or trap.
unplanned deliberatesync fault syscall trapasync interrupt AST
control flow
event handler (e.g., ISR: Interrupt Service
Routine)
exception.cc
AST: Asynchronous System TrapAlso called a software interrupt or an Asynchronous or Deferred Procedure Call (APC or DPC)
Note: different “cultures” may use some of these terms (e.g., trap, fault, exception, event, interrupt) slightly differently.
Mode, Space, and ContextMode, Space, and Context
At any time, the state of each processor is defined by:
1. mode: given by the mode bitIs the CPU executing in the protected kernel or a user program?
2. space: defined by V->P translations currently in effectWhat address space is the CPU running in? Once the system is
booted, it always runs in some virtual address space.
3. context: given by register state and execution streamIs the CPU executing a thread/process, or an interrupt handler?
Where is the stack?
These are important because the mode/space/context determines the meaning and validity of key operations.
The Virtual Address SpaceThe Virtual Address Space
A typical process VAS space includes:• user regions in the lower half
V->P mappings specific to each process
accessible to user or kernel code
• kernel regions in upper halfshared by all processes, but accessible only
to kernel code
• NT (XP?) on x86 subdivides kernel region into an unpaged half and a (mostly) paged upper half at 0xC0000000 for page tables and I/O cache.
• Win95/98 uses the lower half of system space as a system-wide shared region.
text
data
BSS
user stack
args/env
0
data
kernel textand
kernel data
2n-1
2n-1
0x0
0xffffffff
A VAS for a private address space system (e.g., Unix, NT/XP) executing on a typical 32-bit system (e.g., x86).
sbrk()
jsr
Process and Kernel Address SpacesProcess and Kernel Address Spaces
data
0
2n-1-1
2n-1
2n-1
data
0x7FFFFFFF
0x80000000
0xFFFFFFFF
0x0
n-bit virtual address space
32-bit virtual address space
FaultsFaultsFaults are similar to system calls in some respects:
• Faults occur as a result of a process executing an instruction.
Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel.
• The completed fault handler may return to the faulted context.
But faults are different from syscall traps in other respects:
• Syscalls are deliberate, but faults are “accidents”.
divide-by-zero, dereference invalid pointer, memory page fault
• Not every execution of the faulting instruction results in a fault.
may depend on memory state or register contents
Options for Handling a Fault (1)Options for Handling a Fault (1)
1. Some faults are handled by “patching things up” and returning to the faulted context.
Example: the kernel may resolve an address fault (virtual memory fault) by installing a new virtual-physical translation.
The fault handler may adjust the saved PC to re-execute the faulting instruction after returning from the fault.
2. Some faults are handled by notifying the process that the fault occurred, so it may recover in its own way.
Fault handler munges the saved user context (PC, SP) to transfer control to a registered user-mode handler on return from the fault.
Example: Unix signals or Microsoft NT user-mode Asynchronous Procedure Calls (APCs).
Options for Handling a Fault (2)Options for Handling a Fault (2)
3. The kernel may handle unrecoverable faults by killing the user process.
Program fault with no registered user-mode handler?
Destroy the process, release its resources, maybe write the memory image to a file, and find another ready process/thread to run.
In Unix this is the default action for many signals (e.g., SEGV).
4. How to handle faults generated by the kernel itself?Kernel follows a bogus pointer? Divides by zero? Executes an
instruction that is undefined or reserved to user mode?
These are generally fatal operating system errors resulting in a system crash, e.g., panic()!
Issues for Paged Memory ManagementIssues for Paged Memory Management
The OS tries to minimize page fault costs incurred by all processes, balancing fairness, system throughput, etc.
(1) fetch policy: When are pages brought into memory?
prepaging: reduce page faults by bring pages in before needed
clustering: reduce seeks on backing storage
(2) replacement policy: How and when does the system select victim pages to be evicted/discarded from memory?
(3) backing storage policy:
Where does the system store evicted pages?
When is the backing storage allocated?
When does the system write modified pages to backing store?
Virtual Memory as a CacheVirtual Memory as a Cache
text
dataidatawdata
header
symboltable, etc.
programsections
text
data
BSS
user stack
args/envkernel
data
processsegments
physicalpage frames
virtualmemory
(big)
physicalmemory(small)
executablefile
backingstorage
virtual-to-physical translations
pageout/eviction
page fetch
Rationale for I/O Cache StructureRationale for I/O Cache Structure
Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m).
1. What happens on the first access to each item?Fetch it into some slot of the cache, use it, and leave it there to
speed up access if it is needed again later.
2. How to determine if an item is resident in the cache?Maintain a directory of items in the cache: a hash table.
Hash on a unique identifier (tag) for the item (fully associative).
3. How to find a slot for an item fetched into the cache?Choose an unused slot, or select an item to replace according to
some policy, and evict it from the cache, freeing its slot.
Mechanism for Cache Eviction/ReplacementMechanism for Cache Eviction/Replacement
Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse.
• Busy items in active use are not on the list.
E.g., some in-memory data structure holds a pointer to the item.
E.g., an I/O operation is in progress on the item.
• The best candidates are slots that do not contain valid items.
Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed).
• Other slots are listed in order of value of the items they contain.
These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.
Replacement PolicyReplacement Policy
The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list.
defines the replacement policy
A typical cache replacement policy is Least Recently Used.
• Assume hot items used recently are likely to be used again.
• Move the item to the tail of the free list on every release.
• The item at the front of the list is the coldest inactive item.
Other alternatives:
• FIFO: replace the oldest item.
• MRU/LIFO: replace the most recently used item.
The Page Caching ProblemThe Page Caching Problem
Each thread/process/job utters a stream of page references.
• reference string: e.g., abcabcdabce..
The OS tries to minimize the number of faults incurred.
• The set of pages (the working set) actively used by each job changes relatively slowly.
• Try to arrange for the resident set of pages for each active job to closely approximate its working set.
Replacement policy is the key.
• On each page fault, select a victim page to evict from memory; read the new page into the victim’s frame.
• Most systems try to approximate an LRU policy.
VM Page Cache InternalsVM Page Cache Internals
HASH(memory object/segment, logical block)1. Pages in active use are mapped through the page table of one or more processes.
2. On a fault, the global object/offset hash table in kernel finds pages brought into memory by other processes.
3. Several page queues wind through the set of active frames, keeping track of usage.
4. Pages selected for eviction are removed from all page tables first.
Managing the VM Page CacheManaging the VM Page Cache
Managing a VM page cache is similar to a file block cache, but with some new twists.
1. Pages are typically referenced by page table (pmap) entries.
Must invalidate mappings before reusing the frame.
2. Reads and writes are implicit; the TLB hides them from the OS.
How can we tell if a page is dirty?
How can we tell if a page is referenced?
3. Cache manager must run policies periodically, sampling page state.
Continuously push dirty pages to disk to “launder” them.
Continuously check references to judge how “hot” each page is.
Balance accuracy with sampling overhead.
The Paging DaemonThe Paging Daemon
Most OS have one or more system processes responsible for implementing the VM page cache replacement policy.
• A daemon is an autonomous system process that periodically performs some housekeeping task.
The paging daemon prepares for page eviction before the need arises.
• Wake up when free memory becomes low.
• Clean dirty pages by pushing to backing store.
prewrite or pageout
• Maintain ordered lists of eviction candidates.
• Decide how much memory to allocate to file cache, VM, etc.
LRU Approximations for PagingLRU Approximations for Paging
Pure LRU and LFU are prohibitively expensive to implement.
• most references are hidden by the TLB
• OS typically sees less than 10% of all references
• can’t tweak your ordered page list on every reference
Most systems rely on an approximation to LRU for paging.
• periodically sample the reference bit on each page
visit page and set reference bit to zero
run the process for a while (the reference window)
come back and check the bit again
• reorder the list of eviction candidates based on sampling
FIFO with Second ChanceFIFO with Second Chance
Idea: do simple FIFO replacement, but give pages a “second chance” to prove their value before they are replaced.• Every frame is on one of three FIFO lists:
active, inactive and free
• Page fault handler installs new pages on tail of active list.
• “Old” pages are moved to the tail of the inactive list.Paging daemon moves pages from head of active list to tail of
inactive list when demands for free frames is high.
Clear the refbit and protect the inactive page to “monitor” it.
• Pages on the inactive list get a “second chance”.If referenced while inactive, reactivate to the tail of active list.
Illustrating FIFO-2CIllustrating FIFO-2C
activelist
inactivelist
freelist
Consume frames from the head ofthe free list (free).
If free shrinks below threshhold, kickthe paging daemon to start a scan (I, II).
2. Page has not been referenced? pmap_page_protect and place on tail of free list.
3. Page is dirty? Push to backing store and return it to inactive list tail (clean).
I. Restock inactive list by pulling pages fromthe head of the active list: clear the ref bit andplace on inactive list (deactivation).
II. Inactive list scan from head:1. Page has been referenced? Return to tail of active list (reactivation).
Paging daemon typically scans a few times persecond, even if not needed to restock free list.
FIFO-2C in Action (FreeBSD)FIFO-2C in Action (FreeBSD)
What Do the Pretty Colors Mean?What Do the Pretty Colors Mean?
This is a plot of selected internal kernel events during a run of a process that randomly reads/writes its virtual memory.
• x-axis: time in milliseconds (total run is about 3 seconds)
• y-axis: each event pertains to a physical page frame, whose PFN is given on the y-axis
The machine is an Alpha with 8000 8KB pages (64MB total)
The process uses 48MB of virtual memory: force the paging daemon to do FIFO-2C bookkeeping, but little actual paging.
• events: page allocate (yellow-green), page free (red), deactivation (duke blue), reactivation (lime green), page clean (carolina blue).
What to Look ForWhat to Look For
• Some low physical memory ranges are reserved to the kernel.
• Process starts and soaks up memory that was initially free.
• Paging daemon frees pages allocated to other processes, and the system reallocates them to the test process.
• After an initial flurry of demand-loading activity, things settle down after most of the process memory is resident.
• Paging daemon begins to run more frequently as memory becomes overcommitted (dark blue deactivation stripes).
• Test process touches pages deactivated by the paging daemon, causing them to be reactivated.
• Test process exits (heavy red bar).
page allocpage alloc
deactivatedeactivate
cleanclean
freefree
activateactivate
Questions for Paged Virtual MemoryQuestions for Paged Virtual Memory
1. How do we prevent users from accessing protected data?
2. If a page is in memory, how do we find it?Address translation must be fast.
3. If a page is not in memory, how do we find it?
4. When is a page brought into memory?
5. If a page is brought into memory, where do we put it?
6. If a page is evicted from memory, where do we put it?
7. How do we decide which pages to evict from memory?Page replacement policy should minimize overall I/O.
What You Should KnowWhat You Should Know
• Basics of paged memory management
• Typical address space layout
• Basics of address translation
• Architectural mechanisms to support paged memory
• Importance for kernel protection and process isolation
• Why the simple page table is inadequate
• Motivation for and structure of hierarchical tables
• Optional: motivation for and structure of hashed (inverted) tables
BackgroundBackground
Be sure you understand why page-based memory allocation is more memory-efficient than the old way: allocating contiguous physical memory for each address space (partitioning).
• Two partitioning strategies: fixed and variable
• How to make partitioning transparent to programs
• How to protect memory in a partitioned system
• Fragmentation: internal and external
• Fragmentation issues for each strategy
• Relevance to heap managers today
• Approaches to variable partitioning:
First fit, best fit, etc., and the role of compaction.
Memory Management 101Memory Management 101
Once upon a time...memory was called “core”, and programs (“jobs”) were loaded and executed one by one.
• load image in contiguous physical memory
start execution at a known physical location
allocate space in high memory for stack and data
• address text and data using physical addresses
prelink executables for known start address
• run to completion
Memory and MultiprogrammingMemory and Multiprogramming
One day, IBM decided to load multiple jobs in memory at once.
• improve utilization of that expensive CPU
• improve system throughput
Problem 1: how do programs address their memory space?load-time relocation?
Problem 2: how does the OS protect memory from rogue programs?
???
Base and Bound RegistersBase and Bound Registers
Goal: isolate jobs from one another, and from their placement in the machine memory.
• addresses are offsets from the job’s base address
stored in a machine base register
machine computes effective address on each reference
initialized by OS when job is loaded
• machine checks each offset against job size
placed by OS in a bound register
Base and Bound: Pros and ConsBase and Bound: Pros and Cons
Pro:
• each job is physically contiguous
• simple hardware and software
• no need for load-time relocation of linked addresses
• OS may swap or move jobs as it sees fit
Con:
• memory allocation is a royal pain
• job size is limited by available memory
Variable PartitioningVariable Partitioning
Variable partitioning is the strategy of parking differently sized carsalong a street with no marked parking space dividers.
Wasted space from externalfragmentation
Fixed PartitioningFixed Partitioning
Wasted space from internal fragmentation
Alpha Page Tables (Forward Mapped)Alpha Page Tables (Forward Mapped)
2110
POL3L2L1
base+
10 10 13
+
+
PFN
seg 0/1
three-level page table(forward-mapped)
sparse 64-bit address space(43 bits in 21064 and 21164)
offset at each level isdetermined by specific bits in VA