This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Recall: Demand Paging Cost Model• Since Demand Paging like caching, can compute average access time!
(“Effective Access Time”)– EAT = Hit Rate x Hit Time + Miss Rate x Miss Time– EAT = Hit Time + Miss Rate x Miss Penalty
• Example:– Memory access time = 200 nanoseconds– Average page-fault service time = 8 milliseconds– Suppose p = Probability of miss, 1-p = Probably of hit– Then, we can compute EAT as follows:
EAT = 200ns + p x 8 ms= 200ns + p x 8,000,000ns
• If one access out of 1,000 causes a page fault, then EAT = 8.2 μs:– This is a slowdown by a factor of 40!
• What if want slowdown by less than 10%?– EAT < 200ns x 1.1 p < 2.5 x 10-6
Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently
• Clock Algorithm: Arrange physical pages in circle with single clock hand– Approximate LRU (approximation to approximation to MIN)– Replace an old page, not the oldest page
• Details:– Hardware “use” bit per physical page (called “accessed” in Intel architecture):
» Hardware sets use bit on each reference» If use bit isn’t set, means not referenced in a long time» Some hardware sets use bit in the TLB; must be copied back to page TLB entry gets replaced
– On page fault:» Advance clock hand (not real time)» Check use bit: 1 used recently; clear and leave alone
• Keep set of free pages ready for use in demand paging– Freelist filled in background by Clock algorithm or other technique (“Pageout demon”)– Dirty pages start copying back to disk when enter list
• Like VAX second-chance list– If page needed before reused, just return to active set
• Advantage: faster for page fault– Can always use page (or pages) immediately on fault
Set of all pagesin Memory
Single Clock Hand: Advances as needed to keep freelist full (“background”)
• When evicting a page frame, how to know which PTEs to invalidate?– Hard in the presence of shared pages (forked processes, shared memory, …)
• Reverse mapping mechanism must be very fast– Must hunt down all page tables pointing at given page frame when freeing a page– Must hunt down all PTEs when seeing if pages “active”
• Implementation options:– For every page descriptor, keep linked list of page table entries that point to it
– Every process gets same amount of memory– Example: 100 frames, 5 processes process gets 20 frames
• Proportional allocation (Fixed Scheme)– Allocate according to the size of process– Computation proceeds as follows:𝑠 = size of process 𝑝 and S ∑𝑠𝑚 = total number of physical frames in the system𝑎 = (allocation for 𝑝 ) 𝑚
• Priority Allocation:– Proportional scheme using priorities rather than size
» Same type of computation as previous scheme– Possible behavior: If process pi generates a page fault, select for replacement a frame from a
process with lower priority number
• Perhaps we should use an adaptive scheme instead???– What if some application just needs more memory?
What about Compulsory Misses?• Recall that compulsory misses are misses that occur the first time that a
page is seen– Pages that are touched for the first time– Pages that are touched after process is swapped out/swapped back in
• Clustering:– On a page-fault, bring in multiple pages “around” the faulting page– Since efficiency of disk reads increases with sequential reads, makes
sense to read several sequential pages• Working Set Tracking:
– Use algorithm to try to track working set of application– When swapping process back in, swap in working set
Pre-Meltdown Virtual Map (Details)• Kernel memory not generally visible to user
– Exception: special VDSO (virtual dynamically linked shared objects) facility that maps kernel code into user space to aid in system calls (and to provide certain actual system calls such as gettimeofday())
• Every physical page described by a “page” structure– Collected together in lower physical memory– Can be accessed in kernel virtual space– Linked together in various “LRU” lists
• For 32-bit virtual memory architectures:– When physical memory < 896MB
» All physical memory mapped at 0xC0000000– When physical memory >= 896MB
» Not all physical memory mapped in kernel space all the time» Can be temporarily mapped with addresses > 0xCC000000
• For 64-bit virtual memory architectures:– All physical memory mapped above 0xFFFF800000000000
Post Meltdown Memory Map• Meltdown flaw (2018, Intel x86, IBM Power, ARM)
– Exploit speculative execution to observe contents of kernel memory1: // Set up side channel (array flushed from cache)2: uchar array[256 * 4096];3: flush(array); // Make sure array out of cache
4: try { // … catch and ignore SIGSEGV (illegal access)5: uchar result = *(uchar *)kernel_address;// Try access!6: uchar dummy = array[result * 4096]; // leak info!7: } catch(){;} // Could use signal() and setjmp/longjmp
8: // scan through 256 array slots to determine which loaded
– Some details:» Reason we skip 4096 for each value: avoid hardware cache prefetch» Note that value detected by fact that one cache line is loaded» Catch and ignore page fault: set signal handler for SIGSEGV, can use setjump/longjmp….
• Patch: Need different page tables for user and kernel– Without PCID tag in TLB, flush TLB twice on syscall (800% overhead!)– Need at least Linux v 4.14 which utilizes PCID tag in new hardware to avoid flushing
when change address space• Fix: better hardware without timing side-channels
• But a parallel bus has many limitations– Multiplexing address/data for many requests– Slowest devices must be able to tell what’s happening (e.g., for arbitration)– Bus speed is set to that of the slowest device
PCI Express “Bus”• No longer a parallel bus• Really a collection of fast serial channels or “lanes”• Devices can use as many as they need to achieve a desired bandwidth• Slow devices don’t have to share with fast ones
• One of the successes of device abstraction in Linux was the ability to migrate from PCI to PCI Express
– The physical interconnect changed completely, but the old API still worked
Operational Parameters for I/O• Data granularity: Byte vs. Block
– Some devices provide single byte at a time (e.g., keyboard)– Others provide whole blocks (e.g., disks, networks, etc.)
• Access pattern: Sequential vs. Random– Some devices must be accessed sequentially (e.g., tape)– Others can be accessed “randomly” (e.g., disk, cd, etc.)
» Fixed overhead to start transfers– Some devices require continual monitoring– Others generate interrupts when they need service
• Programmed I/O:– Each byte transferred via processor in/out or load/store– Pro: Simple hardware, easy to program– Con: Consumes processor cycles proportional to data size
• Direct Memory Access:– Give controller access to memory bus– Ask it to transfer
data blocks to/from memory directly
• Sample interaction with DMA controller(from OSC book):
• Programmed I/O:– Each byte transferred via processor in/out or load/store– Pro: Simple hardware, easy to program– Con: Consumes processor cycles proportional to data size
• Direct Memory Access:– Give controller access to memory bus– Ask it to transfer
data blocks to/from memory directly
• Sample interaction with DMA controller(from OSC book):
I/O Device Notifying the OS• The OS needs to know when:
– The I/O device has completed an operation– The I/O operation has encountered an error
• I/O Interrupt:– Device generates an interrupt whenever it needs service– Pro: handles unpredictable events well– Con: interrupts relatively high overhead
• Polling:– OS periodically checks a device-specific status register
» I/O device puts completion information in status register– Pro: low overhead– Con: may waste many cycles on polling if infrequent or unpredictable I/O operations
• Actual devices combine both polling and interrupts– For instance – High-bandwidth network adapter:
» Interrupt for first incoming packet» Poll for following packets until hardware queues are empty
Recall: Device Drivers• Device Driver: Device-specific code in the kernel that interacts directly with
the device hardware– Supports a standard, internal interface– Same kernel I/O system can interact easily with different device drivers– Special device-specific configuration supported with the ioctl() system call
• Device Drivers typically divided into two pieces:– Top half: accessed in call path from system calls
» implements a set of standard, cross-device calls like open(), close(), read(),write(), ioctl(), strategy()
» This is the kernel’s interface to the device driver» Top half will start I/O to device, may put thread to sleep until finished
– Bottom half: run as interrupt routine» Gets input or transfers next block of output» May wake sleeping threads if I/O now complete