Computer System Computer System Chapter 10. Virtual Memory Chapter 10. Virtual Memory Lynn Choi Lynn Choi Korea University Korea University
Computer SystemComputer System
Chapter 10. Virtual MemoryChapter 10. Virtual Memory
Lynn ChoiLynn Choi
Korea UniversityKorea University
A System with Physical Memory OnlyA System with Physical Memory OnlyExamples:Examples:
Most Cray machines, early PCs, nearly all embedded systems, etc.
Addresses generated by the CPU correspond directly to bytes in physical memory
CPU
0:1:
N-1:
Memory
PhysicalAddresses
A System with Virtual MemoryA System with Virtual Memory
Examples:Examples:Workstations, servers, modern PCs, etc.
Address Translation: Hardware converts virtual addresses to physical addresses via OS-managed lookup table (page table)
CPU
0:1:
N-1:
Memory
0:1:
P-1:
Page Table
Disk
VirtualAddresses
PhysicalAddresses
Page Faults (like “Cache Misses”)Page Faults (like “Cache Misses”)What if an object is on disk rather than in memory?What if an object is on disk rather than in memory?
Page table entry indicates virtual address not in memory
OS exception handler invoked to move data from disk into memoryCurrent process suspends, others can resume
OS has full control over placement, etc.
CPU
Memory
Page Table
Disk
VirtualAddresses
PhysicalAddresses
CPU
Memory
Page Table
Disk
VirtualAddresses
PhysicalAddresses
Before fault After handling fault
Servicing a Page FaultServicing a Page Fault
Processor Signals ControllerProcessor Signals ControllerRead block of length P starting at disk address X and store starting at memory address Y
Read OccursRead OccursDirect Memory Access (DMA)
Under control of I/O controller
I / O Controller Signals CompletionI / O Controller Signals CompletionInterrupt processor
OS resumes suspended process
diskDiskdiskDisk
Memory-I/O busMemory-I/O bus
ProcessorProcessor
CacheCache
MemoryMemoryI/O
controller
I/Ocontroller
Reg
(2) DMA Transfer
(1) Initiate Block Read
(3) Read Done
Memory ManagementMemory ManagementMultiple processes can reside in physical memory.Multiple processes can reside in physical memory.
How do we resolve address conflicts?How do we resolve address conflicts?What if two processes access something at the same address?
kernel virtual memory
Memory mapped region for shared libraries
runtime heap (via malloc)
program text (.text)
initialized data (.data)
uninitialized data (.bss)
stack
forbidden0
%esp
memory invisible to user code
the “brk” ptr
Linux/x86 process
memory
image
Virtual Address Space for Process 1:
Physical Address Space (DRAM)
VP 1VP 2
PP 2
Address Translation0
0
N-1
0
N-1M-1
VP 1VP 2
PP 7
PP 10
(e.g., read/only library code)
Solution: Separate Virt. Addr. SpacesSolution: Separate Virt. Addr. SpacesVirtual and physical address spaces divided into equal-sized blocks
Blocks are called “pages” (both virtual and physical)
Each process has its own virtual address spaceOperating system controls how virtual pages as assigned to physical memory
...
...
Virtual Address Space for Process 2:
ProtectionProtectionPage table entry contains access rights informationPage table entry contains access rights information
Hardware enforces this protection (trap into OS if violation occurs)
Page Tables
Process i:
Physical AddrRead? Write?
PP 9Yes No
PP 4Yes Yes
XXXXXXX No No
VP 0:
VP 1:
VP 2:•••
•••
•••
Process j:
0:1:
N-1:
Memory
Physical AddrRead? Write?
PP 6Yes Yes
PP 9Yes No
XXXXXXX No No•••
•••
•••
VP 0:
VP 1:
VP 2:
Address Translation SymbolsAddress Translation Symbols
Virtual Address ComponentsVirtual Address ComponentsVPO: virtual page offset
VPN: virtual page number
TLBI: TLB index
TLBT: TLB tag
Physical Address ComponentsPhysical Address ComponentsPPO: physical page offset
PPN: physical page number
CO: byte offset within cache block
CI: cache index
CT: cache tag
Simple Memory System ExampleSimple Memory System Example
AddressingAddressing14-bit virtual addresses
12-bit physical address
Page size = 64 bytes
13 12 11 10 9 8 7 6 5 4 3 2 1 0
11 10 9 8 7 6 5 4 3 2 1 0
VPO
PPOPPN
VPN
(Virtual Page Number) (Virtual Page Offset)
(Physical Page Number) (Physical Page Offset)
Simple Memory System Page TableSimple Memory System Page TableOnly show first 16 entries
VPNVPN PPNPPN ValidValid VPNVPN PPNPPN ValidValid
0000 2828 11 0808 1313 11
0101 –– 00 0909 1717 11
0202 3333 11 0A0A 0909 11
0303 0202 11 0B0B –– 00
0404 –– 00 0C0C –– 00
0505 1616 11 0D0D 2D2D 11
0606 –– 00 0E0E 1111 11
0707 –– 00 0F0F 0D0D 11
Simple Memory System TLBSimple Memory System TLB
TLBTLB16 entries
4-way associative
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPOVPN
TLBITLBT
SetSet TagTag PPNPPN ValidValid TagTag PPNPPN ValidValid TagTag PPNPPN ValidValid TagTag PPNPPN ValidValid
00 0303 –– 00 0909 0D0D 11 0000 –– 00 0707 0202 11
11 0303 2D2D 11 0202 –– 00 0404 –– 00 0A0A –– 00
22 0202 –– 00 0808 –– 00 0606 –– 00 0303 –– 00
33 0707 –– 00 0303 0D0D 11 0A0A 3434 11 0202 –– 00
Simple Memory System CacheSimple Memory System Cache
CacheCache16 lines
4-byte line size
Direct mapped
11 10 9 8 7 6 5 4 3 2 1 0
PPOPPN
COCICT
IdxIdx TagTag ValidValid B0B0 B1B1 B2B2 B3B3 IdxIdx TagTag ValidValid B0B0 B1B1 B2B2 B3B3
00 1919 11 9999 1111 2323 1111 88 2424 11 3A3A 0000 5151 8989
11 1515 00 –– –– –– –– 99 2D2D 00 –– –– –– ––
22 1B1B 11 0000 0202 0404 0808 AA 2D2D 11 9393 1515 DADA 3B3B
33 3636 00 –– –– –– –– BB 0B0B 00 –– –– –– ––
44 3232 11 4343 6D6D 8F8F 0909 CC 1212 00 –– –– –– ––
55 0D0D 11 3636 7272 F0F0 1D1D DD 1616 11 0404 9696 3434 1515
66 3131 00 –– –– –– –– EE 1313 11 8383 7777 1B1B D3D3
77 1616 11 1111 C2C2 DFDF 0303 FF 1414 00 –– –– –– ––
Address Translation Example #1Address Translation Example #1
Virtual Address Virtual Address 0x03D40x03D4
VPN ___ TLBI ___ TLBT ____ TLB Hit? __ Page Fault? __ PPN: ____
Physical AddressPhysical Address
Offset ___ CI___ CT ____ Hit? __ Byte: ____
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPOVPN
TLBITLBT
11 10 9 8 7 6 5 4 3 2 1 0
PPOPPN
COCICT
Address Translation Example #2Address Translation Example #2
Virtual Address Virtual Address 0x0B8F0x0B8F
VPN ___ TLBI ___ TLBT ____ TLB Hit? __ Page Fault? __ PPN: ____
Physical AddressPhysical Address
Offset ___ CI___ CT ____ Hit? __ Byte: ____
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPOVPN
TLBITLBT
11 10 9 8 7 6 5 4 3 2 1 0
PPOPPN
COCICT
Address Translation Example #3Address Translation Example #3
Virtual Address Virtual Address 0x00400x0040
VPN ___ TLBI ___ TLBT ____ TLB Hit? __ Page Fault? __ PPN: ____
Physical AddressPhysical Address
Offset ___ CI___ CT ____ Hit? __ Byte: ____
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPOVPN
TLBITLBT
11 10 9 8 7 6 5 4 3 2 1 0
PPOPPN
COCICT
Program Start ScenarioProgram Start ScenarioBefore starting the processBefore starting the process
Load the page directory into physical memoryLoad the PDBR (page directory base register) with the beginning of the page directoryLoad the PC with the start address of code
When the 1When the 1stst reference to code triggers reference to code triggersiTLB miss (translation failed for instruction address)
Exception handler looks up PTE1
dTLB miss (translation failed for PTE1)Exception handler looks up PTE2Lookup page directory and find PTE2Add PTE2 to dTLB
dTLB hit, but page miss (PTE1 not in memory)Load page containing PTE1
Lookup page table and find PTE1Add PTE1 to iTLB
iTLB hit, but page miss (code page not present in memory)Load the instruction pageCache miss, but memory returns the instruction
P6 Memory SystemP6 Memory System
bus interface unit
DRAM
external system bus
(e.g. PCI)
instruction
fetch unit
L1
i-cache
L2
cache
cache bus
L1
d-cache
inst
TLB
data
TLB
processor package
32 bit address space32 bit address space
4 KB page size4 KB page size
L1, L2, and TLBsL1, L2, and TLBs 4-way set associative
inst TLBinst TLB 32 entries 8 sets
data TLBdata TLB 64 entries 16 sets
L1 i-cache and d-cacheL1 i-cache and d-cache 16 KB 32 B line size 128 sets
L2 cacheL2 cache unified 128 KB -- 2 MB
Overview of P6 Address TranslationOverview of P6 Address Translation
CPU
VPN VPO20 12
TLBT TLBI416
virtual address (VA)
...
TLB (16 sets, 4 entries/set)VPN1 VPN2
1010
PDE PTE
PDBR
PPN PPO20 12
Page tables
TLB
miss
TLB
hit
physical
address (PA)
result32
...
CT CO20 5
CI7
L2 and DRAM
L1 (128 sets, 4 lines/set)
L1
hit
L1
miss
P6 2-level Page Table StructureP6 2-level Page Table Structure
Page directory Page directory 1024 4-byte page directory entries (PDEs) that point to page tables
One page directory per process.
Page directory must be in memory when its process is running
Always pointed to by PDBR
Page tables:Page tables:1024 4-byte page table entries (PTEs) that point to pages.
Page tables can be paged in and out.
page directory
...
Up to 1024 page tables
1024
PTEs
1024
PTEs
1024
PTEs
...
1024
PDEs
P6 Page Directory Entry (PDE)P6 Page Directory Entry (PDE)
Page table physical base addr Avail G PS A CD WT U/S R/W P=1
Page table physical base address: 20 most significant bits of physical page table address (forces page tables to be 4KB aligned)
Avail: These bits available for system programmers
G: global page (don’t evict from TLB on task switch)
PS: page size 4K (0) or 4M (1)
A: accessed (set by MMU on reads and writes, cleared by software)
CD: cache disabled (1) or enabled (0)
WT: write-through or write-back cache policy for this page table
U/S: user or supervisor mode access
R/W: read-only or read-write access
P: page table is present in memory (1) or not (0)
31 12 11 9 8 7 6 5 4 3 2 1 0
Available for OS (page table location in secondary storage) P=0
31 01
P6 Page Table Entry (PTE)P6 Page Table Entry (PTE)
Page physical base address Avail G 0 D A CD WT U/S R/W P=1
Page base address: 20 most significant bits of physical page address (forces pages to be 4 KB aligned)
Avail: available for system programmers
G: global page (don’t evict from TLB on task switch)
D: dirty (set by MMU on writes)
A: accessed (set by MMU on reads and writes)
CD: cache disabled or enabled
WT: write-through or write-back cache policy for this page
U/S: user/supervisor
R/W: read/write
P: page is present in physical memory (1) or not (0)
31 12 11 9 8 7 6 5 4 3 2 1 0
Available for OS (page location in secondary storage) P=0
31 01
How P6 Page Tables Map VirtualHow P6 Page Tables Map VirtualAddresses to Physical OnesAddresses to Physical Ones
PDE
PDBRphysical address
of page table base
(if P=1)
physical
address
of page base
(if P=1)physical address
of page directory
word offset into
page directory
word offset into
page table
page directory page table
VPN1
10
VPO
10 12
VPN2 Virtual address
PTE
PPN PPO
20 12
Physical address
word offset into
physical and virtual
page
Representation of Virtual Address SpaceRepresentation of Virtual Address Space
Simplified ExampleSimplified Example16 page virtual address space
FlagsFlagsP: Is entry in physical memory?
M: Has this part of VA space been mapped?
Page Directory
PT 3
P=1, M=1
P=1, M=1
P=0, M=0
P=0, M=1
••••
P=1, M=1
P=0, M=0
P=1, M=1
P=0, M=1
••••
P=1, M=1
P=0, M=0
P=1, M=1
P=0, M=1
••••
P=0, M=1
P=0, M=1
P=0, M=0
P=0, M=0
••••
PT 2
PT 0
Page 0
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Mem Addr
Disk Addr
In Mem
On Disk
Unmapped
P6 TLB TranslationP6 TLB Translation
CPU
VPN VPO20 12
TLBT TLBI416
virtual address (VA)
...
TLB (16 sets, 4 entries/set)VPN1 VPN2
1010
PDE PTE
PDBR
PPN PPO20 12
Page tables
TLB
miss
TLB
hit
physical
address (PA)
result32
...
CT CO20 5
CI7
L2 andDRAM
L1 (128 sets, 4 lines/set)
L1
hit
L1
miss
P6 TLBP6 TLB
TLB entry (not all documented, so this is speculative):TLB entry (not all documented, so this is speculative):
V: indicates a valid (1) or invalid (0) TLB entry
PD: is this entry a PDE (1) or a PTE (0)?
tag: disambiguates entries cached in the same set
PDE/PTE: page directory or page table entry
Structure of the data TLB:Structure of the data TLB:16 sets, 4 entries/set
PDE/PTE Tag PD V
1 11632
entry entry entry entryentry entry entry entryentry entry entry entry
entry entry entry entry
...
set 0set 1set 2
set 15
Translating with the P6 Page Tables (case 1/1) Translating with the P6 Page Tables (case 1/1)
Case 1/1: page table Case 1/1: page table and page present.and page present.
MMU Action: MMU Action: MMU builds physical address and fetches data word.
OS actionOS actionnone
VPN
VPN1 VPN2
PDE
PDBR
PPN PPO20 12
20VPO12
p=1 PTE p=1
Data page
data
Page directory
Page table
Mem
Disk
Translating with the P6 Page Tables (case 1/0)Translating with the P6 Page Tables (case 1/0)
Case 1/0: page table present Case 1/0: page table present but page missing.but page missing.
MMU Action: MMU Action: Page fault exception
Handler receives the following args:
VA that caused fault
Fault caused by non-present page or page-level protection violation
Read/write
User/supervisor
VPN
VPN1 VPN2
PDE
PDBR
20VPO12
p=1 PTE
Page directory
Page table
Mem
DiskData page
data
p=0
Translating with the P6 Page Tables (case 1/0)Translating with the P6 Page Tables (case 1/0)
OS Action: OS Action: Check for a legal virtual address.
Read PTE through PDE.
Find free physical page (swapping out current page if necessary)
Read virtual page from disk and copy to virtual page
Restart faulting instruction by returning from exception handler.
VPN
VPN1 VPN2
PDE
PDBR
20VPO12
p=1 PTE p=1
Page directory
Page table
Data page
data
PPN PPO20 12
Mem
Disk
Translating with the P6 Page Tables (case 0/1)Translating with the P6 Page Tables (case 0/1)
Case 0/1: page table Case 0/1: page table missing but page present.missing but page present.Introduces consistency Introduces consistency issue. issue.
Potentially every page out requires update of disk page table.
Linux disallows thisLinux disallows thisIf a page table is swapped out, then swap out its data pages too.
VPN
VPN1 VPN2
PDE
PDBR
20VPO12
p=0
PTE p=1
Page directory
Page table
Mem
Disk
Data page
data
Translating with the P6 Page Tables (case 0/0)Translating with the P6 Page Tables (case 0/0)
Case 0/0: page table Case 0/0: page table and page missing.and page missing.
MMU Action: MMU Action: Page fault exception
VPN
VPN1 VPN2
PDE
PDBR
20VPO12
p=0
PTE
Page directory
Page table
Mem
DiskData page
datap=0
Translating with the P6 Page Tables (case 0/0)Translating with the P6 Page Tables (case 0/0)
OS action: OS action: Swap in page table.
Restart faulting instruction by returning from handler.
Like case 1/0 from Like case 1/0 from here on.here on.
VPN
VPN1 VPN2
PDE
PDBR
20VPO12
p=1 PTE
Page directory
Page table
Mem
DiskData page
data
p=0
P6 L1 Cache AccessP6 L1 Cache AccessCPU
VPN VPO20 12
TLBT TLBI416
virtual address (VA)
...
TLB (16 sets, 4 entries/set)VPN1 VPN2
1010
PDE PTE
PDBR
PPN PPO20 12
Page tables
TLB
miss
TLB
hit
physical
address (PA)
result32
...
CT CO20 5
CI7
L2 andDRAM
L1 (128 sets, 4 lines/set)
L1
hit
L1
miss
Speeding Up L1 AccessSpeeding Up L1 Access
ObservationObservationBits that determine CI identical in virtual and physical address
Can index into cache while address translation taking place
Then check with CT from physical address
“Virtually indexed, physically tagged”
Cache carefully sized to make this possible
Physical address (PA)
CT CO20 5
CI7
virtual
address (VA)VPN VPO
20 12
PPOPPN
Addr.
Trans.No
Change CI
Tag Check
Linux Organizes VM as Collection of “Areas” Linux Organizes VM as Collection of “Areas”
AreaAreaContiguous chunk of (allocated) virtual memory whose pages are related
Examples: code segment, data segment, heap, shared library segment, etc.
Any existing virtual page is contained in some area.Any virtual page that is not part of some area does not exist and cannot be referenced!
Thus, the virtual address space can have gaps.
The kernel does not keep track of virtual pages that do not exist.
task_structtask_structKernel maintains a distinct task structure for each process
Contain all the information that the kernel needs to run the processPID, pointer to the user stack, name of the executable object file, program counter, etc.
mm_structOne of the entries in the task structure that characterizes the current state of virtual memory
pgd – base of the page directory table
mmap – points to a list of vm_area_struct
vm_next
vm_next
Linux Organizes VM as Collection of “Areas” Linux Organizes VM as Collection of “Areas”
task_structmm_struct
pgdmm
mmap
vm_area_struct
vm_end
vm_protvm_start
vm_end
vm_protvm_start
vm_end
vm_prot
vm_next
vm_start
process virtual memory
text
data
shared libraries
0
0x08048000
0x0804a020
0x40000000
vm_prot:read/write permissions for this area
vm_flagsshared with other processes or private to this process
vm_flags
vm_flags
vm_flags
Linux Page Fault Handling Linux Page Fault Handling
vm_area_struct
vm_end
r/o
vm_next
vm_start
vm_end
r/w
vm_next
vm_start
vm_end
r/o
vm_next
vm_start
process virtual memory
text
data
shared libraries
0
Is the VA legal?Is the VA legal?i.e. is it in an area defined by a vm_area_struct?
if not then signal segmentation violation (e.g. (1))
Is the operation legal?Is the operation legal?i.e., can the process read/write this area?
if not then signal protection violation fault (e.g., (2))
If OK, handle the page faultIf OK, handle the page faulte.g., (3)write
read
read1
2
3
Memory MappingMemory Mapping
Linux (also, UNIX) initializes the contents of a virtual memory area by Linux (also, UNIX) initializes the contents of a virtual memory area by associating it with an associating it with an object on disk on disk
Create new vm_area_struct and page tables for area
Areas can be mapped to one of two types of objects (i.e., get its initial values from) :
Regular file on disk (e.g., an executable object file)The file is divided into page-sized pieces.
The initial contents of a virtual page comes from each piece.
If the area is larger than file section, then the area is padded with zeros.
Anonymous file (e.g., bss)An area can be mapped to an anonymous file, created by the kernel.
The initial contents of these pages are initialized as zeros
Also, called demand-zero pages
Key pointKey point: no virtual pages are copied into physical memory until they are : no virtual pages are copied into physical memory until they are referenced!referenced!
Known as “demand paging”
Crucial for time and space efficiency
User-Level Memory MappingUser-Level Memory Mapping
void *mmap(void *start, int len,void *mmap(void *start, int len, int prot, int flags, int fd, int offsetint prot, int flags, int fd, int offset))
map len bytes starting at offset offset of the file specified by file description fd, preferably at address start (usually 0 for don’t care).
prot: PROT_EXEC, PROT_READ, PROT_WRITE
flags: MAP_PRIVATE, MAP_SHARED, MAP_ANONMAP_PRIVATE indicates a private copy-on-write object
MAP_SHARED indicates a shared object
MAP_ANON with NULL fd indicates an anonymous file (demand-zero pages)
Return a pointer to the mapped area.
Int munmap(void *start, int len)Int munmap(void *start, int len)Delete the area starting at virtual address start and length len
Shared ObjectsShared Objects
Why shared objects?Why shared objects?Many processes need to share identical read-only text areas. For example,
Each tcsh process has the same text area.Standard library functions such as printf
It would be extremely wasteful for each process to keep duplicate copies in physical memory
An object can be mapped as either a An object can be mapped as either a shared object or a or a private objectShared objectShared object
Any write to that area is visible to any other processes that have also mapped the shared object.The changes are also reflected in the original object on disk.
A virtual memory area into which a shared object is mapped is called a shared area.
Private objectPrivate objectAny write to that area is not visible to other processes.
The changes are not reflected back to the object on disk.
Private objects are mapped into virtual memory using copy-on-write.Only one copy of the private object is stored in physical memory.The page table entries for the private area are flagged as read-only
Any write to some page in the private area triggers a protection faultThe hander needs to create a new copy of the page in physical memory and then restores the write permission to the page.
After the handler returns, the process proceeds normally
Shared ObjectShared Object
Sharedobject
Physicalmemory
Process 1virtual memory
Process 2virtual memory
Sharedobject
Physicalmemory
Process 1virtual memory
Process 2virtual memory
Private ObjectPrivate Object
Private copy-on-write object
Physicalmemory
Process 1virtual memory
Process 2virtual memory
Private copy-on-write object
Physicalmemory
Process 1virtual memory
Process 2virtual
memory
Copy-on-write
Write to private
copy-on-writepage
Exec() RevisitedExec() Revisited
kernel code/data/stack
Memory mapped region for shared libraries
runtime heap (via malloc)
program text (.text)
initialized data (.data)
uninitialized data (.bss)
stack
forbidden0
%espprocess VM
brk
0xc0
physical memorysame for each process
process-specific datastructures
(page tables,task and mm structs)
kernel VM
To run a new program p in the To run a new program p in the current process using current process using exec()exec()::
Free vm_area_struct’s and page tables for old areas.Create new vm_area_struct’s and page tables for new areas.
stack, bss, data, text, shared libs.
text and data backed by ELF executable object file.
bss and stack initialized to zero.
Set PC to entry point in .textLinux will swap in code and data pages as needed.
.data.text
p
demand-zero
demand-zero
libc.so
.data.text
Fork() RevisitedFork() RevisitedTo create a new process using To create a new process using fork()fork()::
Make copies of the old process’s mm_struct, vm_area_struct’s, and page tables.At this point the two processes are sharing all of their pages.
How to get separate spaces without copying all the virtual pages from one space to another?
“copy on write” technique.
copy-on-writeMake pages of writeable areas read-only
flag vm_area_struct’s for these areas as private “copy-on-write”.
Writes by either process to these pages will cause page faults.Fault handler recognizes copy-on-write, makes a copy of the page, and restores write permissions.
Net result:Copies are deferred until absolutely necessary (i.e., when one of the processes tries to modify a shared page).
Dynamic Memory AllocationDynamic Memory Allocation
HeapHeapAn area of demand-zero memory that begins immediately after the bss area.
AllocatorAllocatorMaintains the heap as a collection of various sized blocks.Each block is a contiguous chunk of virtual memory that is either allocated or free.
Explicit allocator requires the application to allocate and free space Explicit allocator requires the application to allocate and free space E.g., malloc and free in C
Implicit allocator requires the application to allocate, but not to free Implicit allocator requires the application to allocate, but not to free spacespace
The allocator needs to detect when an allocated block is no longer being usedImplicit allocators are also known as garbage collectors.The process of automatically freeing unused blocks is known as garbage collection.
E.g. garbage collection in Java, ML or Lisp
HeapHeapkernel virtual memory
Memory mapped region forshared libraries
run-time heap (via malloc)
program text (.text)
initialized data (.data)
uninitialized data (.bss)
stack
0
%esp
memory invisible to user code
the “brk” ptr points to the top of the heap
Malloc PackageMalloc Package#include <stdlib.h>#include <stdlib.h>void *malloc(size_t size)void *malloc(size_t size)
If successful:Returns a pointer to a memory block of at least size bytes
(Typically) aligned to 8-byte boundary so that any kind of data object can be contained in the block
If size == 0, returns NULL
If unsuccessful (i.e. larger than virtual memory): returns NULL (0) and sets errno.Two other variations: calloc (initialize the allocated memory to zero) and reallocUse the mmap or munmap function, or use sbrk function
void *realloc(void *p, size_t size)void *realloc(void *p, size_t size) Changes the size of block pointed by p and returns pointer to the new block.Contents of the new block unchanged up to min of old and new size.
void free(void *p)void free(void *p)Returns the block pointed by p to pool of available memoryp must come from a previous call to malloc or realloc.
Malloc ExampleMalloc Example
void foo(int n, int m) { int i, *p; /* allocate a block of n ints */ if ((p = (int *) malloc(n * sizeof(int))) == NULL) { perror("malloc"); exit(0); } for (i=0; i<n; i++) p[i] = i;
/* add m bytes to end of p block */ if ((p = (int *) realloc(p, (n+m) * sizeof(int))) == NULL) { perror("realloc"); exit(0); } for (i=n; i < n+m; i++) p[i] = i;
/* print new array */ for (i=0; i<n+m; i++) printf("%d\n", p[i]);
free(p); /* return p to available memory pool */}
Allocation ExamplesAllocation Examples
p1 = malloc(4)
p2 = malloc(5)
p3 = malloc(6)
free(p2)
p4 = malloc(2)
Requirements (Explicit Allocators)Requirements (Explicit Allocators)
Applications:Applications:Can issue arbitrary sequence of allocation and free requests
Free requests must correspond to an allocated block
AllocatorsAllocatorsCan’t control the number or the size of allocated blocks
Must respond immediately to all allocation requestsi.e., can’t reorder or buffer requests
Must allocate blocks from free memoryi.e., can only place allocated blocks in free memory
Must align blocks so they satisfy all alignment requirements8 byte alignment for GNU malloc (libc malloc) on Linux boxes
Can only manipulate and modify free memory
Can’t move the allocated blocks once they are allocatedi.e., compaction is not allowed
Goals of Allocators Goals of Allocators
Maximize throughputMaximize throughputThroughput: number of completed requests per unit time
Example:5,000 malloc calls and 5,000 free calls in 10 seconds
Throughput is 1,000 operations/second
Maximize memory utilizationMaximize memory utilizationNeed to minimize “fragmentation”.
Fragmentation (holes) – unused area
There is a tradeoff between throughput and memory utilizationNeed to balance these two goals
Good locality propertiesGood locality properties“Similar” objects should be allocated close in space
Internal FragmentationInternal FragmentationPoor memory utilization caused by Poor memory utilization caused by fragmentationfragmentation..
Comes in two forms: internal and external fragmentation
Internal fragmentationInternal fragmentationFor some block, internal fragmentation is the difference between the block size and the payload size.
Caused by overhead of maintaining heap data structures, i.e. padding for alignment purposes.Any virtual memory allocation policy using the fixed sized block such as paging can suffer from internal fragmentation
payloadInternal fragmentation
block
Internal fragmentation
External FragmentationExternal Fragmentation
p1 = malloc(4)
p2 = malloc(5)
p3 = malloc(6)
free(p2)
p4 = malloc(6)oops!
Occurs when there is enough aggregate heap memory, but no singlefree block is large enough
External fragmentation depends on the pattern of future requests, andthus is difficult to measure.
Implementation IssuesImplementation IssuesFree block organizationFree block organization
How do we know the size of a free block?
How do we keep track of the free blocks?
PlacementPlacementHow do we choose an appropriate free block in which to place a newly allocated block?
SplittingSplittingWhat do we do with the extra space after the placement?
CoalescingCoalescingWhat do we do with small blocks that have been freed
p1 = malloc(1)
How do we know the size of a block?How do we know the size of a block?Standard methodStandard method
Keep the length of a block in the word preceding the block.This word is often called the header field or header
Requires an extra word for every allocated block
Format of a simple heap blockFormat of a simple heap block
Block size
Payload(allocated block only)
a = 1: Allocated a = 0: Free
The block size includesthe header, payload, andany padding.
0 0 a
031 123
malloc returns a pointer to the beginning
of the payload
Padding (optional)
free(p0)
p0 = malloc(4) p0
Block size data
5
ExampleExample
Keeping Track of Free BlocksKeeping Track of Free Blocks
Method 1Method 1: : Implicit listImplicit list using lengths -- links all blocks using lengths -- links all blocks
Method 2Method 2: : Explicit listExplicit list among the free blocks using pointers within the among the free blocks using pointers within the free blocksfree blocks
Method 3Method 3: : Segregated free listSegregated free listDifferent free lists for different size classes
5 4 26
5 4 26
Placement PolicyPlacement PolicyFirst fit:First fit:
Search list from the beginning, choose the first free block that fitsCan take linear time in total number of blocks (allocated and free)(+) Tend to retain large free blocks at the end(-) Leave small free blocks at beginning
Next fit:Next fit:Like first-fit, but search the list starting from the end of previous search(+) Run faster than the first fit(-) Worse memory utilization than the first fit
Best fit:Best fit:Search the list, choose the free block with the closest size that fits(+) Keeps fragments small – better memory utilization than the other two(-) Will typically run slower – requires an exhaustive search of the heap
SplittingSplittingAllocating in a free block - Allocating in a free block - splittingsplitting
Since allocated space might be smaller than free space, we might want to split the block
4 4 26
4 24
p
24
addblock(p, 2)
CoalescingCoalescingCoalescingCoalescing
When the allocator frees a block, there might be other free blocks that are adjacent.Such adjacent free blocks can cause a false fragmentation, where there is an enough free space, but chopped up into small, unusable free spaces.Need to coalesce next and/or previous block if they are freeCoalescing with next block
But how do we coalesce with previous block?
4 24 2
free(p) p
4 4 2
4
6
Bidirectional Coalescing Bidirectional Coalescing Boundary tagsBoundary tags [Knuth73] [Knuth73]
Replicate size/allocated word (called footer) at the bottom of a block Allows us to traverse the “list” backwards, but requires extra spaceImportant and general technique! – allow constant time coalescing
size
1 word
Format ofallocated andfree blocks
payload andpadding
a = 1: allocated block a = 0: free block
size: total block size
payload: application data(allocated blocks only)
a
size aBoundary tag (footer)
4 4 4 4 6 46 4
Header
Constant Time CoalescingConstant Time Coalescing
allocated
allocated
allocated
free
free
allocated
free
free
block beingfreed
Case 1 Case 2 Case 3 Case 4
m1 1
Constant Time Coalescing (Case 1)Constant Time Coalescing (Case 1)
m1 1
n 1
n 1
m2 1
m2 1
m1 1
m1 1
n 0
n 0
m2 1
m2 1
m1 1
Constant Time Coalescing (Case 2)Constant Time Coalescing (Case 2)
m1 1
n+m2 0
n+m2 0
m1 1
m1 1
n 1
n 1
m2 0
m2 0
m1 0
Constant Time Coalescing (Case 3)Constant Time Coalescing (Case 3)
m1 0
n 1
n 1
m2 1
m2 1
n+m1 0
n+m1 0
m2 1
m2 1
m1 0
Constant Time Coalescing (Case 4)Constant Time Coalescing (Case 4)
m1 0
n 1
n 1
m2 0
m2 0
n+m1+m2 0
n+m1+m2 0
Implicit Lists: SummaryImplicit Lists: SummaryImplementation is Implementation is very simplevery simple
AllocateAllocate takes takes linear time in the worst caselinear time in the worst case
FreeFree takes takes constant time in the worst case -- even with coalescingconstant time in the worst case -- even with coalescing
Memory usage Memory usage will depend on placement policywill depend on placement policyFirst fit, next fit or best fit
Not used in practice for malloc/free because of linear time allocate. Not used in practice for malloc/free because of linear time allocate. Used for special purpose applications where the total number of blocks is known beforehand to be small
However, the concepts of splitting and boundary tag coalescing are However, the concepts of splitting and boundary tag coalescing are general to general to allall allocators. allocators.
Keeping Track of Free BlocksKeeping Track of Free BlocksMethod 1Method 1: Implicit list using lengths -- links all blocks: Implicit list using lengths -- links all blocks
Method 2Method 2: Explicit list among the free blocks using pointers within the : Explicit list among the free blocks using pointers within the free blocksfree blocks
Method 3Method 3: Segregated free lists: Segregated free listsDifferent free lists for different size classes
5 4 26
5 4 26
Explicit Free ListsExplicit Free Lists
Use data space for pointersUse data space for pointersTypically doubly linkedStill need boundary tags for coalescing
A B C
4 4 4 4 66 44 4 4
Forward links
Back links
A B
C
Format of Doubly-Linked Heap Blocks Format of Doubly-Linked Heap Blocks
Block size
Payload
a/f
031 123
Padding (optional)
Block size a/f
Header
Footer
Block size a/f
031 123
Padding (optional)
Block size a/f
Header
Footer
Old payload
pred (Predecessor)
succ (Successor)
Allocated Block Free Block
Freeing With Explicit Free ListsFreeing With Explicit Free Lists
Insertion policyInsertion policy: Where in the free list do you put a newly freed block?: Where in the free list do you put a newly freed block?
LIFO (last-in-first-out) policyLIFO (last-in-first-out) policyInsert freed block at the beginning of the free list
(+) Simple and freeing a block can be performed in constant time. If boundary tags are used, coalescing can also be performed in constant time.
Address-ordered policyAddress-ordered policyInsert freed blocks so that free list blocks are always in address order
i.e. addr(pred) < addr(curr) < addr(succ)
(-) Freeing a block requires linear-time search
(+) Studies suggest address-ordered first fit enjoys better memory utilization than LIFO-ordered first fit.
Explicit List SummaryExplicit List Summary
Comparison to implicit list:Comparison to implicit list:Allocation time takes linear in the number of free blocks instead of total blocks
Much faster allocates when most of the memory is full
Slightly more complicated allocate and free since needs to splice blocks in and out of the list
Extra space for the links (2 extra words needed for each block)This results in a larger minimum block size, and potentially increase the degree of internal fragmentation
Main use of linked lists is in conjunction with segregated free listsMain use of linked lists is in conjunction with segregated free listsKeep multiple linked lists of different size classes, or possibly for different types of objects
Keeping Track of Free BlocksKeeping Track of Free BlocksMethod 1Method 1: : Implicit listImplicit list using lengths -- links all blocks using lengths -- links all blocks
Method 2Method 2: : Explicit listExplicit list among the free blocks using pointers within the among the free blocks using pointers within the free blocksfree blocks
Method 3Method 3: : Segregated free listSegregated free listDifferent free lists for different size classes
Can be used to reduce the allocation time compared to a linked list organization
5 4 26
5 4 26
Segregated StorageSegregated StoragePartition the set of all free blocks into equivalent classes called Partition the set of all free blocks into equivalent classes called size classessize classesThe allocator maintains an array of free lists, with one free list per size The allocator maintains an array of free lists, with one free list per size class ordered by increasing size.class ordered by increasing size.
1-2
3
4
5-8
9-16Often have separate size class for every small size (2,3,4,…)
Classes with larger sizes typically have a size class for each power of 2
Variations of segregated storageVariations of segregated storageThey differ in how they define size classes, when they perform coalescing, and when they request additional heap memory to OS, whether they allow splitting, and so on.
Examples: simple segregated storage, segregated fits
Simple Segregated StorageSimple Segregated StorageSeparate heap and free list for each size classSeparate heap and free list for each size class
Free list for each size class contains Free list for each size class contains same-sized blockssame-sized blocks of the largest of the largest element sizeelement size
For example, the free list for size class {17-32} consists entirely of block size 32
To allocate a block of size n:To allocate a block of size n:If free list for size n is not empty, allocate the first block in its entirety
If free list is empty, get a new page from OS, create a new free list from all the blocks in page, and then allocate the first block on list
To free a block:To free a block:Simply insert the free block at the front of the appropriate free list
(+) Both allocating and freeing blocks are fast constant-time operations.(+) Both allocating and freeing blocks are fast constant-time operations.
(+) Little per-block memory overhead: no splitting and no coalescing(+) Little per-block memory overhead: no splitting and no coalescing
(-) Susceptible to internal and external fragmentation(-) Susceptible to internal and external fragmentationInternal fragmentation: since free blocks are never split
External fragmentation: since free blocks are never coalesced
Segregated FitsSegregated Fits
Array of free lists, each one for some size classArray of free lists, each one for some size classFree list for each size class contains potentially different-sized blocks
To allocate a block of size n:To allocate a block of size n:Do a first-fit search of the appropriate free listIf an appropriate block is found:
Split (option) the block and place the fragment on the appropriate list
If no block is found, try the next larger class and repeat this until block is foundIf none of free lists yields a block that fits, request additional heap memory to OS, allocate the block out of this new heap memory, and place the remainder in the largest size
To free a block:To free a block:Coalesce and place on the appropriate list
(+) Fast(+) FastSince searches are limited to part of the heap rather than the entire heap area
However, coalescing can increase search times
(+) Good memory utilization(+) Good memory utilizationA simple first-fit search approximates a best-fit search of the entire heap
Popular choice for production-quality allocators such as GNU mallocPopular choice for production-quality allocators such as GNU malloc
Garbage CollectionGarbage CollectionGarbage collectorGarbage collector: : dynamic storage allocatordynamic storage allocator that automatically frees that automatically frees allocated blocks that are no longer usedallocated blocks that are no longer used
Implicit memory management: an application never has to free
Common in functional languages, scripting languages, and modern object Common in functional languages, scripting languages, and modern object oriented languages:oriented languages:
Lisp, ML, Java, Perl, Mathematica,
Variants (Variants (conservative garbage collectorsconservative garbage collectors) exist for C and C++) exist for C and C++Cannot collect all garbages
void foo() { int *p = malloc(128); return; /* p block is now garbage */}
Garbage CollectionGarbage Collection
How does the memory manager know when memory can be freed?How does the memory manager know when memory can be freed?In general we cannot know what is going to be used in the future since it depends on conditionalsBut we can tell that certain blocks cannot be used if there are no pointers to themNeed to make certain assumptions about pointers
Memory manager need to distinguish pointers from non-pointers
Garbage CollectionGarbage CollectionGarbage collectors views memory as a reachability graph and periodically reclaim the unreachable nodes
Classical GC AlgorithmsClassical GC AlgorithmsMark and sweep collection (McCarthy, 1960)
Does not move blocks (unless you also “compact”)
Reference counting (Collins, 1960)Does not move blocks (not discussed)
Copying collection (Minsky, 1963)Moves blocks (not discussed)
Memory as a GraphMemory as a Graph
Reachability graph: we view memory as a directed graphReachability graph: we view memory as a directed graphEach block is a node in the graph Each pointer is an edge in the graphLocations not in the heap that contain pointers into the heap are called root node
e.g. registers, locations on the stack, global variables
Root nodes
Heap nodes
Not-reachable(garbage)
reachable
A node (block) is A node (block) is reachable if there is a path from any root to that node. if there is a path from any root to that node.
Non-reachable nodes are Non-reachable nodes are garbage garbage (never needed by the application)(never needed by the application)
Mark and Sweep Garbage CollectorsMark and Sweep Garbage Collectors
A Mark&Sweep garbage collector consists of a A Mark&Sweep garbage collector consists of a mark phasemark phase followed by a followed by a sweep phasesweep phase
Use extra mark bit in the head of each block
When out of space:When out of space:Mark: Start at roots and set mark bit on all reachable memory blocks
Sweep: Scan all blocks and free blocks that are not marked
Before mark
root
After mark
After sweep free
Mark bit set
free
Mark and Sweep (cont.)Mark and Sweep (cont.)
ptr mark(ptr p) { if (!is_ptr(p)) return; // do nothing if not pointer if (markBitSet(p)) return // check if already marked setMarkBit(p); // set the mark bit for (i=0; i < length(p); i++) // mark all children mark(p[i]); return;}
Mark using depth-first traversal of the memory graph
Sweep using lengths to find next block
ptr sweep(ptr p, ptr end) { while (p < end) { if markBitSet(p) clearMarkBit(); else if (allocateBitSet(p)) free(p); p += length(p);}
Functions Functions is_ptr(p): If p is a pointer to an allocated block, return a pointer b to the beginning of that block. Return NULL otherwise.
blockMarked(b): return true if block b is already marked
blockAllocated(b): return true if block b is allocated
length(b): returns the length of block b
Common Memory-Related Bugs in CCommon Memory-Related Bugs in CDereferencing bad pointersDereferencing bad pointers
Reading uninitialized memoryReading uninitialized memory
Stack Stack buffer overflowbuffer overflow
Assuming pointers and the objects are the same sizeAssuming pointers and the objects are the same size
Making Off-by-One errorsMaking Off-by-One errors
Referencing a pointer instead of the objectReferencing a pointer instead of the object
Misunderstanding pointer arithmeticMisunderstanding pointer arithmetic
Referencing nonexistent variablesReferencing nonexistent variables
Freeing blocks multiple timesFreeing blocks multiple times
Referencing freed blocksReferencing freed blocks
Memory leaksMemory leaks
Dereferencing Bad PointersDereferencing Bad Pointers
Bad pointersBad pointersThere are large holes in the virtual address space of a process that are not mapped to any meaningful data.
If we attempt to dereference a pointer into one of these holes, the process will cause a segmentation exception
The classic The classic scanfscanf bug bugRead an integer from stdin into a variable
In the best case, the program terminates immediately with an exception
In the worst case, the content of val correspond to some valid read/write area, and we overwrite memory, usually with disastrous consequence much later
The correct form is
scanf(“%d”, val);
scanf(“%d”, &val);
Reading Uninitialized MemoryReading Uninitialized Memory
Assuming that heap data is initialized to zeroAssuming that heap data is initialized to zeroWhile .bss sections are always initialized to zeros by the loader, this is not true for heap memory.
Should use ‘calloc’ instead of ‘malloc’
/* return y = Ax */int *matvec(int **A, int *x) { int *y = malloc(N*sizeof(int)); int i, j;
for (i=0; i<N; i++) for (j=0; j<N; j++) y[i] += A[i][j]*x[j]; return y;}
Stack OverflowStack Overflow
Buffer overflowBuffer overflowA program can run into a buffer overflow bug if it writes to a target buffer on the stack without examining the size of the input string
gets function copies an arbitrary length string to the buffer.To fix this, should use fgets, which limits the size of the input string.
Basis for classic buffer overflow attacksBasis for classic buffer overflow attacks1988 Internet wormModern attacks on Web serversAOL/Microsoft IM war
Void bufoverflow(){ char buf[64];
gets(buf); return;}
Pointers and the Objects are Different in SizePointers and the Objects are Different in Size
Allocating the (possibly) wrong sized objectAllocating the (possibly) wrong sized objectCreate an array of n pointers, each of which points to an array of m ints.
If we run this code on Alpha processor, where a pointer is larger than an int, The for loop will write past the end of the A array.
Should use sizeof(int *) for the first malloc
int **p;
p = malloc(N*sizeof(int));
for (i=0; i<N; i++) { p[i] = malloc(M*sizeof(int));}
Off-by-One ErrorsOff-by-One Errors
Off-by-one errorOff-by-one errorTry to initialize n+1 elements instead of n
int **p;
p = malloc(N*sizeof(int *));
for (i=0; i<=N; i++) { p[i] = malloc(M*sizeof(int));}
Pointer vs ObjectPointer vs Object
Referencing a pointer instead of the object it points toReferencing a pointer instead of the object it points to
The two unary operators – and * have the same precedence and right-associativityWill decrement pointer and then dereference
The correct form is(*size)--
int *BinheapDelete(int **binheap, int *size) { int *packet; packet = binheap[0]; binheap[0] = binheap[*size - 1]; *size--; Heapify(binheap, *size, 0); return(packet);}
Pointer ArithmeticPointer Arithmetic
Misunderstanding pointer arithmeticMisunderstanding pointer arithmetic
p += 4 will incorrectly scans every fourth integer in the array
The correct form is p++
int *search(int *p, int val) { while (*p && *p != val) p += sizeof(int);
return p;}
Referencing Nonexistent VariablesReferencing Nonexistent Variables
Forgetting that local variables disappear when a function returnsForgetting that local variables disappear when a function returnsLater, if the program assigns some value to the pointer, it might modify an entry in another function’s stack frame
int *foo () { int val; return &val;}
Referencing Freed BlocksReferencing Freed Blocks
Evil! Evil! Reference data in heap blocks that have already been freed!
x = malloc(N*sizeof(int));<manipulate x>free(x);...y = malloc(M*sizeof(int));for (i=0; i<M; i++) y[i] = x[i]++;
Failing to Free Blocks (Memory Leaks)Failing to Free Blocks (Memory Leaks)
Slow, long-term killer! Slow, long-term killer!
Memory leaks are particularly serious for programs such as deamons Memory leaks are particularly serious for programs such as deamons and servers, which by definition never terminate.and servers, which by definition never terminate.
foo() { int *x = malloc(N*sizeof(int)); ... return;}
Homework 7Homework 7
Read Chapter 8 from Computer System TextbookRead Chapter 8 from Computer System Textbook
ExerciseExercise
9.11
9.13
9.15
9.17
9.19