순순순순순순 순순순순순순순 순 순 순 1 7. Large and Fast: Exploiting Memory Hierarchy
순천향대학교 정보기술공학부 이 상 정 1
7. Large and Fast:Exploiting Memory Hierarchy
순천향대학교 정보기술공학부 이 상 정 2
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
The Five Classic Components of a Computer
The Big Picture: Where are We Now?
Control
Datapath
Memory
Processor
Input
Output
순천향대학교 정보기술공학부 이 상 정 3
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 yearsDRAM: 4x in 3 years 2x in 10 yearsDisk: 4x in 3 years 2x in 10 years
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
1000:1! 2:1!
순천향대학교 정보기술공학부 이 상 정 4
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
Processor-DRAM Memory Gap (latency)
Who Cares About the Memory Hierarchy?
순천향대학교 정보기술공학부 이 상 정 5
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
The Goal: illusion of large, fast, cheap memory
Fact: Large memories are slowFast memories are small
How do we create a memory that is large, cheap and fast (most of the time)?• Hierarchy
순천향대학교 정보기술공학부 이 상 정 6
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Users want large and fast memories! As of 2004,
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.DRAM access times are 50-70ns at cost of $100 to $200 per GB.Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
Try and give it to them anyway• build a memory hierarchy
Exploiting Memory Hierarchy
CPU
Level 1
Level 2
Level n
Increasing distance
from the CPU in
access timeLevels in the
memory hierarchy
Size of the memory at each level
순천향대학교 정보기술공학부 이 상 정 7
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy of a Modern Computer System
By taking advantage of the principle of locality:• Present the user with as much memory as is available in the
cheapest technology.• Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hip
Cach
e
1 10,000,000
(10 ms)
Speed (ns): 10 100
100 GSize (bytes): K M
TertiaryStorage(Tape)
10,000,000,000 (10 sec)
T
순천향대학교 정보기술공학부 이 상 정 8
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy: Why Does it Work? Locality!
• Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper levels
• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the
processorLower Level
MemoryUpper LevelMemory
To Processor
From ProcessorBlk X
Blk Y
Address Space0 2n - 1
Probabilityof reference
순천향대학교 정보기술공학부 이 상 정 9
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) • Hit Rate: the fraction of memory access found in the upper level• Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block Y)• Miss Rate = 1 - (Hit Rate)• Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Hit Time << Miss PenaltyLower Level
MemoryUpper LevelMemory
To Processor
From ProcessorBlk X
Blk Y
순천향대학교 정보기술공학부 이 상 정 10
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy Technology
Random Access:• “Random” is good: access time is the same for all locations• Volatile Memory
• DRAM: Dynamic Random Access Memory– High density, low power, cheap, slow– Dynamic: need to be “refreshed” regularly
• SRAM: Static Random Access Memory– Low density, high power, expensive, fast– Static: content will last “forever”(until lose power)
• Non-Volatile Memory• ROM(Mask ROM, PROM, EPROM, E2PROM)• Flash Memory, FRAM, MRAM
“Not-so-random” Access Technology:• Access time varies from location to location and from time to time• Examples: Disk, CD-ROM
Sequential Access Technology: • access time linear in location (e.g.,Tape)
순천향대학교 정보기술공학부 이 상 정 11
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Main Memory Background
Main Memory is DRAM: Dynamic Random Access Memory• 1 transistor and 1 capacitor (~ 2 transistors) / bit• Dynamic since needs to be refreshed periodically (8
ms)• Addresses divided into 2 halves (Memory as a 2D
matrix):• Row address and then column address• Number of address pins cut into half • Called “address multiplexing”
Cache uses SRAM: Static Random Access Memory• No refresh• 6 transistors/bit• No address multiplexing• SRAM is faster, and more expensive than DRAM
• Size: SRAM/DRAM 4-8• Cost: SRAM/DRAM 20-25 (1997)• Access Time: DRAM/SRAM - 5-12
순천향대학교 정보기술공학부 이 상 정 12
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache
Motivation• Slow speed of DRAM main memory limits processor
performance• a smaller SRAM memory matches processor speed
Make the average access time near SRAM• if the large majority of memory references hit the
cache
Reduce bandwidth required of the large memory
ProcessorProcessor CacheCache DRAMDRAM
Memory System
순천향대학교 정보기술공학부 이 상 정 13
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Organization
Cache duplicates part of main memory• we specify an address in main memory to search whether a
copy of that memory location resides in the cache• need a mapping between main memory location and cache
location
Direct-mapped Cache• Each memory address maps to a
UNIQUE cache location determined by a simple modulo function
• Simplest implementation because only one cache location to search
순천향대학교 정보기술공학부 이 상 정 14
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Reference Sequence in Direct-Mapped Cache
순천향대학교 정보기술공학부 이 상 정 15
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Directed-Mapped Cache Lookup
For a cache with block size 4 bytes and total capacity 4KB (1024 blocks)• the 2 lowest address bits
specify the byte within a block• the next 10 address bits
specify the block’s index within the cache
• the 20 highest address bits are the unique tag for this memory block
• the valid bit specifies whether the block is an accurate copy of memory
Address (showing bit positions)
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
순천향대학교 정보기술공학부 이 상 정 16
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Index Valid Tag Data
0000 0000 00 x xxx xxx
0000 0000 01 x xxx xxx
0000 0000 10 x xxx xxx
… x xxx xxx
1010 1111 10 1 1011 0110 0101 1001 0101 1111 0000 1100 1100 1010 1010 0101 0101
1010 1111 11 x xxx xxx
. . . x xxx xxx
1111 1111 11 x xxx xxx
Cache Entry Example
1111000011110000110011001100110010101010101010100101010101010101
1 byte = 8 bits
1011 0110 0101 1001 0101 1010 1111 01 001011 0110 0101 1001 0101 1010 1111 01 011011 0110 0101 1001 0101 1010 1111 01 101011 0110 0101 1001 0101 1010 1111 01 111011 0110 0101 1001 0101 1010 1111 10 001011 0110 0101 1001 0101 1010 1111 10 011011 0110 0101 1001 0101 1010 1111 10 101011 0110 0101 1001 0101 1010 1111 10 111011 0110 0101 1001 0101 1010 1111 11 00. . .
Memory Address
Tag (20 bits) Index(10bits)
1 word = 4 bytes = 32 bits
In Cache:
순천향대학교 정보기술공학부 이 상 정 17
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Bits in a Cache
Total Bits Required for a Directed-mapped cache with a 4KB of data and 1-word blocks, assuming a 32-bit address• Block size = 1-words = 4 bytes• # of blocks = 4KB / 4 bytes = 1K blocks• Each block has 4 bytes of data + a tag + a valid bit• Tag size = 32 bits (data address) – 10 bits (block address) – 2 bits (byte in a block) = 20 bits• Total bits in a cache = 1K x (4 bytes + 20 bits + 1 bit) = 53Kbits ( = 6.625KB)
순천향대학교 정보기술공학부 이 상 정 18
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Blocks
Cache Block (sometimes called a cache line)• a cache entry that has in its own cache tag• previous example uses 4-byte blocks
Lager cache blocks take advantage of spatial locality• Example of 64KB cache using 4-word (16-byte) block
Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
순천향대학교 정보기술공학부 이 상 정 19
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT:
• Larger block size means larger miss penalty:• Takes longer time to fill up the block
• If block size is too big relative to cache size, miss rate will go up• Too few cache blocks
In gerneral, Average Access Time: • = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate
Block Size Block Size
MissPenalty
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size
순천향대학교 정보기술공학부 이 상 정 20
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Block Size Tradeoff (cont.)
Data from simulating a direct-mapped cache Note miss rate trends
• capacity increases for fixed block size• block size increases for fixed capacity
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
순천향대학교 정보기술공학부 이 상 정 21
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Read hits• this is what we want!
Read misses• stall the CPU, fetch block from memory, deliver to cache, restart
Write hits:• can replace data in cache and memory (write-through)• write the data only into the cache (write-back the cache later)
Write misses:• read the entire block into the cache, then write the word
Hits vs. Misses
순천향대학교 정보기술공학부 이 상 정 22
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Make reading multiple words easier by using banks of memory
It can get a lot more complicated...
Hardware Issues
CPU
Cache
Memory
Bus
One-word-widememory organization
a.
b. Wide memory organization
CPU
Cache
Memory
Bus
Multiplexor
CPU
Cache
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
순천향대학교 정보기술공학부 이 상 정 23
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Synchronous DRAM (SDRAM) Timing
순천향대학교 정보기술공학부 이 상 정 24
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving:
Start Access for D1
CPU Memory
Start Access for D2
D1 available
Access Pattern with 4-way Interleaving:
Acc
ess
Ban
k 0
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
CPU
MemoryBank 1
MemoryBank 0
MemoryBank 3
MemoryBank 2
Cycle TimeAccess Time
순천향대학교 정보기술공학부 이 상 정 25
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Use split caches because there is more spatial locality in code:• Two independent caches operating in parallel• Instruction cache and data cache• Used to increase cache bandwidth
• i.e. the data rate between cache and processor• Miss rate slightly higher than that of combined cache
• e.g. Total cache size: 32KB– Split cache effect miss rate: 3.24%– Combined cache miss rate: 3.18%
• Increased cache bandwidth easily overcomes the disadvantage of slightly increased miss rate
• Free from cache contention in instruction pipelining
Split Cache
순천향대학교 정보기술공학부 이 상 정 26
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
More about Cache Write
Cache read much easier to handle than cache write• Read does not change value of data
Cache write• Need to keep data in the cache and memory consistent
Two Options• Write-Through: write to both cache and memory
• control is simple• Isn’t memory too slow for this?
• Write-Back: write to cache only• write the cache block to memory when that cache block is
being replaced on a cache miss• reduces the memory bandwidth required• keep a bit ( called the dirty bit ) per cache block to track
whether the block has been modified– only need to write back modified blocks
• control can be complex
순천향대학교 정보기술공학부 이 상 정 27
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory• Processor: writes data into the cache and the write buffer• Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO (First-In First-Out) queue:• Typical number of entries: 4 - 8• Works fine if: “Store” frequency << 1 / (DRAM write cycle)• In Write buffer saturation, stall processor to allow memory to
catch up
ProcessorCache
Write Buffer
DRAM
순천향대학교 정보기술공학부 이 상 정 28
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Performance
We can safely assume cache access time (hit time) is single clock cycle:
CPU Time with perfect cache = CPU cycles x Clock cycle time CPU Time with real world cache = (CPU Cycles + Memory Stall Cycles) x
Clock cycle time Memory system affects
• Memory stall cycles• cache miss stalls + write buffer stalls (in case of write-back cache)
• Clock cycle time• since cache access often determines clock speed for a processor
Memory stall cycles = Read stall cycles + Write stall cycles• Read stall cycles = Read miss rate x #Reads x Read miss penalty• For write-back cache
• Write stall cycles = Write miss rate x #Writes x Write miss penalty• Can combine read and write components
– memory stall cycles = Miss Rate x #MemAccesses x Miss Penalty• For write-through caches
• add write buffer stalls
순천향대학교 정보기술공학부 이 상 정 29
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Performance Example
Assume:• Miss rate for instruction = 5%• Miss rate for data = 8%• Data references per instruction = 0.4• CPI with perfect cache = 1.4• Miss penalty = 20 cycles
Find performance relative to perfect cache with no misses (same clock rate)• Misses/instruction = 0.05(instruction miss) + 0.4x0.08(data miss)
= 0.082• Miss stall CPI = 0.082 x 20 = 1.64• Performance is ratio of CPIs (#instruction, clock rate is the same)
Performance no misses CPI with misses 1.4 + 1.64 ------------------------------------ = --------------------------- = --------------- = 2.1 Performance with misses CPI no misses 1.4
순천향대학교 정보기술공학부 이 상 정 30
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Set-Associative Caches
Improve cache hit ratio by allowing a memory location to be placed in more than one cache block• N-way associative cache allows placement in any block of a set
with N elements• N is the set size• Number of blocks = N x number of sets• Set number is selected by a simple modulo function of the
address bits (the set number is also called the index)• Fully-associative cache
• when there is a single set allowing a memory location to be placed in any cache block
• Directed- mapped organization can be considered a degenerate set-associative cache with set-size=1
For fixed cache capacity, large set size leads to higher hit rates• because more combinations of cache blocks can be present in
the cache at the same time
순천향대학교 정보기술공학부 이 상 정 31
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Set-Associative Cache Examples
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data
Four-way set associative
Set
0
1
Tag Data
One-way set associative(direct mapped)
Block
0
7
1
2
3
4
5
6
Tag Data
Two-way set associative
Set
0
1
2
3
Tag Data
순천향대학교 정보기술공학부 이 상 정 32
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Implementation of 4-Way Set-Associative CacheAddress
22 8
V TagIndex
0
1
2
253
254255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
순천향대학교 정보기술공학부 이 상 정 33
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Miss Rate vs. Set Size
Data is for gcc (gnu C-compiler) and spice for DECStation 3100 with separate instrution/data 64KB caches using 16B blocks
In general, the benefit increasing associativity beyond 2~4 has minimal impact on miss ratio
Associativity Instruction MissRate
Data Miss Rate
gcc 1 2.0 % 1.7 %
gcc 2 1.6 % 1.4 %
gcc 4 1.6 % 1.4 %
spice 1 0.3 % 0.6%spice 2 0.3% 0.6%
spice 4 0.3% 0.6%
순천향대학교 정보기술공학부 이 상 정 34
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Miss Rate vs. Set Size
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo-wayOne-way
1 KB
2 KB
4 KB
8 KB
Mis
s ra
te
Associativity 16 KB
32 KB
64 KB
128 KB
Data for SPEC92 on combined instruction/data cache with 32B block
순천향대학교 정보기술공학부 이 상 정 35
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache:• N comparators vs. 1• Extra MUX delay for the data• Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:• Possible to assume a hit and continue. Recover later if miss.
Example:• 2-way set associative cache
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
순천향대학교 정보기술공학부 이 상 정 36
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Block Replacement Policies
Directed Mapped Cache• Each memory location mapped to a single cache location• No replacement policy is necessary
• new item replaces previous item in that cache location
Set-Associative Caches• N-way set associative cache
• each memory location has a choice of N cache location• Cache miss handling for set-associative caches
• bring in new block from memory• identify a block in the selected set to replace in case of full• need to decide which block to replace
순천향대학교 정보기술공학부 이 상 정 37
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Block Replacement Policies (cont.)
Random Replacement• Hardware randomly selects a cache block to replace
Optimal Replacement• Replace the block that will be used latest in the future
Least Recently Used (LRU)• Hardware keeps track of access history
• replace entry that has not been used for the longest time• Simple for 2-way associative
• single bit in each set to indicate which block was more recently used• Implementing LRU gets harder for higher degrees of associativity
In practice replacement policy has minor impact on miss rate• Especially for high associativity
순천향대학교 정보기술공학부 이 상 정 38
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Decreasing miss penalty with multilevel caches
Add a second level cache:• often primary cache is on the same chip as the processor
• Primary cache, L1 cache, on-chip cache• use SRAMs to add another cache above primary memory (DRAM)
• L2 cache• miss penalty goes down if data is in 2nd level cache • On-die L2 cache
• Started to get integrated into the same die since late 1998, and now became a general trend
Example:• CPI of 1.0 on a 5 Ghz machine with a 2% miss rate, 100ns DRAM access• Adding 2nd level cache with 5ns access time decreases miss rate to .5%• Performance gain is 2.8
• Refer to the textbook (pp.505~506)
Using multilevel caches:• try to optimize the hit time on the 1st level cache• try to optimize the miss rate on the 2nd level cache
순천향대학교 정보기술공학부 이 상 정 39
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Complexities
Not always easy to understand implications of caches:
Radix sort
Quicksort
Size (K items to sort)
04 8 16 32
200
400
600
800
1000
1200
64 128 256 512 1024 2048 4096
Radix sort
Quicksort
Size (K items to sort)
04 8 16 32
400
800
1200
1600
2000
64 128 256 512 1024 2048 4096
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
순천향대학교 정보기술공학부 이 상 정 40
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Cache Complexities
Here is why:
Memory system performance is often critical factor• multilevel caches, pipelined processors, make it harder to predict outcomes• Compiler optimizations to increase locality sometimes hurt ILP
Difficult to predict best algorithm: need experimental data
Radix sort
Quicksort
Size (K items to sort)
04 8 16 32
1
2
3
4
5
64 128 256 512 1024 2048 4096
순천향대학교 정보기술공학부 이 상 정 41
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Summary: Improving Cache Performance
Cache Performance is determined by• Average Memory Access Time = Hit Time + Miss Rate x Miss Penalty
Use Better Technology• Use faster RAMs• Cost and availability are limitations
Decrease Hit Time• Make cache smaller, but miss rate increases• Use direct mapped instead of set-associative, but miss rate increases
Decrease Miss Rate• Make cache large, but can increase hit time• Add associativity, but can increase hit time• Increase block size, but increase miss penalty
Decrease Miss Penalty• Reduce transfer time component of miss penalty• Add another level of cache (L2 cache)
순천향대학교 정보기술공학부 이 상 정 42
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Another View of the Memory Hierarchy
Regs
L2 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Upper Level
Lower Level
Faster
Larger
CacheBlocks
Thus far{{Next:
VirtualMemory
순천향대학교 정보기술공학부 이 상 정 43
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy Requirements
If Principle of Locality allows caches to offer (close to) speed of cache memory with size of DRAM memory,then recursively why not use at next level to give speed of DRAM memory, size of Disk memory?
Share memory between multiple processes but still provide protection – don’t let one program read/write memory from another
Address space – give each program the illusion that it has its own private memory• compiler, linker, and loader are simplified because they see only the
virtual address space abstracted from physical memory allocation
순천향대학교 정보기술공학부 이 상 정 44
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Virtual Memory
Called “Virtual Memory” Also allows OS to share memory, protect
programs from each other Today, more important for protection vs. just
another level of memory hierarchy Each process thinks it has all the memory to itself Historically, it predates caches
순천향대학교 정보기술공학부 이 상 정 45
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Virtual Memory
Addressable Memory Space vs. Physical Memory• example:
• 32bit memory address can specify 4GB memory• physical main memory = 16MB ~ 512MB
Distinguish between virtual and physical addresses• virtual address is used by the programmer to address memory within a
process’s address space• physical address is used by the hardware to access a physical memory
location “Virtual Memory” provides appearance of very large
memory• total memory of all jobs >> physical memory• address space of each job > physical memory
Simplifies memory management for multi-processing system • each program operates in its own virtual address space as if it is the only
program running in the system Uses 2 storage levels
• primary (DRAM) and secondary (Hard Disk) Exploits hierarchy to reduce average access time as in
cache
순천향대학교 정보기술공학부 이 상 정 46
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Virtual to Physical Address Translation
Each program operates in its own virtual address space• as if it is the only program running in the system
Each program is protected from the other
OS can decide where each program goes in memory
Hardware (HW) provides virtual physical mapping
virtualaddress(inst. fetchload, store)
Programoperates inits virtualaddressspace
HWmapping
physicaladdress(inst. fetchload, store)
Physicalmemory(incl. caches)
순천향대학교 정보기술공학부 이 상 정 47
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Paged Virtual Memory
Most common form of address translation• virtual and physical address space partitioned into blocks of equal
size• virtual address space blocks are called pages• physical address space blocks are called frames (or page frames)
Placement• any page can be placed in any frame (fully-associative)
Pages are fetched on demand
regreg cachecache memorymemory diskdisk
pagepage
frame
Memory Hierarchy
순천향대학교 정보기술공학부 이 상 정 48
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Paging Organization
Paging can map any virtual page to any physical frame Data missing from main memory must be transferred from secondary m
emory (disk)• misses(page fault) handled by Operating System• miss time very large, so OS manages the hierarchy and schedules another process inst
ead of stalling (context switching)Physical addresses
Disk addresses
Virtual addresses
Address translation
Virtual addresses Physical addresses
순천향대학교 정보기술공학부 이 상 정 49
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Paging/Virtual Memory Multiple Processes
User B: Virtual Memory
Code
Static
Heap
Stack
0Code
Static
Heap
Stack
A PageTable
B PageTable
User A: Virtual Memory
00
Physical Memory
64 MB
순천향대학교 정보기술공학부 이 상 정 50
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Address Translation
Program uses virtual addresses• Relocation: a program can be loaded anywhere in physical memory
without recompiling or re-linking
Memory accessed with physical addresses
Hardware (HW) provides virtual physical mapping• need a translation table for each process
When a virtual address is missing from main memory, the OS handles the miss• read the missing data, create the translation, return to re-execute the
instruction that caused the miss
VirtualVirtualAddressAddress
AddressAddressTranslationTranslation ValidValid
PhysicalPhysicalAddressAddress
MainMainMemoryMemory
SecondarySecondaryMemory (Disk)Memory (Disk)
Page faulthandler (in OS)
OS performs this transfer
page fault
순천향대학교 정보기술공학부 이 상 정 51
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Address Mapping
frame 0
1
7
0
1024
7168
P.A.
PhysicalMemory
1K
1K
1K
AddrTransMAP
page 0
1
31
1K
1K
1K
0
1024
31744
unit of mapping
also unit oftransfer fromvirtual tophysical memory
Virtual MemoryAddress Mapping
VA page no. disp
10
Page Table
indexintopagetable
Page TableBase Reg
VAccessRights Frame #
+
table locatedin physicalmemory
physicalmemoryaddress
actually, concatenation is more likely
V.A.
Frame # disp
순천향대학교 정보기술공학부 이 상 정 52
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Address Translation Algorithm
If V=1, the mapping is valid• CPU checks permissions (R,R/W,X) against access type
• if access is permitted, generate physical address and proceeds
• if access is not permitted, generates a protection fault
If V!= 1, the mapping is invalid• wanted page does not reside in main memory• CPU generates a page fault
Faults are exceptions handled by the OS• page faults
• the OS fetches the missing page, creates a map entry, and restarts the process
• another user process is switched in to execute while the page is brought from disk (context switching)
• protection faults• checks whether it is a programming error or permission
needs to be changed
순천향대학교 정보기술공학부 이 상 정 53
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Making VM Fast: TLB
If page table is kept in memory• all memory reference require two accesses
• one for page table entry and one to get the actual data Translation Lookaside Buffer (TLB)
• additional cache for page table only• hardware maintains a cache of recently-used page table translations• look up all accesses up in TLB
• hit in TLB gives the physical page number• miss in TLB => get translation from the page table and reload
• TLB usually smaller than cache (each entry maps a full page)• more associativity possible and common• similar speed to cache access• contains all bits needed to translate address, implement VM• Typical TLB entry
Valid Virtual Address Physical Address Dirty Access Rights
순천향대학교 정보기술공학부 이 상 정 54
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Virtual Memory and Cache
OS manages memory hierarchy between secondary storage and main memory• allocates physical memory to virtual memory and specifies the
mapping to hardware through page tables• hardware caches recently used page table entries in the TLB
AddressUnit
TLB Cache
Physical Memory/
Main Memory/Primary Storage
SecondaryStorage
Page Tables
CPU
TLBMisses
VirtualAddress
PhysicalAddress
CacheMisses
PageFaults
AddressTranslation
순천향대학교 정보기술공학부 이 상 정 55
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
TLBs and caches
YesWrite access
bit on?
No
YesCache hit?
No
Write data into cache,update the dirty bit, and
put the data and theaddress into the write buffer
YesTLB hit?
Virtual address
TLB access
Try to read datafrom cache
No
YesWrite?
No
Cache miss stallwhile read block
Deliver datato the CPU
Write protectionexception
YesCache hit?
No
Try to write datato cache
Cache miss stallwhile read block
TLB missexception
Physical address
순천향대학교 정보기술공학부 이 상 정 56
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Page Replacement and Write Policies
When a page fault occurs, choose a page to replace• fully associative so any frame/page is a candidate• choose empty one if it exists• choose either (just as we did for cache)
• LRU• Random
Write Policy: always write-back• keep a dirty bit
• set to 1 if page is modified• when a modified page is replaced, the OS writes it back to
disk
순천향대학교 정보기술공학부 이 상 정 57
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Modern Systems
순천향대학교 정보기술공학부 이 상 정 58
Computer Architecture
7. Large and Fast: Exploiting Memory Hierarchy
Modern Systems
Things are getting complicated!