The Memory Hierarchy Lecture # 30 15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Jan 14, 2016
The Memory HierarchyLecture # 30
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Memory systems that support cache DRAMS are designed to increase density not access timeTo reduce the miss penalty we need to change the memory access design, to increase throughput.
Wide memory
Seq
uen
tial
acc
ess
Par
alle
l acc
ess
to
all w
ord
s in
a b
lock
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways.
Memory Systems that Support Caches
DRAMMemory
One word wide organization (one word wide bus and one word wide memory)
Assume
1 clock cycle (2 ns) to send the address
25 clock cycles (50 ns) for DRAM cycle time,
1 clock cycle (2ns) to return a word of data
Memory-Bus to Cache bandwidth
number of bytes accessed from memory and transferred to cache/CPU per clock cycle
bus32-bit data&
32-bit addrper cycle
CPU
Cache
on-chip
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip If the block size is one word, then for a
memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory
1 cycle to send address25 cycles to read DRAM1 cycle to return data 27 total clock cycles miss penalty
4/27 = 0.148
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
What if the block size were four words?
1 cycle to send 1st address
100 cycles to read DRAM
1 cycle to return last data word 102 total clock cycles miss penalty
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip
25 cycles
25 cycles
25 cycles
25 cycles
(4 x 4)/102 = 0.157 bytes/clock cycle
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Interleaved Memory Organization For a block size of four words 1 cycle to send 1st address 25 + 3 = 28 cycles to read DRAM 1 cycle to return last data word 30 total clock cycles miss penalty
Memorybank 0
CPU
Cache
Memorybank 1
bus
on-chip
Memorybank 2
Memorybank 3
Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock cycle = 4.264 bits/clock cycle. 25 cycles
25 cycles
25 cycles
25 cycles
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Further Improvements to Memory Organization(DDR-SDRAMs)
An external clock (300 MHz) synchronizes memory addresses
Example – 4 M DRAM – outputs one bit from the array
2048 column latches and 1 multiplexor
SDRAM is provided the starting address and the burst length 2/4/8– need not provide successive addresses.
DDR – double data rate – transfers data on both the raising and falling edge of the external clock.
1980 DRAMS were 64 Kbit, column access to an existing row 150 ns
2004 DRAMS were 1024 Mbit, column access to existing row 3 ns.
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Further Improvements – two-level cache
Figure shows the AMD Athlon and Duron processor architecture Two-level caches allow L1 cache to be smaller – improves the hit
time as they are faster L2 cache is larger – its access time is less crytical – larger block
sized L2 is accessed whenever a miss occurs in L1, which reduces the
L1 miss penalty dramatically. L2 is also used to store the contents of the “victim buffer” – data
rejected from L1 cache when a L1 miss occurs15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Reducing Cache Misses through Associativity
We recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator.
A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small caches
In-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set.
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Reducing Cache Misses through Associativity A n-way set-associative cache has sets with n blocks each All blocks in the set have to be searched - reduces the number of
comparators to n. One-way set-associative(same as direct mapped)
Two-way set-associative
Four-way set-associative
Eight-way set-associative (same as fully associative)
As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases.
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
22
32
For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bits by 1.
For fully associative cache, no index bits since it is only one set.
4-way associative cache
Four blocks
Four comparators
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache.
How many tag and index bits for an 8-way associative cache?
Recall the Direct Mapped Cache
Hit 20Tag 10Index
DataIndex TagValid012...
102110221023
31 30 . . . 13 12 11 . . . 2 1 0Byte offset
20
Data
32
23 tag bits 7 index bits
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
The basic principle is “Least-recently Used” (LRU) –replace the block that is older.
Keeping track of a block’s “age” done in hardware It is practical for small set-associativity (2-way or 4-
way). For higher associativity LRU is either approximated of
replacement is random For 2-way set-associative, random replacement has 10%
higher miss rate than LRU As caches become larger the miss rates for both
strategies fall, and the difference between the two is smaller.
Which block to replace in an associative cache?
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size.
Exercise
2-way has half the number of sets for same size. All map to same set
A
B
C
The sequence A,B,C,A,B,C generates: Miss, miss, miss, Hit, miss, miss, Hit..
A BC
The same sequence generates: Miss, miss, miss, miss, miss, miss….
AB
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of 2, so B=2b. Figure out what the following quantities are:
- the number of sets in the cache? - the number of index bits in the address? - the number of bits needed to implement the cache?
Address size = k bitsCache size = S bytes/cacheBlock size = B= 2b bytes/blockAssociativity= A blocks/setThe number of sets/cache=Bytes/cache = Bytes/cache = S Bytes/set Block/set Bytes/block
AB
Exercise
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir
Index bits 2(#index bits) = sets/cache = S__ AB#Index bits =log2 ( S ) = log2 ( S ) = log2 ( S ) - log2 (2b) = log2 ( S ) - b AB A2b A A
Tag address bits = total address bits - index bits - block offset bits= k - [log2 ( S ) - b] - b = K - log2 ( S ) A A
Bits in tag memory/cache = Tag address bits/block Blocks/set Sets/cache = [K - log2 (S) ] A S = S [K - log2 (S) ]
A AB B A
Exercise - continued
15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir