The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

The Memory HierarchyLecture # 30

15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir

Memory systems that support cache DRAMS are designed to increase density not access timeTo reduce the miss penalty we need to change the memory access design, to increase throughput.

Wide memory

Seq

uen

tial

acc

ess

Par

alle

l acc

ess

to

all w

ord

s in

a b

lock


The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways.

Memory Systems that Support Caches

DRAMMemory

One word wide organization (one word wide bus and one word wide memory)

Assume

1 clock cycle (2 ns) to send the address

25 clock cycles (50 ns) for DRAM cycle time,

1 clock cycle (2ns) to return a word of data

Memory-Bus to Cache bandwidth

number of bytes accessed from memory and transferred to cache/CPU per clock cycle

bus32-bit data&

32-bit addrper cycle

CPU

Cache

on-chip


Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles

One Word Wide Memory Organization

CPU

Cache

Memory

bus

on-chip If the block size is one word, then for a

memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory

1 cycle to send address25 cycles to read DRAM1 cycle to return data 27 total clock cycles miss penalty

4/27 = 0.148


What if the block size were four words?

1 cycle to send 1st address

100 cycles to read DRAM

1 cycle to return last data word 102 total clock cycles miss penalty

One Word Wide Memory Organization

CPU

Cache

Memory

bus

on-chip

25 cycles

25 cycles

25 cycles

25 cycles

(4 x 4)/102 = 0.157 bytes/clock cycle

Number of bytes transferred per clock cycle (bandwidth) for a single miss is


Interleaved Memory Organization For a block size of four words 1 cycle to send 1st address 25 + 3 = 28 cycles to read DRAM 1 cycle to return last data word 30 total clock cycles miss penalty

Memorybank 0

CPU

Cache

Memorybank 1

bus

on-chip

Memorybank 2

Memorybank 3

Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock cycle = 4.264 bits/clock cycle. 25 cycles

25 cycles

25 cycles

25 cycles


Further Improvements to Memory Organization(DDR-SDRAMs)

An external clock (300 MHz) synchronizes memory addresses

Example – 4 M DRAM – outputs one bit from the array

2048 column latches and 1 multiplexor

SDRAM is provided the starting address and the burst length 2/4/8– need not provide successive addresses.

DDR – double data rate – transfers data on both the raising and falling edge of the external clock.

1980 DRAMS were 64 Kbit, column access to an existing row 150 ns

2004 DRAMS were 1024 Mbit, column access to existing row 3 ns.


Further Improvements – two-level cache

Figure shows the AMD Athlon and Duron processor architecture Two-level caches allow L1 cache to be smaller – improves the hit

time as they are faster L2 cache is larger – its access time is less crytical – larger block

sized L2 is accessed whenever a miss occurs in L1, which reduces the

L1 miss penalty dramatically. L2 is also used to store the contents of the “victim buffer” – data

rejected from L1 cache when a L1 miss occurs15/05/2009 Lecture 30_CA&O_Engr Umbreen Sabir

Reducing Cache Misses through Associativity

We recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator.

A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small caches

In-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set.


Reducing Cache Misses through Associativity A n-way set-associative cache has sets with n blocks each All blocks in the set have to be searched - reduces the number of

comparators to n. One-way set-associative(same as direct mapped)

Two-way set-associative

Four-way set-associative

Eight-way set-associative (same as fully associative)

As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases.


22

32

For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bits by 1.

For fully associative cache, no index bits since it is only one set.

4-way associative cache

Four blocks

Four comparators


It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache.

How many tag and index bits for an 8-way associative cache?

Recall the Direct Mapped Cache

Hit 20Tag 10Index

DataIndex TagValid012...

102110221023

31 30 . . . 13 12 11 . . . 2 1 0Byte offset

20

Data

32

23 tag bits 7 index bits


The basic principle is “Least-recently Used” (LRU) –replace the block that is older.

Keeping track of a block’s “age” done in hardware It is practical for small set-associativity (2-way or 4-

way). For higher associativity LRU is either approximated of

replacement is random For 2-way set-associative, random replacement has 10%

higher miss rate than LRU As caches become larger the miss rates for both

strategies fall, and the difference between the two is smaller.

Which block to replace in an associative cache?


Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size.

Exercise

2-way has half the number of sets for same size. All map to same set

A

B

C

The sequence A,B,C,A,B,C generates: Miss, miss, miss, Hit, miss, miss, Hit..

A BC

The same sequence generates: Miss, miss, miss, miss, miss, miss….

AB


Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of 2, so B=2b. Figure out what the following quantities are:

- the number of sets in the cache? - the number of index bits in the address? - the number of bits needed to implement the cache?

Address size = k bitsCache size = S bytes/cacheBlock size = B= 2b bytes/blockAssociativity= A blocks/setThe number of sets/cache=Bytes/cache = Bytes/cache = S Bytes/set Block/set Bytes/block

AB

Exercise


Index bits 2(#index bits) = sets/cache = S__ AB#Index bits =log2 ( S ) = log2 ( S ) = log2 ( S ) - log2 (2b) = log2 ( S ) - b AB A2b A A

Tag address bits = total address bits - index bits - block offset bits= k - [log2 ( S ) - b] - b = K - log2 ( S ) A A

Bits in tag memory/cache = Tag address bits/block Blocks/set Sets/cache = [K - log2 (S) ] A S = S [K - log2 (S) ]

A AB B A

Exercise - continued


The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

Documents

engr umbreen sabirlecture

clock cycle bandwidth

engr umbreen sabirthe

engr umbreen sabirwhat

dram1 cycle

cache miss

address25 clock cycles

clock cycle15052009lecture