EECE476: Computer Architecture Lecture 25: Chapter 7, Memory and Caches The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

EECE476: Computer Architecture

Lecture 25:Chapter 7, Memory and Caches

The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux

Motivation for CachesCPU vs. Memory Performance Gap

Memory is getting slower relative to CPU speeds (log scale!)

Goal: Make memory faster!

Importance of Cache MemoryFast CPUs are Mostly Cache!

64 kBData

Cache

64 kBInstr.Cache

Load/Store

ExecutionUnit

Fetch ScanAlign

Micro-code

BusUnit

HyperTransport DDR Memory Interface

1 MB UnifiedInstruction/DataLevel 2 Cache

Floating-Point Unit

Memory Controller

Total Area: 193 mm2

42% 1MB L2 Cache

4% Instr. Cache

4% Data Cache

(50% is cache)

13% HyperTransport

10% DDR Memory

(23% is I/O)

6% Fetch/Scan/etc

4% Mem Controller

4% FPU

3% Exec Units

2% Bus Unit

(only 20% is actually CPU!)

Main Memory

• What to use for Main Memory?– SRAM– DRAM– SDRAM– RAMBUS

– FLASH

– Disk

Memory Technology• SRAM: Static RAM

– 6 transistors per bit• Expensive

– Transistors configured as 2 inverters in a loop• Stable, positive feedback holds value strongly (static)• Actively drive bit value along bitlines to sense amps

– Fast: can tune transistors and sense amps• Used to make cache memory!

• DRAM: Dynamic RAM– 1 transistor per bit

• Inexpensive– Transistor holds charge (C)

• Loses charge/value when driving bitline (dynamic)• Transistor leaks charge over time (dynamic)• Must recharge transistor periodically (including after a data-read)

– Slow• Transistors tiny, hold small charge• Sense amps must detect tiny change in voltage

(row select) word

bit bit

10

0 1

(row select) word

bit

C

Memory Technology• SDRAM: Synchronous DRAM (not Static DRAM!)

– New, around 1995-1996

– Like DRAM, but pipelined (needs clock!)• Pipeline register on Address inputs• Pipeline register on Data outputs• Sometimes additional registers in-between!

– Multiple clock cycles to get data• Latency: CL=2, 2.5, or 3 cycles

– SDR vs DDR• Single data word, transfers once per clock cycle (SDR)• Double data word, twice per clock cycle (DDR, both edges)

– Clock rate• DDR: PC266, PC333, PC400 is 133MHz, 167MHz, 200MHz• SDR: PC100, PC133 is 100MHz, 133MHz

Memory Technology

• RAMBUS– New, around SDRAM time– More complex than SDR, DDR SDRAM

– Faster clock rates (800MHz!)• Fancy signaling on circuit board• Narrow data width (16 bits)• Difficult to get working• Must license technology from Rambus Inc.• Rambus lawyers are costly, $$

– Longer latency (eg, ten cycles)– Overall memory speed higher (not by a lot!)

– Only used on high-end server PCs (too costly)

Memory Technology

• FLASH Memory– Different beast: non-volatile

• Keeps power even when turned off!

– 1 transistor per bit (sometimes 0.5)• Very Cheap

– Operation• Trap charge in floating (disconnected) gate of transistor (tunneling)• Floating-gate keeps transistor turned on or off• Not leaky like DRAM

– Not suitable for main memory• Physically wears out with use (100,000 writes)• Writes are very slow, reads are slow (70ns)

Memory Technology Trends

• Semiconductor manufacturing processes– SRAM & logic compatible– DRAM & logic incompatible– FLASH memory = logic process + extra masks + some tweaking

• Impact on CPU– On-chip SRAM feasible

• Can get FAST memory! (but at high cost)

– On-chip DRAM possible, but unlikely• Cannot get BIG memory

– On-chip FLASH may be feasible• Can store some non-volatile information

Memory Technology Trends

Memory is getting slower relative to CPU speeds (log scale!)

Recent Impact of Memory Speed

• 1996– 100 MHz CPU clock rate (10ns)– 80 ns Memory Access Time– Memory read: 8 CPU clock cycles– Add 8 pipeline stages just to access data memory?

• DF+DS+DT+DF+DF+DS+DS+DE ?

• 2003– 3 GHz CPU clock rate (0.33ns = 330ps)– PC400 DDR (200MHz or 5ns)– Memory read: 5ns x 2 cycles = 10ns

= 30 CPU clock cycles– Add 30 pipeline stages? Impossible to keep up!

Memory Technology (1997)

Memory Technology

Access Time Cost/MB

SRAM 5-25 ns $100-$250

SDRAM 50-60 ns$10-$20

(today: cheaper than DRAM)

DRAM 60-120 ns $5-$10

Disk 10-20 million ns $0.10-$0.20

Cache Memory• Problem:

– SRAM fast, but costly– DRAM cheap, but slow

• Solution:– Cache

• Small SRAM memory• Holds frequently-used data• Logically, insert between CPU and main memory

– Memory Hierarchy is born• Generally, use cheaper/bigger/slower memory as

you move farther away from CPU

• Question: How to access cache SRAM?

Memory HierarchyCPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

MultipleLevels ofMemory

Memory Hierarchy

SRAM

CPURegisters

SDRAM

Cost ($/bit)

Smallest

Biggest

Highest

Lowest

Fastest

SlowestDisk

and/orTape

SizeSpeed

Accessing a Cache• Cache: hide in French, a safe place to hide things

• Importance concept: transparent to user/software!– Wish to speed up ALL programs

• Do not want to rewrite old programs• Do not want to write programs to specifically use the cache

• How to hide? Need general cache management policy– CPU manages cache itself (NOT managed by software)

– Load data• If data is in cache

– retrieve from cache• Else, retrieve from main memory

– put a copy in cache

– Store data (write-through, no-alloc-on-write policy)• If data is in cache, write to that cache location and memory• Else write data to memory

Using a Cache

• Problems– Finding existing location for data in cache?

– Finding new location for new cache data?

– Cache is full?• Finding a location that is no longer needed• Must evict data presently in cache

• Various Solutions– Different styles of caches!

Associative Cache

• Choosing a location– Associative cache is very flexible– New data: any– Find existing data: must search all– Difficult, but not impossible

• CAM: content-addressable memory– Searches all locations (addresses) in “1 cycle”– Reports “match” location– Match location holds data

• Cache is full?– Must throw out old data– Need replacement or eviction policy

Associative Cache:Replacement Policies

• Associative Cache is full? Possible replacement policies:– Ideal

• Non-causal: cannot predict what CPU will do in the future!• CPU architects use simulation to find performance of ideal cache

– Least Frequently Accessed• Count # of accesses, choose the one accessed the least• Problem: you will always choose to evict NEW DATA

– Least Recently Used (LRU)• Timestamp every time you use data in cache• Location with oldest timestamp is evicted

– Pseudo-LRU• Periodically “age” contents of cache• Flag data every time it is used• Location with “aged” status is evicted

RANDOM works too!

(LRU or PseudoLRU is slightly better, so is commonly used)

Direct-mapped Cache

• Choosing a location– Much more restrictive than associative cache– New data: one eligible location– Find existing data: search one location only– Location: use lower bits of data address– Easy to use SRAM, fast access!

• Cache is full? Replacement is easy…– Only one location– Must evict old data

Direct-mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

00

00

01

01

00

11

100

101

110

11

1

Cache Location

Memory Address

Each address inmemory maps toonly one locationin a direct-mappedcache

Lowest 3 bitsof addressdetermineslocation

Direct-mapped Cache

2 0 1 0

B y t e

o f f s e t

0

1

2

1 0 2 1

1 0 2 2

1 0 2 3

2 0 3 2

3 1 3 0 1 3 1 2 1 1 2 1 0Memory Address

DataHit

DataTagVIndex

TagIndex

Cache Size:1024 locations* 4 data bytes each= 4kB cache

Overhead:1024 locations* 21 bits (Tag + V)= 2.626kB tag bits

(more than 50% overhead!)

EECE476: Computer Architecture Lecture 25: Chapter 7, Memory and Caches The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

Documents