1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1

COMP 206:COMP 206:Computer Architecture and Computer Architecture and

ImplementationImplementation

Montek SinghMontek Singh

Wed., Oct. 23, 2002Wed., Oct. 23, 2002

Topic: Topic: Memory Hierarchy Design (HP3 Memory Hierarchy Design (HP3

Ch. 5)Ch. 5)

(Caches, Main Memory and Virtual (Caches, Main Memory and Virtual

Memory)Memory)

2

The Five Classic Components of a ComputerThe Five Classic Components of a Computer

This lecture (and next few): Memory SystemThis lecture (and next few): Memory System

Control

Datapath

Memory

Processor

Input

Output

The Big Picture: Where are We The Big Picture: Where are We Now?Now?

3

MotivationMotivation Large (cheap) memories (DRAM) are slowLarge (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fastSmall (costly) memories (SRAM) are fast

Make the average access time smallMake the average access time small service most accesses from a small, fast memoryservice most accesses from a small, fast memory reduce the bandwidth required of the large memoryreduce the bandwidth required of the large memory

Processor

Memory System

Cache DRAM

The Motivation for CachesThe Motivation for Caches

4

The Principle of LocalityThe Principle of Locality Program access a relatively small portion of the address Program access a relatively small portion of the address

space at any instant of timespace at any instant of time Example: 90% of time in 10% of the codeExample: 90% of time in 10% of the code

Two different types of localityTwo different types of locality Temporal Locality (locality in time): Temporal Locality (locality in time):

if an item is referenced, it will tend to be referenced again if an item is referenced, it will tend to be referenced again soonsoon

Spatial Locality (locality in space): Spatial Locality (locality in space): if an item is referenced, items close by tend to be referenced if an item is referenced, items close by tend to be referenced

soonsoon

Address Space0 2n

Probabilityof reference

The Principle of LocalityThe Principle of Locality

5

CPU Registers500 Bytes0.25 ns~$.01

Cache16K-1M Bytes1 ns~$.0001

Main Memory64M-2G Bytes100ns~$.0000001

Disk100 G Bytes5 ms10-5- 10-7 cents

CapacityAccess TimeCost/bit

Tape/Network“infinite”secs.10-8 cents

Registers

L1, L2, … Cache

Memory

Disk

Tape/Network

Words

Blocks

Pages

Files

StagingTransfer Unit

programmer/compiler1-8 bytes

cache controller8-128 bytes

OS4-64K bytes

user/operatorMbytes

Upper Level

Lower Level

Faster

Larger

Levels of the Memory HierarchyLevels of the Memory Hierarchy

6

Lower Level(Memory)

Upper Level(Cache)

To Processor

From ProcessorBlk X

Blk Y

Memory Hierarchy: Principles of Memory Hierarchy: Principles of OperationOperation At any given time, data is copied between only At any given time, data is copied between only

2 adjacent levels2 adjacent levels Upper Level (Cache): the one closer to the processorUpper Level (Cache): the one closer to the processor

Smaller, faster, and uses more expensive technologySmaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the Lower Level (Memory): the one further away from the

processorprocessorBigger, slower, and uses less expensive technologyBigger, slower, and uses less expensive technology

BlockBlock The smallest unit of information that can either be The smallest unit of information that can either be

present or not present in the two-level hierarchypresent or not present in the two-level hierarchy

7

Memory Hierarchy: TerminologyMemory Hierarchy: Terminology Hit:Hit: data appears in some block in the upper data appears in some block in the upper

level (e.g.: Block X in previous slide) level (e.g.: Block X in previous slide) Hit Rate = fraction of memory access found in upper Hit Rate = fraction of memory access found in upper

levellevel Hit Time = time to access the upper levelHit Time = time to access the upper level

memory access time + Time to determine hit/missmemory access time + Time to determine hit/miss

Miss:Miss: data needs to be retrieved from a block in data needs to be retrieved from a block in the lower level (e.g.: Block Y in previous slide)the lower level (e.g.: Block Y in previous slide) Miss Rate = 1 - (Hit Rate)Miss Rate = 1 - (Hit Rate) Miss Penalty: includes time to fetch a new block from Miss Penalty: includes time to fetch a new block from

lower levellower levelTime to replace a block in the upper level from lower level + Time to replace a block in the upper level from lower level +

Time to deliver the block the processorTime to deliver the block the processor

Hit Time: significantly less than Miss PenaltyHit Time: significantly less than Miss Penalty

8

Cache AddressingCache Addressing

Set 0

Set j-1

Block 0 Block k-1 Replacement info

Sector 0 Sector m-1 Tag

Byte 0 Byte n-1 Valid Dirty Shared

Block/line is unit of allocationBlock/line is unit of allocation Sector/sub-block is unit of transfer and coherenceSector/sub-block is unit of transfer and coherence Cache parameters Cache parameters jj, , kk, , mm, , nn are integers, and are integers, and

generally powers of 2generally powers of 2

9

Examples of Cache ConfigurationsExamples of Cache Configurations

# Sets # Blocks # Sectors # Bytes Name1 k m n Fully associativej 1 m n Direct mappedj k 1 n A cache that is not sectoredj 4 m n 4-way set-associative cache

64 8 2 32 PowerPC 601

10

Storage Overhead of CacheStorage Overhead of Cache

8

31

8

83

bits data ofNumber

bits ofnumber Total

nmk

mktagkrepl

nmkj

nmtagkreplj

System # Address bits (j,k,m,n) Cache size Storage overheadIBM 360/85 24 (1,16,16,64) 16 KB 0.85%IBM 3033 32 (64,16,1,64) 64 KB 5.95%

Motorola 68030 32 (24,4,2,2) 256 B 28.10%Intel i486 32 (128,4,1,16) 8 KB 19.90%

DEC Alpha AXP 21064 34 (256,1,1,32) 8 KB 9.37%IBM PowerPC 601 32 (64,8,2,32) 32 KB 5.76%

11

Cache OrganizationCache Organization Direct Mapped CacheDirect Mapped Cache

Each memory location can only mapped to 1 cache locationEach memory location can only mapped to 1 cache location No need to make any decision :-)No need to make any decision :-)

Current item replaces previous item in that cache locationCurrent item replaces previous item in that cache location

N-way Set Associative CacheN-way Set Associative Cache Each memory location have a choice of N cache locationsEach memory location have a choice of N cache locations

Fully Associative CacheFully Associative Cache Each memory location can be placed in ANY cache locationEach memory location can be placed in ANY cache location

Cache miss in a N-way Set Associative or Fully Cache miss in a N-way Set Associative or Fully Associative CacheAssociative Cache Bring in new block from memoryBring in new block from memory Throw out a cache block to make room for the new blockThrow out a cache block to make room for the new block Need to decide which block to throw out!Need to decide which block to throw out!

12

Write Allocate versus Not AllocateWrite Allocate versus Not Allocate Assume that a 16-bit write to memory location Assume that a 16-bit write to memory location

0x00 causes a cache miss0x00 causes a cache miss Do we read in the block?Do we read in the block?

Yes: Write AllocateYes: Write Allocate No: Write No-AllocateNo: Write No-Allocate

13

Basics of Cache OperationBasics of Cache Operation

HIT MISS READ CPU reads from cache Allocate and load block

from MM, then CPU reads from it

WRITE Write into cache plus write through into MM

Write through into MM with or without write allocate

WRITE Write into cache only and set dirty bit (so that on replacement, block is written back to MM only if modified)

Write allocate with write back

14

Details of Simple Blocking CacheDetails of Simple Blocking Cache HIT MISS READ CPU reads cache CPU detects miss, stalls

Cache selects replacement block New block loaded from MM Requested word sent to CPU CPU resumes operation

WRITE CPU writes cache CPU writes MM and stalls until write completes

CPU detects miss CPU writes MM (cache also if write allocate) stalls until write completes

HIT MISS READ CPU reads cache CPU detects miss, stalls

Cache selects replacement block New block loaded from MM Word sent to CPU CPU resumes operation

WRITE CPU writes cache

CPU detects miss, stalls Cache selects replacement block Old block evicted from cache New block loaded from MM (write allocate) CPU resumes operation

Write Through

Write Back

15

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01SEL1 SEL0

Cache Block

CompareAddr. Tag

Compare

OR

Hit

Addr. Tag

A-way Set-Associative CacheA-way Set-Associative Cache AA-way set associative: -way set associative: AA entries for each entries for each

cache indexcache index A direct-mapped caches operating in parallelA direct-mapped caches operating in parallel

Example: Two-way set associative cacheExample: Two-way set associative cache Cache Index selects a “set” from the cacheCache Index selects a “set” from the cache The two tags in the set are compared in parallelThe two tags in the set are compared in parallel Data is selected based on the tag resultData is selected based on the tag result

16

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

Fully Associative CacheFully Associative Cache Push the set-associative idea to its limit!Push the set-associative idea to its limit!

Forget about the Cache IndexForget about the Cache Index Compare the Cache Tags of all cache tag entries in Compare the Cache Tags of all cache tag entries in

parallelparallel Example: Block Size = 32B, we need N 27-bit Example: Block Size = 32B, we need N 27-bit

comparatorscomparators

17

Cache ShapesCache Shapes

Direct-mapped(A = 1, S = 16)

2-way set-associative(A = 2, S = 8)



Fully associative(A = 16, S = 1)

18

:

Entry 0

Entry 1

Entry 63

Replacement

Pointer

Cache Block Replacement PoliciesCache Block Replacement Policies Random ReplacementRandom Replacement

Hardware randomly selects a cache item and throw it outHardware randomly selects a cache item and throw it out

Least Recently UsedLeast Recently Used Hardware keeps track of the access historyHardware keeps track of the access history Replace the entry that has not been used for the longest timeReplace the entry that has not been used for the longest time For 2-way set-associative cache, need one bit for LRU repl.For 2-way set-associative cache, need one bit for LRU repl.

Example of a Simple “Pseudo” LRU ImplementationExample of a Simple “Pseudo” LRU Implementation Assume 64 Fully Associative entriesAssume 64 Fully Associative entries Hardware replacement pointer points to one cache entryHardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to:Whenever access is made to the entry the pointer points to:

Move the pointer to the next entryMove the pointer to the next entry Otherwise: do not move the pointerOtherwise: do not move the pointer

19

Cache Write PolicyCache Write Policy Cache read is much easier to handle than cache Cache read is much easier to handle than cache

writewrite Instruction cache is much easier to design than data cacheInstruction cache is much easier to design than data cache

Cache writeCache write How do we keep data in the cache and memory consistent?How do we keep data in the cache and memory consistent?

Two options (decision time again :-)Two options (decision time again :-) Write Back: write to cache only. Write the cache block to Write Back: write to cache only. Write the cache block to

memorymemory when that cache block is being replaced on a cache miss when that cache block is being replaced on a cache missNeed a “dirty bit” for each cache blockNeed a “dirty bit” for each cache blockGreatly reduce the memory bandwidth requirementGreatly reduce the memory bandwidth requirementControl can be complexControl can be complex

Write Through: write to cache and memory at the same timeWrite Through: write to cache and memory at the same timeWhat!!! How can this be? Isn’t memory too slow for this?What!!! How can this be? Isn’t memory too slow for this?

20

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write ThroughWrite Buffer for Write Through

Write Buffer: needed between cache and main Write Buffer: needed between cache and main memmem Processor: writes data into the cache and the write Processor: writes data into the cache and the write

bufferbuffer Memory controller: write contents of the buffer to Memory controller: write contents of the buffer to

memorymemory

Write buffer is just a FIFOWrite buffer is just a FIFO Typical number of entries: 4Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM Works fine if store freq. (w.r.t. time) << 1 / DRAM

write cyclewrite cycle

Memory system designer’s nightmareMemory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturationWrite buffer saturation

21

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

Write Buffer SaturationWrite Buffer Saturation

Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle If this condition exist for a long period of time (CPU cycle

time too quick and/or too many store instructions in a row)time too quick and/or too many store instructions in a row) Store buffer will overflow no matter how big you make itStore buffer will overflow no matter how big you make it CPU Cycle Time << DRAM Write Cycle TimeCPU Cycle Time << DRAM Write Cycle Time

Solutions for write buffer saturationSolutions for write buffer saturation Use a write back cacheUse a write back cache Install a second level (L2) cacheInstall a second level (L2) cache

22

Four Questions for Memory Four Questions for Memory HierarchyHierarchy Where can a block be placed in the upper Where can a block be placed in the upper

level? level? (Block placement)(Block placement) How is a block found if it is in the upper level?How is a block found if it is in the upper level?

(Block identification)(Block identification) Which block should be replaced on a miss?Which block should be replaced on a miss?

(Block replacement)(Block replacement) What happens on a write?What happens on a write?

(Write strategy)(Write strategy)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Documents

cache data cache block

cache location cache

associative cache

processor upper level

memory hierarchy slide

cache tagvalid

upper level memory access

processor lower level