Top Banner
46

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Jan 18, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Page 2: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Improving Cache Performance

• Four categories of optimisation:– Reduce miss rate– Reduce miss penalty– Reduce miss rate or miss penalty using

parallelism– Reduce hit time

AMAT = Hit time + Miss rate × Miss penalty

Page 3: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.5. Reducing Miss Rate

• Three sources of misses:– Compulsory

• “cold start misses”

– Capacity• Cache is full

– Conflict• Set is full/block is occupied

Increase block size

Increase size of cache

Increase degree of associativity

Page 4: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Larger Block Size

• Bigger blocks reduce compulsory misses– Spatial locality

• BUT:– Increased miss penalty

• More data to transfer

– Possibly increased overall miss rate• More conflict and capacity misses as there are fewer

blocks

Page 5: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Effect of Block Size

AMAT

Block size

Missrate

Block size

Transfer

Access

Misspenalty

Block size

Page 6: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Larger Caches

• Reduces capacity misses

• Increases hit time and cost

Page 7: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Higher Associativity

• Miss rates improve with higher associativity

• Two rules of thumb:– 8-way set associative caches are almost as

effective as fully associative• But much simpler!

– 2:1 cache rule• A direct mapped cache of size N has about the same

miss rate as a 2-way set associative cache of size N/2

Page 8: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Way Prediction

• Set-associative cache predicts which block will be needed on next access to the set

• Only one tag check is done– If mispredicted the whole set must be checked

• E.g. Alpha 21264 instruction cache– Prediction rate > 85%– Correct prediction: 1 cycle hit– Misprediction: 3 cycles

Page 9: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pseudo-Associative Caches

• Check a direct mapped cache for a hit as usual

• If it misses, check a second block– Invert MSB of index

• One fast and one slow hit time

Page 10: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Compiler Optimisations

• Compilers can optimise code to minimise miss rates:– Reordering procedures– Aligning basic blocks with cache blocks– Reorganising array element accesses

Page 11: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.6. Reduce Miss Rate or Miss Penalty via Parallelism

• Three techniques that overlap instruction execution with memory access

Page 12: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Nonblocking caches

• Dynamic scheduling allows CPU to continue with other instructions while waiting for data

• Nonblocking cache allows other cache accesses to continue while waiting for data

Page 13: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Hardware Prefetching• Fetch data/instructions before they are

requested by the processor– Either into cache or another buffer

• Particularly useful for instructions– High degree of spatial locality

• UltraSPARC III– Special prefetch cache for data– Increases effectiveness by about four times

Page 14: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Compiler Prefetching

• Compiler inserts “prefetch” instructions

• Two types:– Prefetch register value– Prefetch data cache block

• Can be faulting or non-faulting

• Cache continues as normal while data is prefetched

Page 15: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

SPARC V9• Prefetch:

prefetch [%rs1 + %rs2], fcnprefetch [%rs1 + imm13], fcn

fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page

Page 16: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.7. Reducing Hit Time

• Critical– Often affects CPU clock cycle time

Page 17: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Small, simple caches

• Small usually equals fast in hardware

• A small cache may reside on the processor chip– Decreases communication– Compromise: tags on chip, data separate

• Direct mapped– Data can be read in parallel with tag checking

Page 18: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Avoiding address translation

• Physical caches– Use physical addresses

• Address translation must happen before cache lookup

• Virtual caches– Use virtual addresses– Protection issues– High context switching overhead

Page 19: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Virtual caches

• Minimising context switch overhead:– Add process-identifier tag to cache

• Multiple virtual addresses may refer to a single physical address– Hardware enforces anti-aliasing– Software requires less significant bits to be the

same

Page 20: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Avoiding address translation (cont.)

• Choice of page size:– Bigger than cache index + offset– Address translation and tag lookup can happen

in parallel

OffsetTag IndexAddress

CPUPage offsetPage no.

Cache

VM Translation

Page 21: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pipelining cache access

• Split cache access into several stages

• Impacts on branch and load delays

Page 22: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Trace caches

• Blocks follow program flow rather than spatial locality!

• Branch prediction is taken into account by cache

• Intel NetBurst microarchitecture

• Complicates address mapping

• Minimises wasted space within blocks

Page 23: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Cache OptimisationSummary

• Cache optimisation is very complex– Improving one factor may have a negative

impact on another

Page 24: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.6. Main Memory

• Latency and bandwidth are both important

• Latency is composed of two factors:– Access time– Cycle time

• Two main technologies:– DRAM– SRAM

Page 25: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.7. Virtual Memory

• Physical memory is divided into blocks– Allocated to processes– Provides protection– Allows swapping to disk– Simplifies loading

• Historically:– Overlays

• Programmer controlled swapping

Page 26: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Terminology

• Block:– Page– Segment

• Miss:– Page fault– Address fault

• Memory mapping (address translation)– Virtual address physical address

Page 27: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Characteristics

• Block size– 4kB – 64kB

• Hit time– 50 – 150 cycles

• Miss penalty– 1 000 000 – 10 000 000 cycles

• Miss Rate– 0.000 01 – 0.001%

Page 28: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Categorising VM Systems

• Fixed block size– Pages

• Variable block size– Segments– Difficult replacement

• Hybrid approaches– Paged segments– Multiple page sizes (2n × smallest)

Page 29: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q1: Block placement?

• Anywhere in memory– “Fully associative”– Minimises miss rate

Page 30: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q2: Block identification?

• Page/segment number gives physical page address– Paging: offset concatenated– Segments: offset added

• Uses a page table– Number of pages in virtual address space– Save space: inverted page table

• Number of pages in physical memory

Page 31: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q3: Block replacement?

• Least-recently used (LRU)– Minimises miss rate– Hardware provides a use bit or reference bit

Page 32: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q4: Write strategy?

• Write back– With a dirty bit

You won’t become famous by being the first to try write through!

Page 33: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Fast Address Translation

• Page tables are big– Stored in memory themselves– Two memory accesses for every datum!

• Principle of locality– Cache recent translations– Translation look-aside buffer (TLB), or

translation buffer (TB)

Page 34: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Alpha 21264 TLB

Page 35: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Selecting a Page Size

• Big– Smaller page table– Allows parallel cache access– Efficient disk transfers– Reduces TLB misses

• Small– Less memory wastage (internal fragmentation)– Quicker process startup

Page 36: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Putting it ALL Together!

SPARC Revisited

Page 37: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Two SPARCs

• SuperSPARC– 1992– 32-bit superscalar design

• UltraSPARC– Late 1990’s– 64-bit design– Graphics support (VIS)

Page 38: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

UltraSPARC

• Four-way superscalar execution

• Two integer ALU’s

• FP unit– Five functional units

• Graphics unit

Page 39: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pipeline

• 9 stages:– Fetch

– Decode

– Grouping

– Execution

– Cache access

– Load miss

– Integer pipe wait (for FP/graphics pipelines)

– Trap resolution

– Writeback

Page 40: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Branch Handling

• Dynamic branch prediction– Two bit scheme– Every second instruction in cache has

prediction bits (predicts up to 2048 branches)– 88% success rate (integer)

• Target prediction– Fetches from predicted path

Page 41: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

FPU

• Five functional units:– Add– Multiply– Divide/square root– Two graphics units (add and multiply)

• Mostly fully pipelined (latency 3 cycles)– Except divide and square root (not pipelined,

latency is 22 cycles for 64-bit)

Page 42: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Memory Hierarchy

• On-chip instruction and data caches– Data:

• 16kB direct-mapped, write-through

– Instructions:• 16kB 2-way set associative

– Both virtually addressed

• External cache– Up to 4MB

Page 43: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Virtual Memory

• 64-bit virtual addresses 44-bit physical addresses

• TLB– 64 entry, fully-associative cache

Page 44: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Multimedia Support (VIS)

• Integrated with FPU

• Partitioned operations– Multiple smaller values in 64-bits

• Video compression instructions– E.g. motion estimation instruction replaces 48

simple instructions for MPEG compression

Page 45: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

The End!

Page 46: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.