CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

CMP L2 Cache Management

Presented by: Yang Liu

CPS221 Spring 2008

Based on:

Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar

ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

Outline

Motivation

Related Work (1) – Non-uniform Caches

CMP-NuRAPID

Related Work (2) – Replication Schemes

ASR

Motivation

Two options for L2 caches in CMPs Shared: high latency because of wire delay Private: more misses because of replications

Need hybrid L2 caches

Take in mind On-chip communication is fast On-chip capacity is limited

NUCA

Non-Uniform Cache Architecture Place frequently-accessed data closest to the

core to allow fast access Couple tag and data placement

Can only place one or two ways in each set close to the processor

NuRAPID

Non-uniform access with Replacement And Placement usIng Distance associativity

Decouple the set-associative way number from data placement

Divide the cache data array into d-groups Use forward and reverse pointers

Forward: from tag to data Reverse: from data to tag One to one?

CMP-NuRAPID - Overview

Hybrid private tag Shared data organization

Controlled Replication – CR In-Situ Communication – ISC Capacity Stealing – CS

CMP-NuRAPID – Structure

Need carefully chosen d-group preference

CMP-NuRAPID – Data and Tag Array Tag arrays snoop on bus to maintain coherence The data array is accessed through a crossbar

CMP-NuRAPID – Controlled Replication For read-only sharing First use no copy, save capacity Second copy, reduce future access latency In total, avoid off-chip misses

CMP-NuRAPID – Time Issues Start to read before the invalidation and end

after the invalidation Mark the tag for the block being read from a

farther d-group busy

Start to read after the invalidation begins and end before the invalidation completes Put an entry in the queue that holds the order of

the bus transaction before sending a read request to a farther d-group

CMP-NuRAPID – In-situ Communication

For read-write sharing Communication state Write-through for all C blocks in L1 cache

CMP-NuRAPID – Capacity Stealing Demote less-frequently-used data to unused

frames in the d-groups closer to the cores with less capacity demands

Placement and Promotion Place all private blocks in the d-group closest to

the initiating core Promote the block directly to the closest d-group

for the core

CMP-NuRAPID – Capacity Stealing Demotion and Replacement

Demote the block to the next-fastest d-group Replace in the order of invalid, private, and shared

Doesn’t this kind of demotion pollute another core’s fastest d-group?

CMP-NuRAPID - Methodology Simics 4-core CMP 8 MB, 8-way CMP-NuRAPID with 4 single-

ported d-groups Both multithreaded and multiprogrammed

workloads

CMP-NuRAPID – Multithreaded

CMP-NuRAPID – Multiprogrammed

Replication Schemes

Cooperative Caching Private L2 caches Restrict replication under certain criteria

Victim Replication Share L2 cache Allow replication under certain criteria

Both have static replication policies How about dynamic?

ASR - Overview

Adaptive Selective Replication

Dynamic cache block replication Replicate blocks when the benefits exceed

the costs Benefits: lower L2 hit latency Costs: More L2 misses

ASR – Sharing Types

Shingle Requestor Blocks are accessed by a single processor

Shared Read-Only Blocks are read, but not written, by multiple processors

Shared Read-Write Blocks are accessed by multiple processors, with at least

one write

Focus on replicating shared read-only blocks High locality Little Capacity Large portion of requests

ASR - SPR

Selective Probabilistic Replication Assume private L2 caches and selectively

limits replication on L1 evictions Use probabilistic filtering to make local

replication decisions

ASR – Balancing Replication

ASR – Replication Control

Replication levels C: Current H: Higher L: Lower

Cycles H: Hit cycles-per-instruction M: Miss cycles-per-instruction



Wait until there are enough events to ensure a fair cost/benefit comparison

Wait until four consecutive evaluation intervals predict the same change before change the replication level

ASR – Designs Supported by SPR SPR-VR

Add 1-bit per L2 cache block to identify replicas Disallow replications when the local cache set is filled with

owner blocks with identified sharers SPR-NR

Store a 1-bit counter per remote processor for each L2 block

Remove the shared bus overhead (How?) SPR-CC

Model the centralized tag structure using an idealized distributed tag structure

ASR - Methodology

Two CMP configurations – Current and Future 8 processors Writeback, write-allocate cache Both commercial and scientific workloads Use throughput as metrics

ASR – Memory Cycles

ASR - Speedup

Conclusion

Hybrid is better Dynamic is better

Need tradeoff

How does it scale…

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Documents

way cmpnurapid

dgroup closest

closest dgroup

cache data array

data placementdivide

data placementcan

accessed data closest

cscmpnurapid structureneed