1 Improving Cache Locality for TLS Steffan Improving Cache Locality Improving Cache Locality for for Thread-Level Speculation Thread-Level Speculation Stanley Fung and J. Gregory Steffan Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering Electrical and Computer Engineering University of Toronto University of Toronto
40
Embed
Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan
Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering University of Toronto. IBM Power 5. AMD Opteron. Intel Yonah. Chip Multiprocessors (CMPs) are Here!. Use CMPs to improve sequential program performance?. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Improving Cache Locality for TLS Steffan
Improving Cache Locality for Improving Cache Locality for
Thread-Level SpeculationThread-Level Speculation
Stanley Fung and J. Gregory SteffanStanley Fung and J. Gregory Steffan
Electrical and Computer EngineeringElectrical and Computer Engineering
University of TorontoUniversity of Toronto
2Improving Cache Locality for TLS Steffan
Chip Multiprocessors (CMPs) are Here!Chip Multiprocessors (CMPs) are Here!
IBM Power 5AMD OpteronIntel Yonah
Use CMPs to improve sequential program performance?
3Improving Cache Locality for TLS Steffan
Exploiting CMPS: The IntuitionExploiting CMPS: The Intuition
CMPs have lots of distributed resourcesCMPs have lots of distributed resources
– Hence a given read miss predicts future read missesHence a given read miss predicts future read misses
– i.e., other CPUs will likely read-miss that same linei.e., other CPUs will likely read-miss that same line
• Broadcasting for all read missesBroadcasting for all read misses
– Any read miss results in that line being pushed to all cachesAny read miss results in that line being pushed to all caches
• Provided lines in speculative state are not evictedProvided lines in speculative state are not evicted
– Trivial to implement in CMP with bus interconnectTrivial to implement in CMP with bus interconnect
• No extra trafficNo extra traffic
will such broadcasting result in cache pollution?
25Improving Cache Locality for TLS Steffan
Impact of Broadcasting All Read Misses (RB)Impact of Broadcasting All Read Misses (RB)
Data Cache Misses Execution Time
27.7% reduction 7.3% speedup
simple broadcasting is effective
• Attempts to throttle broadcasting reduced benefitsAttempts to throttle broadcasting reduced benefits– Hence resulting cache pollution is limitedHence resulting cache pollution is limited
26Improving Cache Locality for TLS Steffan
Miss Patterns ObservedMiss Patterns Observed
71.3%
Miss PatternMiss Pattern PercentagePercentage
L2 missL2 miss 15.7%15.7%
Read-based sharingRead-based sharing 53.7%53.7%
Write-based sharingWrite-based sharing 11.4%11.4%
StridedStrided 6.2%6.2%
OtherOther 13.0%13.0%
27Improving Cache Locality for TLS Steffan
Exploiting Write-Based Sharing PatternsExploiting Write-Based Sharing Patterns• Note: caches extended for TLS are write-backNote: caches extended for TLS are write-back
– Modifications are not propagated before thread commitsModifications are not propagated before thread commits
• Example: write-based sharing of a cache lineExample: write-based sharing of a cache line– CPU0 writes then commits; then CPU1 readsCPU0 writes then commits; then CPU1 reads
– Read results in miss, read-request, write-back, then fillRead results in miss, read-request, write-back, then fill
• Aggressive approach: Aggressive approach: – On commit, broadcast all modified linesOn commit, broadcast all modified lines
– Too much traffic, too many superfluous copiesToo much traffic, too many superfluous copies
• A more selective approach: A more selective approach: – Predict lines involved in write-based sharing Predict lines involved in write-based sharing
more general: predict stores involved in WB sharing
28Improving Cache Locality for TLS Steffan
Predicting Stores & Lines Involved in WB SharingPredicting Stores & Lines Involved in WB Sharing
tag index offsetAddress:
Extended Tag (etag)
RST Index
8 Entries
8 Entries8 Entries
8-entries each is sufficient
Recent Store Table (RST)
store PC
store PC
store PCRST Index
(Recent store PCs)
store PC
store PC
store PC
store PC
Invalidation PC List (IPCL)
(Store PCs for lines that are written back)
Push Required Buffer (PRB)
etag
etag
etag
(lines to push on commit)
29Improving Cache Locality for TLS Steffan
Operation of Write-Based Sharing TechniqueOperation of Write-Based Sharing Technique
On a store:On a store:
– Add store PC to Add store PC to Recent Store TableRecent Store Table ( (RSTRST))
– If store PC is in If store PC is in Invalidation PC List Invalidation PC List ((IPCLIPCL):):
• Add store PC to Add store PC to Push Required BufferPush Required Buffer ( (PRBPRB))
On a coherence request requring writeback:On a coherence request requring writeback:
– Use RST index to lookup PC in Use RST index to lookup PC in RSTRST, add PC to , add PC to IPCLIPCL
On commit:On commit:
– For each extended tag in For each extended tag in PRBPRB::
• Writeback, self-invalidate, push line to next cacheWriteback, self-invalidate, push line to next cache
simple case: next cache is in round-robin order
30Improving Cache Locality for TLS Steffan
Impact of Write-Based Technique (WB)Impact of Write-Based Technique (WB)
Data Cache Misses Execution Time
19.6% reduction 7.8% speedup
worth the cost of small additional hardware
31Improving Cache Locality for TLS Steffan
Miss Patterns ObservedMiss Patterns Observed
71.3%
Miss PatternMiss Pattern PercentagePercentage
L2 missL2 miss 15.7%15.7%
Read-based sharingRead-based sharing 53.7%53.7%
Write-based sharingWrite-based sharing 11.4%11.4%
StridedStrided 6.2%6.2%
OtherOther 13.0%13.0%
32Improving Cache Locality for TLS Steffan
Exploiting Strided Miss PatternsExploiting Strided Miss Patterns
• Hardware stride-prefetcher [Fu Hardware stride-prefetcher [Fu et alet al, Baer , Baer et alet al]]
– Each CPU has its own aggressive prefetcherEach CPU has its own aggressive prefetcher