Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Adaptive Cache Compression for Adaptive Cache Compression for High-Performance ProcessorsHigh-Performance Processors

Alaa R. Alameldeen and David A.WoodAlaa R. Alameldeen and David A.WoodComputer Sciences Department, University of Wisconsin-Computer Sciences Department, University of Wisconsin-

MadisonMadison

OutlineOutline

IntroductionIntroduction

MotivationMotivation

Adaptive Cache CompressionAdaptive Cache Compression

Evaluation Methodology Evaluation Methodology

Reported performanceReported performance

Review conclusionReview conclusion

Critics/SuggestionsCritics/Suggestions

IntroductionIntroduction

Increasing performance gap between Increasing performance gap between processors and memory calls for faster processors and memory calls for faster memory access.memory access.

Cache memories – reduce average memory Cache memories – reduce average memory latencylatency

Cache Compression – improves performance of cache Cache Compression – improves performance of cache

memoriesmemoriesAdaptive Cache Compression – Theme of this Adaptive Cache Compression – Theme of this discussion discussion

MotivationMotivation

Cache compressionCache compression can improve effectiveness of cache memories can improve effectiveness of cache memories (increase (increase effective cache capacityeffective cache capacity))

Increasing effective cache capacity reduces miss rateIncreasing effective cache capacity reduces miss rate

Performance will improve !Performance will improve !

Adaptive Cache Compression Adaptive Cache Compression An OverviewAn Overview

Dynamically optimize cache performanceDynamically optimize cache performance

Use the past to predict the futureUse the past to predict the future

How likely is compression going to help, hurt, or make no difference to next reference? Feedback from previous compression helps to decide whether to compress the next write to cache

Adaptive Cache CompressionAdaptive Cache CompressionImplementationImplementation

2-level cache hierarchy

• L1 cache (data, instruction) uncompressed• L2 cache is unified and optionally compressed• Decompression/ Compression used/skipped as necessary

Pros: L1 cache performance not affectedCons: Compression/Decompression introduces latency

Adaptive Cache CompressionAdaptive Cache CompressionL2 cache detailL2 cache detail

8-way set associative8-way set associative

Use a Use a compression information tagcompression information tag stored with each stored with each address tagaddress tag

32 segments (8 bytes each) in each set32 segments (8 bytes each) in each set

An An uncompressed lineuncompressed line comprises comprises 8 segments8 segments

(4 uncompressed lines max in each set)(4 uncompressed lines max in each set)

Compressed linesCompressed lines are are 1 to 7 segments1 to 7 segments in length in length

Max number of lines in each set =8Max number of lines in each set =8

Least recently used (LRU) lines evictedLeast recently used (LRU) lines evicted

CompactingCompacting may be used to make room for a new line may be used to make room for a new line

Adaptive Cache Compression:Adaptive Cache Compression:To compress or not to compress?To compress or not to compress?

While compression eliminates L2 misses, it increases the While compression eliminates L2 misses, it increases the latency of L2 hits (more frequent).latency of L2 hits (more frequent).However, penalty for L2 misses is usually large and However, penalty for L2 misses is usually large and extra latency due to decompression is usually small.extra latency due to decompression is usually small.Compression helps if:Compression helps if:

( avoided L2 misses ) x (L2 miss penalty)

>( penalized L2 hits ) x( decompression penalty)

Example: For a 5 cycle decompression penalty and 400 cycle cycle L2 miss penalty, compression wins if it eliminates at least one L2 miss for every 400/5=80 penalized L2 hits

Adaptive Cache CompressionAdaptive Cache CompressionClassification of Cache ReferencesClassification of Cache References

Classifications of hitsClassifications of hits Unpenalized hit Unpenalized hit (e.g. reference to address A)(e.g. reference to address A) Penalized hitPenalized hit

(e.g. reference to address C)(e.g. reference to address C) Avoided missAvoided miss

(e.g. reference to address E)(e.g. reference to address E)

Classifications of missesClassifications of misses Avoidable missAvoidable miss

( e.g. reference to address G)( e.g. reference to address G) Unavoidable missUnavoidable miss

( e.g. reference to address H)( e.g. reference to address H) Evicted

Adaptive Cache CompressionAdaptive Cache CompressionHardware use in decision-makingHardware use in decision-making

Global Compression PredictorGlobal Compression Predictor estimates the recent estimates the recent cost or benefitcost or benefit of compression of compression On a On a penalized hitpenalized hit, the controller , the controller biases againstbiases against compression by compression by

decrementing the counterdecrementing the counter

( subtractedvalue=decompression penalty)( subtractedvalue=decompression penalty) On an avoided or avoidable miss, the controller increments the On an avoided or avoidable miss, the controller increments the

counter by the L2 miss penalty.counter by the L2 miss penalty. The controller uses the GCP when allocating a line in the L2 cacheThe controller uses the GCP when allocating a line in the L2 cache

Positive value -> compression has helped, so now compressPositive value -> compression has helped, so now compress Negative value -> compression has been penalizing, so don’t Negative value -> compression has been penalizing, so don’t

compresscompress

Size of GCP determines sensitivity to changesSize of GCP determines sensitivity to changes In this paper, 19-bit used ( saturates at 262143 or -262144 )In this paper, 19-bit used ( saturates at 262143 or -262144 )

Adaptive Cache CompressionAdaptive Cache CompressionSensitivitySensitivity

Effectiveness depend on the workload’s size, cache’s Effectiveness depend on the workload’s size, cache’s size and latenciessize and latencies

Sensitive to L2 cache size (effective for small L2 cache)Sensitive to L2 cache size (effective for small L2 cache)

Sensitive to L1 cache size (observe trade-offs)Sensitive to L1 cache size (observe trade-offs)

Adapting to benchmark phaseAdapting to benchmark phase

- changes in phase behaviour may hurt adaptive policy- changes in phase behaviour may hurt adaptive policy

- takes time to adapt- takes time to adapt

Evaluation MethodologyEvaluation Methodology

Host system: dynamically-scheduled SPARC V9 uniprocessorHost system: dynamically-scheduled SPARC V9 uniprocessor

Target system: superscalar processor with out-of-order executionTarget system: superscalar processor with out-of-order execution

Simulation Parameters:Simulation Parameters:

Evaluation Methodology Evaluation Methodology (continued)(continued)

Simulator: Simics full-system simulator, extended with a detailedSimulator: Simics full-system simulator, extended with a detailed

processor simulator (TFSim), and a detailed memoryprocessor simulator (TFSim), and a detailed memory

system timing simulator.system timing simulator.

Workloads: Workloads: multi-threaded commercial workloads from the Wisconsin multi-threaded commercial workloads from the Wisconsin

Commercial Commercial

workload suiteworkload suite eight of the SPECcpu2000 benchmarkseight of the SPECcpu2000 benchmarks

Integer benchmarks (bzip, gcc, mcf, twolf)Integer benchmarks (bzip, gcc, mcf, twolf) Floating benchmarks (ammp, applu, equake, swim)Floating benchmarks (ammp, applu, equake, swim)

Workloads selected to cover a wide range of compressibility properties,Workloads selected to cover a wide range of compressibility properties,

miss rates and working set sizes.miss rates and working set sizes.

Evaluation methodology Evaluation methodology (continued)(continued)

To understand the utility of adaptive compression, 2 To understand the utility of adaptive compression, 2 extreme policies ( Never compress, and always extreme policies ( Never compress, and always compress were compared with )compress were compared with )

‘‘Never’ strives to reduce hit latencyNever’ strives to reduce hit latency

‘‘Always’ strives to reduce miss rateAlways’ strives to reduce miss rate

‘‘Adaptive’ strives to optimize.Adaptive’ strives to optimize.

Reported PerformanceReported Performance(Average cache capacity)(Average cache capacity)

Figure: Average cache capacity during benchmark runs (4MB uncompressed)

Reported Performance Reported Performance (cache miss rate)(cache miss rate)

Figure: L2 cache miss rate (normalized to “Never” miss rate)

Reported Performance Reported Performance (Runtime)(Runtime)

Figure: Runtime for the three compression alternatives (normalized to “Never”)

Reported PerformanceReported Performance((sensitivity of adaptive compression to sensitivity of adaptive compression to

benchmark phase changes)benchmark phase changes)

Top: temporal changes in Global Compression Predictor values

Bottom: effective cache size

Review ConclusionReview Conclusion

Compressing all compressible cache lines only improves Compressing all compressible cache lines only improves memory-intensive applications. Applications with low memory-intensive applications. Applications with low miss rate / compressibility suffer.miss rate / compressibility suffer.

Optimization achieved by adaptive scheme are:Optimization achieved by adaptive scheme are:Up to 26% speedup (over uncompressed scheme) forUp to 26% speedup (over uncompressed scheme) for

memory-intensive, highly-compressible benchmarksmemory-intensive, highly-compressible benchmarksPerformance degradation for other benchmarks < 0.4%Performance degradation for other benchmarks < 0.4%

Critics/SuggestionsCritics/Suggestions

Data inconsistency:17% improvement in performance for memory-Data inconsistency:17% improvement in performance for memory-intensive commercial workloads claimed on page 2 but 26% claimed intensive commercial workloads claimed on page 2 but 26% claimed on page 11.on page 11.

Miscalculation on page 4 Miscalculation on page 4 The sum of the compressed sizes at stack depths 1 through 7 totals The sum of the compressed sizes at stack depths 1 through 7 totals

29.29. However, this miss cannot be avoided because the sum of compressed However, this miss cannot be avoided because the sum of compressed

sizes exceeds the total number of segments (i.e. 35 > 32 ) .sizes exceeds the total number of segments (i.e. 35 > 32 ) .

All in all, the proposed technique doesn’t seem to enhance All in all, the proposed technique doesn’t seem to enhance performance significantly with respect to ‘always’. performance significantly with respect to ‘always’.

Thank you !Thank you !

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Documents

cache slide

motivation cache compression

l1 cache performance

benefit of compression

previous compression

effective cache capacity

compression information

latency slide