Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE

04/22/23

Fair and High Throughput Cache Partitioning Scheme for CMPs

Shibdas BandyopadhyayDept of CISE

University of Florida

04/22/23

Outline

• Motivation and Problem Statement

• Existing Solutions

• What we propose to achieve

• Feasibility Study

• Potential Impact

• Conclusion

04/22/23

Motivation

• More cores are integrated on die (e.g. Intel Tera Scale computing) - Multitasking becomes more common - Multiple applications are running simultaneously - Virtualized workload becomes mainstream; multiple VMs are consolidated onto the same platform

• Problems in platform resource management - Loss of efficiency Disparate behavior of simultaneously running applications - No fair or determinism guaranteed - No effective prioritization

04/22/23

Motivation – An Example

• We have a shared L2 cache • Operating System wants to prioritize certain front-end application which is directly visible to the user

• It usually does so by increase time slice of the application

• But another process with a poor temporal locality is running on the system. It trashes the L2 cache and hence more pages of that program is present in shared L2

• Higher priority process spends most of its time evicting pages of the other process

04/22/23

Motivation

04/22/23

Motivation

04/22/23

Problem Statement

• We need to modify existing cache management to include a notion of fairness and throughput

• Throughput improvements dominated cache management protocols as we tend to reduce the number of misses

• But increasing throughput should not result in increasing the cache hit for the a resource hogging process to such a degree that other processes results in excessive cache trashing

04/22/23

Problem Statement

• Not a new problem as this was the case with traditional multi- tasking environment

• Now the degree of multi-tasking has increased due to an increase in the number of cores and sharing some part of cache hierarchy amongst them

• Shared L2 cache is more useful when processes share data between them. If we don’t share then we will be ending up with many copies of shared data inside private L2

04/22/23

Existing Solution

• Profiling Based Approach

• Non-uniform cache architecture

• Partially shared cache hierarchy

• Marginal Gain based approach

• Fairness based approach

• Resource QoS based approach

04/22/23

Profiling based approach

• Profile various application for their cache access when run alone

• Determine the optimum cache size for each of them given the total cache size available

• Currently only maximizes the sum of cache hits of all processes

• Need to incorporate fairness criteria and maximize them along with throughput.

04/22/23

Non Uniform Cache Architecture• Arises due to the large wire-delay dominated L2 cache resulting in non-uniform access time for different parts of the cache depending on the distance from the processor.

• We need to place the locally accessed blocks nearer to the processor while placing shared blocks optimally from the processors sharing the blocks.

• Divides the cache banks according to the processors.

• Dynamically controlling the granularity provides performance improvements

04/22/23

Partially Shared Cache• Aims to combine the best of two worlds – Private L2 cache and shared L2 cache.

• One can keep a fraction of the L2 Cache as a private cache and rest as shared. Basically boils down to a private L2 and shared L3 cache with equivalent access time for L2 and L3. Coherency protocols need to include a state for “shared” blocks.

• Put the evicted blocks from a private L2 to private L2 of some other processor so that next time they can be fetched from that processor instead of going into memory.

• Would be great to place the blocks on those processors which might share them with the other processor in future.

04/22/23

Marginal Gain based approach

• Based on the concept of reuse distance. Normally we can calculate the stack based profiling after running the applications

• If the stack based profile curve is convex one can apply resource minimization procedure to find the optimal cache size of individual caches ( Note: we minimize the sum of all cache miss here)

• Can be implemented on-the-fly by calculating marginal gain (increase in the cache hit if cache size is increased by 1) and using LRU stack and counters to figure out the partitioning scheme

04/22/23

Fairness based approach

• Define fairness metrics based on the cache access pattern while running along other processes vis-à-vis when running independently

• During Context switching, calculate the position of the every process with respect to the fairness metric. Increase partition size of the processes having low value of fairness metric

• Need to develop a model where show the metrics indeed leads to fairness given a stream with certain mathematical properties

04/22/23

Resource QoS based approach

• Incorporate the notion of the priority starting from application down to every resources in the system

04/22/23

What we propose to achieve• Propose a strategy which maximizes both fairness and total number of cache hits.

• Should not depend on OS to provide priority data as it should be implemented by hardware cache controller. This implies we can not enforce priority for a process but we will aim to guarantee fairness.

• Should use counters and LRU stack and need not to run the application aprior

• One interesting aspect could be to profile cache coherence protocol to understand the behavior of various processes running on different cores. This was not possible for multi-tasking systems with a single core (don’t have any coherency protocol)

04/22/23

What we propose to achieve• Mathematically, we have reuse distance curve for every process

04/22/23

What we propose to achieve

• If we want to maximize throughput, we have to maximize the sum of all those area under the curve given sum of the sizes is a constant

• We need to take account of fairness in this situation

• We can conceptualize total cache space allocated to a process consists of exclusive cache space for the process and shared address space which is shared with all other processes

• Fairness is violated when shared cache spaces are used unevenly by different processes

04/22/23


• We start with a shared space assignment.

• We vary the shared space depending on the shared blocks present in the cache

• We can determine the degree of “sharedness” of the cache using the state of the cache blocks

• We need to maintain a counter which gets updated with the state change of the block according to the cache coherence protocol

04/22/23


• To maximize the throughput we need to have the reuse distance which implies running the application once to produce it. But this is not really required as marginal gains approximate the curve

• We start with say an equal partition and then after each block state (many??) we recalculate the “sharedness” and “throughput” criteria to repartition it.

04/22/23

Thank You

Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE

Documents