Top Banner
Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55 , Issue: 4 Pages: 353-365 April 2006 On seminar book: 152-164
21

Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

Cache Replacement Algorithms with Nonuniform Miss Costs

Jeong, J. and Dubois, M. IEEE Transactions on ComputersVolume: 55 , Issue: 4Pages: 353-365 April 2006On seminar book: 152-164

Page 2: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

2/21

Abstract Cache replacement algorithms originally developed in the context of

uniprocessors executing one instruction at a time implicitly assume that all cache misses have the same cost. However, in modern systems, some cache misses are more expensive than others. The cost may be latency, penalty, power consumption, bandwidth consumption, or any other ad hoc numeric property attached to a miss. We call the class of replacement algorithms designed to minimize a nonuniform miss cost function “cost-sensitive replacement algorithms.” In this paper, we first introduce and analyze an optimum cost-sensitive replacement algorithm (CSOPT) in the context of multiple nonuniform miss costs. CSOPT can significantly improve the cost function over OPT (the replacement algorithm minimizing miss count) in large regions of the design space. Although CSOPT is an offline and unrealizable replacement policy, it serves as a lower bound on the achievable cost by realistic cost-sensitive replacement algorithms.

Using the practical example of latency cost in CC-NUMA multiprocessors, we demonstrate that there is a lot of room left to improve current replacement algorithms in many situations beyond the promise of OPT. Next, we introduce three practical extensions of LRU inspired by CSOPT and we compare their performance to LRU, OPT and CSOPT. Finally, as a practical application, we evaluate these realizable cost-sensitive replacement algorithms in the context of the second-level caches of a CC-NUMA multiprocessor with superscalar processors, using the miss latency as the cost function. By applying simple replacement policies sensitive to the latency of misses, we can improve the execution time of some parallel applications by up to 18 percent.

Page 3: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

3/21

What’s the Problem

Cache replacement algorithms widely used in modern systems aim to reduce the aggregate miss count and assume that miss cost are uniform However, the uniform cost assumption has lost its

validity in multiprocessor system The cost of a miss mapping to a remote memory is higher

than the cost of a miss mapping to a local memory

Motivating the exploration of replacement policy to minimize the miss cost of multiple nonuniform miss costs instead of miss count

Page 4: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

4/21

Introduction Cache replacement algorithm reaching a minimum

aggregate miss cost of multiple miss costs Cost-Sensitive OPTimal replacement algorithm (CSOPT)

CSOPT is an extension of OPT, the classical replacement algorithm minimizing miss count

CSOPT and OPT are identical under the uniform miss cost

However, CSOPT and OPT requires knowledge of future memory accesses and are unrealizable Thus, we also introduce 3 realistic online cost-sensitive

replacement algorithms by extending the LRU algorithm

With multiple miss costs, CSOPT doesn’t always replace the block selected by OPT CSOPT considers the option of keeping the block

victimized by OPT in cache until next reference to it

Page 5: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

5/21

Replacement Algorithm Optimizing the Miss Count

OPTimal Replacement Algorithm (OPT) Consider uniform miss cost and minimize miss count The victim block select by OPT is

The block whose next reference is farthest away in the future among all blocks in cache

OPT can be implemented with a priority list Consider a trace of block addresses X= x1, x2, x3, …, xL

The forward distance to a block a at time t : wt(a) Define as the position t’ in the trace xt+1,…, xt’ , where xt’ is the first reference to

block a after time t. (i.e. next reference time of block a) If block a is never referenced again after time t, it is set to L+1

The priority list at time t : pt Order by their forward distance right before the reference xt is performed

Initially, P1 contains null blocks whose forward distances are set to L+1, P is updated before each reference

When a replacement is required, the victim block is => Block whose forward distance is largest in P, at bottom of P

Page 6: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

6/21

Optimum Replacement Policy With Multiple Miss Costs

Cost-Sensitive OPTimal replacement algorithm (CSOPT) Let c(xt) be the miss cost of the memory reference with block addres

s xt at time t, for a trace of memory references X= x1, x2, x3, …, xL

The problem is to find a replacement algorithm such that the aggregate cost of the trace, C(X) = ∑ c(xt) is minimized

Basic Implementation of CSOPT Expands all possible replacement sequences in a search tree

Pick the sequence with the least cost at the end

Add one level of depth in the search tree on every reference

S possible blocks to replace, where s is the set size in block

The procedure is extremely complex and unfeasible

Page 7: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

7/21

Exploiting OPT to Cut the Branch Factor The basic idea

If all cache blocks have the same miss cost for their next reference, the victim can be selected by invoking OPT

Improvement of CSOPT Consider the next miss cost at time t : ft(i)

Miss cost to bring pt(i) into cache at its next reference, if pt(i) is replaced at time t

In the case that ft(s) ≤ ft(i) for every i, where i < s Invoke OPT to select pt(s) as the victim

if ft(i) < ft(s) for every i, where i < s

Reservation can be made for more than one block at a time with up to s-1 reservation options

This is the only situation the search branch with 2 replacement options

(1) Pursuing OPT : still select pt(s) as the victim

(2) Reserving pt(s) until its next reference by replacing one of the lower cost blocks

Page 8: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

8/21

Illustration of CSOPT Consider 2 static miss costs and assume c(d) = r, where r > 1

and the miss cost of all other blocks is 1

The block at the bottom of p4 has the highest cost

Consider the option to reserve for d by replacing b

From t=4 to t=11, we reserved for block d and applied OPT to the remaining 2 blocks

RV releases the hold on block d at t=11 (where it is first accessed after t=4)

Page 9: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

9/21

The Final Replacement option Made by CSOPT

Follow the above illustration At t = 14, the cache states under OPT and RV

are identical, thus we compare the costs of OPT and RV COPT(x5,…,x14) = r and CRV(x5,…,x14) = 4

If r > 4 , RV yields lower cost and block b is replaced at t=4

If r = 4 , both options lead to the optimal cost

If r < 4 , OPT yields lower cost and block d is replaced at t=4

Page 10: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

10/21

Comparison Between OPT and CSOPT Experimental methodology

Used simulation trace of one processor in multiprocessor system Used the simplified cost model

Miss mapping to local memory is assigned a cost of 1 Miss mapping to remote memory is assigned a cost of r

Data placement policy in main memory Random replacement of blocks

Place blocks locally or remotely in a random fashion First touch policy (Practical situation)

If a processor is the first one to access the block, the block is mapped locally; otherwise it is mapped remotely

The setting of cache Caches are 4-way 16KB with 64-byte blocks

Define the cost ratio of accessing remote and local memory, denoted r

Page 11: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

11/21

Evaluation for Relative Cost Saving The relative cost saving of CSOPT over OPT

is calculated as

MlocR denotes the number of local misses using repl

acement algorithm R (OPT or CSOPT) Mrem

R denotes the number of remote misses using replacement algorithm R (OPT or CSOPT)

r denotes the cost ratio (under the two static costs)

Page 12: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

12/21

Relative Cost Saving with Random Cost Mapping

The relative cost savings increase with r The curve for r = inf shows the upper bound of cost savings

As HAF (high cost access fraction) varies from 0 to 1 The relative cost saving increases, showing a peak between HAF=0.1 and HAF=0.5

It is easier to benefit from CSOPT when HAF < 0.5

The relative cost savings by CSOPT over OPT is significant and consistent across all benchmarks

Page 13: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

13/21

Relative Cost Saving with First-Touch Cost Mapping Relative cost savings of CSOPT over OPT with

first-touch and random cost mapping with same HAF

The cost savings achieved under first-touch cost mapping are consistently less than under random cost mapping Especially for LU, the cost savings under first-touch policy is very poor

Page 14: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

14/21

Online Cost-Sensitive Replacement Policies The rationale of LRU cost-sensitive replacement algorithm

Let c[i] be the predicted miss cost of the block which occupies the ith position from the top of the LRU stack in a set of size s c[1] is the predicted miss cost of the MRU block c[s] is the predicted miss cost of the LRU block

In the case that ct(s) ≤ ct(i) for every i, where i < s Replace the LRU block LRU(s) as the victim

If ct(i) < ct(s) for every i, where i < s Reserve the LRU block LRU(s)

Terminate reservation by depreciating the predicted miss cost of reserved LRU block Explore 3 strategies to depreciate the predicted miss cost

Basic Cost-Sensitive LRU Algorithm (BCL) Dynamic Cost-Sensitive LRU Algorithm (DCL) Adaptive Cost-Sensitive LRU Algorithm (ACL)

when a reservation is activeVictimize the first block in LRU stack whose predicted miss cost is lower than the predicted miss cost of the reserved block

Page 15: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

15/21

Basic Cost-Sensitive LRU Algorithm (BCL) The Basic Cost-Sensitive LRU Algorithm (BCL) depreci

ate the cost of a reserved LRU block whenever A non-LRU block is victimized in its place

BCL algorithm in an s-way set associative cache

To select a victim, BCL searches for blocks in the LRU stack Such that c[i] < Acost and i is closest to the LRU position

If BCL find one, reserve the LRU block by replacing the ith block Otherwise, the LRU block is replaced

When Acost reaches zero, the reserved LRU block becomes the prime candidate for replacement

Whenever a block takes LRU position Acost is loaded with c[s] (Predicted miss cost of new LRU block)

Page 16: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

16/21

Dynamic Cost-Sensitive LRU Algorithm (DCL) BCL’s weakness is that it assumes that LRU provid

es a perfect estimate of forward distances To correct for this weakness

The Dynamic Cost-Sensitive LRU Algorithm (DCL) Depreciate the Acost of a reserved LRU block whenever

A non-LRU block is victimized in its place is actually accessed before the reserved LRU block

To do this, DCL keeps a record of every replaced non-LRU blocks in the Extended Tag Directory (ETD) ETD entries are attached to each set, initially all entries are invalid

When a non-LRU block is replaced, an ETD entry is allocated and its valid bit is setIf miss in cache but hit in ETD, then depreciate the cost and invalidate the matching ETD entry

When an access hits on the LRU block, all ETD entries are invalidated

Page 17: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

17/21

Adaptive Cost-Sensitive LRU Algorithm (ACL) The Adaptive Cost-Sensitive LRU Algorithm (ACL)

The rationale behind ACL The reservation of LRU block successes and failures are

clustered in time Thus, ACL implements an adaptive reservation activation scheme ACL automaton in each set

Associate a counter to enable and disable reservation

When reservations are disabled An LRU block enters the ETD on replacement and other blocks

in the set has lower cost If miss in cache but hit in ETD, then all ETD entries are invalidated and reservations are enabled

Initially, the counter is set to zero Disabling all reservations

The counter increases or decreases when a reservation success or fail

Page 18: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

18/21

Evaluation Approach and Setup To estimate the impact of the replacement policy on

performance We need to simulate the architecture in detail

Costs associated with misses are multiple and dynamic Data is placed in main memory using the first touch policy Implement the 3 online cost-sensitive replacement

algorithms in the second-level cache Target system configuration

Page 19: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

19/21

Reduction of Execution Time Over LRU

The improvement on the execution time by DCL is significant

Although the performance of ACL is often slightly lower than DCL However, ACL gives more reliable performance across applications

Compare to BCL, DCL yields reliable and significant improvement of execution time in every situation

The performance of ACL is often slightly lower than DCL

In LU, the performance improvement of ACL is over DCLThis is because ACL effectively filters unnecessary reservations in LU

Page 20: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

20/21

Conclusions Proposed a new optimum cache replacement policy,

which minimizes the miss cost of multiple miss costs rather than miss count, called CSOPT To facilitate the search involves in a huge search tree

Exploit the rationale behind OPT and the block reservation

Although CSOPT is unrealizable in real systems However, CSOPT gives useful hints and guideline to improve

existing cache replacement algorithms We have demonstrated the significant performance benefits by

developing 3 practical algorithms, called BCL, DCL and ACL

The application domain of our algorithm is very board They are applicable to the management of various kinds of

storage which involve multiple nonuniform cost functions

Page 21: Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

21/21

Appendix – CSOPT Algorithm

When cache miss The prime candidate for replacement is the block whose forward distance is the largest and not reserved

Initially, P has only one active node which contains null blocks whose forward distances are set to L+1, and cost is zero

(1) Pursuing OPT : still select pt(s) as the victim

(2) Reserving pt(s) until its next reference by replacing one of the lower cost blocks