Top Banner
NUDA: Non-Uniform Directory Architecture for Scalable Chip Multiprocessors Wei Shu and Nian-Feng Tzeng Abstract—Chip multiprocessors (CMPs) involve directory storage overhead if cache coherence is realized via sharer tracking. This work proposes a novel framework dubbed non- uniform directory architecture (NUDA), by leveraging our two insights in that the number of “active” directory entries required to stay on chip is usually small for a short execution time window due to high directory locality, and that the fraction of interrogated directory entries drops as the core count rises. Unlike earlier storage overhead reduction techniques that require all cached LLC blocks to have their directory entries fully on chip, NUDA dynamically buffers only most active directory vectors (DVs) on chip while keeping DVs of all LLC blocks in a backing store at low level storage. NUDA attains its superior efficiency via an inventive criticality- aware replacement policy (CARP) for on-chip buffer management and effective prefetching to pre- activate vectors (PAVE) for upcoming coherence interrogations. We have evaluated NUDA by gem5 simulation for 64-core CMPs under PARSEC and SPLASH benchmarks, demonstrating that CARP and PAVE enhance on-chip directory storage efficiency significantly. NUDA with a small on-chip buffer for DVs exhibits negligible performance degradation (to stay within 2.6 percent) compared to a full on-chip directory, while outperforming its previous counterparts for directory area reduction when on-chip directory budget is provisioned scarcely for high scalability. Index Terms—Chip multi-processors, coherence protocols, hash tables, memory hierarchy, prefetching, sharer tracking directory Ç 1 INTRODUCTION CONTEMPORARY large scale chip multiprocessors (CMPs) usually resort to hardware-oriented cache coherence for easy programma- bility and good performance. Recent directory reduction methods [7], [15], [21] commonly lower the on-chip directory capacity to improve scalability while requiring all cached blocks in cores to have their directory entries fully on chip. However, such a direc- tory can yield either poor private cache hit rates due to coherence- induced invalidations [7], [15] or significant network interference due to mounted coherence traffic [7], [21]. In this work, we present Non-Uniform Directory Architecture (NUDA), a novel directory scheme that greatly reduces on-chip directory storage. Unlike directory compression adopted conven- tionally, NUDA keeps the directory vectors (i.e., bit-map vectors) of all LLC blocks in off-chip memory and dynamically “buffering” the active vectors on chip in a small directory buffer, hence result- ing in our layered directory resident in the multi-level memory hierarchy, so named “non-uniform directory architecture”. NUDA builds upon our two newly observed insights: (1) the number of required directory entries (DEs) for a short execution time window is low and (2) the fraction of DEs which are interrogated decreases as the system scales up. The first insight stems from the fact that during parallel program runs, the number of “active” DEs over a short time window is lim- ited, as demonstrated in Fig. 1a. Active DEs may change from one time window to the next due to coherence interrogations on differ- ent LLC blocks. Given a cache block, its directory vector is altered under three coherence events: private cache misses (i.e., the data block is shared by a new core, resulting in adding the new sharer to the corresponding vector), private cache replacements (i.e., result- ing in the corresponding bits in the vector reset, if the protocol is non-silent), or private cache write (i.e., with all the vector bits but the write missed one reset). If a block experiences none of the three events, its directory entry is inactive. Statically holding those inac- tive DEs fully on chip unnecessarily elevates chip area overhead. The second insight results from the fact that the fraction of DEs that are active over a short time window decreases as the core count increases. For example, blackscholes in PARSEC benchmark suite [2] results in the active DEs ratio of 13.8 percent when run on a 4-core CMP, but the ratio drops to 2.4 percent on a 64-core CMP, as shown in Fig. 1b. Such increased concentration on active DEs offers a great opportunity for NUDA to contain the size of on-chip directory buffer, calling for small on-chip directory area budget in support of high scalability. Hence, the use of an on-chip directory buffer to cache only coherence-active DEs makes NUDA particu- larly attractive for large-scale CMPs. NUDA keeps the vectors of all LLC blocks in DRAM memory for precise sharer tracking to avoid excessive directory-induced cache misses and mounted NoC traffic. Notice that memory storage provi- sioned for directory vectors is proportional to the LLC size, irrespec- tive of the memory size. This is in sharp contrast to earlier multi- level directory approaches targeting shared-memory multiproces- sors [1], [8], [9], [11], because such an earlier approach requires one directory entry for each memory block frame (of 64 bytes) statically. As a result, the earlier approach makes shared-memory multiproc- essors suffer from the notorious directory memory-scaling problem as elaborated in Section 2.1. On the other hand, NUDA is designed for large scale on-chip cache-coherent CMPs to require extremely low chip directory area budget. NUDA intelligently promotes vector transfer between on-chip and off-chip storage to sustain execution performance via two novel mechanisms. First, NUDA adopts a unique criticality- aware replacement policy (CARP) to help retain performance-critical directory vectors on chip, avoiding those vectors from being evicted undesirably. Second, NUDA reduces the average interro- gation latency and increases the on-chip directory hit rate by means of an effective pre- activating vector (PAVE) technique, which pre- fetches the vectors that are likely to be involved in future directory interrogations, for sustained execution performance. We use the gem5-simulated Linux environment [8] to evaluate NUDA under PARSEC [2] and SPLASH [19] benchmark suites for 64-core CMPs. Evaluation results reveal that NUDA soundly out- performs previous approaches for on-chip directory storage reduc- tion when on-chip directory budget is kept fairly scarce for high system scalability. To sum up, this work makes the following contributions: ^ We examine directory activities in many-core cache- coherent CMPs to show strong locality. ^ We discover such locality intensifies as the core count rises. ^ We develop NUDA as a multi-layer directory architecture to make it possible to have very low on-chip directory overhead for superior scalability. ^ NUDA attains effective on-chip directory buffer manage- ment via novel CARP and PAVE techniques. ^ NUDA is shown to outperform its earlier counterparts for on-chip area overhead reduction. The NUDA framework is orthogonal to the plethora of prior compression-based area overhead reduction techniques, like The authors are with Center for Advanced Computer Studies (CACS), University of Louisiana at Lafayette, Lafayette, LA 70503-2014. E-mail: {wxs0569, tzeng}@louisiana.edu. Manuscript received 13 May 2017; revised 3 Nov. 2017; accepted 8 Nov. 2017. Date of publication 13 Nov. 2017; date of current version 13 Apr. 2018. (Corresponding author: Wei Shu.) Recommended for acceptance by R. F. DeMara. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2017.2773061 740 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018 0018-9340 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
8

NUDA: Non-Uniform Directory Architecture for Scalable Chip ...

Mar 28, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
untitledWei Shu and Nian-Feng Tzeng
Abstract—Chip multiprocessors (CMPs) involve directory storage overhead if
cache coherence is realized via sharer tracking. This work proposes a novel
framework dubbed non-uniform directory architecture (NUDA), by leveraging our
two insights in that the number of “active” directory entries required to stay on chip
is usually small for a short execution time window due to high directory locality, and
that the fraction of interrogated directory entries drops as the core count rises.
Unlike earlier storage overhead reduction techniques that require all cached LLC
blocks to have their directory entries fully on chip, NUDA dynamically buffers only
most active directory vectors (DVs) on chip while keeping DVs of all LLC blocks in
a backing store at low level storage. NUDA attains its superior efficiency via an
inventive criticality-aware replacement policy (CARP) for on-chip buffer
management and effective prefetching to pre-activate vectors (PAVE) for
upcoming coherence interrogations. We have evaluated NUDA by gem5
simulation for 64-core CMPs under PARSEC and SPLASH benchmarks,
demonstrating that CARP and PAVE enhance on-chip directory storage efficiency
significantly. NUDA with a small on-chip buffer for DVs exhibits negligible
performance degradation (to stay within 2.6 percent) compared to a full on-chip
directory, while outperforming its previous counterparts for directory area reduction
when on-chip directory budget is provisioned scarcely for high scalability.
Index Terms—Chip multi-processors, coherence protocols, hash tables,
memory hierarchy, prefetching, sharer tracking directory
Ç
1 INTRODUCTION
CONTEMPORARY large scale chip multiprocessors (CMPs) usually resort to hardware-oriented cache coherence for easy programma- bility and good performance. Recent directory reduction methods [7], [15], [21] commonly lower the on-chip directory capacity to improve scalability while requiring all cached blocks in cores to have their directory entries fully on chip. However, such a direc- tory can yield either poor private cache hit rates due to coherence- induced invalidations [7], [15] or significant network interference due to mounted coherence traffic [7], [21].
In this work, we present Non-Uniform Directory Architecture (NUDA), a novel directory scheme that greatly reduces on-chip directory storage. Unlike directory compression adopted conven- tionally, NUDA keeps the directory vectors (i.e., bit-map vectors) of all LLC blocks in off-chip memory and dynamically “buffering” the active vectors on chip in a small directory buffer, hence result- ing in our layered directory resident in the multi-level memory hierarchy, so named “non-uniform directory architecture”. NUDA builds upon our two newly observed insights: (1) the number of required directory entries (DEs) for a short execution time window is low and (2) the fraction of DEs which are interrogated decreases as the system scales up.
The first insight stems from the fact that during parallel program runs, the number of “active” DEs over a short time window is lim- ited, as demonstrated in Fig. 1a. Active DEs may change from one
time window to the next due to coherence interrogations on differ- ent LLC blocks. Given a cache block, its directory vector is altered under three coherence events: private cache misses (i.e., the data block is shared by a new core, resulting in adding the new sharer to the corresponding vector), private cache replacements (i.e., result- ing in the corresponding bits in the vector reset, if the protocol is non-silent), or private cache write (i.e., with all the vector bits but the write missed one reset). If a block experiences none of the three events, its directory entry is inactive. Statically holding those inac- tive DEs fully on chip unnecessarily elevates chip area overhead.
The second insight results from the fact that the fraction of DEs that are active over a short time window decreases as the core count increases. For example, blackscholes in PARSEC benchmark suite [2] results in the active DEs ratio of 13.8 percent when run on a 4-core CMP, but the ratio drops to 2.4 percent on a 64-core CMP, as shown in Fig. 1b. Such increased concentration on active DEs offers a great opportunity for NUDA to contain the size of on-chip directory buffer, calling for small on-chip directory area budget in support of high scalability. Hence, the use of an on-chip directory buffer to cache only coherence-active DEs makes NUDA particu- larly attractive for large-scale CMPs.
NUDA keeps the vectors of all LLC blocks in DRAM memory for precise sharer tracking to avoid excessive directory-induced cache misses andmountedNoC traffic. Notice that memory storage provi- sioned for directory vectors is proportional to the LLC size, irrespec- tive of the memory size. This is in sharp contrast to earlier multi- level directory approaches targeting shared-memory multiproces- sors [1], [8], [9], [11], because such an earlier approach requires one directory entry for each memory block frame (of 64 bytes) statically. As a result, the earlier approach makes shared-memory multiproc- essors suffer from the notorious directory memory-scaling problem as elaborated in Section 2.1. On the other hand, NUDA is designed for large scale on-chip cache-coherent CMPs to require extremely low chip directory area budget.
NUDA intelligently promotes vector transfer between on-chip
and off-chip storage to sustain execution performance via two
novel mechanisms. First, NUDA adopts a unique criticality-aware
replacement policy (CARP) to help retain performance-critical
directory vectors on chip, avoiding those vectors from being
evicted undesirably. Second, NUDA reduces the average interro-
gation latency and increases the on-chip directory hit rate by means
of an effective pre-activating vector (PAVE) technique, which pre-
fetches the vectors that are likely to be involved in future directory
interrogations, for sustained execution performance. We use the gem5-simulated Linux environment [8] to evaluate
NUDA under PARSEC [2] and SPLASH [19] benchmark suites for 64-core CMPs. Evaluation results reveal that NUDA soundly out- performs previous approaches for on-chip directory storage reduc- tion when on-chip directory budget is kept fairly scarce for high system scalability. To sum up, this work makes the following contributions:
^ We examine directory activities in many-core cache- coherent CMPs to show strong locality.
^ We discover such locality intensifies as the core count rises. ^ We develop NUDA as a multi-layer directory architecture
to make it possible to have very low on-chip directory overhead for superior scalability.
^ NUDA attains effective on-chip directory buffer manage- ment via novel CARP and PAVE techniques.
^ NUDA is shown to outperform its earlier counterparts for on-chip area overhead reduction.
The NUDA framework is orthogonal to the plethora of prior compression-based area overhead reduction techniques, like
The authors are with Center for Advanced Computer Studies (CACS), University of Louisiana at Lafayette, Lafayette, LA 70503-2014. E-mail: {wxs0569, tzeng}@louisiana.edu.
Manuscript received 13 May 2017; revised 3 Nov. 2017; accepted 8 Nov. 2017. Date of publication 13 Nov. 2017; date of current version 13 Apr. 2018. (Corresponding author: Wei Shu.) Recommended for acceptance by R. F. DeMara. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2017.2773061
740 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018
0018-9340 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2 RELATED WORK
This section first discusses how NUDA differs from directory over- head reduction methods for distributed shared memory (DSM) systems. Earlier approaches to achieving chip directory area over- head are then highlighted.
2.1 Memory Directories for DSM Systems
Distributed shared memory (DSM) systems may keep their shared memory directories either in DRAM, such as Origin [8], DASH [9], and FLASH [11], or in memory controllers for better performance [1], [13]. Each memory data block in such a system is associated with a directory entry. NUDA has no infamous directory memory- scaling problem the DSM system faces. It differs from the directory of a DSM system in three distinct aspects. First, NUDA avoids pro- visioning full directory entries to all LLC blocks on chip for storage reduction, whereas the DSM system equips every memory block with its directory information statically. Second, DRAM storage reserved for holding LLC directory vectors under NUDA is con- stant, irrespective of the memory size, whereas the directory of a DSM system grows linearly with its memory size. Third, NUDA explores the unique insight based on directory access locality, for effectively transferring sharer vectors across the storage hierarchy.
Multi-Level DSM Directory Design. A multi-level directory approach has been considered earlier [1], [8], [9], [11], [13] for the memory directory of DSM systems. It holds DEs of accessed cache blocks in an on-chip buffer. Each DE held in the buffer includes not only a directory vector but also the state bits (e.g., 5 bits under the MESI protocol) plus other control bits needed to manage the on- chip buffer. In contrast, our NUDA’s on-chip buffer is for holding only directory vectors, which vastly dominate directory area over- head, but not the state bits. Every LLC block under NUDA is equipped with its state bits plus other control bits statically. The presence of state bits for each LLC block avoids otherwise performance degradation which will be caused by read misses to an LLC block whose directory vector is absent from the buffer.
2.2 Directory Area Overhead Reduction
Earlier methods for lowering cache-coherent directory area over- head generally fall into two classes: directory width reduction and directory height reduction. The first class of methods deals with various sharer representations, such as full bit-map [7], coarse- grained bit-map [8], imprecise bit-map [21], or limited pointers [8], [15], [16], [21]. The second class of methods mainly explores the
fact that the amount of directory entries should be no more than the total number of private cache lines [7], [15], and it can be fur- ther reduced via directory entries encoded [15] or tracking- avoidance with software modifications [4]. Such further directory area reduction may either lead to imprecise tracking [21] or involve invalidating all cached copies for the victimized block [7], [15]. Meanwhile, the tiny directory [17] employs a small pool of entries to keep the directory elements of selected LLC blocks that experi- ence high shared accesses. While achieving considerable area over- head reduction, such a directory limits its scalability to 512 (since it uses a 64-byte LLC data block to track sharers precisely), and it incurs one extra core-to-core trip for every interrogation on the data block converted for sharer tracking. Large-scale CMPs may avoid full-fledged implementation of cache coherence by domain- aware coherence [6], [12]. However, it adds hardware for on-the- fly logical-to-physical core mappings.
Based on relinquishment coherence and compressed sharer tracking, ReCoST [16] is able to achieve performance close to that of a full-fledged directory based CMP. However, it still keeps all sharer patterns non-redundantly in an on-chip table. Being hardware-light and complementary to ReCoST, NUDA may work in conjunction with ReCoST to further lift on-chip directory area efficiency.
To our best knowledge, this is the first work that employs DRAM to back up the CMP directory for efficient cache coherence, avoiding excessive directory-induced invalidation and surged net- work interference.
3 MOTIVATION
NUDA is motivated by the newly discovered two insights: (1) the number of active DEs is small over a short time window, and (2) the fraction of interrogated directory entries shrinks as the number of cores rises. These insights offer a unique opportunity to keep only a very small number of active DEs on chip. By doing so would lower directory storage overhead substantially, promising superior CMP scalability.
To prop our insights, we have collected evaluation results via the gem5-simulated Linux environment, in which various multi- threaded PARSEC [2] and SPLASH [19] benchmarks are executed for a range of core counts. A two-level inclusive cache hierarchy (i.e., L1 and LLC) under the MESI coherence protocol is adopted. CMP cores are organized as meshed-NoC topologies, with each core assigned with one thread. Details of the simulated environ- ment are provided in Section 5.
For easy description, the coverage (b) of a directory design is defined as the number of DEs provisioned over the number of aggre- gate private cache lines. The coverage of b ¼ 1:0 signifies that each private cache line (be an instruction or data line in our two-level cache hierarchy detailed in Table 2) is provisionedwith one DE.
3.1 Locality of DE Accesses
The maximum number of active DEs over a short time window (e.g., 1000 cycles) stays very low for the range of core counts shown in Fig. 1a, where the number of DEs with directory coverage under b ¼ 1:0 is also shown. The normalized results of active DE counts against that under b ¼ 1:0 are depicted in Fig. 1b. As an example, in a 2-core system with each core holding a 64 KB private cache, its directory contains 2 ð64 KB=64BÞ ¼ 2K entries for b ¼ 1:0. How- ever, when executing X264, it is found that only 644 entries are interrogated within the given time window (of 1K cycles), resulting in the active DE ratio of 32.2 percent. For a 64-core system, less than one thousand DEs are interrogated, signifying the active DEs to account for only 1.2 percent of the DE count under b ¼ 1:0. Other benchmarks all exhibit limited percentages of active DEs, as depicted in Fig. 1b. The maximum percentage of active DEs is
Fig. 1. (a) Maximum number of distinct DEs interrogated in a 1K-cycle time win- dow, with the DE count under b ¼ 1.0 shown, where b represents the coverage fraction. (b) Active DE fraction of various core counts, measured by the number of distinct DEs over the DE count under b ¼ 1.0.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018 741
below 6.2 percent (under canneal) for a 64-core CMP and is less than 33 percent (under X264) for a 2-core system.
Locality of DE accesses is also observed for longer time win- dows of 3K cycles and 10K cycles, as demonstrated in Fig. 2, where the results are averaged over multiple runs of each benchmark exe- cuted on the conventional 64-core CMP. The numbers creep up when the time window extends for a given benchmark, as expected. Nonetheless, they are far smaller than the total number of private cache lines in the CMP, even for the time window of 10K cycles. It thus suffices to hold on chip, those DEs interrogated for coherence enforcement (called “active DEs”) in a short time win- dow, with any other DE fulfilled from the backing store possibly in a few hundred cycles.
Our observed insight of directory locality may result from three facts. First, the multi-threaded runs are for single-tasking so that all concurrent threads of a task share the same data in their work- ing sets within a short time window, with shared variables con- tained in a few cache blocks, thus involving a few DEs, irrespective of the number of sharers. Second, during typical task execution, a significant portion of data blocks is used privately without multi- ple sharers [4]. Hence, over an execution time window, most cache blocks are not involved in coherent events. Given a cache block, its directory vector is altered under one of the three coherent events: private cache miss (i.e., the data block is shared by a new core, resulting in its vector adding the new sharer), private cache replacement (i.e., resulting in resetting the corresponding bit in the vector, if the protocol is non-silent), or private cache write (i.e., with all but one bit corresponding to the write-missed core-reset). On the other hand, read/write hits to private blocks in any core involve no DE interrogation. Third, various cache placing/replac- ing policies [10] and OS locality explorations [14] all aim to improve private cache hit rates, thereby lowering the number of directory interrogation instances.
3.2 Decline Trend
The active DE fraction exhibits a decline trend when the system scales, as shown in Fig. 1b. Although the number of directory entries required to track all private cache lines grows linearly with the core count (as well-known previously [7], [15], [18]), the ratio of active DEs, in fact, drops monotonically as the core count rises. The most drastic decrease happens to x264, where its ratio drops from 32 percent for 2 cores to 1.6 percent for 64 cores. This is mainly because the aggregate private cache size of a 64-core CMP increases (by a factor of 32) while the number of referenced LLC lines rises much slowly. In a short execution time window, cores tend to share/synchronize a small set of data blocks, concentrating associ- ated interrogations on a limited number of DEs. Additionally, each DV for a larger core count is longer, able to track more sharers and hence, making the ratio in decline.
Another reason is caused by an increase in the mean latency of communication for a larger NoC. The increased latencies prolong the read/write and replacement operations on private caches, low- ering the number of data blocks referred within the given time
window. Further communication latency hikes can result, if the home of a data block happens to reside in a farther LLC tile while its sharers are close to each other [6], [12]. NUDA exploits this trend for chip directory area overhead reduction.
4 DESIGN AND IMPLEMENTATION DETAILS
Fig. 3 illustrates our proposed NUDA architecture for a many-core CMP. Every LLC block under NUDA is equipped with its state bits plus other control bits statically. A small on-chip directory vector (DV) buffer is incorporated. A full-fledged backing store is provi- sioned in memory, with each store entry designated statically for one DV of a corresponding LLC block.
Advantages. In Fig. 3, the DVs which dominate storage overhead are resided in main memory instead of precious on-chip real estate. They will be brought to the on-chip DV buffer dynamically accord- ing to the coherence need. Such hierarchical design comes with other advantages. First, precise sharer-tracking can be achieved since a backing store is provisioned in memory to hold all LLC DVs. Second, privately cached data are tracked fully and explicitly. This is in contrast to prior work that requires switching between core pointers and vectors [15], [16], [18], [21] or using a software- managed mechanism [4] that triggers exceptions. Third, each DV is in one whole piece. Although breaking a DV into pieces as prior designs [15], [17] permits directory compression, but it may yield performance degradation which causing by lengthened directory interrogations.
Design Issues. NUDA faces three design issues critical to NUDA performance: (1) locating DEs in the backing store straightfor- wardly, (2) managing the DV buffer efficiently, and (3) prefetching DVs from the backing store appropriately.
4.1 On-Chip Buffer Management
It is critical to manage the buffer effectively for its scalability. For a given size, the buffer has a hit rate governed by such factors as the replacement policy and set-associativity (i.e., 16 for our current DV buffer design), among others. This work focuses on DV manage- ment and investigates a novel DV buffer replacement policy. It aims to keep as many active DVs in the on-chip buffer as possible according to DV criticality.
Fig. 4a depicts the DV buffer allocation policy upon two differ- ent state transitions, from an exclusive (E) or modified (M) state to a shared (S) state. Notice that each LLC block is statically associ- ated with its state bits, in order to permit fast protocol reactions (e.g., fulfilling read miss requests from private caches) without interrogating the backing store. Fig. 4b illustrates the DV buffer line format, which involves four fields: tag, ever-written (EW) flag,
Fig. 2. Mean number of DEs interrogated within different time windows for 64 cores, where benchmarks along the X-axis are denoted by their abbreviations listed in Section 5.2.
Fig. 3. Proposed NUDA architecture for inclusive cache hierarchy, with the full directory vectors of LLC (shown by the dashed box) kept in the backing store.
742 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018
LRU control, and bit-map DV. The buffer entries are 16-way set- associative, with every entry tagged by its corresponding LLC line number ðLLÞ.
DV Buffer Entry Allocation. Any DV denoting just a single sharer (i.e., in an E or M state) will not be allocated in the buffer. Instead, buffer entry allocation takes place when an LLC block exhibits its state transition from E or M State to S State. This means that any DV of the LLC line that is cached first time by a core is not allocated in the DV buffer. The affected DV in the backing store is written through, in order to track this first privately caching situation. Once a later coherent event from another core happens to the LLC line, its associated DV will be fetched from the backing store and then kept in the on-chip DV buffer, for accelerating subsequent directory interrogations.
This allocation method has two advantages. First, during pro- gram execution, those memory blocks cached only for private use [4], [17] require no on-chip DVs for them. Second, if a block is for private use, its DV is never interrogated by other cores. Our alloca- tion strategy is preferred over employing encoded core pointers [15] and software-assisted tracking-avoidance approaches [4].
4.2 Criticality-Aware Replacement Policy (CARP)
To take criticality into consideration in DV buffer replacement, NUDA pursues a Criticality-Aware Replacement Policy (CARP) by identifying two categories of critical DVs: sharing-intensive and Read-After-Write (RAW) types. Sharing-intensive DVs are those frequently referred ones, and replacing them is likely to be fol- lowed soon by calls for their returns on chip. Hence, this type of DV is desired to be kept in the on-chip buffer. The LRU field, which is refreshed upon every interrogation that caused by a read/write miss, suffices to signify their activeness for them to stay on chip.
A DV that is interrogated due to a write miss will refresh its LRU field as well. In addition, the EW flag is raised to mark this ever-written (EW) history as shown in Fig. 4b, aiming to record the Read-After-Write (RAW) sequence. As the second critical type, RAW criticality is captured by the EW flag to prevent the entry from being evicted undesirably in a predefined time span. For a modified block with its DV not held on chip, the latest block owner can be identified only after its DV is brought from the backing store into the buffer. The read miss is subject to a prolonged latency because of the critical RAW sequence, which could be translated to performance degradation, since the read miss typically is on the critical path. The EW flag is reset periodically to purge out the stale write history, so that its associated DV entry can be replaced later by another critical DV. A suitable reset period for the EW bits is design-specific and is affected by the system size.
Selecting the replaced entry in a DV set starts with checking the EW flag to exclude those vectors which are potentially under criti- cal RAW sequences. The victim is then selected according to the LRU fields of the remaining entries in the set. In an extremely rare situation for the 16-way set in our evaluation, if all entries in the set have their EW flags set, the victim is determined by the LRU field of all entries in the set.
The replaced DVs are temporarily stored in the write back buffer in memory controllers aside with other data blocks, and they are written back to backing store when the buffer is flushed. If later DV interrogations hit DVs in the buffer, they are served directly from the memory write buffer. A DV read request is treated as a memory read request, albeit with a different target address at the backing store that is a reserved DRAM region.
4.3 Prefetching for Pre-Activating Vectors (PAVE)
During program execution, active DVs may vary from one time window to the next, calling for prefetching DVs appropriately to boost the on-chip DV hit rate. Our proposed pre-activating vectors (PAVE) module predicts potential DVs to be required for future coherence. When a read/write miss interrogates a missing DV, the DV is fetched from the backing store, followed by prefetching DVs anticipated in future coherence interrogations.
Unlike data prefetching, PAVE must deal with the DV accesses of all threads globally in the CMP, as illustrated in Fig. 5. Thread t1 accesses to memory blocks a and aþ d during the first given time window, and then to blocks aþ 3d and aþ 5d in the next time win- dow, where d is the block size. Likewise, Threads t2; t3, and t4 access their respective sequences of memory blocks. Such memory block access sequences widely exist and provide an indication for PAVE to prefetch DVs anticipated by execution of all threads.
PAVE History Table. PAVE faces a unique issue in prefetching DVs from the backing store, since the LLC line number to which a given memory block resides, is not known by the memory control- ler. With set-associative LLCs, PAVE must know the way number of an LLC set to which the memory block is mapped, so as to pre- fetch its corresponding DV properly in the backing store through address translation performed in a memory controller. To this end, PAVE employs a hash table to keep the recent LLC line number allocation history.
As depicted in Fig. 6, each memory controller is equipped with one PAVE hash table and a collection of queues to serve as the pre- fetching buffer. Every entry of the hash table tracks a sequence of blocks existing in the LLC. It includes a tag filed, a block address present-bit vector (V-bit) to mark the memory blocks that have
Fig. 4. (a) DV buffer allocation, and (b) proposed DV buffer entry format, where EW and LRU fields are for replacement decisions and the LRU field is refreshed upon read/write misses.
Fig. 5. Example memory block access patterns by 4 threads, with d denoting the block size (i.e., 64 bytes typically).
Fig. 6. Memory block access history table kept in each memory controller for PAVE. Each history table entry includes a tag field, an 8-bit presence field, and eight way-fields, where the tag field keeps the first 39 bits of the 48-bit physical address, aligned at (864) for the cache block size of 64 bytes.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018 743
been accessed by the memory controller, and V way-fields to indi- cate the way numbers in LLC sets. Each way-field takes log2a bits, if the LLC is a-way set-associative.
History Recording. PAVE conducts history recording whenever a memory block is requested due to an LLC miss, as demonstrated in Fig. 6. Upon receiving the request for a missing block (u), say Block aa, the memory controller handles the memory block access and records this access in the PAVE history table (v), hashed with the key of PP ¼ aa½47 : 9. The LLC controller provides the way num- ber (i.e., 2, as stated in u, based on the LLC block mapping policy) in the LLC set for Block aa. Any block access within the eight conse- cutive memory blocks recorded in the same table entry. Their pre- sences are encoded by the present-bit vector. Likewise, the block access patterns of the other three threads shown in Fig. 5 are recorded respectively by three history table entries in Fig. 6.
The PAVE history table records the recent history of memory block accesses during thread execution. It addresses the unique need of the LLC line number (LL) in PAVE, by letting every entry track both the memory blocks fetched and their corresponding way numbers after brought into LLC. There are relatively few entries (say, 128) in the history table, which is the basis to enable PAVE for effective DV prefetching from the backing store.
Vector Pre-Activation. When a DV under interrogation is not in the DV buffer, the buffer initiates an access to the DV from the backing store through a memory controller as marked by u in Fig. 7. Meanwhile, PAVE prefetching is triggered and carried out by the same memory controller. Prefetching is based on the mem- ory block access history recorded in the PAVE hash table at the entry hashed by PP ½47 : 9.
Given the DV of LLC line number L (obtained using Set m and Way 3 shown in Fig. 7) is not on chip, a request is created by a memory controller for fetching the DV from the backing store. Simultaneously, the PAVE history table is checked by hashing PP ½47 : 9. If any subsequent block exists in LLC, its way number will be employed to generate a prefetching DV request, as denoted byv in Fig. 7. The DV requests are kept in prefetching queues and all prefetching queues together constitute the prefetch buffer, as depicted in Fig. 7. Each element in the prefetch buffer contains two fields, one for the determined prefetch address of a DV and the other for storing the DV from the backing store.
The requests are placed in the prefetching queues determined by the last q bits of the tag stored in the hit hash table entry, if the prefetch buffer consists of 2q queues. Each memory controller then issues prefetching to the addresses specified by the head elements of all non-empty queues at a time, as shown by w in Fig. 7. Multi- ple DV prefetch requests are issued at the same time for supporting multiple threads coherence execution. The prefetching results are stored in their corresponding queue entries, as shown byx.
When an interrogated DV is not in the DV buffer, it is checked in the prefetch buffer that the DV might be resided. If the DV is ready in the prefetch queue, it is promoted to the DV buffer in sup- port of future directory interrogation.
It is expected that PAVE delivers better performance for a larger prefetch buffer size. Further, higher performance results if more queues are involved in the prefetch buffer of a given size, because fewer threads are then to share a prefetch queue. The evaluation results of PAVE are given in Section 6.
5 EVALUATION METHODOLOGY
We evaluate our proposed NUDA under the inclusive cache hierar- chy, presenting evaluation results on the storage efficiency of NUDA and on the impacts of CARP and PAVE under various pre- fetch buffer sizes.
5.1 Directory Organizations
We compare NUDA against representative prior directory area reduction designs, including Sparse [7], SPACE [21], and SCD [15]. We evaluate all the designs across a range of on-chip directory area budget, with their results normalized with respect to that of the conventional full bit-mapped (FBM) directory, in which each LLC block is associated with one DV statically. For meaningful compari- son, NUDA (including history tables and prefetch queues) and its prior counterpart are allocated with exactly identical on-chip direc- tory area budget. Area budget signifies the numbers of directory elements for Sparse and SCD, and it dictates the sharer pattern table entry count for SPACE. The configuration of prior counter- part is partially listed in Table 1.
5.2 Simulation Setup and Workloads
We perform full system simulation using the cycle-accurate simu- lator of gem5 [3] to model a 64-core CMP. Table 2 lists the key sys- tem configuration parameters. We select ten applications from the PARSEC benchmark suite [2], i.e., blackscholes (BLK), bodytrack (BDY), canneal (CNL), dedup (DDP), ferret (FRT), fluidanimate
Fig. 7. DV transfer with effective prefetching by a memory controller, realized through the PAVE history table.
TABLE 1 Directory Configurations Parameters
DV length Control field Major augmented logics
NUDA 64 bits 1-bit EW; 15-bit LRU; 22-bit tag
PAVE assisted directory prefetching in memory controller
Sparse 64 bits 15-bit LRU; 22-bit tag
Invalidation-based precise tracking
SPACE 64 bits 20-bit counter Pattern table pointer for each LLC line; imprecise tracking
SCD 8 bits 15-bit LRU; 21-bit Tag; 4-bit index; 2-bit type
2-D root-leave hierarchy; invalidation-based precise tracking
TABLE 2 System Configuration Parameters
Processor x86-64, 64 out-of-order cores, 3 GHz Private I/D L1 $ 32 KB, 2-way, Pseudo-LRU, 2-cycle latency Shared L2 $ 2 MB per core, 16-way, Pseudo-LRU, inclusive,
16-cycle latency Cache block size 64 bytes Network 88 2D mesh, VC router, X-Y routing, 2-cycle
per hop latency, 32-byte link Coherence Protocol Directory-based MESI Memory DDR3-1600, 4 GB
744 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018
(FLU), freqmine (FRQ), streamcluster (STR), swaptions (SWP) and X264, and four applications from the SPLASH benchmark suite [19], i.e., FFT, FMM, ocean_ncp (OCN), and radix (RDX), for multi- threaded execution. Each application spawns 64 threads, with each thread mapped to one core.
6 RESULTS AND DISCUSSION
To ease explanation, NUDA with the on-chip directory area cover- age of b is denoted by NUDA b. We evaluate NUDAwith a range of b from 1 to 1/32, to examine its performance for various cover- age levels. It is found that NUDA-b exhibits no performance degra- dation for b 1=8, when compared to the conventional FBM (full bit-mapped) directory. With b shrinks from 1/16 to 1/32, NUDA starts to degrade its performance gradually (usually within 35 percent slowdowns). Those coverage levels coincide with the frac- tions of active directory elements, as discussed in Section 3. In this article, we demonstrate NUDA performance results under b ¼ 1=16 and 1=32, which signify fairly scarce on-chip directory area budget. The results shown in subsequent figures are normal- ized with respect to that of FBM, and they confirm that NUDA indeed is subject to negligible performance degradation even under fairly low coverage. We illustrate evaluation results of NUDA under b ¼ 1=16 subsequently.
6.1 Benefit Potentials of CARP
The mean fraction of directory entries initiating the RAW sequence (which will be marked by the EW bit, see Fig. 4b) is depicted in Fig. 8a. It reveals that on an average of 33 percent of the total direc- tory entries are RAW-critical, to possibly prolong later read misses. Therefore, it is beneficial to keep RAW-critical entries on chip, avoiding detrimental DV replacement. Comparative performance results of NUDA with and without CARP (which follows only the na€ve LRU-based replacement policy) for b ¼ 1=16 are shown in Fig. 8b. They clearly demonstrate themarked gains of CARP by con- taining performance slowdowns up to 15 (from 5.0 percent down to 0.33 percent for SWP). While its counterpart (without CARP) suffers from moderate performance degradation (5.2 percent on an average), NUDA cuts mean performance degradation down to 2.0 percent). Without CARP, FLU has the highest performance degradation (of 7.1 percent), possibly because it involves frequent synchronization operations [2]. With CARP to track potential RAW
sequences, NUDA lowers the performance degradation of FLU down to 2.0 percent. Note that for b ¼ 1=32, the performance gain due to CARP is expected to expand. CARP indeed offers substantial performance benefits to NUDA, indispensable for improving directory efficiency. All following results for NUDA are with its CARP enabled.
6.2 Impacts of PAVE
The impacts of PAVE on the overall performance of NUDA are evaluated under various prefetch buffer sizes, with PAVE-x denot- ing x prefetch entries in every memory controller for 64-core CMPs. The x entries are organized into x/4 queues, with prefetch addresses indicated by queue head elements issued by a memory controller at a time. When the buffer size grows, the number of queues involved rises accordingly so that fewer threads are to share a given queue for higher performance.
The runtime ratio results of NUDA-1/16 under various pre-
fetching buffer sizes for NUDA-1/16 are demonstrated in Fig. 9.
The outcomes when no prefetching exists are also included for
comparison. NUDA without PAVE shows noticeable runtime
increases (by 3.9 percent on an average). With PAVE-16, NUDA
experiences performance improvement, with its runtime ratios
drop by various degrees for the benchmarks evaluated (by up to
2.5 percent for RDX). Further, PAVE-32 boosts NUDA performance
markedly in many benchmarks. For example, it shrinks the run-
time slowdown to 2.2 percent (or 2.0 percent) for FLU (or DDP)
from 5.8 percent (or 5.3 percent) when no prefetching exists. Over-
all, PAVE-32 cuts the runtime by 2.0 percent on an average for all
benchmarks examined, when compared to the situation without
PAVE. Further expanding the prefetch buffer size to PAVE-64
benefits a few benchmarks modestly (e.g., OCN and RDX by some
1.3 percent) but has little improvement for most benchmarks. As a
result, PAVE-32 is adequate for NUDA, and it is chosen as the
appropriate prefetch buffer size for our following evaluation. Execution performance is expected to be correlated tightly with
the hit rate to the DV buffer. Even with very scarce on-chip budget, NUDA-1/16 exhibits respectable hit rates (that exceed 60 percent) for seven benchmarks even without PAVE, as demonstrated in Fig. 10. The mean hit rate over all benchmarks without PAVE actu- ally stands at 58.4 percent, signifying that directory interrogations indeed concentrate on hot directory entries. This result validates our insight that the small number of active directory entries is usu- ally small within a short time window. The mean hit rate across all benchmarks under PAVE-16 rises to exceed 65 percent, yielding the runtime slowdowns of most benchmarks to shrink noticeably as shown in Fig. 10. PAVE-32 makes the mean hit rate rise beyond 73 percent, as can be found in Fig. 10, so that all but three bench- marks (i.e., BDY, FRQ, OCN) have their runtimes extended by less than 3 percent, as can be found in Fig. 9.
NUDA involves DV transfer between on-chip LLC and the backing store, consuming memory bandwidth. For a quantitative measure, bandwidth utilization for b ¼ 1=16 under NUDA with different prefetching configurations is plotted in Fig. 12. In general,
Fig. 8. (a) Mean RAW ratio of on-chip directory. (b) Normalized execution time of NUDA-1/16 for different configurations w.r.t. that of FBM.
Fig. 9. Execution time ratio with PAVE under various prefetching buffer sizes for NUDA-1/16.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018 745
normalized memory bandwidth utilization increases as the num- ber of prefetch requests grows (i.e., for more prefetch buffer queues). Without PAVE, NUDA contains its average extra band- width utilization within 1.3 percent, with the largest utilization rise by up to 6.7 percent (for BDY), since BDY possibly having the larg- est active DE fraction to involve the biggest memory bandwidth overhead. Several other sharing-intensive benchmarks, such as CNL, FLU, FRQ, and SWP, increase bandwidth consumption by 1.2 to 4.5 percent. The sharing degrees of CNL, FLU, and SWP are relatively higher as stated earlier [2] and so do their directory activ- ity as shown in Fig. 1b.
It can be seen in Fig. 11 that PAVE-32 tends to have markedly lower memory bandwidth utilization than PAVE-64 for four of the benchmarks (i.e., BDY, CNL, DDP, and FRT). Meanwhile, those four benchmarks under PAVE-64 have insignificant reduction in their normalized runtimes when compared with those under PAVE-32, as demonstrated in Fig. 9. This indicates that PAVE-32 is more cost-effective, because aggressive prefetching can hike band- width utilization (up to 14.0 percent for BDY) without noticeably lowering runtimes. Subsequent NUDA evaluation results for com- paring with those of prior directory storage reduction techniques are gathered under the prefetch buffer sized at PAVE-32.
6.3 Comparative Outcomes
The prior designs of SCD [15], SPACE [21], and Sparse [7] are found to exhibit limited performance degradation when the on- chip directory area is richly provisioned, i.e., with the coverage level of b 1=2, under which NUDA presents negligible perfor- mance degradation. As the coverage level drops for a large core count, however, prior designs suffer from marked performance degradation while NUDA still retains performance. Our compari- son here focuses on very scarce directory area budget (under b ¼ 1=16 or even b ¼ 1=32) for high CMP scalability.
The normalized performance outcomes of Sparse, SCD, SPACE, and NUDA for b ¼ 1=16 are illustrated in Fig. 12, where on-chip directory area budget for NUDA is kept exactly identical to those for Sparse, SCD, and SPACE. In addition, the outcomes of NUDA under b ¼ 1=32 are also included for comparison. It is observed that NUDA incurs negligible (or limited) performance degradation for a (or an extremely) scarce on-chip directory area of b ¼ 1=16 (or 1/32), as compared to the conventional FBM design, which pro- vides each LLC block with a DV statically to track all privately cached contents precisely on chip. While achieving the best
execution performance possible, FBM exceedingly over-provisions its on-chip directory and hence limits its scalability. In sharp contrast, NUDA-1/16 and NUDA-1/32 budget only 6.7 and 3.3 percent as much directory area, respectively, and yet exhibit just 2.1 and 5.9 percent execution slowdowns, on an average, when compared with its FBM counterpart.
NUDA is seen in Fig. 12 to contain execution slowdowns far better than Sparse, SPACE, and SCD for all benchmarks examined, under b ¼ 1=16. For example, Sparse, SPACE, and SCD then degrade mean execution performance by 3:2; 2:2; and 1:9, respectively, as opposed to only 2.1 percent for NUDA-1/16. Hence, NUDA lowers the mean execution slowdowns by 3:7 (or 1:9) in comparison to SCD (or SPACE) under the scarce directory area of b ¼ 1=16.
NUDA has the largest gains in slowdown restraints for swap- tions, with NUDA-1/16 yielding the gains of 5:9, 4:0, and 2:7 when compared respectively with Sparse, SPACE, and SCD. The reasons are two-fold. First, swaptions often involves repetitive accesses to the privately-cached “coherence-active” data blocks. Under Sparse and SCD, the directory elements of those active LLC blocks may be evicted prematurely, to cause substantial execution slowdowns. Second, threads of swaptions frequently compete for shared lock variables to enter the critical section but the directory elements of those lock variables (without CARP to track their RAW sequences) can be evicted prematurely, to extend the execution times significantly. On the other hand, NUDA always keeps DVs of such synchronization-related data blocks on chip to achieve fast execution.
7 CONCLUSION
We have investigated the non-uniform directory architecture (NUDA) framework, aiming at CMP scalability improvement by drastically reducing on-chip directory area ratio. NUDA employs a multi-level directory configuration in CMP, equipping a small on- chip buffer to keep only “active” directory vectors (DVs) for sus- tained execution performance. It employs a full-fledged backing store in a low storage level to hold all LLC’s DVs and relies on novel CARP (criticality-aware replacement policy) and PAVE (pre- activating vectors) to effectively manage DV transfer between the DV buffer and the backing store. Evaluation results reveal that NUDA experiences negligible performance degradation under a scarcely provisioned on-chip directory area, when compared with the one that provisions every LLC with an on-chip DV statically for best possible performance. NUDA is also shown to outperform ear- lier compression-based techniques for on-chip directory area reduction, especially if directory area budget is held very low for high scalability.
ACKNOWLEDGMENT
The authors wish to thank Xiyue Xiang for his constructive discus- sion. This work was supported in part by National Science Founda- tion with Award Numbers: CCF-1423302 and CNS-1527051.
Fig. 10. On-chip directory hit rates under various prefetching buffer sizes for NUDA-1/16. Fig. 12. Normalized execution times of NUDA and its earlier counterparts with
respect to those of FBM, with the coverage of Sparse, SCD, and SPACE all equal to 1/16.
Fig. 11. Normalized memory bandwidth utilization for NUDA-1/16.
746 IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018
REFERENCES
[1] M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato, “A two-level directory architecture for highly scalable cc-NUMA multiprocessors,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 1, pp. 67–79, Jan. 2005.
[2] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” in Proc. 17th Int. Conf. Par- allel Archit. Compilation Techn., Oct. 2008, pp. 72–81.
[3] N. Binkert, et al., “The gem5 simulator,” ACM SIGARCH Comput. Archit. News, vol. 39, pp. 1–7, 2011.
[4] B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. F. Duato, “Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks,” in Proc. 38th Int. Symp. Comput. Archit., Jun. 2011, pp. 93– 104.
[5] N. Eisley, L.-S. Peh, and L. Shang, “In-network cache coherence,” in Proc. 39th IEEE/ACM Int. Symp. Microarchit., Dec. 2006, pp. 321–332.
[6] Y. Fu, T. M. Nguyen, and D. Wentzlaff, “Coherence domain restriction on large scale systems,” in Proc. 48th ACM/IEEE Int. Symp. Microarchit., Dec. 2015, pp. 686–698.
[7] A. Gupta, et al., “Reducing memory and traffic requirements for scalable directory-based cache coherence schemes,” in Proc. Int. Conf. Parallel Process., 1990, pp. 312–321.
[8] J. Laudon andD. Lenoski, “The SGI Origin: A ccNUMAhighly scalable serv- er,” in Proc. 24th Annu. Int. Symp. Comput. Archit., Jun. 1997, pp. 241–251.
[9] D. Lenoski, et al., “The Stanford DASH multiprocessor,” IEEE Comput., vol. 25, no. 3, pp. 63–79, Mar. 1992.
[10] A. Jain and C. Lin, “Back to the future: Leveraging Belady’s algorithm for improved cache replacement,” in Proc. 43rd Int. Symp. Comput. Archit., 2016, pp. 78–89.
[11] J. Kuskin, et al., “The Stanford FLASH multiprocessor,” in Proc. Int. Symp. Comput. Archit., 1994, pp. 302–313.
[12] M. Marty and M. Hill, “Virtual hierarchies,” IEEE Micro, Vol. 28, no. 1, pp. 99–109, Jan. 2008.
[13] M. Michael and A. K. Nanda, “Design and performance of directory caches for scalable shared memory multiprocessors,” in Proc. IEEE 5th Int. Symp. High Performance Comput. Archit., Jan. 1999, pp. 142–151.
[14] J. Philbin, et al., “Thread scheduling for cache locality,” in Proc. 7th Int. Conf. Archit. Support Programming Languages Operating Syst., Oct. 1996, pp. 60–71.
[15] D. Sanchez and C. Kozyrakis, “SCD: A scalable coherence directory with flexible sharer set encoding,” in Proc. IEEE 18th Int. Symp. High Performance Comput. Archit., Feb. 2012, pp. 1–12.
[16] W. Shu and N. Tzeng, “Superior directory efficiency via relinquishment coherence and compressed sharer tracking for chip multiprocessors,” IEEE Trans. Comput., vol. 66, no. 11, pp. 1975–1981, Nov. 2017.
[17] S. Shukla and M. Chaudhuri, “Tiny directory, efficient shared memory in many-core systems with ultra-low-overhead coherence tracking,” in Proc. IEEE 23th Int. Symp. High Performance Comput. Archit., Feb. 2017, pp. 205–216.
[18] D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence. San Rafael, CA, USA: Morgan and Claypool, 2011, pp. 161–163.
[19] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs: Characterization and methodological considerations,” in Proc. 22nd Int. Symp. Comput. Archit., Jun. 1995, pp. 24–36.
[20] G. Zhang, W. Horn, and D. Sanchez, “Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems,” in Proc. IEEE/ACM Int. Symp. Microarchit., Dec. 2015, pp. 13–25.
[21] H. Zhao, et al., “SPACE: Sharing pattern-based directory coherence for multicore scalability,” in Proc. 19th Int. Conf. Parallel Archit. Compilation Tech., Oct. 2010, pp. 135–146.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 5, MAY 2018 747
<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /sRGB /DoThumbnails true /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize true /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts false /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /Algerian /Arial-Black /Arial-BlackItalic /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialUnicodeMS /BaskOldFace /Batang /Bauhaus93 /BellMT /BellMTBold /BellMTItalic /BerlinSansFB-Bold /BerlinSansFBDemi-Bold /BerlinSansFB-Reg /BernardMT-Condensed /BodoniMTPosterCompressed /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolSeven /BritannicBold /Broadway /BrushScriptMT /CalifornianFB-Bold /CalifornianFB-Italic /CalifornianFB-Reg /Centaur /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /Chiller-Regular /ColonnaMT /ComicSansMS /ComicSansMS-Bold /CooperBlack /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /EstrangeloEdessa /FootlightMTLight /FreestyleScript-Regular /Garamond /Garamond-Bold /Garamond-Italic /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /Haettenschweiler /HarlowSolid /Harrington /HighTowerText-Italic /HighTowerText-Reg /Impact /InformalRoman-Regular /Jokerman-Regular /JuiceITC-Regular /KristenITC-Regular /KuenstlerScript-Black /KuenstlerScript-Medium /KuenstlerScript-TwoBold /KunstlerScript /LatinWide /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LucidaBright /LucidaBright-Demi /LucidaBright-DemiItalic /LucidaBright-Italic /LucidaCalligraphy-Italic /LucidaConsole /LucidaFax /LucidaFax-Demi /LucidaFax-DemiItalic /LucidaFax-Italic /LucidaHandwriting-Italic /LucidaSansUnicode /Magneto-Bold /MaturaMTScriptCapitals /MediciScriptLTStd /MicrosoftSansSerif /Mistral /Modern-Regular /MonotypeCorsiva /MS-Mincho /MSReferenceSansSerif /MSReferenceSpecialty /NiagaraEngraved-Reg /NiagaraSolid-Reg /NuptialScript /OldEnglishTextMT /Onyx /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Parchment-Regular /Playbill /PMingLiU /PoorRichard-Regular /Ravie /ShowcardGothic-Reg /SimSun /SnapITC-Regular /Stencil /SymbolMT /Tahoma /Tahoma-Bold /TempusSansITC /TimesNewRomanMT-ExtraBold /TimesNewRomanMTStd /TimesNewRomanMTStd-Bold /TimesNewRomanMTStd-BoldCond /TimesNewRomanMTStd-BoldIt /TimesNewRomanMTStd-Cond /TimesNewRomanMTStd-CondIt /TimesNewRomanMTStd-Italic /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Times-Roman /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /VinerHandITC /Vivaldii /VladimirScript /Webdings /Wingdings2 /Wingdings3 /Wingdings-Regular /ZapfChanceryStd-Demi /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 150 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 150 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /ColorImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /GrayImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.) /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create PDFs that match the "Suggested" settings for PDF Specification 4.0) >> >> setdistillerparams << /HWResolution [600 600] /PageSize [612.000 792.000] >> setpagedevice