Temporal Prefetching Without the Off-Chip Metadatalin/papers/micro19m.pdfTemporal Prefetching Without the Of-Chip Metadata MICRO-52, October 12ś16, 2019, Columbus, OH, USA goal, which
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Temporal Prefetching Without the Off-Chip Metadata
Temporal prefetching offers great potential, but this potential is
difficult to achieve because of the need to store large amounts of
prefetcher metadata off chip. To reduce the latency and traffic of
off-chip metadata accesses, recent advances in temporal prefetching
have proposed increasingly complex mechanisms that cache and
prefetch this off-chip metadata. This paper suggests a return to sim-
plicity: We present a temporal prefetcher whose metadata resides
entirely on chip. The key insights are (1) only a small portion of
prefetcher metadata is important, and (2) for most workloads with
irregular accesses, the benefits of an effective prefetcher outweigh
the marginal benefits of a larger data cache. Thus, our solution, the
Triage prefetcher, identifies important metadata and uses a portion
of the LLC to store this metadata, and it dynamically partitions the
LLC between data and metadata.
Our empirical results show that when compared against spatial
prefetchers that use only on-chip metadata, Triage performs well,
achieving speedups on irregular subset of SPEC2006 of 23.5% com-
pared to 5.8% for the previous state-of-the-art. When compared
against state-of-the-art temporal prefetchers that use off-chip meta-
data, Triage sacrifices performance on single-core systems (23.5%
speedup vs. 34.7% speedup), but its 62% lower traffic overhead
translates to better performance in bandwidth-constrained 16-core
systems (6.2% speedup vs. 4.3% speedup).
CCS CONCEPTS
· Computer systems organization → Processors and mem-
ory architectures.
KEYWORDS
Data prefetching, irregular temporal prefetching, caches, CPUs
ACM Reference Format:
Hao Wu, Krishnendra Nathella, Joseph Pusdesris, Dam Sunwoo, Akanksha
Jain, and Calvin Lin. 2019. Temporal Prefetching Without the Off-Chip
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
fic consumes significant energy, since DRAM accesses consume
more power than on-chip operations. Third, off-chip metadata adds
hardware complexity because it requires (1) changes to the mem-
ory interface, (2) communication with the OS, and (3) methods for
managing the metadata, which can include both a metadata cache
replacement policy and a metadata prefetcher [47].
In this paper, we present a new temporal data prefetcher that
addresses these issues by not maintaining any off-chip metadata.
Our work is motivated by two observations. First, most of the cover-
age for state-of-the-art temporal prefetchers [24, 47] comes from a
small number of metadata entries, so it is possible to get substantial
coverage without storing megabytes of metadata (see Figure 1).
Second, the marginal utility of the last-level cache (LLC) [34, 40] is
typically outweighed by the benefits of an effective prefetcher. For
example, for the irregular subset of SPEC2006, reducing the cache
by 1 MB reduces performance by 7.4%, but a state-of-the-art irregu-
lar prefetcher with unlimited resources can improve performance
by 41.7%. Therefore, if we can distinguish the important metadata
MICRO-52, October 12ś16, 2019, Columbus, OH, USA Hao Wu, Krishnendra Nathella, Joseph Pusdesris, Dam Sunwoo, Akanksha Jain, and Calvin Lin
0
100
200
300
400
500
600
1 10001 20001 30001 40001 50001 60001
Reu
se C
ou
nt
Metadata Entry
Figure 1: Metadata reuse distribution for the mcf bench-
mark: For an execution with 60Kmetadata entries, only 15%
of metadata entries are reused more than 15 times.
from the unimportant metadata, it can be profitable to use large
portions of the LLC to store important prefetcher metadata.
Thus, our prefetcher, which we call the Triage prefetcher, re-
purposes a portion of the LLC as ametadata store, and any metadata
that cannot be kept in the metadata store is simply discarded. To
identify important metadata, Triage uses the Hawkeye replacement
policy [25], which provides significant performance benefits for
small metadata stores, as it identifies frequently accessed metadata
over a long history of time. Of course, the ideal size of this metadata
store varies by workload, so we also introduce a dynamic cache
partitioning scheme that determines the amount of the LLC cache
that should be provisioned for metadata entries.
By forsaking off-chip metadata and by intelligently managing on-
chip metadata, Triage offers vastly different tradeoffs than state-of-
the-art temporal prefetchers. For example, Triage reduces off-chip
traffic overhead from 156.4% to 59.3%, it reduces energy consump-
tion for metadata accesses by 4-22×, and it offers a much simpler
hardware design. In Section 4, we show that in bandwidth-rich
environments these benefits come at the cost of lower performance
(due to limited prefetcher metadata), but they translate to better
performance in bandwidth-constrained environments.
Triage’s metadata organization has the added benefit of a sim-
plified and compressed metadata representation. In particular, we
find that without the need to store metadata off chip, tables [27]
are the most compact data structure for tracking correlated ad-
dresses, because they have no redundancy. By contrast, previous so-
lutions [24, 45] introduce varying degrees of metadata redundancy
to facilitate off-chip metadata management. Since our metadata
store competes for space in the LLC, this compactness has a direct
performance benefit.
To summarize, this paper makes several contributions:
• We introduce Triage, the first PC-localized1 temporal data
prefetcher that does not use off-chip metadata. Triage reuses
a portion of the LLC for storing prefetcher metadata, and it
includes a simple adaptive policy for dynamically provision-
ing the size of the metadata store.
1PC localization is a method of creating more predictable reference streams by sepa-rating streams according to the address of the instruction that issued the load.
• We evaluate the Triage prefetcher using a highly accurate
proprietary simulator for single-core simulations and the
ChampSim simulator for multi-core simulations.
ś On single-core systems running SPEC 2006 workloads,
fic overhead for Triage vs. 156.4% for MISB). This traffic reduction
translates to better speedup in bandwidth-constrained 16-core sys-
tems, where Triage outperforms MISB despite having access to a
metadata store that is orders of magnitude smaller. Triage’s traffic
overhead is comparable to state-of-the-art spatial prefetchers, such
as BO, which speaks to its practicality. Overall, Triage provides
a new and attractive design point for temporal prefetchers with
vastly different tradeoffs than previous solutions.
ACKNOWLEDGMENTS
We thank Jaekyu Lee for his help in setting up the proprietary
simulation infrastructure. This work was funded in part by NSF
Grant CCF-1823546 and a gift from Intel Corporation through the
NSF/Intel Partnership on Foundational Microarchitecture Research.
Temporal Prefetching Without the Off-Chip Metadata MICRO-52, October 12ś16, 2019, Columbus, OH, USA
REFERENCES[1] 2015. 2nd Data Prefetching Championship (2015). http://comparch-conf.gatech.
edu/dpc2[2] 2017. 2nd Cache Replacement Championship (2017). http://crc2.ece.tamu.edu/[3] Jean-Loup Baer and Tien-Fu Chen. 1995. Effective Hardware-Based Data Prefetch-
ing for High-Performance Processors. IEEE Trans. Comput. 44, 5 (May 1995),609ś623.
[4] Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018.Domino Temporal Data Prefetcher. In High Performance Computer Architecture(HPCA), 2018 IEEE 24th International Symposium on. 131ś142.
[5] Shekhar Borkar. 2011. The Exascale Challenge.https://parasol.tamu.edu/pact11/ShekarBorkar-PACT2011-keynote.pdf.
[6] Ioana Burcea, Stephen Somogyi, Andreas Moshovos, and Babak Falsafi. 2008.Predictor virtualization. In Proceedings of the 13th international conference onArchitectural support for programming languages and operating systems (ASPLOSXIII). ACM, 157ś167.
[7] Doug Burger, Thomas R. Puzak, Wei-Fen Lin, and Steven K. Reinhardt. 2001.Filtering Superfluous Prefetches Using Density Vectors. In ICCD ’01: Proceedingsof the International Conference on Computer Design: VLSI in Computers & Processors.124ś133.
[8] Chi F. Chen, Se-Hyun Yang, Babak Falsafi, and Andreas Moshovos. 2004. Accurateand Complexity-Effective Spatial Pattern Prediction. In Proceedings of the 10thInternational Symposium on High Performance Computer Architecture (HPCA ’04).276ś288.
[9] Trishul M. Chilimbi. 2001. Efficient Representations and Abstractions for Quan-tifying and Exploiting Data Reference Locality. In SIGPLAN Conference on Pro-gramming Language Design and Implementation (PLDI). 191ś202.
[10] Yuan Chou. 2007. Low-Cost Epoch-Based Correlation Prefetching for CommercialApplications. In MICRO. 301ś313.
[11] Jamison Collins, Suleyman Sair, Brad Calder, and Dean M. Tullsen. 2002. PointerCache Assisted Prefetching. In Proceedings of the 35th Annual ACM/IEEE Interna-tional Symposium on Microarchitecture (MICRO 35). 62ś73.
[12] Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. SIGARCH Computer Architecture News 30,5 (October 2002), 279ś290.
[13] Arjun Deb, Paolo Faraboschi, Ali Shafiee, Naveen Muralimanohar, Rajeev Bal-asubramonian, and Robert Schreiber. 2016. Enabling technologies for memorycompression: Metadata, mapping, and prediction. In 2016 IEEE 34th InternationalConference on Computer Design (ICCD). IEEE, 17ś24.
[14] Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. InHPCA. 7ś17.
[15] Keith I. Farkas, Paul Chow, Norman P. Jouppi, and Zvonko Vranesic. 1997.Memory-system Design Considerations for Dynamically-scheduled Processors.In ISCA ’97: Proceedings of the 24th Annual International Symposium on ComputerArchitecture. 133ś143.
[16] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, MohammadAlisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, AnastasiaAilamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads onmodern hardware. In Proceedings of the seventeenth internationalconference on Architectural Support for Programming Languages and OperatingSystems. 37ś48.
[17] Michael Ferdman, Thomas F Wenisch, Anastasia Ailamaki, Babak Falsafi, andAndreas Moshovos. 2008. Temporal instruction fetch streaming. In Proceedingsof the 41st annual IEEE/ACM International Symposium on Microarchitecture. IEEEComputer Society, 1ś10.
[18] John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCHComput. Archit. News 34, 4 (September 2006), 1ś17. https://doi.org/10.1145/1186736.1186737
[19] Seokin Hong, Prashant Jayaprakash Nair, Bulent Abali, Alper Buyuktosunoglu,Kyu-Hyoun Kim, and Michael Healy. 2018. Attache: Towards ideal memorycompression by mitigating metadata bandwidth overheads. In 2018 51st AnnualIEEE/ACM International Symposium onMicroarchitecture (MICRO). IEEE, 326ś338.
[20] Zhigang Hu, Margaret Martonosi, and Stefanos Kaxiras. 2003. TCP: Tag Corre-lating Prefetchers. In HPCA. 317ś326.
[21] Ibrahim Hur and Calvin Lin. 2006. Memory Prefetching Using Adaptive StreamDetection. In Proceedings of the 39th International Symposium onMicroarchitecture.397ś408.
[22] Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2011. Access Map Pattern Matching forHigh Performance Data Cache Prefetch. In Journal of Instruction-Level Parallelism,Vol. 13. 1ś24.
[23] Bruce Jacob, Spencer Ng, and David Wang. 2010. Memory systems: cache, DRAM,disk. Morgan Kaufmann.
[24] Akanksha Jain and Calvin Lin. 2013. Linearizing Irregular Memory Accessesfor Improved Correlated Prefetching. In 46th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO).
[25] Akanksha Jain and Calvin Lin. 2016. Back to the Future: Leveraging Belady’sAlgorithm for Improved Cache Replacement. In Proceedings of the InternationalSymposium on Computer Architecture (ISCA).
[26] Teresa L. Johnson, Matthew C. Merten, and Wen-Mei W. Hwu. 1997. Run-timespatial locality detection and optimization. In Proceedings of the 30th AnnualACM/IEEE International Symposium on Microarchitecture. 57ś64.
[27] Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov Predictors. InProceedings of the 24th Annual International Symposium on Computer Architecture.252ś263.
[28] Norman P. Jouppi. 1990. Improving direct-mapped cache performance by theaddition of a small fully-associative cache and prefetch buffers. In InternationalSymposium on Computer Architecture (ISCA). 364ś373.
[29] Jinchun Kim, Elvira Teran, Paul V Gratz, Daniel A Jiménez, Seth H Pugsley,and Chris Wilkerson. 2017. Kill the Program Counter: Reconstructing ProgramBehavior in the Processor Cache Hierarchy. In Proceedings of the Twenty-SecondInt’Conference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS). 737ś749.
[30] Sanjeev Kumar and Christopher Wilkerson. 1998. Exploiting spatial locality indata caches using spatial footprints. SIGARCH Computer Architecture News 26, 3(April 1998), 357ś368.
[31] Snehasish Kumar, Hongzhou Zhao, Arrvindh Shriraman, Eric Matthews, SandhyaDwarkadas, and Lesley Shannon. 2012. Amoeba-Cache: Adaptive Blocks forEliminating Waste in the Memory Hierarchy. In Proceedings of the 45th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO). 376ś388.
[32] Pierre Michaud. 2016. Best-offset hardware prefetching. In 2016 IEEE InternationalSymposium on High Performance Computer Architecture (HPCA). 469ś480.
[33] Kyle J. Nesbit and James E. Smith. 2005. Data Cache Prefetching Using a GlobalHistory Buffer. IEEE Micro 25, 1 (2005), 90ś97.
[34] Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney, and HongWang. 2018. Criticality aware tiered cache hierarchy: a fundamental relookat multi-level cache hierarchies. In 2018 ACM/IEEE 45th Annual InternationalSymposium on Computer Architecture (ISCA). IEEE, 96ś109.
[35] Subbarao Palacharla and Richard E. Kessler. 1994. Evaluating Stream Buffers as aSecondary Cache Replacement. In Proceedings of the International Symposium onComputer Architecture (ISCA). 24ś33.
[36] Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. 2015. Semantic localityand context-based prefetching using reinforcement learning. In 2015 ACM/IEEE42nd Annual International Symposium on Computer Architecture (ISCA). IEEE,285ś297.
[37] Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert LScott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian.2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers.In High Performance Computer Architecture (HPCA), 2014 IEEE 20th InternationalSymposium on. IEEE.
[38] Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching forlinked data structures. In Proceedings of the 26th Annual International Symposiumon Computer Architecture (ISCA). 111ś121.
[40] Amna Shahab, Mingcan Zhu, Artemiy Margaritov, and Boris Grot. 2018. FarewellMy Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers.In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture(MICRO). IEEE, 559ś572.
[41] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Auto-matically characterizing large scale program behavior. ACM SIGOPS OperatingSystems Review 36, 5 (2002), 45ś57.
[42] A.J. Smith. 1978. Sequential Program Prefetching in Memory Hierarchies. IEEETrans. Comput. 11, 12 (December 1978), 7ś12.
[43] Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a user-level memorythread for correlation prefetching. In Proceedings of the 29th Annual InternationalSymposium on Computer Architecture. 171ś182.
[44] Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, andAndreas Moshovos. 2006. Spatial Memory Streaming. In ISCA ’06: Proceedings ofthe 33th Annual International Symposium on Computer Architecture. 252ś263.
[45] Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, andAndreas Moshovos. 2009. Practical off-chip meta-data for temporal memorystreaming. In HPCA. 79ś90.
[46] Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, andAndreas Moshovos. 2010. Making Address-Correlated Prefetching Practical. IEEEMicro 30, 1 (2010), 50ś59.
[47] Hao Wu, Krishnendra Nathella, Akanksha Jain, Dam Sunwoo, and Calvin Lin.2019. Efficient Metadata Management for Irregular Data Prefetching. In the 46thInternational Symposium on Computer Architecture (ISCA).
[48] Vinson Young, Sanjay Kariyappa, and Moinuddin Qureshi. 2019. Enabling Trans-parent Memory-Compression for Commodity Memory Systems. In 2019 IEEEInternational Symposium on High Performance Computer Architecture (HPCA).IEEE, 570ś581.