Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by Sankalp Shivaprakash
22
Embed
Many-Thread Aware Prefetching Mechanisms for GPGPU Application
Many-Thread Aware Prefetching Mechanisms for GPGPU Application. Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu. In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010. Paper presentation by - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Many-Thread Aware Prefetching Mechanisms for GPGPU Application
Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu
In the proceedings of the 43rd Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), December 2010
Paper presentation by Sankalp Shivaprakash
Motivation
• Memory latency hiding through multithread prefetching schemes – Per-warp training and Stride promotion– Inter-thread Prefetching– Adaptive Throttling
• Propose software and hardware prefetching mechanisms for a GPGPU architecture – Scalable to large number of threads– Robustness through feedback and throttling
mechanisms to avoid degraded performance
Memory Latency Hiding techniques
• Multithreading– Thread level and Warp level context switching
• Utilization of complex cache memory hierarchies– Using L1, L2, DRAMs than accessing Global Memory each
• Greater contention seen when the number of warps increase and delay increased
Useful vs. Harmful Prefetching
• Harmful prefetch requests could be due to:– Queuing Delays– DRAM row-buffer conflicts– Wasting of off-chip bandwidth due to early eviction– Wasting of off-chip bandwidth due to inaccurate prefetches
Metrics for Adaptive Prefetch Throttling
• Early Eviction Rate
• Merge Ratio
Avoids :• Consumption of system bandwidth• Delay requests• Occupation of Cache by unnecessary prefetches
Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps
Metrics for Adaptive Prefetch Throttling
• Monitoring of Early Eviction and Merge Ratio
Methodology
• Baseline processor used is NVIDIA’s 8800GT• Applications to simulator is generated using GPUOcelot,
a binary translator framework for PTX
Methodology
Results and Discussion
Results and Discussion
Results and Discussion
Conclusion
• The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it
• The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone
• Scalability and robustness was given importance• The study does not consider complex cache memory
hierarchies• Overhead of prefetching is not clearly substantiated