Top Banner
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli, Jian Chen and Lizy K. John Department of Electrical and Computer Engineering, The University of Texas at Austin, TX, USA, IBM Corp., Austin, TX, USA Reviewed by: Stanley Ikpe
27

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Dec 28, 2015

Download

Documents

Mary Fletcher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP SystemsDimitris Kaseridis, Jeffery Stuecheli, Jian Chen and Lizy K. John

Department of Electrical and Computer Engineering, The University of Texas at Austin, TX, USA, IBM Corp., Austin, TX, USA

Reviewed by: Stanley Ikpe

Page 2: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Overview Terminology Paper Breakdown Paper Summary

Objective Implementation Results

General Comments Discussion Topics

Page 3: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Terminology Chip Multiprocessor (CMP): multiple processor cores on

a single chip Throughput: measure of work done; successful messages

delivered Bandwidth (memory): rate at which data can be

read/stored Quality of Service (QoS): ability to provide priority to

applications Fairness: ability to allocate resources Resources: utilities used for work (cache capacity and

memory bandwidth) Last Level Cache (LLC): largest (slow) cache memory (on

or off chip)

Page 4: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Paper Breakdown Motivation: CMP integration provides

opportunity for improved throughput. Adversely, sharing resources can be hazardous to performance.

Causes: Parallel Applications; each thread (core) puts different demands/requests on common (shared) resources.

Effects: Inconsistencies in performance, resource contention (unfairness).

Page 5: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Paper Breakdown So how do we fix this?? Resource Management: control the

allocation and use of available resources.

What are some of these resources? Cache capacity Available memory bandwidth

Page 6: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Paper Breakdown How do we go about resource

management?? Predictive work monitoring: intuitively infer

what resources will be used. Non-invasive (hardware) method of profiling resources (cache capacity and memory bandwidth)

System-wide resource allocation and job scheduling by identifying over-utilized CMPs (BW) and reallocate work.

Page 7: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Baseline Architecture

Page 8: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Set-Associative Design

[3] www.utdallas.edu/~edsha/parallel/2010S/Cache-Overview.pdf

Page 9: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Objectives Create an algorithm to effectively

project memory bandwidth and cache capacity requirements (per core).

Implement for system-wide optimization of resource allocation and job scheduling.

Improve potential throughput for CMP systems.

Page 10: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Implementation Resource Profiling: prediction scheme

to detect cache misses and bandwidth requirements Mattson’s stack distance algorithm (MSA):

method for reducing the simulation time of trace-driven caches. (Mattson et al. [2])

MSA-based profiler for LLC misses: K-way set associative cache implies K+1 counters. Cache access at position i increments counter i. If cache miss increment counter K+1.

Page 11: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

MSA-based profiler for LLC misses

Page 12: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Implementation MSA-based Profiler for Memory

Bandwidth: the memory bandwidth required to read (due to cache fills) and write (due to cache dirty write-backs to main memory)

•Hits to dirty cache lines indicate write-back operations if cache capacity allocation < stack distance.•Dirty Stack Distance used to track largest stack distance at which a dirty line accessed•Dirty counter projects write-back rate and Dirty bit marks the greatest stack distance of dirty line

Page 13: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Write-back pseudocode

Page 14: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Write-back Profiling Example

Page 15: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

SPEC CPU 2006

Page 16: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Implementation Resource Allocation: compute Marginal-Utility

for a given workload across a range of possible cache allocations to compare all possible allocations of unused capacity (n new elements, c already used elements)

Intra-chip partitioning algorithm: Marginal-Utility figure of merit measuring amount of utility provided (reduced cache misses) for a given amount of resource (cache capacity). Algorithm considers ideal cache capacity and distributes specific cache-ways per core.

Page 17: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Algorithm

Page 18: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Implementation Inter-chip partitioning algorithm: find an efficient

(below threshold or bandwidth limit) workload schedule on all available CMPs in system. A global implementation is used to mitigate misdistribution of workload. Marginal-Utility algorithm along side bandwidth over-commit detection allow additional workload migration

•Cache capacity: estimate optimal resource assignment (marginal-utility) and intra-chip partitioning assignment. Algorithm performs workload swapping so each core is below bandwidth limit.•Memory Bandwidth: Memory bandwidth over-commit algorithm finds workloads with high/low requirements and does shifting to undercommitted CMPs

Page 19: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Algorithm

Page 20: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Example

Page 21: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Resource Management Scheme

Page 22: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Results LLC misses: 25.7% average

reduction from static-even partitions (with 1.4% storage overhead associated)

BW-aware algorithm shows improvement up until 8 CMP implementation (beyond shows diminishing returns)

Miss rates consistent across different cache sizes with slight improvement due to increased possible cache ways and hence potential workload swapping candidates

Page 23: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Results Memory Bandwidth: reduction

of the average worst-case chip memory bandwidth in the system (per epoch).

Figure of merit used is long memory latencies associated with overcommitted memory bandwidth requirements by specific CMPs

UCP+ algorithm (Marginal-Utility/Intra-chip) shows average of 19% improvement over static-even. (Also increases with number of CMPs due to random workload selection of average worst-case bandwidth .

Page 24: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Results Simulated Throughput: used

to measure the effectiveness of implementation Case 1: Use of only UCP+ Case 2: Addition of Inter-chip

(workload swapping) BW-aware algorithm

Case 1 shows 8.6% IPC and 15.3 MPKI improvements on Chip 4 and 7. (swapping high memory bandwidth benchmarks for lesser demanding ones)

Case 2 shows 8.5% IPC and 11% MPKI improvements due to workload migration of overcommitted chip 7.

Page 25: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Comments No detailed hardware implementation of

“non-invasive” profilers “Large” CMP systems not demonstrated

due to complexity Good implementation of resource

management Design limited (additional cores) Cache designs (other than set-

associative)

Page 26: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

References [1] D. Kaseridis, J. Stuecheli, J. Chen and L. K.

John, “A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems”.

[2] R. L. Mattson, “Evaluation techniques for storage hierarchies”. IBM Systems Journal, 9(2):78-117, 1970.

[3] www.utdallas.edu/~edsha/parallel/2010S/Cache-Overview.pdf

Page 27: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Discussion Topics How can an inter-board partitioning

algorithm be implemented? Is it necessary?

What causes diminished returns beyond 8 CMP chips? Can it be circumvented?