Data Access Profiling & Improved Structure Field Regrouping in Pegasus Vas Chellappa & Matt Moore May 2, 2005 / Optimizing Compilers / Project Poster Session
Feb 25, 2016
Data Access Profiling & Improved Structure Field Regrouping in Pegasus
Vas Chellappa & Matt MooreMay 2, 2005 / Optimizing Compilers / Project Poster Session
Introduction
Structure definitions group fields by semantics, not access contemporaneity
Data access profiling can be used to improve cache performance by reordering for contemporaneity
In this context, contemporaneity is a measure of how close in time two data accesses to structure fields occur
Problem Statement
Obtaining contemporaneity information for structure fields
Exploiting this information to improve the ordering of the fields
Doing this within the CASH/Pegasus environment
Approach
Pegasus Implementation Data Access Profiling to track contemporaneous
field accesses to build the Field Affinity Graphs Modify Simulator interface to SimpleScalar (3rd
party cache simulator) to achieve this Regrouping Algorithm
Field Affinity Graphs built by the modified Simulator are then used to recommend reorderings based on a new regrouping algorithm
Project Design
C Source
SUIF/C2DIL
Simulator
load/dump
.cir file(Pegasus representation)
RFU Simulator
Tagged Pegasus IR
SimpleScalar(libcachesim)
Memory Accesses
Regrouper(libregroup)
Contemporaneous Accesses
End of Simulation
OutputCycles
Regroupings
Legend
Unmodified
New
Modified
Design Overview
1. Build stage: Tag structure field accesses in the Pegasus IR
2. Simulation stage: Propagate tag information through SimpleScalar to the new regroup library
3. Final stage: Invoke regrouping algorithm to calculate reordering recommendations
Build Stage, Tagging Accesses
Objective: Identify and tag structure field accesses in the Pegasus IR
Not trivial, since SUIF/C2DIL do not preserve required type information during transformation to IR
Need to identify patterns that indicate structure field accesses
Field Accesses in Pegasus
+
Structure pointer(Structure’s base address)
Field offset
Memory Op(Load/Store)
Structure pointer
(structure specific)
Add
ress
intFi
eld
type
Typical structure field access made through a structure pointer
+
Structure Address
Field offset
Memory Op(Load/Store)
Address (int)
Add
ress
intFi
eld
type
Typical structure field access made to a structure variable on
the stack.
Structure Address
Memory Op(Load/Store)
Add
ress
Fiel
d ty
pe
Optimized to
Since the base address and offsets are constants, they can be, and are, optimized away. There is no way to know that this represents a structure field access. Also, there is no wire that now contains the structure type. Type information is thus lost,
and impossible to recreate.
Actual Pegasus Illustration
int foo(struct my_t stestfoo) { int retval = stestfoo.f2; return(retval);}
Which wire here should have struct type?
int foo(struct my_t* stestfoo) { return(stestfoo->f2);}
Which wire here has struct type?
Simulation Process
Tag info on loads and stores is propagated through SimpleScalar to the regrouping library that builds the field affinity graph (done online, during simulation)
Regrouping Stage
After simulation, analyze collected profiling data to produce reordering recommendation
Can be done better than has been done in previous work (greedy)
Cannot be done optimally (NP-hard) Field Affinity Graph (one per structure):
Vertices: fields in a structure Edge weights: represent degree of
contemporaneity of accesses between the fields
Matching Heuristic
Find a maximum weight matching in the field affinity graph
Fields that will not fit into a cache line together anyway are identified and ignored
Structure is reordered by placing matched fields together
Greedy vs. Matching
struct foo { int f1; int f3; int f2; int f4;}
Matching-Based Field Ordering
f1 f2
f3 f4
1000
900
1
900
Greedy Field Ordering
Cache Layout (8 byte lines):
struct foo { int f1; int f2; int f3; int f4;}
f1 f2
f3 f4
1000
900
1
900
Cache Layout (8 byte lines):
NP-Hardness
NP-Hardness is shown by reducing graph coloring problem to regrouping problem
1
1
1
11
-1
-1
-1
-1
-1-1
Reduction
K-Coloring Regrouping (K cache lines)
Results
Implemented successfully to handle structure field accesses done through pointers (ptr->fld)
So far, only small programs have been tested
Reordering is done manually and fed into simulator again to obtain the number of cycles for comparison
Results - Example
Original:struct my_t { int f1; int f2; char nu[4096]; int f3; int f4;};
int foo(struct my_t *elt){ int i; elt->f1 = 2; elt->f4 = 100; for(i=0; i < 50; i++) { elt->f1++; elt->f4--; } return elt->f1+elt->f4;}
750 Cycles per Call 745 Cycles per Call(one less cache miss)
Modified:struct my_t { int f1; int f4; int f2; char nu[4096]; int f3;};
int foo(struct my_t *elt){ int i; elt->f1 = 2; elt->f4 = 100; for(i=0; i < 50; i++) { elt->f1++; elt->f4--; } return elt->f1+elt->f4;}
Conclusion
Performance improvements are achievable even on simple programs using reorganization recommendations
Propagation of full type information in SUIF/c2dil from source would be required to optimize non-pointer accesses
Less memory-exposed languages would allow for easy and quick implementation of the reordering recommendation
References
Trishul M. Chilimbi, Bob Davidson, and James R. Larus, “Cache-Conscious Structure Definition,'' in Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, pages 13-24, May 1999.
Mathprog (Weighted Matching Algorithm) http://elib.zib.de/pub/Packages/mathprog/matching/weighted/
Pegasus: http://www-2.cs.cmu.edu/~phoenix/
SUIF: http://suif.stanford.edu/
SimpleScalar Tool set: http://www.cs.wisc.edu/~mscalar/simplescalar.html