Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational Science June 2004
Slide 1
Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse
Kristof Beyls and Erik D’Hollander
International Conference on Computational Science
June 2004
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
1.a Introduction
Anti-law of Moore (2003: half of execution time is lost due to data cache misses)
1
10
100
1000
1980 1985 1990 1995 2000
PROCESSOR
MEMORY
Speed Gap
Re
latieve sp
eed
versu
s 1
980
1.b Observation:Capacity misses dominate
3 kinds of cache misses (3 C’s): Cold, Conflict, Capacity
8K
1
6K
3
2K
6
4K
12
8K
25
6K
51
2K
10
24
K 1
24
8
full
0%
20%
40%
60%
80%
100%
per
cen
tag
e ca
pac
ity
mis
ses
(SP
EC
2000
, C
anti
n a
nd
Hil
l)
cache sizeassoc.
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
2.a Reuse Distance
Definition: The reuse distance of a memory access is the number of unique memory locations accessed since the previous access to the same data.
2
C
1022∞∞∞distance
ABBACBAaddress
2.b Reuse Distance - property
Lemma: In a fully. assoc. LRU cache with n lines, an access hits the cache reuse distance < n.
Consequence: In every cache with n lines, a cache miss with distance d is:
Cold missd = ∞
Capacity missn ≤ d < ∞
Conflict missd < n
2.c Reuse distance histogram Spec95fp
0
10
20
30
0 5 10 15 20
Bil
lio
ns
log2(reuse distance)
hits
misses
0
1
2
3
0 5 10 15 20
Bil
lio
ns
log2(reuse distance)
hits
misses
2.d Classifying cache misses SPEC95fp
Cache size
Conflict Capacity
2.e Reuse distance vs. cache hit probability
0
20
40
60
80
100
0 5 10 15 20
Hit
Perc
enta
ge
log2(Reuse Distance)
Direct Mapped Fully Associative
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
3a. Removing Capacity misses
1. Hardware Enlarge cache
0
1
2
3
0 5 10 15 20
Bill
ion
s
log2(reuse distance)
hits
misses
CSCSCS CSCS CS
Reuse distance must be smaller than cache size
1. Compiler– Loop tiling– Loop fusion
2. Algorithm
CS
3.b Compiler optimizations: SGIpro for Itanium (spec95fp)
0E+0
1E+9
2E+9
3E+9
0 5 10 15 20
log2(reuse distance)
nu
mb
er o
f m
isse
s
original
after optimization
Conflict Capacity
30% conflict misses eliminated,1% capacity misses eliminated.
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
4.a Objectives for cache visualization
Cache behavior is shown in the source code. Cache behavior is presented accurately and
concisely. Independent of specific cache parameters
(e.g. size, associativity,…).
Reuse Distance allows to meet the above objectives
4.b Example: MCF
for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||
red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
68% of capacity misses
4.c Example: MCF
68.3% / sl=21%
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
5.a Optimization: classification
1. Eliminate memory access with poor locality (+++)
2. Reduce reuse distance (keep data in cache between use and reuse) (++)
3. Increase spatial locality (++)
4. Hide latency by prefetching (+)
5.b 3 case studies
From Spec2000: With large memory bottleneck:
Mcf (90%) – optimization of bus schedule Art (87%) – simulation of neural network Equake (66%) – simulation of earthquake
Percentage of execution time thatthe processor is stalled waiting fordata from memory and cache.(Itanium1 733Mhz)
5.c Equake
For every time step: Sparse matrix-vector
multiplication Vector rescaling
Optimizations:1. Long reuse distance between consecutive time
steps:• Shorten distance by performing multiple time
steps on limited part of matrix.• Eliminated memory accesses:
1. K[Anext][i][j] (3 accesses) K[Anext*N*9 + 3*i + j] (1 access)
5.d Art (neural network)
• Poor spatial locality(0% - 20%)
• Neuron is C-structure containing 8 fields. Every loop updates one field, for each neuron.
typedef struct { typedef struct { double I; double* I; … … double R; double* R;} f1_neuron; } f1_neurons;
f1_neuron[N] f1_neurons f1_layer; f1_layer;
F1_layer[y].W f1_layer.W[y]
5.e Mcf
• Reordering of accesses is hard.
• Therefore: prefetching
for( ; arc < stop_arcs; arc += nr_group ) {#define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||
red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
5.f Measurements
0123456789
101112
athlonXP Alpha Itanium average
Sp
eed
up
mcf
art
equake
18M264KAlpha 2126442M696K416KItanium
16256K264KAthlonXPassocsizeassocsizeassocsize
L3L2L1processor
cc –O5Alpha
ecc –O3Itanium
icc –O3AthlonXP
CompilerProcessor
5.g Reuse Distance Histograms
Art
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17log2(reuse distance)
num
ber
of
acce
sses
(b
illio
ns)
Original
Optimized
Equake
0
2
4
6
8
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21log2(reuse distance)
nu
mb
er
of
acc
ess
es
(bill
ion
s)
Original
Optimized
Overview
1. Introduction
2. Reuse Distance
3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?
4. Visualization
5. Case Studies
6. Conclusion
6. Conclusion
Reuse distance predicts cache behaviour accurately. Compiler-optimizations are not powerful enough to
remove a substantial portion of the capacity misses. The programmer often has a global overview of program
behaviour. However, cache behavior is invisible in source code. Visualisation
Mcf, Art, Equake: 3x faster on average, on different CISC/RISC/EPIC platforms, with identical source code optimisations.
Visualization of reuse distance enables portable and platform-independent cache optimisations.
Questions?