Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Slide 1

Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse

Kristof Beyls and Erik D’Hollander

International Conference on Computational Science

June 2004

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

1.a Introduction

Anti-law of Moore (2003: half of execution time is lost due to data cache misses)

1

10

100

1000

1980 1985 1990 1995 2000

PROCESSOR

MEMORY

Speed Gap

Re

latieve sp

eed

versu

s 1

980

1.b Observation:Capacity misses dominate

3 kinds of cache misses (3 C’s): Cold, Conflict, Capacity

8K

1

6K

3

2K

6

4K

12

8K

25

6K

51

2K

10

24

K 1

24

8

full

0%

20%

40%

60%

80%

100%

per

cen

tag

e ca

pac

ity

mis

ses

(SP

EC

2000

, C

anti

n a

nd

Hil

l)

cache sizeassoc.

Overview

1. Introduction

2. Reuse Distance


4. Visualization

5. Case Studies

6. Conclusion

2.a Reuse Distance

Definition: The reuse distance of a memory access is the number of unique memory locations accessed since the previous access to the same data.

2

C

1022∞∞∞distance

ABBACBAaddress

2.b Reuse Distance - property

Lemma: In a fully. assoc. LRU cache with n lines, an access hits the cache reuse distance < n.

Consequence: In every cache with n lines, a cache miss with distance d is:

Cold missd = ∞

Capacity missn ≤ d < ∞

Conflict missd < n

2.c Reuse distance histogram Spec95fp

0

10

20

30

0 5 10 15 20

Bil

lio

ns

log2(reuse distance)

hits

misses

0

1

2

3

0 5 10 15 20

Bil

lio

ns


hits

misses

2.d Classifying cache misses SPEC95fp

Cache size

Conflict Capacity

2.e Reuse distance vs. cache hit probability

0

20

40

60

80

100

0 5 10 15 20

Hit

Perc

enta

ge

log2(Reuse Distance)

Direct Mapped Fully Associative

Overview

1. Introduction

2. Reuse Distance


4. Visualization

5. Case Studies

6. Conclusion

3a. Removing Capacity misses

1. Hardware Enlarge cache

0

1

2

3

0 5 10 15 20

Bill

ion

s


hits

misses

CSCSCS CSCS CS

Reuse distance must be smaller than cache size

1. Compiler– Loop tiling– Loop fusion

2. Algorithm

CS

3.b Compiler optimizations: SGIpro for Itanium (spec95fp)

0E+0

1E+9

2E+9

3E+9

0 5 10 15 20


nu

mb

er o

f m

isse

s

original

after optimization

Conflict Capacity

30% conflict misses eliminated,1% capacity misses eliminated.

Overview

1. Introduction

2. Reuse Distance


4. Visualization

5. Case Studies

6. Conclusion

4.a Objectives for cache visualization

Cache behavior is shown in the source code. Cache behavior is presented accurately and

concisely. Independent of specific cache parameters

(e.g. size, associativity,…).

Reuse Distance allows to meet the above objectives

4.b Example: MCF

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||

red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

68% of capacity misses

4.c Example: MCF

68.3% / sl=21%

Overview

1. Introduction

2. Reuse Distance


4. Visualization

5. Case Studies

6. Conclusion

5.a Optimization: classification

1. Eliminate memory access with poor locality (+++)

2. Reduce reuse distance (keep data in cache between use and reuse) (++)

3. Increase spatial locality (++)

4. Hide latency by prefetching (+)

5.b 3 case studies

From Spec2000: With large memory bottleneck:

Mcf (90%) – optimization of bus schedule Art (87%) – simulation of neural network Equake (66%) – simulation of earthquake

Percentage of execution time thatthe processor is stalled waiting fordata from memory and cache.(Itanium1 733Mhz)

5.c Equake

For every time step: Sparse matrix-vector

multiplication Vector rescaling

Optimizations:1. Long reuse distance between consecutive time

steps:• Shorten distance by performing multiple time

steps on limited part of matrix.• Eliminated memory accesses:

1. K[Anext][i][j] (3 accesses) K[Anext*N*9 + 3*i + j] (1 access)

5.d Art (neural network)

• Poor spatial locality(0% - 20%)

• Neuron is C-structure containing 8 fields. Every loop updates one field, for each neuron.

typedef struct { typedef struct { double I; double* I; … … double R; double* R;} f1_neuron; } f1_neurons;

f1_neuron[N] f1_neurons f1_layer; f1_layer;

F1_layer[y].W f1_layer.W[y]

5.e Mcf

• Reordering of accesses is hard.

• Therefore: prefetching

for( ; arc < stop_arcs; arc += nr_group ) {#define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||

red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

5.f Measurements

0123456789

101112

athlonXP Alpha Itanium average

Sp

eed

up

mcf

art

equake

18M264KAlpha 2126442M696K416KItanium

16256K264KAthlonXPassocsizeassocsizeassocsize

L3L2L1processor

cc –O5Alpha

ecc –O3Itanium

icc –O3AthlonXP

CompilerProcessor

5.g Reuse Distance Histograms

Art

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17log2(reuse distance)

num

ber

of

acce

sses

(b

illio

ns)

Original

Optimized

Equake

0

2

4

6

8

10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21log2(reuse distance)

nu

mb

er

of

acc

ess

es

(bill

ion

s)

Original

Optimized

Overview

1. Introduction

2. Reuse Distance


4. Visualization

5. Case Studies

6. Conclusion

6. Conclusion

Reuse distance predicts cache behaviour accurately. Compiler-optimizations are not powerful enough to

remove a substantial portion of the capacity misses. The programmer often has a global overview of program

behaviour. However, cache behavior is invisible in source code. Visualisation

Mcf, Art, Equake: 3x faster on average, on different CISC/RISC/EPIC platforms, with identical source code optimisations.

Visualization of reuse distance enables portable and platform-independent cache optimisations.

Questions?

Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Documents