1 Software Performance Optimisation Group Imperial College, London Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly Software Performance Optimisation Group, Imperial College, London
28
Embed
Software Performance Optimisation GroupImperial College, London 1 Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Sof
twar
e P
erfo
rman
ce O
ptim
isat
ion
Gro
upIm
peria
l Col
lege
, Lo
ndon
Improving the Performance of Morton Layout by Array Alignment and
Loop Unrolling
Reducing the Price of Naivety
Jeyarajan Thiyagalingam
Olav Beckmann and Paul H.J. Kelly
Software Performance Optimisation Group,
Imperial College, London
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
2
Motivation • Consider two code variants of a matrix multiply
Morton storage layout is unbiased towards either row- or column-major traversal.
Row-major Traversal
Block Size
RM
Array
Morton
Array
32B 75% 50%
128B 93.7% 75%
8kB page 99.9% 96.87%
Column-major Traversal
Block Size
RM
Array
Morton
Array
32B 0% 50%
128B 0% 75%
8kB 0% 96.87%
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
8
So have we solved the problem?
• Unfortunately, the basic Morton Scheme often performs disappointingly.
• At least Morton does not seem to suffer from pathological drops in performance.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
9
Alignment
Statement that Morton is unbiased turns out to be based on assumption that a cache line maps to start of Morton block.
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0)
(0,1)
(1,0)
(1,1)
(0,2)
(0,3)
(1,2)
(1,3)
(2,0)
(2,1)
(3,0)
(3,1)
(2,2)
(2,3)
(3,2)
(3,3)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0)
(0,1)
(1,0)
(1,1)
(0,2)
(0,3)
(1,2)
(1,3)
(2,0)
(2,1)
(3,0)
(3,1)
(2,2)
(2,3)
(3,2)
(3,3)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
10
It turns out that Morton layout is only unbiased for even power-of-two cache line sizesThe same problems happen when mis-aligning the base address
Alignment
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
11
We calculated miss-rates systematically for all levels of memory hierarchyIn each case, we calculated the miss-rates for all possible alignments of the base address.The difference in miss-rates between best and worst alignment of the base address of Morton arrays can be up to a factor of 1.5 for even power-of-two cache lines, a factor of 2 for odd power-of-two cache lines.
Alignment
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
12
Alignment
The overall miss-rates drop exponentially with block size, but access times are generally assumed to increase geometrically with block size.
Morton Order Missrates for Row-major and Colum-major Traversal
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32 64 128 256 512 1024
Blocksize in Double Words
Mis
sra
te
RM Maximum
RM Minimum
CM Maximum
CM Minimum
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
13
AlignmentWith canonical layouts, it is often necessary to pad the row or column length in order to avoid pathological behaviour. Finding the right amount of padding is not trivial.Theoretically, one should align the base address of Morton arrays to the largest significant block size in the memory hierarchy – i.e. page size. Aliasing in the memory hierarchy can spoil the theory.For example, on Pentium 4, the following aliasing patterns cause problems
2K – map to same L1 cache line16K – aliases in store-forwarding logic32K – map to the same L2 cache line64K – indistinguishable in L1 cache
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
14
Address calculation
With lexicographic (aka canonical) layout, it’s easy to calculate the offset S of A[i,j] in a NM array A:
Srm(i,j) = Ni + j Scm(i,j) = i + Mj
(if N and M are powers of two, this is bit-wise concatenation of i and j)
In loops, the multiplication is replaced by an incrementWhen unrolling loops, the address calculation can be strength-reduced.
How can we calculate the Morton offset?
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
15
Address calculationMorton indices can be calculated by using the bit-concatenation idea of RM/CM for power-of-two arrays recursively:For a 2x2 array, if i and j are the indices, then the location is (i << 1) | j.
Let D0(i) = in0 … i10i00
Let D1(i) = 0in … 0i10i0
Then Smz(i,j) = D0(i) | D1(j)Dilation is rather expensive for inner loopStrength reduction (Wise et al)
D0(i+1) = ((D0(i) | Ones0) + 1) & Ones1
D1(i+1) = ((D1(i) | Ones1) + 1) & Ones0
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
16
Address calculation
Idea: use lookup tables for D0(i) and D1(j)
A[MortonTabEven[i] + MortonTabOdd[j]]
When can we do strength reduction? In general Smz(i,j+1) could be anywhere
D0(i + 1) = ???
D0(i + k) = D0(i) + D0(k) if i’s and k’s bits do not overlap.
We can do strength reduction
D0(i + k) = D0(i) + D0(k) as long as i = 2n and k < 2n
With this, we can do loop unrolling
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
17
Unrolled Code with Stength-Reductiondouble mmijk_unrolled(unsigned sz,FLOATTYPE *A,FLOATTYPE
*B,FLOATTYPE *C) unsigned i,j,k;for (i=0;i<sz;i++){ unsigned int t1i=MortonTabOdd[i]; for (j=0;j<sz;j++){ unsigned int t0j=MortonTabEven[j]; for (k=0;k<sz;k+=4){
unsigned int t0k=MortonTabEven[k];unsigned int t1k=MortonTabOdd[k];