How to Compute and Prove Lower and Upper Bounds on the
Communication Costs of Your Algorithm Part II: Geometric embedding
Oded Schwartz CS294, Lecture #3 Fall, 2011 Communication-Avoiding
AlgorithmsBased on: D. Irony, S. Toledo, and A. Tiskin:
Communication lower bounds for distributed-memory matrix
multiplication. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:
Minimizing communication in linear algebra. 2 Last time: the models
Two kinds of costs: Arithmetic (FLOPs) Communication: moving data
between levels of a memory hierarchy (sequential case) over a
network connecting processors (parallel case) CPU RAM CPU RAM CPU
RAM CPU RAM Parallel CPU Cache RAM Sequential M1M1 M2M2 M3M3 M k =
Hierarchy 3 Last time: Communication Lower Bounds Approaches:
1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding
[Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a]
3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S.
2011b] Proving that your algorithm/implementation is as good as it
gets. 4 Last time: Lower bounds for matrix multiplication
Bandwidth: [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04]
Sequential and parallel Latency: Divide by M. 5 Last time:
Reduction (1 st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm:
Cholesky and LU decompositions are (communication-wise) as hard as
matrix-multiplication Proof: By a reduction (from
matrix-multiplication) that preserves communication bandwidth,
latency, and arithmetic. Cor: Any classical O(n 3 ) algorithm for
Cholesky and LU decomposition requires: Bandwidth: (n 3 / M 1/2 )
Latency: (n 3 / M 3/2 ) (similar cor. for the parallel model). 6
Today: Communication Lower Bounds Approaches: 1.Reduction [Ballard,
Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin
04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong
& Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that
your algorithm/implementation is as good as it gets. 7 Lower
bounds: for matrix multiplication using geometric embedding [Hong
& Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and
parallel Now: prove both, using the geometric embedding approach of
[Irony,Toledo,Tiskin 04]. 8 Geometric Embedding (2 nd approach)
[Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Matrix
multiplication form: (i,j) n x n, C(i,j) = k A(i,k) B(k,j), Thm: If
an algorithm agrees with this form (regardless of the order of
computation) then BW = (n 3 / M 1/2 ) BW = (n 3 / PM 1/2 )in
P-parallel model. 9 S1S1 S2S2 S3S3 Read Write FLOP Time... M
Example of a partition, M = 3 For a given run (algorithm, machine,
input) 1.Partition computations into segments of M reads / writes
2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications
in S k 4.The total communication BW is BW = BW of one segment
#segments M #mults / k... Geometric Embedding (2 nd approach)
[Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] 10
Volume of box V = xyz = ( xz zy yx) 1/2 Thm: (Loomis & Whitney,
1949) Volume of 3D set V (area(A shadow) area(B shadow) area(C
shadow) ) 1/2 x z z y x y A B C A shadow B shadow C shadow A B C V
V Matrix multiplication form: (i,j) n x n, C(i,j) = k A(i,k)B(k,j),
Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based
on [Loomis & Whitney 49] 11 S1S1 S2S2 S3S3 Read Write FLOP
Time... M Example of a partition, M = 3 For a given run (algorithm,
machine, input) 1.Partition computations into segments of M reads /
writes 2.Any segment S has 3M inputs/outputs. 3.Show that
#multiplications in S k 4.The total communication BW is BW = BW of
one segment #segments M #mults / k = M n 3 / k 5.By Loomis-Whitney:
BW M n 3 / (3M) 3/2... Geometric Embedding (2 nd approach)
[Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] 12
From Sequential Lower bound to Parallel Lower Bound We showed: Any
classical O(n 3 ) algorithm for matrix multiplication on sequential
model requires: Bandwidth: (n 3 / M 1/2 ) Latency: (n 3 / M 3/2 )
Cor: Any classical O(n 3 ) algorithm for matrix multiplication on
P-processors machine (with balanced workload) requires: 2D-layout:
M=O(n 2 /P) Bandwidth: (n 3 /PM 1/2 ) (n 2 /P 1/2 ) Latency: (n 3 /
PM 3/2 ) (P 1/2 ) 13 From Sequential Lower bound to Parallel Lower
Bound Proof: Observe one processor. Is it always true? A shadow B
shadow C shadow A B C Let Alg be an algorithm with communication
lower bound B = B(n,M). Then any parallel implementation of Alg has
a communication lower bound B(n, M, p) = B(n, M)/p ? Proof of
Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N =
|T| = #cubes T x = projection of T onto x=0 plane N x = |T x | =
#squares in T x, same for T y, N y, etc Goal: N (N x N y N z ) 1/2
14 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i)
onto y=0 plane N(x=i) = |T(x=i)| etc N = i N(x=i) = i (N(x=i)) 1/2
(N(x=i)) 1/2 i (N x ) 1/2 (N(x=i)) 1/2 (N x ) 1/2 i (N(x=i | y )
N(x=i | z ) ) 1/2 = (N x ) 1/2 i (N(x=i | y ) ) 1/2 (N(x=i | z ) )
1/2 (N x ) 1/2 ( i N(x=i | y ) ) 1/2 ( i N(x=i | z ) ) 1/2 = (N x )
1/2 (N y ) 1/2 (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y)
N(x=i) N(x=i|y) N(x=i|z) N(x=i|z) T(x=i) 15 Communication Lower
Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009]
2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel,
Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard,
Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation
is as good as it gets. How to generalize this lower bound 16 Matrix
multiplication form: (i,j) n x n, C(i,j) = k A(i,k)B(k,j), (1)
Generalized form: (i,j) S, C(i,j) = f ij ( g i,j,k1 (A(i,k1),
B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), , k1,k2, S ij other
arguments) C(i,j) any unique memory location. Same for A(i,k) and
B(k,j). A,B and C may overlap. Lower bound for all reorderings.
Incorrect ones too. It does assume each operand generate
load/store. Turns out QR, eig, SVD all may do this Need a different
analysis. Not today f ij and g ijk are nontrivial functions 17
Geometric Embedding (2 nd approach) (1) Generalized form: (i,j) S,
C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2),
B(k2,j)), , k1,k2, S ij other arguments) Thm: [Ballard, Demmel,
Holtz, S. 2011a] If an algorithm agrees with the generalized form
then BW = (G/ M 1/2 ) where G = |{g (i,j,k) | (i,j) S, k S ij } BW
= (G/ PM 1/2 )in P-parallel model. 18 Example: Application to
Cholesky decomposition (1) Generalized form: (i,j) S, C(i,j) = f ij
( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), ,
k1,k2, S ij other arguments) 19 From Sequential Lower bound to
Parallel Lower Bound We showed: Any algorithm that agrees with Form
(1) on sequential model requires: Bandwidth: (G / M 1/2 ) Latency:
(G / M 3/2 ) where G is the g ijk count. Cor: Any algorithm that
agrees with Form (1), on a P- processors machine, where at least
two processors perform (1/P) of G each requires: Bandwidth: (G /PM
1/2 ) Latency: (G / PM 3/2 ) 20 Geometric Embedding (2 nd approach)
[Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin
04], based on [Loomis & Whitney 49] Lower bounds: for
algorithms with flavor of 3 nested loops BLAS, LU, Cholesky, LDL T,
and QR factorizations, eigenvalues and SVD, i.e., essentially all
direct methods of linear algebra. Dense or sparse matrices In
sparse cases: bandwidth is a function NNZ. Bandwidth and latency.
Sequential, hierarchical, and parallel distributed and shared
memory models. Compositions of linear algebra operations. Certain
graph optimization problems [Demmel, Pearson, Poloni, Van Loan,
11], [Ballard, Demmel, S. 11] Tensor contraction For dense: 21 Do
conventional dense algorithms as implemented in LAPACK and
ScaLAPACK attain these bounds? Mostly not. Are there other
algorithms that do? Mostly yes. 22 Dense Linear Algebra: Sequential
Model Lower boundAttaining algorithm
AlgorithmBandwidthLatencyBandwidthLatency Matrix- Multiplication
[Ballard, Demmel, Holtz, S. 11] [Ballard, Demmel, Holtz, S. 11]
[Frigo, Leiserson, Prokop, Ramachandran 99] Cholesky[Ahmad, Pingali
00] [Ballard, Demmel, Holtz, S. 09] LU [Toledo97] [DGX08] QR [EG98]
[DGHL08a] Symmetric Eigenvalues [Ballard,Demmel,Dumitriu 10]
SVD[Ballard,Demmel,Dumitriu 10] (Generalized) Nonsymetric
Eigenvalues [Ballard,Demmel,Dumitriu 10] Dense 2D parallel
algorithms Assume nxn matrices on P processors, memory per
processor = O(n 2 / P) ScaLAPACK assumes best block size b chosen
Many references (see reports), Blue are new Recall lower bounds:
#words_moved = ( n 2 / P 1/2 ) and #messages = ( P 1/2 )
AlgorithmReferenceFactor exceeding lower bound for #words_moved
Factor exceeding lower bound for #messages Matrix multiply[Cannon,
69]11 CholeskyScaLAPACKlog P LU[GDX08] ScaLAPACK log P ( N / P 1/2
) log P QR[DGHL08] ScaLAPACK log P log 3 P ( N / P 1/2 ) log P Sym
Eig, SVD[BDD10] ScaLAPACK log P log 3 P N / P 1/2 Nonsym Eig[BDD10]
ScaLAPACK log P P 1/2 log P log 3 P N log P Relax: 2.5D Algorithms
Solomonik & Demmel 11 24 S1S1 S2S2 S3S3 Read Write FLOP Time...
M Example of a partition, M = 3 For a given run (algorithm,
machine, input) 1.Partition computations into segments of M reads /
writes 2.Any segment S has 3M inputs/outputs. 3.Show that S
performs k FLOPs g ijk 4.The total communication BW is BW = BW of
one segment #segments M G / k, where G is #g i,j,k... Geometric
Embedding (2 nd approach) 25 Geometric Embedding (2 nd approach)
[Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin
04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j)
S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2),
B(k2,j)), , k1,k2, S ij other arguments) Volume of box V = xyz = (
xz zy yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V
(area(A shadow) area(B shadow) area(C shadow) ) 1/2 x z z y x y A B
C A shadow B shadow C shadow A B C V V 26 S1S1 S2S2 S3S3 Read Write
FLOP Time... M Example of a partition, M = 3 For a given run
(algorithm, machine, input) 1.Partition computations into segments
of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show
that S performs k FLOPs g ijk 4.The total communication BW is BW =
BW of one segment #segments M G / k where G is #g i,j,k 5.By
Loomis-Whitney: BW M G / (3M) 3/2... Geometric Embedding (2 nd
approach) 27 Applications 27 BW = (G/ M 1/2 ) where G = |{g (i,j,k)
| (i,j) S, k S ij } BW = (G/ PM 1/2 )in P-parallel model. (1)
Generalized form: (i,j) S, C(i,j) = f ij ( g i,j,k1 (A(i,k1),
B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), , k1,k2, S ij other
arguments) 28 Geometric Embedding (2 nd approach) [Ballard, Demmel,
Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis
& Whitney 49] (1) Generalized form: (i,j) S, C(i,j) = f ij ( g
i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), , k1,k2, S
ij other arguments) But many algorithms just dont fit the
generalized form! For example: Strassens fast matrix multiplication
29 Beyond 3-nested loops How about the communication costs of
algorithms that have a more complex structure? 30 Communication
Lower Bounds to be continued Approaches: 1.Reduction [Ballard,
Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin
04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong
& Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that
your algorithm/implementation is as good as it gets. 31 Further
reduction techniques: Imposing reads and writes Example: Computing
||AB|| where each matrix element is a formulas, computed only once.
Problem: Input/output do not agree with Form (1). Solution: Impose
writes/reads of (computed) entries of A and B. Impose writes of the
entries of C. The new algorithm has lower bound For the original
algorithm i.e., for (which we assume anyway). (1) Generalized form:
(i,j) S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2
(A(i,k2), B(k2,j)), , k1,k2, S ij other arguments) 32 Further
reduction techniques: Imposing reads and writes The previous
example can be generalized to other black-box uses of algorithms
that fit Form (1). Consider a more general class of algorithms:
Some arguments of the generalized form may be computed on the fly
and discarded immediately after used. 33 S1S1 S2S2 S3S3 Read Write
FLOP Time... M Example of a partition, M = 3 For a given run
(Algorithm, Machine, Input) 1.Partition computations into segments
of 3M reads / writes 2.Any segment S has M inputs/outputs. 3.Show
that S performs G(3M) FLOPs g ijk 4.The total communication BW is
BW = BW of one segment #segments M G / G(3M) But now some operands
inside a segment may be computed on-the fly and discarded. So no
read/write performed.... Recall How to generalize this lower bound:
How to deal with on-the-fly generated operands 34 Need to
distinguish Sources, Destinations of each operand in fast memory
during a segment: Possible Sources: R1: Already in fast memory at
start of segment, or read; at most 2M R2: Created during segment;
no bound without more information Possible Destinations: D1: Left
in fast memory at end of segment, or written; at most 2M D2:
Discarded; no bound without more information S1S1 S2S2 S3S3 Read
Write FLOP Time... How to generalize this lower bound: How to deal
with on-the-fly generated operands 35 There are at most 4M of
types: R1/D1, R1/D2, R2/D1. Need to assume/prove: not too many
R2/D2 arguments; Then we can use LW, and obtain the lower bound of
Form (1). Bounding R2/D2 is sometimes quite subtle. S1S1 S2S2 S3S3
Read Write FLOP Time... A shadow B shadow C shadow A B C V 36
Composition of algorithms Many algorithms and applications use
composition of other (linear algebra) algorithms. How to compute
lower and upper bounds for such cases? Example - Dense matrix
powering Compute A n by ( log n times) repeated squaring: A A 2 A 4
A n Each squaring step agrees with Form (1). Do we get or is there
a way to reorder (interleave) computations to reduce communication?
37 Communication hiding vs. Communication avoiding Q. The Model
assumes that computation and communication do not overlap. Is this
a realistic assumption? Can we not gain time by such overlapping?
A. Right. This is called communication hiding. It is done in
practice, and ignored in our model. It may save up to a factor of 2
in the running time. Note that the speedup gained by avoiding
(minimizing) communication is typically larger than a constant
factor. 38 Two-nested loops: when the input/output size dominates
Q. Do two-nested-loops algorithms fall into the paradigm of Form
(1)? For example, what lower bound do we obtain for computing
Matrix- vector multiplication? A. Yes, but the lower bound we
obtain is Where just reading the input costs More generally, the
communication cost lower bound for algorithms that agree with Form
(1) is where LW is the one we obtain from the geometric embedding,
and #inputs+#outputs is the size of the inputs and outputs. For
some algorithms LW dominates, for others #inputs+#outputs dominate.
39 Composition of algorithms Claim: any implementation of A n by (
log n times) repeated squaring requires Therefore we cannot reduce
communication by more than a constant factor (compared to log n
separate calls to matrix multiplications) by reordering
computations. (1) Generalized form: (i,j) S, C(i,j) = f ij ( g
i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), , k1,k2, S
ij other arguments) 40 Composition of algorithms Proof: by imposing
reads/writes on each entry of every intermediate matrix. The total
number of g i,j,k is (n 3 log n). The total number of imposed
reads/writes is (n 2 log n). The lower bound for the original
algorithm is (1) Generalized form: (i,j) S, C(i,j) = f ij ( g
i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), , k1,k2, S
ij other arguments) 41 Composition of algorithms: when interleaving
does matter Example 1: Input: A,v 1,v 2,,v n Output: Av 1,Av 2,,Av
n The phased solution costs But we already know that we can save a
M 1/2 factor: Set B = (v 1,v 2,,v n ), and compute A B, then the
cost is Other examples? 42 Composition of algorithms: when
interleaving does matter Example 2: Input: A,B, t Output: C (k) = A
B (k) for k = 1,2,,t where B i,j (k) = B i,j 1/k Phased solution:
Upper bound: (by adding up the BW cost of t matrix multiplication
calls). Lower bound: (by imposing writes/reads between phases). 43
Composition of algorithms: when interleaving does matter Example 2:
Input: A,B, t Output: C (k) = A B (k) for k = 1,2,,t where B i,j
(k) = B i,j 1/k Can we do better than ? 44 Composition of
algorithms: when interleaving does matter Example 2: Input: A,B, t
Output: C (k) = A B (k) for k = 1,2,,t where B i,j (k) = B i,j 1/k
Can we do better than ? Yes. Claim: There exists an implementation
for the above algorithm, with communication cost (tight lower and
upper bounds): 45 Composition of algorithms: when interleaving does
matter Example 2: Input: A,B, t Output: C (k) = A B (k) for k =
1,2,,t where B i,j (k) = B i,j 1/k Proofs idea: Upper bound: Having
both A i,k and B k,j in fast memory lets us do up to t evaluations
of g ijk. Lower bound: The union of all these tn 3 operations does
not match Form (1), since the inputs B k,j cannot be indexed in a
one-to-one fashion. We need a more careful argument regarding the
numbers of g ijk. Operations in a segment as a function of the
number of accessed elements of A, B and C (k). 46 Composition of
algorithms: when interleaving does matter Can you think of natural
examples where reordering / interleaving of known algorithms may
improve the communication costs, compared to the phased
implementation? 47 Summary How to compute an upper bound on the
communication costs of your algorithm? Typically straightforward.
Not always. How to compute and prove a lower bound on the
communication costs of your algorithm? Reductions: from another
algorithm/problem from another model of computing By using the
generalized form (flavor of 3 nested loops) and imposing
reads/writes black-box-wise or bounding the number of R2/D2
operands By carefully composing the lower bounds of the building
blocks. Next time: by graph analysis 48 Open Problems Find
algorithms that attain the lower bounds: Sparse matrix algorithms
that auto-tune or are cache oblivious cache oblivious for parallel
(distributed memory) Cache oblivious parallel matrix
multiplication? (Cilk++ ?) Address complex heterogeneous hardware:
Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel,
Gearhart 11] How to Compute and Prove Lower Bounds on the
Communication Costs of your Algorithm Oded Schwartz CS294, Lecture
#2 Fall, 2011 Communication-Avoiding Algorithms Based on: D. Irony,
S. Toledo, and A. Tiskin: Communication lower bounds for
distributed-memory matrix multiplication. G. Ballard, J. Demmel, O.
Holtz, and O. Schwartz: Minimizing communication in linear algebra.
Thank you!