Vector: Data Layout

1

Vector: Data LayoutVector: x[n]P processorsAssume n = r * p

A[m:n] for(i=m;i<=n) A[i]…Let A[m : s : n] denotes for(i=m;i<=n;i=i+s) A[i] …

Block distribution: id = 0, 1, …, p-1 x[r*id : r*(id+1)-1] id-th processor

Cyclic distribution: x[id : p : n-1] id-th processor

Block cyclic distribution: x = [x1, x2, …, xN]^T where xj is a subvector of length n/N Assume N = s*p Do a cyclic distribution of xj, j=1,2…,N

r

n

block

p

n

cyclic

n/N

P*n/N

n

Block cyclic

2

Matrix: Data LayoutRow: block, cyclic, block cyclicColumn: block, cyclic, block cyclic

Matrix: 9 combinationsIf only one block in one direction 1D decompositionOtherwise 2D decomposition

e.g. p processors with a (Nx, Ny) virtual topology, p=Nx*Ny

Matrix A[n][n], n = rx * Nx = ry * Ny

A[rx*I : rx*(I+1)-1, J:Ny:n-1], I=0,…,rx-1, J=0,…,ry-1, is a 2D decomposition, block distribution in x direction, cyclic distribution in y direction

1D block cyclic

2D block cyclic

3

Matrix-Vector Multiply: Row-wiseAX=YA – NxN matrix, row-wise block distributionX,Y – vectors, dimension N=

A X Y

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y1

Y2

Y3Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3

=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X2

X3

X1

Y1

Y2

Y3

Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3

=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X3

X1

X2

Y1

Y2

Y3


cpu 0

cpu 1

cpu 2

cpu 0

cpu 1

cpu 2

cpu 0

cpu 1

cpu 2

4

Matrix-Vector Multiply: Column-wiseAX=YA – NxN matrix, column-wise block distributionX,Y – vectors, dimension N




=

A X Y

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y1

Y2

Y3

cpu 0

cpu 1

cpu 2

=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y2

Y3

Y1

cpu 0

cpu 1

cpu 2

=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y3

Y1

Y2

cpu 0

cpu 1

cpu 2

5

Matrix-Vector Multiply: Row-wise

All-gather

6

Matrix-Vector Multiply: Column-wise

=

A X Y

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y1

Y2

Y3

cpu 0

cpu 1

cpu 2

AX=YA – NxN matrix, column-wise block distributionX,Y – vectors, dimension N

Y1’ = A11*X1Y2’ = A21*X1Y3’ = A31*X1

Y1’ = A12*X2Y2’ = A22*X2Y3’ = A32*X2

Y1’ = A13*X3Y2’ = A23*X3Y3’ = A33*X3

Then reduce-scatter across processors

First local computations

7

Matrix-Vector Multiply: 2D Decomposition

P_{0} P_{1} … P_{K-1}

P_{K} P_{K+1} … P_{2K-1}

…

… …

A x P=K^2 number of cpusAs a 2D KxK meshA – K x K block matrix, each block (N/K)x(N/K)X – K x 1 blocks, each block (N/K)x1

Each block of A is distributed to a cpuX is distributed to the K cpus in last column

Result A*X be distributed on the K cpus of the last column

8

Matrix-Vector Multiply: 2D Decomposition

9

Homework Write an MPI program and implement the matrix vector

multiplication algorithm with 2D decomposition Assume:

Y=A*X, A – NxN matrix, X – vector of length N Number of processors P=K^2, arranged as a K x K mesh in a row-major

fashion, i.e. cpus 0, …, K-1 in first row, K, …, 2K-1 in 2nd row, etc N can be divided by K. Initially, each cpu has the data for its own submatrix of A; Input vector X

is distributed on processors of the rightmost column, i.e. cpus K-1, 2K-1, …, P-1

In the end, the result vector Y should be distributed on processors at the rightmost column.

A[i][j] = 2*i+j, X[i] = i; Make sure your result is correct using a small value of N Turn in:

Source code + binary Wall-time and speedup vs. cpu for 1, 4, 16 processors for N = 1024.

10

Load Balancing: (Block) Cyclica

a a

a a a

b b b b

b b b b b

b b b b b b

c c c c c c c

c c c c c c c c

c c c c c c c c c

a

a b b b

c c c c c c c

a a

b b b b b

c c c c c c c c

a a a

b b b b b b

c c c c c c c c c

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

y1

y2

y3

y4

y5

y6

y7

y8

y9

y1

y4

y7

y2

y5

y8

y3

y6

y9

=

=

11

Cyclic Distribution

Matrix-vector multiply, row-wise cyclic distribution of A and yBlock distribution of x

Initial data: id – cpu id p – number of cpus ids of left/right neighbors n – matrix dimension, n=r*p Aloc = A(id:p:n-1,:) yloc = y(id:p:n-1) xloc = x(id*r:(id+1)*r-1)

r = n/pfor t=0:p-1 send(xloc,left) s = (id+t)%p // xloc = x(s*r:(s+1)*r-1) for i=0:r-1 for j=0:min(id+i*p-s*r,r) yloc(j) += Aloc(i,j+s*r)*xloc(j) end end recv(xloc,right)end

12

Matrix MultiplicationRow: A(i,:) = [Ai1, Ai2, …, Ain]Column: A(:,j) = [A1j, A2j, …, Anj]^TSubmatrix: A(i1:i2,j1:j2) = [ A(i,j) ], i1<=i<=i2, j1<=j<=j2

for i=1:m for j=1:n for k=1:p C(i,j) = C(i,j)+A(i,k)*B(k,j) end endend

A – m x pB – p x nC – m x n

C = AB + C

for i=1:m for j=1:n C(i,j) = C(i,j)+A(i,:)B(:,j) endend

Dot product formulationA(i,:) dot B(:,j)A accessed by rowB accessed by columnNon-optimal memory access!

(ijk) variant of matrix multiplication

13

Matrix Multiplyijk loop can be arranged in other orders

(ikj) variantfor i=1:m for k=1:p for j=1:n C(i,j) = C(i,j) + A(i,k)B(k,j) end endend

for i=1:m for k=1:p C(i,:) = C(i,:) + A(i,k)B(k,:) endend

axpy formulationB by rowC by row

(jki) variantfor j=1:n for k=1:p for i=1:m C(i,j) = C(i,j)+A(i,k)B(k,j) end endend

for j=1:n for k=1:p C(:,j) = C(:,j)+A(:,k)B(k,j) endend

axpy formulationA by columnC by column

14

Other VariantsLoop order Inner loop Middle loop Inner loop data

access

ijk Dot Axpy A by row, B by column

jik Dot Axpy A by row, B by column

ikj axpy Axpy B by row, C by row

jki axpy Axpy A by column, C by column

kij axpy Row outer product

B by row, C by row

kji axpy Column outer product

A by column, C by column

15

Block Matrices

qrq

r

AA

AAA

1

111

nnnn

mmmm

r

q

...

...

21

21

Block matrix multiplyA, B, C – NxN block matrices each block: s x s

(mnp) variantfor m=1:N for n = 1:N for p = 1:N i=(m-1)s+1 : ms j = (n-1)s+1 : ns k = (p-1)s+1 : ps C(i,j) = C(i,j) + A(i,k)B(k,j) end endend

CBACN

1

also other variants

Cache blocking

16

Block Matrices

NNNN

N

NN CCBB

BBAACC

1

1

111

11

CBABABAC NN 2211

NNNNN

N

N C

C

B

B

AA

AA

C

C

11

1

1111

NNBABABAC 2211

17

Matrix Multiply: Column-wise

A1 A2 A3

B11 B12 B13

B21 B22 B23

B31 B32 B33

C1 C2 C3=

A B C

C1 = A1*B11 + A2*B21 + A3*B31 cpu 0

C2 = A1*B12 + A2*B22 + A3*B32 cpu 1

C3 = A1*B13 + A2*B23 + A3*B33 cpu 2

A, B, C – NxN matricesP – number of processors

A1, A2, A3 – Nx(N/P) matricesC1, C2, C3 - …Bij – (N/P)x(N/P) matrices

Column-wise decomposition

18

Matrix Multiply: Row-wise

B1

B2

B3

A11 A12 A13

A21 A22 A23

A31 A32 A33

C1

C2

C3

=

C1 = A11*B1 + A12*B2 + A13*B3 cpu 0

C2 = A21*B1 + A22*B2 + A23*B3 cpu 1

C3 = A31*B1 + A32*B2 + A33*B3 cpu 2

A, B, C – NxN matricesP – number of processors

B1, B2, B3 – (N/P)xN matricesC1, C2, C3 - …Aij – (N/P)x(N/P) matrices

19

Matrix Multiply: 2D DecompositionHypercube-Ring Cpus: P = K^2

Matrices A, B, C: dimension N x N, K x K blocksEach block: (N/K) x (N/K)

Determine coordinate (irow,icol) of current cpu.Set B’=B_localFor j=0:K-1 root_col = (irow+j)%K broadcast A’=A_local from root cpu (irow,root_col) to other cpus in the row C_local += A_local*B_local shift B’ upward one stepend

A01 A01 A01 A01

A12 A12 A12 A12

A23 A23 A23 A23

A30 A30 A30 A30

Step 1

Step 2

Step 2

broadcast

shift

Broadcast A diagonalsShift BC fixed

20

Matrix Multiply

Total ~K*log(K) communication steps, or sqrt(P)log(sqrt(P)) stepsIn contrast, 1D decomposition, P

communication stepsCan use max N^2 processors for problem

size NxN matrices1D decomposition, max N processors

21

Matrix Multiply: Ring-Hypercube

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B00

B11

B22

B33

Shift A columns Broadcast B diag

A00B00

A01B11

A02B22

A03B33

A10B00

A11B11

A12B22

A13B33

A20B00

A21B11

A22B22

A23B33

A30B00

A31B11

A32B22

A33B33

A01B10

A02B21

A03B32

A00B03

A11B10

A12B21

A13B32

A10B03

A21B10

A22B21

A23B32

A20B03

A31B10

A32B21

A33B32

A30B03

Determine coordinate (irow,icol) of current cpu.Set A’=A_localFor j=0:K-1 root_row = (icol+j)%K broadcast B’=B_local from root cpu (root_row,icol) to other cpus in the column C_local += A_local*B_local Shift A’ leftward one stepend

Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)

Step 1 Step 2

C fixed

22

A00B00

A01B01

A02B02

A03B03

A10B10

A11B11

A12B12

A13B13

A20B20

A21B21

A22B22

A23B23

A30B30

A31B31

A32B32

A33B33

A00B00

A01B11

A02B22

A03B33

A10B00

A11B11

A12B22

A13B33

A20B00

A21B11

A22B22

A23B33

A30B00

A31B11

A32B22

A33B33

A01 A02 A03 A00

A11 A12 A13 A10

A21 A22 A23 A20

A31 A32 A33 A30

A01B10

A02B21

A03B32

A00B03

A11B10

A12B21

A13B32

A10B03

A21B10

A22B21

A23B32

A20B03

A31B10

A32B21

A33B32

A30B03

A02 A03 A00 A01

A12 A13 A10 A11

A22 A23 A20 A21

A32 A33 A30 A31

A02B20

A03B31

A00B02

A01B13

A12B20

A13B31

A10B02

A11B13

A22B20

A23B31

A20B02

A21B13

A32B20

A33B31

A30B02

A31B13

initial Broadcast B compute Shift A broadcast

compute Shift A Broadcast B

Matrix Multiply: Ring-Hypercube

23

Matrix Multiply: Systolic (Torus)

A00B00

A01B11

A02B22

A11B10

A12B21

A10B02

A22B20

A20B01

A21B12

A01B10

A02B21

A00B02

A12B20

A10B01

A11B12

A20B00

A21B11

A22B22

A

B

A02B20

A00B01

A01B12

A10B00

A11B11

A12B22

A21B10

A22B21

A20B02

Step 1 Step 2 Step 3

A00B00

A01B01

A02B02

A10B10

A11B11

A12B12

A20B20

A21B21

A22B22

initial

Shift rows of A leftwardShift columns of B upward

C fixed

Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)

24

Matrix Multiply: SystolicP = K^2 number of processors, as a K x K 2D torusA, B, C: KxK block matrices, each block (N/K)x(N/K)Each cpu computes 1 block: A_loc, B_loc, C_locCoordinate in torus of current cpu: (irow, icol)Ids of left, right, top, bottom neighboring processors

// first get appropriate initial distributionfor j=0:irow-1 send(A_loc,left); recv(A_loc,right)endfor j=0:icol-1 send(B_loc,top); recv(B_loc,bottom)end// start computationfor j=0:K-1 send(A_loc,left) send(B_loc,top) C_loc = C_loc + A_loc*B_loc recv(A_loc,right) recv(B_loc,bottom)end

Max N^2 processors~ sqrt(P) communication steps

25

Matrix Multiply on P=K^3 CPUs

Assume:

A, B, C: dimension N x N

P = K^3 number of processorsOrganized into K x K x K 3D mesh

A (NxN) can be considered as a q x q block matrix, each block (N/q)x(N/q)

Let q = K^(1/3), i.e. consider A as a K^(1/3) x K^(1/3) block matrix, each block being (N/K^(1/3)) x (N/K^(1/3))

Similar for B and C

26

Matrix Multiply on K^3 CPUs

3/1

1

K

ttsrtrs BAC r,s = 1, 2, …, K^(1/3)

Total K^(1/3)*K^(1/3)*K^(1/3) = K block matrix multiplications

Idea: Perform these K matrix multiplications on the K different planes (or levels) of the 3D mesh of processors.

Processor (i,j,k) (i,j=1,…,K) belongs to plane k.Will perform multiplication A_{rt}*B_{ts}, where k = (r-1)*K^(2/3)+(s-1)*K^(1/3)+t

Within a plane, (N/K^(1/3)) x (N/K^(1/3)) matrix multiply on K x K processors. Use the systolic multiplication algorithm.Within a plane k: A_{rt}, B_{ts} and C_{rs} decomposed into K x K block matrices, each block (N/K^(4/3)) x (N/K^(4/3)).

27

Matrix Multiply

Initial data distribution

Initially, processor (i,j,1) has (i,j) sub-blocks of all A_{rt} and B_{ts} blocks, for all r,s,t=1,…,K^(1/3), i,j=1,…,K

N/K^(1/3) dimension

K^(1/3) blocks

K blocks

A, B, C

x

On KxK processors

A_{rt} B_{ts}

A_{rt} destined to levels k=(r-1)*K^(2/3)+(s-1)K^(1/3)+t, for all s=1,…,K^(1/3)

B_{ts} destined to levels k=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t, for all r=1,…,K^(1/3)

(i,j)

28

Matrix Multiply// set up input dataOn processor (i,j,1), read in the (i,j)-th block of matrices A_{r,t} and B_{t,s}, 1<= r,s,t <= K^(1/3); pass data onto processor (i,j,2);On processor (i,j,m), make own copy of A_{rt} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some s=1,...,K^(1/3); make own copy of B_{ts} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some r=1,...,K^(1/3); pass data onward to (i,j,m+1);

// ComputationOn each processor (i,j,m), Compute A_{rt}*B_{ts} on the K x K processors using the systolic matrix multiplication algorithm; Some initial data setup may be needed before multiplication;

// SummationDetermine (r0,s0) of matrix the current processor (i,j,k) works on: r0 = k/K^(2/3)+1; s0 = (k-(r0-1)*K^(2/3))/K^(1/3);Do reduction (sum) over processors (i,j,m), m=(r0-1)*K^(2/3)+(s0-1)*K^(1/3)+t, of all 1<=t<=K^(1/3);

29

Matrix Multiply

Communication steps: ~K, or P^(1/3)

Maximum CPUs: N/K^(4/3) = 1 K=N^(3/4), or P=N^(9/4)

30

Matrix MultiplyIf number of processors: P = KQ^2, arranged

into KxQxQ meshK planesEach plane QxQ processors

Handle similarlyDecompose A, B, C into K^(1/3)xK^(1/3) blocksDifferent block matrix multiplications in different

planes, K multiplications totalEach block multiplication handled in a plane on QxQ

processors; use any favorable algorithm, e.g. systolic

31

Processor Array in Higher Dimension

Processors P=K^4, arranged into KxKxKxK mesh

Similar strategy:Divide A,B,C into K^(1/3)xK^(1/3) block matricesDifferent multiplications (total K) computed on

different levels of 1st dimensionEach block matrix multiplication done on the KxKxK

mesh at one level; repeat the above strategy.For even higher dimensions, P=K^n, n>4,

handle similarly.

32

Matrix Multiply: DNS Algorithm

Assume:

A, B, C: dimension N x N

P = K^3 number of processorsOrganized into K x K x K 3D mesh

K

ttsrtrs BAC

1

A, B, C are K x K block matrices, each block (N/K) x (N/K)Total K*K*K block matrix multiplications

Idea: each block matrix multiplication is assigned to one processor Processor (i,j,k) computes C_{ij}=A_{ik}*B_{kj} Need a reduction (sum) over processors (i,j,k), k=0,…,K-1

33

Matrix Multiply: DNS Algorithm

Initial data distribution:A_{ij} and B_{ij} at processor (i,j,0)

Need to trasnsfer A_{ik} (i,k=0,…,K-1) to processor (i,j,k) for all j=0,1,…,K-1

Two steps:- Send A_{ik} from processor (i,k,0) to (i,k,k);- Broadcast A_{ik} from processor (i,k,k) to processors (i,j,k);

34

Matrix Multiply

Send A_{ik} from (i,k,0) to (i,k,k)

To broadcast A_{ik} from (i,k,k) to (i,j,k)

35

Matrix Multiply

Final data distribution for A

A can also be considered to come in through the (i,k) plane; with a broadcast along the j-direction.

36

B Distribution

B distribution:Initially B_{kj} in processor (k,j,0);Need to transfer to processors (i,j,k) for all i=0,1,…,K-1

Two steps:-First send B_{kj} from (k,j,0) to (k,j,k)-Broadcast B_{kj} from (k,j,k) to (i,j,k) for all i=0,…,K-1, i.e. along i-direction

37

B Distribution

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

i

jk

Send B_{kj} from (k,j,0) to (k,j,k)

To broadcast from (k,j,k) to along i direction

38

Matrix Multiply

Final B distribution:

B can also be considered to come through (j,k) plane; then broadcast along i-direction

39

Matrix MultiplyA_{ik} and B_{kj} on cpu (i,j,k)Compute C_{ij} locallyReduce (sum) C_{ij} along k-directionFinal result: C_{ij} on cpu (i,j,0)

40

Matrix MultiplyA matrix comes through (i,k) plane, broadcast along j-directionB matrix comes through (j,k) plane, broadcast along i-directionC matrix result goes to (i,j) plane

Broadcast: 2log(K) stepsReduction: log(K) stepsTotal: 3log(K) = log(P) steps

Can use a maximum of P=N^3 processors

In contrast:

Systolic: P^(1/2) communication stepsCan use a maximum of P=N^2 processors

Slide #24: P^(1/3) communication stepsCan use a maximum of P=N^(9/4) processors

Vector: Data Layout

Documents

x1 a12

x1 a32

x1 a22

x2 a13

x2 a23

x2 a33

block matrix

block nkx1each block