1 Vector: Data Layout Vector: x[n] P processors Assume n = r * p A[m:n] for(i=m;i<=n) A[i]… Let A[m : s : n] denotes for(i=m;i<=n;i=i+s) A[i] … Block distribution: id = 0, 1, …, p-1 x[r*id : r*(id+1)-1] id- th processor Cyclic distribution: x[id : p : n-1] id-th processor Block cyclic distribution: x = [x1, x2, …, xN]^T where xj is a subvector of length n/N r n block p n cyclic n/N P*n/N n Block cyclic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Vector: Data LayoutVector: x[n]P processorsAssume n = r * p
A x P=K^2 number of cpusAs a 2D KxK meshA – K x K block matrix, each block (N/K)x(N/K)X – K x 1 blocks, each block (N/K)x1
Each block of A is distributed to a cpuX is distributed to the K cpus in last column
Result A*X be distributed on the K cpus of the last column
8
Matrix-Vector Multiply: 2D Decomposition
9
Homework Write an MPI program and implement the matrix vector
multiplication algorithm with 2D decomposition Assume:
Y=A*X, A – NxN matrix, X – vector of length N Number of processors P=K^2, arranged as a K x K mesh in a row-major
fashion, i.e. cpus 0, …, K-1 in first row, K, …, 2K-1 in 2nd row, etc N can be divided by K. Initially, each cpu has the data for its own submatrix of A; Input vector X
is distributed on processors of the rightmost column, i.e. cpus K-1, 2K-1, …, P-1
In the end, the result vector Y should be distributed on processors at the rightmost column.
A[i][j] = 2*i+j, X[i] = i; Make sure your result is correct using a small value of N Turn in:
Source code + binary Wall-time and speedup vs. cpu for 1, 4, 16 processors for N = 1024.
10
Load Balancing: (Block) Cyclica
a a
a a a
b b b b
b b b b b
b b b b b b
c c c c c c c
c c c c c c c c
c c c c c c c c c
a
a b b b
c c c c c c c
a a
b b b b b
c c c c c c c c
a a a
b b b b b b
c c c c c c c c c
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
y1
y2
y3
y4
y5
y6
y7
y8
y9
y1
y4
y7
y2
y5
y8
y3
y6
y9
=
=
11
Cyclic Distribution
Matrix-vector multiply, row-wise cyclic distribution of A and yBlock distribution of x
Initial data: id – cpu id p – number of cpus ids of left/right neighbors n – matrix dimension, n=r*p Aloc = A(id:p:n-1,:) yloc = y(id:p:n-1) xloc = x(id*r:(id+1)*r-1)
r = n/pfor t=0:p-1 send(xloc,left) s = (id+t)%p // xloc = x(s*r:(s+1)*r-1) for i=0:r-1 for j=0:min(id+i*p-s*r,r) yloc(j) += Aloc(i,j+s*r)*xloc(j) end end recv(xloc,right)end
Matrix Multiply: 2D DecompositionHypercube-Ring Cpus: P = K^2
Matrices A, B, C: dimension N x N, K x K blocksEach block: (N/K) x (N/K)
Determine coordinate (irow,icol) of current cpu.Set B’=B_localFor j=0:K-1 root_col = (irow+j)%K broadcast A’=A_local from root cpu (irow,root_col) to other cpus in the row C_local += A_local*B_local shift B’ upward one stepend
A01 A01 A01 A01
A12 A12 A12 A12
A23 A23 A23 A23
A30 A30 A30 A30
Step 1
Step 2
Step 2
broadcast
shift
Broadcast A diagonalsShift BC fixed
20
Matrix Multiply
Total ~K*log(K) communication steps, or sqrt(P)log(sqrt(P)) stepsIn contrast, 1D decomposition, P
communication stepsCan use max N^2 processors for problem
size NxN matrices1D decomposition, max N processors
21
Matrix Multiply: Ring-Hypercube
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B00
B11
B22
B33
Shift A columns Broadcast B diag
A00B00
A01B11
A02B22
A03B33
A10B00
A11B11
A12B22
A13B33
A20B00
A21B11
A22B22
A23B33
A30B00
A31B11
A32B22
A33B33
A01B10
A02B21
A03B32
A00B03
A11B10
A12B21
A13B32
A10B03
A21B10
A22B21
A23B32
A20B03
A31B10
A32B21
A33B32
A30B03
Determine coordinate (irow,icol) of current cpu.Set A’=A_localFor j=0:K-1 root_row = (icol+j)%K broadcast B’=B_local from root cpu (root_row,icol) to other cpus in the column C_local += A_local*B_local Shift A’ leftward one stepend
Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)
Step 1 Step 2
C fixed
22
A00B00
A01B01
A02B02
A03B03
A10B10
A11B11
A12B12
A13B13
A20B20
A21B21
A22B22
A23B23
A30B30
A31B31
A32B32
A33B33
A00B00
A01B11
A02B22
A03B33
A10B00
A11B11
A12B22
A13B33
A20B00
A21B11
A22B22
A23B33
A30B00
A31B11
A32B22
A33B33
A01 A02 A03 A00
A11 A12 A13 A10
A21 A22 A23 A20
A31 A32 A33 A30
A01B10
A02B21
A03B32
A00B03
A11B10
A12B21
A13B32
A10B03
A21B10
A22B21
A23B32
A20B03
A31B10
A32B21
A33B32
A30B03
A02 A03 A00 A01
A12 A13 A10 A11
A22 A23 A20 A21
A32 A33 A30 A31
A02B20
A03B31
A00B02
A01B13
A12B20
A13B31
A10B02
A11B13
A22B20
A23B31
A20B02
A21B13
A32B20
A33B31
A30B02
A31B13
initial Broadcast B compute Shift A broadcast
compute Shift A Broadcast B
Matrix Multiply: Ring-Hypercube
23
Matrix Multiply: Systolic (Torus)
A00B00
A01B11
A02B22
A11B10
A12B21
A10B02
A22B20
A20B01
A21B12
A01B10
A02B21
A00B02
A12B20
A10B01
A11B12
A20B00
A21B11
A22B22
A
B
A02B20
A00B01
A01B12
A10B00
A11B11
A12B22
A21B10
A22B21
A20B02
Step 1 Step 2 Step 3
A00B00
A01B01
A02B02
A10B10
A11B11
A12B12
A20B20
A21B21
A22B22
initial
Shift rows of A leftwardShift columns of B upward
C fixed
Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)
24
Matrix Multiply: SystolicP = K^2 number of processors, as a K x K 2D torusA, B, C: KxK block matrices, each block (N/K)x(N/K)Each cpu computes 1 block: A_loc, B_loc, C_locCoordinate in torus of current cpu: (irow, icol)Ids of left, right, top, bottom neighboring processors
P = K^3 number of processorsOrganized into K x K x K 3D mesh
A (NxN) can be considered as a q x q block matrix, each block (N/q)x(N/q)
Let q = K^(1/3), i.e. consider A as a K^(1/3) x K^(1/3) block matrix, each block being (N/K^(1/3)) x (N/K^(1/3))
Similar for B and C
26
Matrix Multiply on K^3 CPUs
3/1
1
K
ttsrtrs BAC r,s = 1, 2, …, K^(1/3)
Total K^(1/3)*K^(1/3)*K^(1/3) = K block matrix multiplications
Idea: Perform these K matrix multiplications on the K different planes (or levels) of the 3D mesh of processors.
Processor (i,j,k) (i,j=1,…,K) belongs to plane k.Will perform multiplication A_{rt}*B_{ts}, where k = (r-1)*K^(2/3)+(s-1)*K^(1/3)+t
Within a plane, (N/K^(1/3)) x (N/K^(1/3)) matrix multiply on K x K processors. Use the systolic multiplication algorithm.Within a plane k: A_{rt}, B_{ts} and C_{rs} decomposed into K x K block matrices, each block (N/K^(4/3)) x (N/K^(4/3)).
27
Matrix Multiply
Initial data distribution
Initially, processor (i,j,1) has (i,j) sub-blocks of all A_{rt} and B_{ts} blocks, for all r,s,t=1,…,K^(1/3), i,j=1,…,K
N/K^(1/3) dimension
K^(1/3) blocks
K blocks
A, B, C
x
On KxK processors
A_{rt} B_{ts}
A_{rt} destined to levels k=(r-1)*K^(2/3)+(s-1)K^(1/3)+t, for all s=1,…,K^(1/3)
B_{ts} destined to levels k=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t, for all r=1,…,K^(1/3)
(i,j)
28
Matrix Multiply// set up input dataOn processor (i,j,1), read in the (i,j)-th block of matrices A_{r,t} and B_{t,s}, 1<= r,s,t <= K^(1/3); pass data onto processor (i,j,2);On processor (i,j,m), make own copy of A_{rt} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some s=1,...,K^(1/3); make own copy of B_{ts} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some r=1,...,K^(1/3); pass data onward to (i,j,m+1);
// ComputationOn each processor (i,j,m), Compute A_{rt}*B_{ts} on the K x K processors using the systolic matrix multiplication algorithm; Some initial data setup may be needed before multiplication;
// SummationDetermine (r0,s0) of matrix the current processor (i,j,k) works on: r0 = k/K^(2/3)+1; s0 = (k-(r0-1)*K^(2/3))/K^(1/3);Do reduction (sum) over processors (i,j,m), m=(r0-1)*K^(2/3)+(s0-1)*K^(1/3)+t, of all 1<=t<=K^(1/3);
29
Matrix Multiply
Communication steps: ~K, or P^(1/3)
Maximum CPUs: N/K^(4/3) = 1 K=N^(3/4), or P=N^(9/4)
30
Matrix MultiplyIf number of processors: P = KQ^2, arranged
into KxQxQ meshK planesEach plane QxQ processors
Handle similarlyDecompose A, B, C into K^(1/3)xK^(1/3) blocksDifferent block matrix multiplications in different
planes, K multiplications totalEach block multiplication handled in a plane on QxQ
processors; use any favorable algorithm, e.g. systolic
31
Processor Array in Higher Dimension
Processors P=K^4, arranged into KxKxKxK mesh
Similar strategy:Divide A,B,C into K^(1/3)xK^(1/3) block matricesDifferent multiplications (total K) computed on
different levels of 1st dimensionEach block matrix multiplication done on the KxKxK
mesh at one level; repeat the above strategy.For even higher dimensions, P=K^n, n>4,
handle similarly.
32
Matrix Multiply: DNS Algorithm
Assume:
A, B, C: dimension N x N
P = K^3 number of processorsOrganized into K x K x K 3D mesh
K
ttsrtrs BAC
1
A, B, C are K x K block matrices, each block (N/K) x (N/K)Total K*K*K block matrix multiplications
Idea: each block matrix multiplication is assigned to one processor Processor (i,j,k) computes C_{ij}=A_{ik}*B_{kj} Need a reduction (sum) over processors (i,j,k), k=0,…,K-1
33
Matrix Multiply: DNS Algorithm
Initial data distribution:A_{ij} and B_{ij} at processor (i,j,0)
Need to trasnsfer A_{ik} (i,k=0,…,K-1) to processor (i,j,k) for all j=0,1,…,K-1
Two steps:- Send A_{ik} from processor (i,k,0) to (i,k,k);- Broadcast A_{ik} from processor (i,k,k) to processors (i,j,k);
34
Matrix Multiply
Send A_{ik} from (i,k,0) to (i,k,k)
To broadcast A_{ik} from (i,k,k) to (i,j,k)
35
Matrix Multiply
Final data distribution for A
A can also be considered to come in through the (i,k) plane; with a broadcast along the j-direction.
36
B Distribution
B distribution:Initially B_{kj} in processor (k,j,0);Need to transfer to processors (i,j,k) for all i=0,1,…,K-1
Two steps:-First send B_{kj} from (k,j,0) to (k,j,k)-Broadcast B_{kj} from (k,j,k) to (i,j,k) for all i=0,…,K-1, i.e. along i-direction
37
B Distribution
0,0
0,1
0,2
0,3
1,0
1,1
1,2
1,3
2,0
2,1
2,2
2,3
3,0
3,1
3,2
3,3
i
jk
Send B_{kj} from (k,j,0) to (k,j,k)
To broadcast from (k,j,k) to along i direction
38
Matrix Multiply
Final B distribution:
B can also be considered to come through (j,k) plane; then broadcast along i-direction
39
Matrix MultiplyA_{ik} and B_{kj} on cpu (i,j,k)Compute C_{ij} locallyReduce (sum) C_{ij} along k-directionFinal result: C_{ij} on cpu (i,j,0)
40
Matrix MultiplyA matrix comes through (i,k) plane, broadcast along j-directionB matrix comes through (j,k) plane, broadcast along i-directionC matrix result goes to (i,j) plane