Part 2: Codes for distributed linear data processing in presence … · 2017-06-30 · Part 2: Codes for distributed linear data processing in presence of straggling/faults/errors

Part 2: Codes for distributed linear data processing

in presence of straggling/faults/errors

1

Motivation: nonideal computing systems

2

Motivation: nonideal computing systems M x V for 4 processors on AmazonEC2 cloud system

12106

106

2


[Ack: Jeremy Bai, CUHK]

M x V for 4 processors on AmazonEC2 cloud system

12106

106

2


[Ack: Jeremy Bai, CUHK]

M x V for 4 processors on AmazonEC2 cloud system

12106

106

Practitioners are already using redundancy to address straggling 2

Organization: How to perform these computations?

Motivation: The critical steps for many compute applications (Machine learning: neural nets, LDA, PCA, Regression, Projections. Scientific computing and physics simulations)

x

M N

N 1

A

1.A B

2.

efficiently, fast, in presence of faults/straggling/errors

I. Big processors [Huang, Abraham ’84] II. Small processors [von Neumann ’56]

Rest of the tutorial is divided into two parts:

3

Part I: Big processors Processor memory scales with problem size

PROCESSOR 1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P

X

4

System metricsPROCESSOR

1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P

X

5

System metricsPROCESSOR

1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P

X

1. Per-processor computation costs: - # operations/processor

2. Straggler tolerance (directly related to “recovery threshold”) - max # processors that can be ignored by fusion node

3. Communication costs - number of bits exchanged between all processors - can use more sophisticated metrics. See [Bruck et al.’97]

“Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems” Bruck, Ho, Kipnis, Upfal, Weathersby ‘97 5

x

M N

N 1

AI.1

6

Parallelization for speeding up matrix-vector products

P processors (master node aggregates outputs)

Operations/processor: MN/P (e.g. P=3, each does 1/3rd computations)

M N

N 1

. . .A1 A2

x1

x2

...

AP

xP

N/P

7




In practice, processors can be delayed (“stragglers”) or faulty

Recovery threshold = P i.e., Straggler tolerance = 0

M N

N 1

. . .A1 A2

x1

x2

...

AP

xP

N/P

7






M N

N 1

. . .A1 A2

x1

x2

...

AP

xP

Note: can parallelize by dividing the matrix horizontally as well

N/P

7






Note: can parallelize by dividing the matrix horizontally as well

...x

A1

A2

AM/P

M/P

7

Replication: repeat Job r times

N 1

. . .A1 A2x1

x2

...

N/P

AP

xP

8


N 1

. . .A1 A2x1

x2

...

N/P

AP. . .A1 A2

. . .A1 A2

...

. . .A1 A2

AP/r

AP/r

AP/rxP/r

rN/P

8


N 1

. . .A1 A2x1

x2

...

N/P

AP. . .A1 A2

. . .A1 A2

...

. . .A1 A2

AP/r

AP/r

AP/rxP/r

P processors

Straggler tolerance: r-1

# operations/processor: rMN/PRecovery threshold: P-r+1

rN/P

8


N 1

. . .A1 A2x1

x2

...

N/P

AP. . .A1 A2

. . .A1 A2

...

. . .A1 A2

AP/r

AP/r

AP/rxP/r

P processors

Straggler tolerance: r-1

# operations/processor: rMN/PRecovery threshold: P-r+1

rN/P

Also see: recent works of [Joshi, Soljanin, Wornell]

8

A coding alternative to replication: MDS compute codes (“ABFT”)Algorithm-Based Fault Tolerance [Huang, Abraham ’84] [Lee, Lam, Pedarsani, Papailopoulos, Ramchandran ’16]

Computer Communications and Networks

Thomas HeraultYves Robert Editors

Fault-Tolerance Techniques for High-Performance Computing

9

A coding alternative to replication: MDS compute codes (“ABFT”)Algorithm-Based Fault Tolerance [Huang, Abraham ’84] [Lee, Lam, Pedarsani, Papailopoulos, Ramchandran ’16]

9

A coding alternative to replication: MDS compute codes (“ABFT”)

x

N 1

A1

A2

A1 + A2

Example: P=3, K=2

Algorithm-Based Fault Tolerance [Huang, Abraham ’84] [Lee, Lam, Pedarsani, Papailopoulos, Ramchandran ’16]

9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2


A

9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2


9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2


Assumption: A known in advance

9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2


Assumption: A known in advanceCan tolerate 1 straggler # operations per processor = MN/2

9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2



9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2



9


x

N 1

A1

A2

A1 + A2

Example: P=3, K=2


Assumption: A known in advance

In general, use a (P,K)-MDS code (K < M): Recovery Threshold = K, i.e., Straggler tolerance = P-K # operations/processor = MN/K (> MN/P in uncoded)

P processors

Can tolerate 1 straggler # operations per processor = MN/2

9

MDS coded computing of M x V outperforms replication

10


[Lee et al]: MDS beats replication in expected time (exponential tail models)

10



35%reduc*on

[Fig courtesy R Pedarsani]

Experiments on AmazonEC2: [Lee at al]

10


Can tradeoff # operations/processor for straggler tolerance Codes for # operations/processor < N ?


35%reduc*on

[Fig courtesy R Pedarsani]

Experiments on AmazonEC2: [Lee at al]

10

Short-Dot codes

VERY LONGVECTOR

SHORT AND FAT MATRIX

ILLUSTRATION OF SHORT-DOT IMPLEMENTATIONTHE MATRIX-VECTOR PRODUCTTO BE COMPUTED PROCESSOR

1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P

PARALLEL PROCESSING ARCHITECTURE

A

x

X

BCODED MATRIX

VALUES SENT TOPROCESSOR 1

Any sparsity pattern with equal number of zeros in

each row, and in each column

[Dutta, Cadambe, Grover ’16] [Tandon, Lei, Dimakis, Karampatziakis ‘16]

11

Short-Dot codes

VERY LONGVECTOR



1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P


A

x

X

BCODED MATRIX





11

Short-Dot codes

VERY LONGVECTOR



1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P


A

x

X

BCODED MATRIX





11

Short-Dot codes

Sparsity (i) allows tradeoff between computation per-processor and straggler tolerance; (ii) reduces communication to each processor

VERY LONGVECTOR



1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P


A

x

X

BCODED MATRIX





11

Short-Dot codes

Sparsity (i) allows tradeoff between computation per-processor and straggler tolerance; (ii) reduces communication to each processor

VERY LONGVECTOR



1

MASTER NODE FUSION

NODE

PROCESSOR 2

PROCESSOR P


A

x

X

BCODED MATRIX





# operations/processor = s < N Recovery threshold = K = P(1-s/N)+M

11

Short-Dot codes: the construction

“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products [Dutta, Cadambe, Grover, NIPS 2016]

B

s

P N

x

N 1

. . .

. . .

Each processor computes a “short” dot product of x with one row of B

Given A, an M x N matrix, M < P, and a parameter K, M < K < P, an (s,K) Short-Dot code consists of a P x N matrix B satisfying:

1) A is contained in span of any K rows of B2) Every row of B is s-sparse

12

Achievability and outer bound

s N

P(P K +M)

Achievability: For any M x N matrix A, an (s, K) Short-Dot code exists s.t.:

…and outputs of any K processors suffice, i.e., Straggler tolerance = P-K

13Proof overviews in appendices of this talk


s N

P(P K +M)



Outer bound: Any Short-Dot code satisfies:

… for “sufficiently dense” A

s N

P(P K +M) M2

P

P

K M + 1



s N

P(P K +M)



Outer bound: Any Short-Dot code satisfies:

… for “sufficiently dense” A

s N

P

(P K +M) o(N)


Short-Dot strictly and significantly outperforms Uncoded/Replication/ABFT (MDS)

Exponential tail models

Paper contains expected completion time analysis for exponential service time model, and experimental results.For N>>M, decoding complexity negligible compared to per-processor computation

14

Related result: Gradient coding

What if some gradient-computing workers straggle?

D1

D2

D3

D4

D5

D6

D7

D8

D9

worker 1 worker 2 worker 3

modelβ

g1

modelβ modelβ

master

g2 g3

modelβ

addgradientsandupdate

model

[Figure courtesy A Dimakis]

[Tandon, Lei, Dimakis, Karampatziakis’17]

15



D1

D2

D3

D4

D5

D6

D7

D8

D9


modelβ

g1

modelβ modelβ

master

g2 g3

modelβ


model


Want to compute: X

i

gi = [1, 1, . . . , 1]

2

66664

g1g2··gN

3

77775known “matrix”

vector computed distributedly


15



D1

D2

D3

D4

D5

D6

D7

D8

D9


modelβ

g1

modelβ modelβ

master

g2 g3

modelβ


model


Want to compute: X

i

gi = [1, 1, . . . , 1]

2

66664

g1g2··gN

3




15


What if some gradient-computing workers straggle?Solution: code “matrix” A (i.e., [1 1 … 1]) using a Short-Dot code - introduce redundancy in datasets consistent with the Short-Dot pattern - computes the correct (redundant) gradients at each processorCan also be viewed as a novel “distributed storage code for computation”

D1

D2

D3

D4

D5

D6

D7

D8

D9


modelβ

g1

modelβ modelβ

master

g2 g3

modelβ


model


Want to compute: X

i

gi = [1, 1, . . . , 1]

2

66664

g1g2··gN

3




15


What if some gradient-computing workers straggle?Solution: code “matrix” A (i.e., [1 1 … 1]) using a Short-Dot code - introduce redundancy in datasets consistent with the Short-Dot pattern - computes the correct (redundant) gradients at each processorCan also be viewed as a novel “distributed storage code for computation”

For VT V, coding can beat replication only due to integer effects. No scaling-sense gain, at least in this coarse model, over replication. (See also [Halbawi, Azizan-Ruhi, Salehi, Hassibi ’17])

D1

D2

D3

D4

D5

D6

D7

D8

D9


modelβ

g1

modelβ modelβ

master

g2 g3

modelβ


model


Want to compute: X

i

gi = [1, 1, . . . , 1]

2

66664

g1g2··gN

3




15

Trend: - V x V : offers some advantage over replication

- M x V: arbitrary gains over replication, MDS coding

16

- Next: M x M: ?



16

- Next: M x M: ?

Answer: arbitrarily large gains over M x V-type coding!



16

- Next: M x M: ?

Answer: arbitrarily large gains over M x V-type coding!



break!

16

A B

M M M M

17

A B

M M M M

Uncoded parallelization

(i,j)-th Processor receives Ai, Bj, computes Ai x Bj, sends them to fusion center

# operations/processor = N3/mn (we’ll keep this constant across strategies) Recovery Threshold = P; Straggler tolerance = 0

Let’s assume that each processor can store 1/m of A and 1/n of B

A1

A2

Am

B1B2 Bn

Total mn processors

N x N N x N

18

Strategy I: M x V → M x M

Recovery threshold = # operations/processor:

A2

A1

B1B2 Bn

P P/n+m = (P )

AT

N3/mn

T = P/n

19

Each processor computes a product Ai Bj

Algorithm-based Fault Tolerance (ABFT)

[Huang, Abraham’84] [Lee, Suh, Ramchandran’17]

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 6, juNE 1984

in the (n + I)st row; the elements of the surnmation vectorare generated as

for 1 ' j . m..n

an+l,]= l aiji=1

AUsing the notation in [21], Ac = ,where eT is a 1-by-n

vector [1"1 1 1* 1] and the vector eTA is the columnsummation vector.

Definition 4.2: The row checksum matrix Ar of the matrixA is an n-by-(m + 1) matrix which consists of the matrix Ain the first m columns and a row summation vector in the(m + l)st column; the elements of the summation vector aregenerated as

mai,m+l =

j=1for 1 ' i ' n. (2)

Ar = |A Ae|, where Ae is the row summation vector.Definition 4.3: The full checksum matrix Af of the ma-

trix A is an (n + l)-by-(m + 1) matrix, which is the columnchecksum matrix of the row checksum matrix Ar.

Definition 4.4: Each row or column in the full checksummatrix is called a checksum encoded vector and is denotedbyIC-SEV.From the definitions, we can see that each checksum ma-

trix has its separate information matrix (A) and summationvectors. To apply the checksum technique, each matrix isstored in its full checksum matrix format and is manipulatedin the full, row, or column checksum matrix format de-pending on 'the matrix operations. Five matrix operationsexist which preserve the checksum property; they are givenin the following theorems. We use the symbol "*" for bothmatrix and scalar multiplication; it is clear from the contextwhich operation is intended.Theorem 4.1: (i) The result of a column checksum matrix

(A.) multiplied by a row checksum rmatrix (Br) is a full check-sum matrix (Cf). (ii) The corresponding information matricesA, B, and C have the following relation:

A *B = C.

Proof:

A AB ABee eTAB eTABe

Fig. 1 depicts the checksum matrix multiplication.LU decomposition of a matrix is a time-consuming part of

the procedure used to solve large linear equations

C*x b

where C is an n * n matrix, b is a given n * 1 vector, and x isan unknown n * 1 vector.

If the equation C * x = b can be solved by Gaussianelimination without pivoting, then the matrix C, with ele-ments cij, can be decomposed into the product of a lowertriangular matrix with an upper triangular matrix

C = L * U

A X B1. C.~~~~~~~~ c

CHECKSUM lCHECKSUM l

Fig. 1. A checksum matrix multiplication.

where U = (uiik) and L = (4k ) are evaluated [2] as follows:

'Ci = C~,

k+i Ci,= jk + l, k( Uk,])

Oli, k 1

kCi,k * (llukk)

Uk,j = k,

when i < k,when i = k,when i > k,when k > j,when k c j.

If the pivoting is required in order for the procedure towork, then C can be factored into L and U; but, in general, theyare not triangular matrices [15].From [21, p. 265] we get the following theorem.Theorem 4.2: When the information matrix C is LU

decomposable, the full checksum matrix of C, Cf, can bedecomposed into a column checksum lower matrix and a rowchecksum upper matrix.

Proof: Let the decomposition of C be C = LU,

C Ce-Cf = C Ce

can be decomposed as LIUU where

LLi= - and U=|IUIUel.

Theorem 4.3: (i) The result of the addition of two fullchecksum matrices (Af and Bf) is a full checksum matrix (Cf).(ii) Their corresponding information matrices have therelation

A +B = C. o

Corollary 4.1: The result of the addition of two CSEV's isa CSEV.Theorem 4.4: The product of a scalar and a full checksum

matrix is a full checksum matrix.Corollary 4.2: The product of a scalar and a CSEV is a

CSEV.Theorem 4.5: The transpose of a full checksum matrix is a

full checksum matrix. oMatrix addition, multiplication, scalar product, LU

decomposition, and transpose thus preserve the check-sum property.

A. Effecron the WordLengttgtlt wlte &uig theChecksum Technique

Since the checksum elements are the sum of several matrixelements, we must consider the possible problems with word

520

20





for 1 ' j . m..n

an+l,]= l aiji=1




mai,m+l =

j=1for 1 ' i ' n. (2)






A *B = C.

Proof:




C*x b



C = L * U

A X B1. C.~~~~~~~~ c




'Ci = C~,


Oli, k 1

kCi,k * (llukk)

Uk,j = k,





C Ce-Cf = C Ce




A +B = C. o








520

A1

A2

A3

A4

B1 B2 B3 B4

A1B1

20





for 1 ' j . m..n

an+l,]= l aiji=1




mai,m+l =

j=1for 1 ' i ' n. (2)






A *B = C.

Proof:




C*x b



C = L * U

A X B1. C.~~~~~~~~ c




'Ci = C~,


Oli, k 1

kCi,k * (llukk)

Uk,j = k,





C Ce-Cf = C Ce




A +B = C. o








520

A1

A2

A3

A4

B1 B2 B3 B4

A1B1

20

Recovery threshold: Straggler resilience:

K = 2(m 1)pP (m 1)2 + 1 = (

pP )

P K# operations/processor: N3/mn

[Lee, Suh, Ramchandran’17]





for 1 ' j . m..n

an+l,]= l aiji=1




mai,m+l =

j=1for 1 ' i ' n. (2)






A *B = C.

Proof:




C*x b



C = L * U

A X B1. C.~~~~~~~~ c




'Ci = C~,


Oli, k 1

kCi,k * (llukk)

Uk,j = k,





C Ce-Cf = C Ce




A +B = C. o








520

A1

A2

A3

A4

B1 B2 B3 B4

A1B1

Next: Polynomial codes [Yu, Maddah-Ali, Avestimehr ’17]

Recovery threshold:K = mn# operations/processor: N3/mn

20

Recovery threshold: Straggler resilience:

K = 2(m 1)pP (m 1)2 + 1 = (

pP )

P K# operations/processor: N3/mn

[Lee, Suh, Ramchandran’17]

Polynomial codes [Yu, Maddah-Ali, Avestimehr ’17]Intuition: forget matrices for this slide

21


Ai Bj

21


Ai Bj

PROC 1

PROC 2

PROC P

21


Ai Bj

PROC 1

PROC 2

PROC P

AiBjGOAL: Compute all products of the form

21


Ai Bj

PROC 1

PROC 2

PROC P

WANTS ALL 'S

DECODER

AiBj


21


Ai Bj

PROC 1

PROC 2

PROC P

WANTS ALL 'S

DECODER

AiBj

Constraints: 1) Can only send information of size of one Ai and one Bj 2) Processor can only compute a product of its inputs


21

Solution: Send and

X

i

iAi

X

i

iBi


Ai Bj

PROC 1

PROC 2

PROC P

WANTS ALL 'S

DECODER

AiBj



21


Ai Bj

PROC 1

PROC 2

PROC P

WANTS ALL 'S

DECODER

AiBj


Solution: Send and

X

i

ipAi

X

i

ipBi


21


Ai Bj

PROC 1

PROC 2

PROC P

WANTS ALL 'S

DECODER

AiBj

Aimi=1 Bjni=1


Solution: Send and

X

i

ipAi

X

i

ipBi


21

AchievabilityYou can use random codes. But “polynomial codes” get you there with lower enc/dec complexity

Proc i computes Ci = AiBi = A1B1 + iA2B1 + i2A1B2 + i3A2B2

PROC 1

PROC i

PROC P

DECODER

A1 + 1.A2

A1 + P.A2

A1 + i.A2

B1 + i2.B2

B1 + 12.B2

B1 + P 2.B2

A1,A2B1,B2

22

Example: m=2, n=2



Fusion center needs outputs from only 4 such processors! e.g. from 1,2,3,4:2

664

C1

C2

C3

C4

3

775 =

2

664

10 11 12 13

20 21 22 23

30 31 32 33

40 41 42 43

3

775

2

664

A1B1

A2B1

A1B2

A2B2

3

775Invert a Vandermonde matrix

PROC 1

PROC i

PROC P

DECODER

A1 + 1.A2

A1 + P.A2

A1 + i.A2

B1 + i2.B2

B1 + 12.B2

B1 + P 2.B2

A1,A2B1,B2

22

Example: m=2, n=2



Fusion center needs outputs from only 4 such processors! e.g. from 1,2,3,4:2

664

C1

C2

C3

C4

3

775 =

2

664

10 11 12 13

20 21 22 23

30 31 32 33

40 41 42 43

3

775

2

664

A1B1

A2B1

A1B2

A2B2

3

775Invert a Vandermonde matrix

In general, Recovery Threshold = mn (attained using RS-code-type construction)

PROC 1

PROC i

PROC P

DECODER

A1 + 1.A2

A1 + P.A2

A1 + i.A2

B1 + i2.B2

B1 + 12.B2

B1 + P 2.B2

A1,A2B1,B2

22

Example: m=2, n=2

Summary so far…

- V x V : Coding offers little advantage over replication - M x V: Short-Dot codes provide arbitrary gains over replication, MDS coding, - M x M: polynomial coding provides arbitrary gains over M x V codes

What additional costs come with coding?- encoding and decoding complexity (skipped here for simplicity) - Next: degradation is not graceful as you pull deadline earlier

To see this, let’s look a problem with repeated M x V, and slow convergence to solution

23

Understanding a limitation of coding: Coding for linear iterative solutions

Power-iterations converge to PageRank solution

x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.

Subtracting, e(l+1)= (1 d)Ae

(l), where e

(l)= x

(l) x

.

0 20 40 60 80Number of iterations

10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er

ror Convergence of PageRank using power-iteration

8 / 17

computation inputMxV

“Coding Method for Parallel Iterative Linear Solver,” Y Yang, P Grover, S Kar, Submitted 24



x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


“Coding Method for Parallel Iterative Linear Solver,” Y Yang, P Grover, S Kar, Submitted 24



x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


“Coding Method for Parallel Iterative Linear Solver,” Y Yang, P Grover, S Kar, Submitted

Next: how to code multiple linear iterative problems in parallel

24



x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


x

(l+1)= (1 d)Ax

(l)+ dr.

Converges to x

satisfying x

= (1 d)Ax

+ dr.


(l), where e

(l)= x

(l) x

.


10-15

10-10

10-5

100

Aver

age

Mea

n Sq

uare

d Er


8 / 17


“Coding Method for Parallel Iterative Linear Solver,” Y Yang, P Grover, S Kar, Submitted

Next: how to code multiple linear iterative problems in parallel

24

linear inx

r

Solving multiple iterative problems in parallelClassical coded computation applied to personalized

PageRank

N=

Worker 1 Worker 2 Worker 3

=

(Encoding)

(Parallel Computing)

(Decoding)

[r1,r2]G

s1 s2 s3

y1 y2 y3(l1) (l2) (l3)

Y(Tdl)

X

N G[ [-1

I Initialize (Encoding)

[s1, . . . , sP ] = [r1, . . . , rk] ·GkP .

I Parallel Computing:li power iterations at the i-thworker with input si

Y

(Tdl

)NP = [y

(l1)1 , . . . ,y

(lP )P ].

I Post Processing (Decoding) Matrixinversion on fastest k processors.

bX

>=

˜

G

1(Y

(Tdl

))

>.

9 / 17

Classical coded computation applied to linear iterative problems

25


PageRank

N=


=

(Encoding)


(Decoding)

[r1,r2]G

s1 s2 s3

y1 y2 y3(l1) (l2) (l3)

Y(Tdl)

X

N G[ [-1


[s1, . . . , sP ] = [r1, . . . , rk] ·GkP .


Y

(Tdl

)NP = [y

(l1)1 , . . . ,y

(lP )P ].


bX

>=

˜

G

1(Y

(Tdl

))

>.

9 / 17


Is this invertible?Is this well conditioned?

25


PageRank

N=


=

(Encoding)


(Decoding)

[r1,r2]G

s1 s2 s3

y1 y2 y3(l1) (l2) (l3)

Y(Tdl)

X

N G[ [-1


[s1, . . . , sP ] = [r1, . . . , rk] ·GkP .


Y

(Tdl

)NP = [y

(l1)1 , . . . ,y

(lP )P ].


bX

>=

˜

G

1(Y

(Tdl

))

>.

9 / 17



Yes!

25


PageRank

N=


=

(Encoding)


(Decoding)

[r1,r2]G

s1 s2 s3

y1 y2 y3(l1) (l2) (l3)

Y(Tdl)

X

N G[ [-1


[s1, . . . , sP ] = [r1, . . . , rk] ·GkP .


Y

(Tdl

)NP = [y

(l1)1 , . . . ,y

(lP )P ].


bX

>=

˜

G

1(Y

(Tdl

))

>.

9 / 17



Yes!No!

25

Classical coded computation on for personalized

PageRank: errors blow up!

Decoding: bX>=

˜

G

1(Y

(Tdl

))

>.

I e.g. 120 processors; 100 PageRankproblems.

I Decode using fastest 100 processors.

I Decoding matrix is ill-conditionedw.h.p. ) errors are blown up at smalldeadlines!

0 10 20 30Computation deadline Tdl (sec)

10-15

10-10

10-5

100

105

Aver

age

mea

n-sq

uare

d er

ror

Google Plus graphOriginal coded

method in [Lee et al.]

Extension of coded method in [Lee et al.]

10 / 17

Natural extension of ABFT

ABFT

Experiments on CMU clusters:

What is the effect of a poor conditioning number? Error blows up!

26



Decoding: bX>=

˜

G

1(Y

(Tdl

))

>.





10-15

10-10

10-5

100

105

Aver

age

mea

n-sq

uare

d er

ror




10 / 17


ABFT



26



Decoding: bX>=

˜

G

1(Y

(Tdl

))

>.





10-15

10-10

10-5

100

105

Aver

age

mea

n-sq

uare

d er

ror




10 / 17


ABFT



Similar issues arise in designing good “analog coding with erasures” [Haikin, Zamir ISIT’16][Haikin, Zamir, Gavish ‘17] 26

A graceful degradation with time: Coded computing with weighted least squares

Proposed algorithm: weighted combination of processor

outputs

N=


=Decoding

Matrix

[ [-1 [ [

-1[ [-1

=

(Encoding)


(Decoding)

[r1,r2]G

s1 s2 s3

y1 y2 y3(l1) (l2) (l3)

Y(Tdl)

X

N

G GTGT


[s1, . . . , sP ] = [r1, . . . , rk] ·G.


Y

(Tdl

)NP = [y

(l1)1 , . . . ,y

(lP )P ].

I Post Processing (Decoding)

bX

>= (G

1G

>)

1G

1(Y

(Tdl

))

>.

Similar to the “weightedleast-square” solution.

12 / 1727

9

0 5 10 15 20 25 30Computation deadline Tdl (sec)

10-15

10-10

10-5

100

105Av

erag

e m

ean-

squa

red

erro

r

Comparison between Algorithm 1 and [6] on Gplus graph

Algorithm 1

Original coded method in [6]

Extension of coded method in [6]

Fig. 3. Experimental comparison between an extended version of thealgorithm in [6] and Algorithm 1 on the Google Plus graph. The figure showsthat naively extending the general coded computing in [6] using matrix inverseincreases the computation error.

0 0.5 1 1.5 2Computation deadline Tdl (sec)

10-5

100

Aver

age

mea

n-sq

uare

d er

ror

Comparison of Different Codes on Twitter graph

DFTRandom binaryRandom sparseGaussian

Fig. 4. Experimental comparison of four different codes on the Twitter graph.In this experiment the DFT-code out-performs the other candidates in meansquared error.

coded PageRank uses n workers to solve these k = 100

equations using Algorithm 1. We use a (120, 100) code wherethe generator matrix is the submatrix composed of the first100 rows in a 120120 DFT matrix. The computation resultsare shown in Fig. 2. Note that the two graphs of differentsizes so the computation in the two experiments takes differenttime. From Fig. 2, we can see that the mean-squared error ofuncoded and replication-based schemes is larger than that ofcoded computation by a factor of 104.

We also compare Algorithm 1 with the coded computingalgorithm proposed in [6]. The original algorithm proposedin [6] is not designed for iterative algorithms, but it has anatural extension to the case of computing before a deadline.Fig. 3 shows the comparison between the performance ofAlgorithm 1 and this extension of the algorithm from [6].This extension uses the results from the k fastest workersto retrieve the required PageRank solutions. More concretely,suppose S [n] is the index set of the k fastest workers.Then, this extension retrieves the solutions to the original k


10-15

10-10

10-5

100

Aver

age

mea

n-sq

uare

d er

ror Correlated queries on Google Plus graph

Replication 1

Uncoded

Replication 2Coded

Fig. 5. Experimentally computed overall mean squared error of uncoded,replication-based and coded personalized PageRank on the Twitter graph ona cluster with 120 workers. The queries are generated using the model fromthe stationary model in Assumption 2.

PageRank problems by solving the following equation:

YS = [x1

,x2

, . . . ,xk] ·GS , (64)

where YS is the computation results obtained from the fastestk workers and GS is the k k submatrix composed ofthe columns in the generator matrix G with indexes in S .However, since there is some remaining error at each worker(i.e., the computation results YS have not converged yet),when conducting the matrix-inverse-based decoding from [6],the error is magnified due to the large condition number of GS .This is why the algorithm in [6] cannot be naively applied inthe coded PageRank problem.

Finally, we test Algorithm 2 for correlated PageRank queriesthat are distributed with the stationary covariance matrix in theform of (36) and (37). Note that the only change to be madein this case is on the matrix (see equation (38)). The othersettings are exactly the same as the experiments that are shownin Figure 2. The results on the Twitter social graph are shownin Figure 4. In this case, we also have to compute

One question remains: what is the best code design for thecoded linear inverse algorithm? Although we do not have aconcrete answer to this question, we have tested different codes(with different generator matrices G) in the Twitter graphexperiment, all using Algorithm 1. The results are shown inFig. 3 (right). The generator matrix used for the “binary” curvehas i.i.d. binary entries in 1, 1. The generator matrix usedfor the “sparse” curve has random binary sparse entries. Thegenerator matrix for the “Gaussian” curve has i.i.d. standardGaussian entries. In this experiment, the DFT-code performsthe best. However, finding the best code in general is ameaningful future work.

B. Simulations

We also test the coded PageRank algorithm in a simulatedsetup with randomly generated graphs and worker responsetimes. These simulations help us understand looseness inour theoretical bounding techniques. They can also test theperformance of the coded Algorithm for different distributions.We simulate Algorithm 1 on a randomly generated Erdos-Renyi graph with N = 500 nodes and connection probability0.1. The number of workers n is set to be 240 and the numberof PageRank vectors k is set to be 200. We use the first

ABFT


Weighted least squares

Weighted least squares outperforms competition; Degrades gracefully with early deadline

28

Summary thus far…

ABFT Coded computation 6=

New codes, new problems, new analyses, converses

But, we need to be careful in lit-searching ABFT literature

Next: small processors

29

Break!

Questions/comments? Your favorite computation problem?

Preview of Part II: Small Processors

Controlling error propagation with small processors/gates - No central processor to distribute/aggregate

Encoding/decoding also have errors

30

Part II: “Small processors”

has so far received relatively less attention

31

What are small processors?

1) Logic gates

2) Analog “Nanofunctions” and beyond CMOS devices

3) Processors with limited memory (i.e., ALL processors are small!) - can’t assume that processor memory increases with problem size

Synthesize large reliable computations using small processors?

0 2 4 6 8 10 12 14 16 18 200.66

0.68

0.7

0.72

Time [s]

Vou

t [V]

Desired Vout = 0.685 V

GDOT = 0.697 V GDOT = 0.697 V

CH1 = OFFCH2 = OFF

ON

OFF

ON

ON

OFF

ON

ON

ON

ON

OFFOFF

OFF

OFF

OFFOFF

ON

0 100 20010

-2

10-1

100

τ [ns]

Energ

y/o

p [pJ]

ȕ = 4

ȕ = 2.1 ȕ = 1.1

ȕ = 1.1, 2.1, 4

0 100 200

0

5

10

15

τ [ns]

Err

or

[%]

0 100 200τ [ns]

ȕ = 4

ȕ = 2.1

ȕ = 1.1

GDOT CMOS

ȕ = 4

ȕ = 2.1

ȕ = 1.1

(a) (b)

Fig. 1. Schematics of graphene dot product (GDOT) operation. (a) Input pulse weights. (b) Dot-product kernel transistor-level schematic. (c) Ideal output voltage.

Fig. 2. GDOT vs. CMOS SAC simulation (L= 180 nm, W=1 µm) for (a) Ĳ = 200 ns and (b) 20 ns.

Fig. 3. Simulated % Error vs. Ĳ for different input voltage ranges for (a) GDOT and (b) CMOS dot product implementation.

Fig. 4. Signal-to-Noise Ratio [SNR=(Vdesired/ı)2] vs. Ĳ for (a) GDOT and (b) CMOS dot product implementation.

Fig. 5. Energy per operation [Ĳop = 5Ĳ] vs. Ĳ.

Fig. 8. Measured Vout vs. time (V1 = 0.665 V, p1 = 0.7, V2 = 0.7321 V, p2 = 0.3). CH 1 and CH 2 refer to manually triggered pulse generator outputs.

Fig. 6. (a) Wafer-scale (4”) GDOT fabrication. (b) Optical image ofprototype 4-input L = 1 µm GDOT. Only 2-input (bold) used due totest equipment limitations. (c) Representative GFET cross section [4].

Fig. 7. Pulsed device resistance [RD] vs. top-gate bias [VTG] (symbols) and model fit (lines) for fabricated GFETs [T = 10 ȝs, TON = 3 ȝs, TRise = TFall = 200 ns].

Table 1. Values extractedfrom VS Model fit [3] forGFETs connected to IN1 (i.e.p1, V1) and IN2 (i.e. p2, V2).

ȕ = max(Vi) /min(Vi).

Fig. 9. Measured (a) Vout and (b) % Error vs. input-weight, with Vpp = 0.6, 1.2, 1.8 V.

Fig. 10. (a) Vout vs. V2 with V1 = 0.665 V. (b) % error vs. ȕ. (c) Behavioral model describing 2-input GDOT output with p = p1.

CMOS

GDOT

0.6

0.8

1

Vou

t [V]

Vout vs. t – Gaussian Blur Filter (ı2 = 0.85)

Ĳ = (100 kȍ)(2 pF) = 200 ns

0 0.2 0.4 0.6 0.8 10.6

0.8

1

Time [µs]

Vou

t [V]

GDOT

GDOT

CMOS

CMOS

Ĳ = (10 kȍ)(2 pF) = 20 ns

(a)

(b)

R’

C’

ĭ1

ĭ2

ĭN

V1

V2

VN

Vout

ĭ1(t)

ĭN(t)

ĭ2(t)

t

TON,1T

t

TON,2

t

TON,3t

R’C’ ب max(T,ĲGFET)

V(t)V1

V2

VN

Vout

out i ii

=¦VV p

(a) (b) (c)

Oi

N,i=T

pT

Vx

0 0.5 1

0.8

0.9

1

Vpp

V1 = 1.025 VV2 = 0.732 V

0 0.5 10

5

10

Vpp

(a) (b)

Err

or

[%]

Vo

ut [

V]

p1 p1

0.2 0.4 0.6 0.8 1

0.4

0.6

0.8

1

V2 [V]

0 50

20

40

60

80

V1

(a) (b)

Measured

SPICE

(c)

Vou

t [V

]

Err

or

[%]

ȕ

0 100 200τ [ns]

0 100 20010

2

104

106

108

τ [ns]

ȕ = 4 ȕ = 2.1

ȕ = 1.1

GDOT CMOS

ȕ = 4

ȕ = 2.1

ȕ = 1.1

(a) (b)

SN

R

GFET IN1 IN2

ȝ[cm2V-1s-1]

960 1370

RC,elec

[kȍ-ȝm]1.3 0.8

RC,hole

[kȍ-ȝm]1.6 0.9

-2 0 20

2

4

6

8

10

VTG [V]

RD [k

Ω-µ

m]

VD =1.2 VVBG = 5 VL = 1.5 µm

CH 1

CH 2

RHL= 1.50

RHL= 1.86

MIT VS Fit

0 2 4 6 8 10 12 14 16 18 200.66

0.68

0.7

0.72

Time [s]

Vou

t [V]

Desired Vout = 0.685 V

GDOT = 0.697 V GDOT = 0.697 V

CH1 = OFFCH2 = OFF

ON

OFF

ON

ON

OFF

ON

ON

ON

ON

OFFOFF

OFF

OFF

OFFOFF

ON

0 100 20010

-2

10-1

100

τ [ns]

Energ

y/o

p [pJ]

ȕ = 4

ȕ = 2.1 ȕ = 1.1

ȕ = 1.1, 2.1, 4

0 100 200

0

5

10

15

τ [ns]

Err

or

[%]

0 100 200τ [ns]

ȕ = 4

ȕ = 2.1

ȕ = 1.1

GDOT CMOS

ȕ = 4

ȕ = 2.1

ȕ = 1.1

(a) (b)

Fig. 1. Schematics of graphene dot product (GDOT) operation. (a) Input pulse weights. (b) Dot-product kernel transistor-level schematic. (c) Ideal output voltage.

Fig. 2. GDOT vs. CMOS SAC simulation (L= 180 nm, W=1 µm) for (a) Ĳ = 200 ns and (b) 20 ns.

Fig. 3. Simulated % Error vs. Ĳ for different input voltage ranges for (a) GDOT and (b) CMOS dot product implementation.

Fig. 4. Signal-to-Noise Ratio [SNR=(Vdesired/ı)2] vs. Ĳ for (a) GDOT and (b) CMOS dot product implementation.

Fig. 5. Energy per operation [Ĳop = 5Ĳ] vs. Ĳ.

Fig. 8. Measured Vout vs. time (V1 = 0.665 V, p1 = 0.7, V2 = 0.7321 V, p2 = 0.3). CH 1 and CH 2 refer to manually triggered pulse generator outputs.

Fig. 6. (a) Wafer-scale (4”) GDOT fabrication. (b) Optical image ofprototype 4-input L = 1 µm GDOT. Only 2-input (bold) used due totest equipment limitations. (c) Representative GFET cross section [4].

Fig. 7. Pulsed device resistance [RD] vs. top-gate bias [VTG] (symbols) and model fit (lines) for fabricated GFETs [T = 10 ȝs, TON = 3 ȝs, TRise = TFall = 200 ns].

Table 1. Values extractedfrom VS Model fit [3] forGFETs connected to IN1 (i.e.p1, V1) and IN2 (i.e. p2, V2).

ȕ = max(Vi) /min(Vi).

Fig. 9. Measured (a) Vout and (b) % Error vs. input-weight, with Vpp = 0.6, 1.2, 1.8 V.

Fig. 10. (a) Vout vs. V2 with V1 = 0.665 V. (b) % error vs. ȕ. (c) Behavioral model describing 2-input GDOT output with p = p1.

CMOS

GDOT

0.6

0.8

1

Vou

t [V]

Vout vs. t – Gaussian Blur Filter (ı2 = 0.85)

Ĳ = (100 kȍ)(2 pF) = 200 ns

0 0.2 0.4 0.6 0.8 10.6

0.8

1

Time [µs]

Vou

t [V]

GDOT

GDOT

CMOS

CMOS

Ĳ = (10 kȍ)(2 pF) = 20 ns

(a)

(b)

R’

C’

ĭ1

ĭ2

ĭN

V1

V2

VN

Vout

ĭ1(t)

ĭN(t)

ĭ2(t)

t

TON,1T

t

TON,2

t

TON,3t

R’C’ ب max(T,ĲGFET)

V(t)V1

V2

VN

Vout

out i ii

=¦VV p

(a) (b) (c)

Oi

N,i=T

pT

Vx

0 0.5 1

0.8

0.9

1

Vpp

V1 = 1.025 VV2 = 0.732 V

0 0.5 10

5

10

Vpp

(a) (b)

Err

or

[%]

Vo

ut [

V]

p1 p1

0.2 0.4 0.6 0.8 1

0.4

0.6

0.8

1

V2 [V]

0 50

20

40

60

80

V1

(a) (b)

Measured

SPICE

(c)

Vou

t [V

]

Err

or

[%]

ȕ

0 100 200τ [ns]

0 100 20010

2

104

106

108

τ [ns]

ȕ = 4 ȕ = 2.1

ȕ = 1.1

GDOT CMOS

ȕ = 4

ȕ = 2.1

ȕ = 1.1

(a) (b)

SN

R

GFET IN1 IN2

ȝ[cm2V-1s-1]

960 1370

RC,elec

[kȍ-ȝm]1.3 0.8

RC,hole

[kȍ-ȝm]1.6 0.9

-2 0 20

2

4

6

8

10

VTG [V]

RD [k

Ω-µ

m]

VD =1.2 VVBG = 5 VL = 1.5 µm

CH 1

CH 2

RHL= 1.50

RHL= 1.86

MIT VS Fit

e.g. Dot product “nanofunction” in graphene [Pop, Shanbhag, Blaauw labs ’15-’16]

32

What is fundamentally new in small processor computing?1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Dbinaryinputs

error probabilityof binary output

33


Noisy circuits built with noisy gates


Dbinaryinputs


33




Dbinaryinputs


33




Dbinaryinputs


33




Dbinaryinputs


X Y ZBSC()

33




Dbinaryinputs


X Y ZBSC()

“Strong” Data-Processing InequalityI(X;Z)

I(X;Y ) f() < 1

[Pippenger ’88] [Evans, Schulman ’99][Erkip, Cover ’98] [Polayanskiy, Wu ’14] [Anantharam, Gohari, Nair, Kamath ’14] [Raginsky ’14]

Classical Data-Processing InequalityI(X;Z)

I(X;Y ) 1

33




Dbinaryinputs


X Y ZBSC()


I(X;Y ) f() < 1



I(X;Y ) 1

1x

2x

3x

1 1)(q w x

2 2 )(q w x

3 3)(q w x

4xest4( )q y

6x

7x

est5( )q y

est est est0 4 5) )( (q q+=y y y

1 1 2 2est4 ) ( )( wq w q= +x xy

3 3 4 4)(w wq+ +x x

5x

Sink

b) Distortion accumulation with quantization noise (e.g. in “data summarization”, consensus, etc.)

33




Dbinaryinputs


X Y ZBSC()


I(X;Y ) f() < 1



I(X;Y ) 1

1x

2x

3x

1 1)(q w x

2 2 )(q w x

3 3)(q w x

4xest4( )q y

6x

7x

est5( )q y


1 1 2 2est4 ) ( )( wq w q= +x xy

3 3 4 4)(w wq+ +x x

5x

Sink


2

( ) 21log2

ii PN i

i

RDσ

→ ≥ S

21/2

( ) 21log ( )2

ii PN i i

i

R O DDσ

→ ≥ −Δ

S

An application of cut-set bound: [Cuff, Su, El Gamal ’09]

Incremental-distortion bound: [Yang, Grover, Kar IEEE Trans IT’17]

33




Dbinaryinputs


X Y ZBSC()


I(X;Y ) f() < 1



I(X;Y ) 1

1x

2x

3x

1 1)(q w x

2 2 )(q w x

3 3)(q w x

4xest4( )q y

6x

7x

est5( )q y


1 1 2 2est4 ) ( )( wq w q= +x xy

3 3 4 4)(w wq+ +x x

5x

Sink


2

( ) 21log2

ii PN i

i

RDσ

→ ≥ S

21/2

( ) 21log ( )2

ii PN i i

i

R O DDσ

→ ≥ −Δ

S

An application of cut-set bound: [Cuff, Su, El Gamal ’09]

Incremental-distortion bound: [Yang, Grover, Kar IEEE Trans IT’17]

tighter by an unbounded factor33

What is fundamentally new in small processor computing?

2) Decoding, and possibly encoding, also error prone

Error-prone decoding (often message-passing for LDPCs) [Taylor ‘67][Hadjicostis, Verghese ’05][Vasic et al. ’07-’13][Varshney ’11][Grover, Palaiyanur, Sahai ’10] [Huang, Yao, Dolecek ’14][Gross et al. ’13][Vasic et al.’16]

Error-prone encoding [Yang, Grover, Kar ’14][Dupraz et al. ’15] - see also erasure version [Hachem, Wang, Fragouli, Diggavi ‘13]

Can we compute M x V reliably using error-prone gates? Is it even possible?

We’ll next discuss this for 1) Gates; 2) Processors

Essential to analyze decoding/encoding costs in noisy computation: there may be no conceptual analog of Shannon capacity in computing problems [Grover et al.’07-’15][Grover ISIT’14][Blake, Kschischang ’15,’16]

1) Errors accumulate; information dissipates

34

M x V on noisy gates: the basics

Output Input Linear transform

[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK

35



[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK

[x1, x2, . . . , xN ] = [s1, s2, . . . , sL]

2

4A

3

5

LK

2

4 IKK |P

3

5

KNCoded output Systematic

generator matrix

Input G

35



[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK

[x1, x2, . . . , xN ] = [s1, s2, . . . , sL]

2

4A

3

5

LK

2

4 IKK |P

3

5


generator matrix

Input G

eG : coded generator matrix

35



[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK

Note: rows of are also codewords of !eG G

[x1, x2, . . . , xN ] = [s1, s2, . . . , sL]

2

4A

3

5

LK

2

4 IKK |P

3

5


generator matrix

Input G


35



[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK


[x1, x2, . . . , xN ] = [s1, s2, . . . , sL]

2

4A

3

5

LK

2

4 IKK |P

3

5


generator matrix

Input G


Decoding: use parity-check matrix H for GEncoded computation: multiply with eGs

35



[r1, r2, . . . , rK ] = [s1, s2, . . . , sL]

2

4 A

3

5

LK


[x1, x2, . . . , xN ] = [s1, s2, . . . , sL]

2

4A

3

5

LK

2

4 IKK |P

3

5


generator matrix

Input G


Decoding: use parity-check matrix H for GEncoded computation: multiply with eGs

PRECOMPUTEDNOISELESSLY

35

A difficulty with this approach: error propagation

x = s

eG

Naive computation of requires computingxi =

X

j

sjgji

36


x = s

eG

Naive computation of requires computingxi =

X

j

sjgji

s1g1j

s2g2j gLj

sL

36


x = s

eG

Naive computation of requires computing

Requiring L AND gates, L-1 XOR gates

Error accumulates! As L→ ∞ , each approaches a random coin flip xi

xi =X

j

sjgji

s1g1j

s2g2j gLj

sL

36

∼∼

∼

∼

Addressing error accumulation: a simple observation

x = sG = [s1, s2, . . . , sk]

2

6666664

g1 g2

.

.

.gk

3

7777775Codeword

generatormatrix

sourcesequence

37

∼∼

∼

∼


x = sG = [s1, s2, . . . , sk]

2

6666664

g1 g2

.

.

.gk

3

7777775Codeword

generatormatrix

sourcesequence

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk

37

∼∼

∼

∼


x = sG = [s1, s2, . . . , sk]

2

6666664

g1 g2

.

.

.gk

3

7777775Codeword

generatormatrix

sourcesequence

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk

A valid codeword.Can be corrected for errors

37

∼∼

∼

∼


x = sG = [s1, s2, . . . , sk]

2

6666664

g1 g2

.

.

.gk

3

7777775Codeword

generatormatrix

sourcesequence

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk

Any correctly computed partial sum is a valid codeword


37

∼∼

∼

∼


x = sG = [s1, s2, . . . , sk]

2

6666664

g1 g2

.

.

.gk

3

7777775Codeword

generatormatrix

sourcesequence

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk

Any correctly computed partial sum is a valid codeword- possibly correct compute errors by embedding decoders inside encoder - Use LDPC codes: utilize results on noisy decoding

(we used [Tabatabaei, Cho, Dolecek ’14])


37

“ENCODED”: ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk

NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


Better yet: ENCODED-Tree

NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


p reg

p lim

Stag

e in

dex

C&C C&C C&C C&C C&C C&C C&C C&CC&C

C&CC&CC&C

C&CENCODED Tree

Dcodewords


NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


p reg

p lim

Stag

e in

dex


C&CC&CC&C

C&CENCODED Tree

Dcodewords


NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

Fig.

1.(a

)sh

ows

aco

mpu

te-a

nd-c

orre

ctst

ruct

ure.

(b)

show

sth

eno

isy

enco

ding

tree.

Dur

ing

the

enco

ding

proc

ess,

the

bite

rror

rate

isbo

unde

dbe

twee

ntw

oco

nsta

ntsp

reg

andp

limsh

own

in(c

).Th

ebi

terr

orra

teev

olut

ion

inon

eem

bedd

edde

code

ris

show

nin

(d).

num

ber

ofco

mpu

tatio

nsre

quire

dpe

rbi

tCb

mus

tbe

atle

ast

Ω(l

og1/p

tar

log1/ϵ),

whe

reϵ

isth

eer

ror

prob

abili

tyof

nois

yga

tes.

Proo

f.Se

eA

ppen

dix

C.

Rem

ark

1.Si

nce

each

com

puta

tion

cons

umes

ace

rtain

amou

ntof

pow

er,t

hepr

oces

sing

pow

erca

nnot

rem

ain

boun

ded

whe

nth

eer

ror

prob

abili

typ t

arap

proa

ches

zero

.Thi

scl

aim

isco

n-si

sten

twith

the

rece

ntre

sult

that

Shan

non

wat

erfa

llcu

rve

does

noth

old

whe

nde

codi

ngpo

wer

isco

nsid

ered

[13]

.In

the

next

sect

ion,

we

will

build

ano

isy

enco

derw

ithno

isy

deco

ders

embe

dded

into

it,w

hich

achi

eves

Pbi

te

smal

ler

than

p tar

and

com

plex

ityCb

ofth

eor

derO

(log

1/p t

ar).

The

achi

eved

rate

isal

soan

alyz

ed.

III.

NO

ISY

EN

CO

DIN

GW

ITH

EM

BE

DD

ED

DE

CO

DE

RS

Inpa

rtII

I-A

,we

prov

ide

the

cons

truct

ion

ofa

nois

yen

code

rth

atm

eets

the

low

erbo

und

inTh

eore

m1.

With

inou

rpro

blem

form

ulat

ion,

the

enco

derw

ithem

bedd

edde

code

rsm

ayad

dres

slim

itatio

nsof

som

eot

hern

atur

alst

rate

gies

,whi

chis

expl

aine

din

part

III-

B.T

hem

ain

theo

rem

,whi

chpr

ovid

esth

een

code

rco

nstru

ctio

n,is

give

nin

Sect

ion

IV.

A.Th

eN

oisy

Enco

der

Con

stru

ctio

nan

dW

orki

ngPr

inci

ples

The

nois

yen

code

rst

ruct

ure

issh

own

inFi

g.1(

b).

We

use

aD

-bra

nch

tree

with

dept

hM

todo

the

enco

ding

.Th

em

essa

ges=

(s1,...s K

)is

inpu

tfr

omth

ele

afno

des.

The

outp

utx=

sG=

(x1,...xN)

isca

lcul

ated

from

botto

mto

top

and

final

lyob

tain

edat

the

root

.In

theM

-thle

vel,

i.e.,

the

botto

mle

vel,

the

firstK

leaf

node

sha

vero

ws

ofth

ege

nera

tor

mat

rixg1

togK

stor

edin

them

.At

the

star

tof

the

enco

ding

proc

ess,

thek

-thno

deof

the

first

Kno

des

calc

ulat

ess k

·gT k

usin

gN

nois

yA

ND

gate

san

dst

ores

itas

anin

term

edia

tere

sult.

The

rem

aini

ngD

M−1−

Kle

afno

des

stor

eal

l-zer

oco

dew

ords

asin

term

edia

tere

sults

.In

the

uppe

rle

vels

,ea

chno

depe

rfor

ms

aco

mpo

nent

-wis

eX

OR

-ope

ratio

nof

theD

inte

rmed

iate

resu

ltsfr

omD

child

ren-

node

s.Th

eref

ore,

ifno

gate

erro

rsoc

cur,

the

root

gets

the

the

bina

rysu

mof

all

s k·g

T k,

whi

chis

the

corr

ect

code

wor

dx=

sG.

How

ever

,in

orde

rto

deal

with

erro

rsca

used

byno

isy

gate

s,ea

chno

n-le

aftre

eno

deis

chan

ged

toa

com

pute

-and

-cor

rect

unit

show

nin

Fig.

1(a)

,con

stitu

ted

by2E

D-f

an-in

nois

yX

OR

gate

s,an

Alg

orith

m1

Noi

syEn

codi

ngIN

PUT:

Mes

sage

bits

s=(s

1,s

2,...s K

).O

UTP

UT:

Cha

nnel

inpu

tbits

x=

(x1,x

2,...xN).

INIT

IALI

ZEW

rites l

·gT l

inth

ere

gist

ers

ofvl M

,1≤

l≤

K.

Writ

e0T

inth

ere

gist

ers

ofot

her

node

s.C

opy

each

vect

orto

2Ebi

tsan

dst

ore

them

asth

efir

stla

yer

ofin

term

edia

tere

sults

xl M

,1≤

l≤

DM

−1.

FOR

mfr

omM

−1

to1

•Ea

chno

devl m

calc

ulat

esth

eX

OR

ofth

eou

tput

sfr

omits

Dch

ildre

n-no

des

and

writ

eth

ere

sult

inth

e2E

-bit

regi

ster

.

xl m

=!

v∈D(v

l m)

xv,1

≤l≤

Dm

−1;

(10)

•Ea

chno

devl m

perf

orm

sC

itera

tions

ofth

em

essa

ge-

pass

ing

deco

ding

.EN

DC

hang

eth

e2E

-bit

vect

orx1 1

back

toth

eN

-bit

code

wor

dx1 1.

Out

putx

1 1.

2E-b

itre

gist

eran

da

sepa

rate

Gal

lage

rB

deco

der,

whe

reE

isth

enu

mbe

rof

edge

sin

the

LDPC

bipa

rtite

grap

h.Ea

chre

gist

erst

ores

2Ebi

tsin

stea

dof

Nbi

ts,b

ecau

seth

enu

mbe

rof

mes

sage

sdu

ring

each

Gal

lage

rB

deco

ding

oper

atio

nis

2E.

We

stor

eth

ese2E

bits

asin

term

edia

tere

sults

inst

ead

ofst

orin

gth

eN

-bit

code

wor

d.N

ote

that

byst

orin

gth

ese

2Ebi

ts,

the

corr

espo

ndin

gco

dew

ord

isst

ored

asw

ell.

The

nois

yX

OR

gate

sar

eus

edto

perf

orm

the

com

pone

nt-w

ise

XO

R-o

pera

tion

ofth

ein

term

edia

tere

sults

whi

char

eD

·2E

bits

from

Dch

ildre

nno

des.

The

nois

yG

alla

ger

Bde

code

ris

used

toco

rrec

terr

ors

inth

eas

soci

ated

regi

ster

afte

rthe

XO

R-

oper

atio

n.B

efor

ese

ndin

gth

eou

tput

toth

epa

rent

-nod

e,ea

chno

depe

rfor

msC

itera

tions

ofth

em

essa

ge-p

assi

ngde

codi

ngw

ithth

eem

bedd

edde

code

r.W

ew

illsh

owth

atit

suffi

ces

tous

eC

=3.

Inth

eno

isy

deco

der,

the

erro

rpr

obab

ility

follo

ws

the

dens

ityev

olut

ionp(

i+1)<

f(p

(i) )

whe

rei

isth

eite

ratio

nin

dex

and

the

expl

icit

expr

essi

onof

func

tionf(·)

isgi

ven

inTh

eore

m3.

This

evol

utio

nis

illus

trate

din

Fig.

1(d)

.In

the

enco

ding

proc

ess,

the

XO

R-o

pera

tions

intro

duce

erro

rs,w

hile

the

Gal

lage

rBde

codi

ngpr

oces

ssup

pres

sest

hem

.

38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


p reg

p lim

Stag

e in

dex


C&CC&CC&C

C&CENCODED Tree

Dcodewords


NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

Fig.

1.(a

)sh

ows

aco

mpu

te-a

nd-c

orre

ctst

ruct

ure.

(b)

show

sth

eno

isy

enco

ding

tree.

Dur

ing

the

enco

ding

proc

ess,

the

bite

rror

rate

isbo

unde

dbe

twee

ntw

oco

nsta

ntsp

reg

andp

limsh

own

in(c

).Th

ebi

terr

orra

teev

olut

ion

inon

eem

bedd

edde

code

ris

show

nin

(d).

num

ber

ofco

mpu

tatio

nsre

quire

dpe

rbi

tCb

mus

tbe

atle

ast

Ω(l

og1/p

tar

log1/ϵ),

whe

reϵ

isth

eer

ror

prob

abili

tyof

nois

yga

tes.

Proo

f.Se

eA

ppen

dix

C.

Rem

ark

1.Si

nce

each

com

puta

tion

cons

umes

ace

rtain

amou

ntof

pow

er,t

hepr

oces

sing

pow

erca

nnot

rem

ain

boun

ded

whe

nth

eer

ror

prob

abili

typ t

arap

proa

ches

zero

.Thi

scl

aim

isco

n-si

sten

twith

the

rece

ntre

sult

that

Shan

non

wat

erfa

llcu

rve

does

noth

old

whe

nde

codi

ngpo

wer

isco

nsid

ered

[13]

.In

the

next

sect

ion,

we

will

build

ano

isy

enco

derw

ithno

isy

deco

ders

embe

dded

into

it,w

hich

achi

eves

Pbi

te

smal

ler

than

p tar

and

com

plex

ityCb

ofth

eor

derO

(log

1/p t

ar).

The

achi

eved

rate

isal

soan

alyz

ed.

III.

NO

ISY

EN

CO

DIN

GW

ITH

EM

BE

DD

ED

DE

CO

DE

RS

Inpa

rtII

I-A

,we

prov

ide

the

cons

truct

ion

ofa

nois

yen

code

rth

atm

eets

the

low

erbo

und

inTh

eore

m1.

With

inou

rpro

blem

form

ulat

ion,

the

enco

derw

ithem

bedd

edde

code

rsm

ayad

dres

slim

itatio

nsof

som

eot

hern

atur

alst

rate

gies

,whi

chis

expl

aine

din

part

III-

B.T

hem

ain

theo

rem

,whi

chpr

ovid

esth

een

code

rco

nstru

ctio

n,is

give

nin

Sect

ion

IV.

A.Th

eN

oisy

Enco

der

Con

stru

ctio

nan

dW

orki

ngPr

inci

ples

The

nois

yen

code

rst

ruct

ure

issh

own

inFi

g.1(

b).

We

use

aD

-bra

nch

tree

with

dept

hM

todo

the

enco

ding

.Th

em

essa

ges=

(s1,...s K

)is

inpu

tfr

omth

ele

afno

des.

The

outp

utx=

sG=

(x1,...xN)

isca

lcul

ated

from

botto

mto

top

and

final

lyob

tain

edat

the

root

.In

theM

-thle

vel,

i.e.,

the

botto

mle

vel,

the

firstK

leaf

node

sha

vero

ws

ofth

ege

nera

tor

mat

rixg1

togK

stor

edin

them

.At

the

star

tof

the

enco

ding

proc

ess,

thek

-thno

deof

the

first

Kno

des

calc

ulat

ess k

·gT k

usin

gN

nois

yA

ND

gate

san

dst

ores

itas

anin

term

edia

tere

sult.

The

rem

aini

ngD

M−1−

Kle

afno

des

stor

eal

l-zer

oco

dew

ords

asin

term

edia

tere

sults

.In

the

uppe

rle

vels

,ea

chno

depe

rfor

ms

aco

mpo

nent

-wis

eX

OR

-ope

ratio

nof

theD

inte

rmed

iate

resu

ltsfr

omD

child

ren-

node

s.Th

eref

ore,

ifno

gate

erro

rsoc

cur,

the

root

gets

the

the

bina

rysu

mof

all

s k·g

T k,

whi

chis

the

corr

ect

code

wor

dx=

sG.

How

ever

,in

orde

rto

deal

with

erro

rsca

used

byno

isy

gate

s,ea

chno

n-le

aftre

eno

deis

chan

ged

toa

com

pute

-and

-cor

rect

unit

show

nin

Fig.

1(a)

,con

stitu

ted

by2E

D-f

an-in

nois

yX

OR

gate

s,an

Alg

orith

m1

Noi

syEn

codi

ngIN

PUT:

Mes

sage

bits

s=(s

1,s

2,...s K

).O

UTP

UT:

Cha

nnel

inpu

tbits

x=

(x1,x

2,...xN).

INIT

IALI

ZEW

rites l

·gT l

inth

ere

gist

ers

ofvl M

,1≤

l≤

K.

Writ

e0T

inth

ere

gist

ers

ofot

her

node

s.C

opy

each

vect

orto

2Ebi

tsan

dst

ore

them

asth

efir

stla

yer

ofin

term

edia

tere

sults

xl M

,1≤

l≤

DM

−1.

FOR

mfr

omM

−1

to1

•Ea

chno

devl m

calc

ulat

esth

eX

OR

ofth

eou

tput

sfr

omits

Dch

ildre

n-no

des

and

writ

eth

ere

sult

inth

e2E

-bit

regi

ster

.

xl m

=!

v∈D(v

l m)

xv,1

≤l≤

Dm

−1;

(10)

•Ea

chno

devl m

perf

orm

sC

itera

tions

ofth

em

essa

ge-

pass

ing

deco

ding

.EN

DC

hang

eth

e2E

-bit

vect

orx1 1

back

toth

eN

-bit

code

wor

dx1 1.

Out

putx

1 1.

2E-b

itre

gist

eran

da

sepa

rate

Gal

lage

rB

deco

der,

whe

reE

isth

enu

mbe

rof

edge

sin

the

LDPC

bipa

rtite

grap

h.Ea

chre

gist

erst

ores

2Ebi

tsin

stea

dof

Nbi

ts,b

ecau

seth

enu

mbe

rof

mes

sage

sdu

ring

each

Gal

lage

rB

deco

ding

oper

atio

nis

2E.

We

stor

eth

ese2E

bits

asin

term

edia

tere

sults

inst

ead

ofst

orin

gth

eN

-bit

code

wor

d.N

ote

that

byst

orin

gth

ese

2Ebi

ts,

the

corr

espo

ndin

gco

dew

ord

isst

ored

asw

ell.

The

nois

yX

OR

gate

sar

eus

edto

perf

orm

the

com

pone

nt-w

ise

XO

R-o

pera

tion

ofth

ein

term

edia

tere

sults

whi

char

eD

·2E

bits

from

Dch

ildre

nno

des.

The

nois

yG

alla

ger

Bde

code

ris

used

toco

rrec

terr

ors

inth

eas

soci

ated

regi

ster

afte

rthe

XO

R-

oper

atio

n.B

efor

ese

ndin

gth

eou

tput

toth

epa

rent

-nod

e,ea

chno

depe

rfor

msC

itera

tions

ofth

em

essa

ge-p

assi

ngde

codi

ngw

ithth

eem

bedd

edde

code

r.W

ew

illsh

owth

atit

suffi

ces

tous

eC

=3.

Inth

eno

isy

deco

der,

the

erro

rpr

obab

ility

follo

ws

the

dens

ityev

olut

ionp(

i+1)<

f(p

(i) )

whe

rei

isth

eite

ratio

nin

dex

and

the

expl

icit

expr

essi

onof

func

tionf(·)

isgi

ven

inTh

eore

m3.

This

evol

utio

nis

illus

trate

din

Fig.

1(d)

.In

the

enco

ding

proc

ess,

the

XO

R-o

pera

tions

intro

duce

erro

rs,w

hile

the

Gal

lage

rBde

codi

ngpr

oces

ssup

pres

sest

hem

.

Moral: can overcome info loss on each link by collecting info over many links38


COMPUTE &

CORRECT

sk1

s1 COMPUTE &

CORRECT

s2 s3 sk

COMPUTE &

CORRECT

CODEWORD

∼ ∼ ∼= s1g1 + s2g2 + . . .+ skgk


p reg

p lim

Stag

e in

dex


C&CC&CC&C

C&CENCODED Tree

Dcodewords


Reflections of a converse [Evans, Schulman ’99] in our achievability

NOISYCOMPUTATION

COMPUTE & CORRECT

NOISYDECODING

s1g1 + s2g2

Fig.

1.(a

)sh

ows

aco

mpu

te-a

nd-c

orre

ctst

ruct

ure.

(b)

show

sth

eno

isy

enco

ding

tree.

Dur

ing

the

enco

ding

proc

ess,

the

bite

rror

rate

isbo

unde

dbe

twee

ntw

oco

nsta

ntsp

reg

andp

limsh

own

in(c

).Th

ebi

terr

orra

teev

olut

ion

inon

eem

bedd

edde

code

ris

show

nin

(d).

num

ber

ofco

mpu

tatio

nsre

quire

dpe

rbi

tCb

mus

tbe

atle

ast

Ω(l

og1/p

tar

log1/ϵ),

whe

reϵ

isth

eer

ror

prob

abili

tyof

nois

yga

tes.

Proo

f.Se

eA

ppen

dix

C.

Rem

ark

1.Si

nce

each

com

puta

tion

cons

umes

ace

rtain

amou

ntof

pow

er,t

hepr

oces

sing

pow

erca

nnot

rem

ain

boun

ded

whe

nth

eer

ror

prob

abili

typ t

arap

proa

ches

zero

.Thi

scl

aim

isco

n-si

sten

twith

the

rece

ntre

sult

that

Shan

non

wat

erfa

llcu

rve

does

noth

old

whe

nde

codi

ngpo

wer

isco

nsid

ered

[13]

.In

the

next

sect

ion,

we

will

build

ano

isy

enco

derw

ithno

isy

deco

ders

embe

dded

into

it,w

hich

achi

eves

Pbi

te

smal

ler

than

p tar

and

com

plex

ityCb

ofth

eor

derO

(log

1/p t

ar).

The

achi

eved

rate

isal

soan

alyz

ed.

III.

NO

ISY

EN

CO

DIN

GW

ITH

EM

BE

DD

ED

DE

CO

DE

RS

Inpa

rtII

I-A

,we

prov

ide

the

cons

truct

ion

ofa

nois

yen

code

rth

atm

eets

the

low

erbo

und

inTh

eore

m1.

With

inou

rpro

blem

form

ulat

ion,

the

enco

derw

ithem

bedd

edde

code

rsm

ayad

dres

slim

itatio

nsof

som

eot

hern

atur

alst

rate

gies

,whi

chis

expl

aine

din

part

III-

B.T

hem

ain

theo

rem

,whi

chpr

ovid

esth

een

code

rco

nstru

ctio

n,is

give

nin

Sect

ion

IV.

A.Th

eN

oisy

Enco

der

Con

stru

ctio

nan

dW

orki

ngPr

inci

ples

The

nois

yen

code

rst

ruct

ure

issh

own

inFi

g.1(

b).

We

use

aD

-bra

nch

tree

with

dept

hM

todo

the

enco

ding

.Th

em

essa

ges=

(s1,...s K

)is

inpu

tfr

omth

ele

afno

des.

The

outp

utx=

sG=

(x1,...xN)

isca

lcul

ated

from

botto

mto

top

and

final

lyob

tain

edat

the

root

.In

theM

-thle

vel,

i.e.,

the

botto

mle

vel,

the

firstK

leaf

node

sha

vero

ws

ofth

ege

nera

tor

mat

rixg1

togK

stor

edin

them

.At

the

star

tof

the

enco

ding

proc

ess,

thek

-thno

deof

the

first

Kno

des

calc

ulat

ess k

·gT k

usin

gN

nois

yA

ND

gate

san

dst

ores

itas

anin

term

edia

tere

sult.

The

rem

aini

ngD

M−1−

Kle

afno

des

stor

eal

l-zer

oco

dew

ords

asin

term

edia

tere

sults

.In

the

uppe

rle

vels

,ea

chno

depe

rfor

ms

aco

mpo

nent

-wis

eX

OR

-ope

ratio

nof

theD

inte

rmed

iate

resu

ltsfr

omD

child

ren-

node

s.Th

eref

ore,

ifno

gate

erro

rsoc

cur,

the

root

gets

the

the

bina

rysu

mof

all

s k·g

T k,

whi

chis

the

corr

ect

code

wor

dx=

sG.

How

ever

,in

orde

rto

deal

with

erro

rsca

used

byno

isy

gate

s,ea

chno

n-le

aftre

eno

deis

chan

ged

toa

com

pute

-and

-cor

rect

unit

show

nin

Fig.

1(a)

,con

stitu

ted

by2E

D-f

an-in

nois

yX

OR

gate

s,an

Alg

orith

m1

Noi

syEn

codi

ngIN

PUT:

Mes

sage

bits

s=(s

1,s

2,...s K

).O

UTP

UT:

Cha

nnel

inpu

tbits

x=

(x1,x

2,...xN).

INIT

IALI

ZEW

rites l

·gT l

inth

ere

gist

ers

ofvl M

,1≤

l≤

K.

Writ

e0T

inth

ere

gist

ers

ofot

her

node

s.C

opy

each

vect

orto

2Ebi

tsan

dst

ore

them

asth

efir

stla

yer

ofin

term

edia

tere

sults

xl M

,1≤

l≤

DM

−1.

FOR

mfr

omM

−1

to1

•Ea

chno

devl m

calc

ulat

esth

eX

OR

ofth

eou

tput

sfr

omits

Dch

ildre

n-no

des

and

writ

eth

ere

sult

inth

e2E

-bit

regi

ster

.

xl m

=!

v∈D(v

l m)

xv,1

≤l≤

Dm

−1;

(10)

•Ea

chno

devl m

perf

orm

sC

itera

tions

ofth

em

essa

ge-

pass

ing

deco

ding

.EN

DC

hang

eth

e2E

-bit

vect

orx1 1

back

toth

eN

-bit

code

wor

dx1 1.

Out

putx

1 1.

2E-b

itre

gist

eran

da

sepa

rate

Gal

lage

rB

deco

der,

whe

reE

isth

enu

mbe

rof

edge

sin

the

LDPC

bipa

rtite

grap

h.Ea

chre

gist

erst

ores

2Ebi

tsin

stea

dof

Nbi

ts,b

ecau

seth

enu

mbe

rof

mes

sage

sdu

ring

each

Gal

lage

rB

deco

ding

oper

atio

nis

2E.

We

stor

eth

ese2E

bits

asin

term

edia

tere

sults

inst

ead

ofst

orin

gth

eN

-bit

code

wor

d.N

ote

that

byst

orin

gth

ese

2Ebi

ts,

the

corr

espo

ndin

gco

dew

ord

isst

ored

asw

ell.

The

nois

yX

OR

gate

sar

eus

edto

perf

orm

the

com

pone

nt-w

ise

XO

R-o

pera

tion

ofth

ein

term

edia

tere

sults

whi

char

eD

·2E

bits

from

Dch

ildre

nno

des.

The

nois

yG

alla

ger

Bde

code

ris

used

toco

rrec

terr

ors

inth

eas

soci

ated

regi

ster

afte

rthe

XO

R-

oper

atio

n.B

efor

ese

ndin

gth

eou

tput

toth

epa

rent

-nod

e,ea

chno

depe

rfor

msC

itera

tions

ofth

em

essa

ge-p

assi

ngde

codi

ngw

ithth

eem

bedd

edde

code

r.W

ew

illsh

owth

atit

suffi

ces

tous

eC

=3.

Inth

eno

isy

deco

der,

the

erro

rpr

obab

ility

follo

ws

the

dens

ityev

olut

ionp(

i+1)<

f(p

(i) )

whe

rei

isth

eite

ratio

nin

dex

and

the

expl

icit

expr

essi

onof

func

tionf(·)

isgi

ven

inTh

eore

m3.

This

evol

utio

nis

illus

trate

din

Fig.

1(d)

.In

the

enco

ding

proc

ess,

the

XO

R-o

pera

tions

intro

duce

erro

rs,w

hile

the

Gal

lage

rBde

codi

ngpr

oces

ssup

pres

sest

hem

.

Moral: can overcome info loss on each link by collecting info over many links38

ENCODED vs Uncoded and Repetition

ENCODED provably requires fewer gates, and less energy than repetition in scaling sense [Yang, Grover, Kar IEEE Trans. Info Theory ’17]

0 2 4 6 8 10 12 14 160

1

2

3

4

5x 10

-3

Stage Index in the Tree Structure

Bit E

rror R

atio

Average Bit Error Ratio (Simulation)Theoretical Lower BoundTheoretical Upper Bound

0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Stage Index in the Tree Structure

Bit E

rror R

atio

“ENCODED” (using LDPC codes)

“Uncoded”

Pe Pe

Moral: repeated error-correction can fight information dissipation

Theorem Error correction with ENCODED-Tree [Yang, Grover, Kar Allerton ’14]LDPC codes of sufficiently large girth can keep errors contained through repeated error suppression

Using general device models, focusing

specifically on spintronics

39Next: How do these insights apply to processors of limited memory (but > 1 gate)?

M x V on small (but reliable) processors

Let’s first understand M x V on reliable processors“SUMMA”: Scalable Universal Matrix Multiplication Algorithm - a widely used algorithm [van de Geijn and Watts ’95]

40

A11

x1

broa

dcas

t

mat

rix A

tree aggregation

A1

x

broa

dcas

t

mat

rix A

the entire input vector

SUMMAGeneral M-V Multiplicationx2 xc

AP-1

AP

A1c

Ar1 Arc

x=[x1, x2, ..., xc]Naive M x V computation (Ax)

Ax



40

A11

x1

broa

dcas

t

mat

rix A

tree aggregation

A1

x

broa

dcas

t

mat

rix A



AP-1

AP

A1c

Ar1 Arc


Ax



40

A11

x1

broa

dcas

t

mat

rix A

tree aggregation

A1

x

broa

dcas

t

mat

rix A



AP-1

AP

A1c

Ar1 Arc


Ax



40

Coded SUMMA for M x V on error-prone processors

A11

x1

broa

dcas

t

code

d m

atrix

mat

rix A

tree aggregation

A1

xbr

oadc

ast

code

d m

atrix

mat

rix A


ENCODED (using LDPC)General Coded Computingx2 xc

AP-1

AP

BP+1

BP/R

A1c

Ar1 Arc

BscBs1

[in prep.] 41

with repeated error correction

ABFT/MDS coding

Coded SUMMA for M x V on error-prone processors

A11

x1

broa

dcas

t

code

d m

atrix

mat

rix A

tree aggregation

A1

xbr

oadc

ast

code

d m

atrix

mat

rix A


ENCODED (using LDPC)General Coded Computingx2 xc

AP-1

AP

BP+1

BP/R

A1c

Ar1 Arc

BscBs1

[in prep.] 41

with repeated error correction

ABFT/MDS coding

Summary of Part II.2

2) Decoding also error prone

1) Errors accumulate; information dissipates

What is fundamentally new in small vs large processors?

Embed (noisy) decoders to repeatedly suppress errors, limiting info dissipation

0) Memory limitations: necessitate algorithms like SUMMA

42

Coded Map-reduce Not covered in detail here, but belongs thematically

Map-reduce: A widely used framework for parallelizing a variety of tasks •  Simple to learn, very scalable

[Li-Avestimehr-Maddah-Ali 2015]

Map-reduce: A widely used framework for parallelizing a variety of tasks •  Simple to learn, very scalable Three phases

Map( ) Reduce( )

First phase

Data exchange

Second phase (usually called shuffle)

Third phase



Map-reduce: A widely used framework for parallelizing a variety of tasks •  Simple to learn, very scalable Three phases

Map( ) Reduce( )

First phase

Data exchange

Second phase (usually called shuffle)

Third phase

Idea of coded map reduce •  Introduce redundancy in the map phase •  Exploit information theory ideas (a la coded caching) to minimize

communication cost in data exchange •  Save on overall time-to-completion by tuning correctly



Lots of follow up work, exciting area of research!

Conventional “division of labor” approach: - design a “good” algorithm with low Turing complexity - engineer deals with real world costs and imperfections

This tutorial: an information-theoretic approach: - model system costs and imperfections and, - derive fundamental information-theoretic limits, - obtain optimal strategies for these models

Broader view of coded distributed computing

46

Our thanks to…

Funding sources:

center of the Semiconductor Research Corporation

Help with talk and slides:Collaborators:

- Mohammad Ali Maddah Ali - Salman Avestimehr - Alex Dimakis - Gauri Joshi - Kangwook Lee - Ramtin Pedarsani

- Soummya Kar - Kishori Konwar - Nancy Lynch - Muriel Medard - Prakash N Moorthy - Peter Musial- Zhiying Wang

Student collaborators:- Rami Ali- Jeremy Bai- Malhar Chaudhari - Sanghamitra Dutta - Mohammad Fahim - Farzin Haddadpour - Haewon Jeong - Yaoqing Yang

National Science Foundation (NSF)

47

Appendices/Backup slides

48

Weak scaling: Number of processors scales with problem size - constant computational workload per processor Strong scaling: Problem size fixed! - finding the “sweet-spot” in number of processors - too many processors => high comm overhead - too few => not enough parallelization Related: gate-level errors - error/fault-tolerant computing

49

Related problem: Minimizing total power in communication systems

Ptotal

= PT

+ Penc

+ Pdec

New goal: Design a -efficient codePtotal

PTChannelTransmitter ReceiverM cM

(errors only in the channel; encoding/decoding noiseless)

50


0 0

1 1pch

1 pch

1 pch

pchpch = Q

sPTN02

!Circuit implementation model: Channel model:

Ptotal

= PT

+ Penc

+ Pdec




50


0 0

1 1pch

1 pch

1 pch

pchpch = Q

sPTN02

!Circuit implementation model: Channel model:

Efriction

= µ w d

B bitsw

d d

wweight

Einfofriction

= µ B d

Circuit energy model: “Information-Friction” [Grover, IEEE Trans IT 2015] [Blake, Ph.D. thesis UToronto, 2017]

Ptotal

= PT

+ Penc

+ Pdec




50

[Grover, IEEE Trans. Info Theory ’15]Theorem

for any code, and any encoding & decoding algorithm implemented in the circuit model

Eenc,dec per-bit

0

@

slog

1Pe

PT

1

A

builds on [El Gamal, Greene, Peng ’84] [Grover, Woyach, Sahai ’11] [Grover, Goldsmith, Sahai ’12] [Grover et al. ’07-15] [Thompson ’80]

Fundamental limits on total communication energy

51

0 0.05 0.1 0.15 0.2

−30

−25

−20

−15

−10

−5

Total power (Watts)

log 10

(Pe)

Total power (watts)

log

10(Pe)

Shannon limit (Tx power)

Transmit power

fixed

0.05

R,W,



Eenc,dec per-bit

0

@

slog

1Pe

PT

1

A



51

0 0.05 0.1 0.15 0.2

−30

−25

−20

−15

−10

−5

Total power (Watts)

log 10

(Pe)

Total power (watts)

log

10(Pe)


Transmit power

fixed

0.05

R,W,

0 1 2 3 4 5x 10−5

−30

−25

−20

−15

−10

−5

Power (Watts)

log

10(Pe)

Ptotal

P T

Ptotal

with bounded PT

3

rlog

1

Pe

rlog

1

Pe 3

rlog

1

Pe



Eenc,dec per-bit

0

@

slog

1Pe

PT

1

A



51

0 0.05 0.1 0.15 0.2

−30

−25

−20

−15

−10

−5

Total power (Watts)

log 10

(Pe)

Total power (watts)

log

10(Pe)


Transmit power

fixed

0.05

R,W,

0 1 2 3 4 5x 10−5

−30

−25

−20

−15

−10

−5

Power (Watts)

log

10(Pe)

Ptotal

P T

Ptotal

with bounded PT

3

rlog

1

Pe

rlog

1

Pe 3

rlog

1

Pe



Eenc,dec per-bit

0

@

slog

1Pe

PT

1

A



Straightforward extension to noisy computing of invertible linear transforms [Grover, ISIT’14]: don’t aim for “Shannon capacity of noisy computing”!

51

any square submatrix invertible

(e.g. gen matrix of MDS code; transposed)

Short Dot Achievability

B

s

P N

x

N 1

. . .

. . . P x N P x KK x N

AMN

Z(KM)N

K = P r + 1

=

B = R

AZ

Rows of A lie in the span of any K rows of B

i-th column of Z chosen to set zeroes in the i-th column of B

Equation/variable counting gives s N

P(P K +M)

52

Short Dot outer bound intuition

Intuition: no column can be too sparse: can’t have > K zeros - since A has to be recoverable from any K rows

Converse: Any Short-Dot code satisfies:

s N

P(P K + 1)

Tighten by rank arguments (messy; happy to discuss offline)

B

s

P N

x

N 1

. . .

. . .

This argument yields a looser converse:

53

Part 2: Codes for distributed linear data processing in presence … · 2017-06-30 · Part 2: Codes for distributed linear data processing in presence of straggling/faults/errors

Documents