Top Banner
ADSP Lecture2 - Unfolding ([email protected]) 2-1 VLSI Signal Proces VLSI Signal Proces sing sing Lecture 2 Unfolding Lecture 2 Unfolding Transformation Transformation
47

ADSP Lecture2 - Unfolding ([email protected])2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

Dec 14, 2015

Download

Documents

Rafael Prigge
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-1

VLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingLecture 2 Unfolding Lecture 2 Unfolding

TransformationTransformation

Page 2: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-2

Multiple-Data Processing• Create a program with more than one

iteration, e.g. J loops unrolling• Example: Loop unrolling + software pipelining

1

2

3

4

5

6

7

8

clock cycle operation

1

2

3

1

2

3

1

2

1

1

1

2

2

2

3

3

3

1

2

3

4

5

6

7

8

clock cycle

Page 3: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-3

Basic Ideas• Parallel

processing• Pipelined

processing

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

P1

P2

P3

P4

P1

P2

P3

P4

time time

Page 4: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-4

Data Dependence• Parallel processing

requires NO data dependence between processors

• Pipelined processing will involve inter-processor communication

P1

P2

P3

P4

P1

P2

P3

P4

time time

Page 5: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-5

Parallel Processing•

• In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)

Page 6: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-6

Parallel Processing• Block processing

– the number of inputs processed in a clock cycle is referred to as the block size

– at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)

S e ria l toP a ra lle l

C o nve rte r

S IS Ox(n) y(n)

M IM O

x(3k ) y(3k )

x(3 k+1 )

x(3 k+2 )

y(3 k+1 )

y(3 k+2 )

P ara lle l toS eria l

C o nve rte rx(n) y(n)

Page 7: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-7

I/O Conversion• Serial to parallel converter

• Parallel to serial converter

3 k

D D

T/3T/3

s a m p lin g p e rio d

y(3k )y(3 k+1 )y(3 k+2 )

y(n)

x(n) D D

x(3k)x(3 k+1 )x(3 k+2 )

T/3T/3

s a m p lin g p e rio d

Page 8: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-8

General approach for block processing

Page 9: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-9

Mathematical Formulation

• e.g. y(n) = ay(n-9) + x(n)• 2-parallel

Y(2k) = ay(2k-9) + x(2k)Y(2k+1) = ay(2k-8) + x (2k+1)

• In 2-parallel SDFG, one active clock edge leads two samplesY(2k) = ay(2(k-5)+1) + x(2k)Y(2k+1) = ay(2(k-4)+0) + x(2k+1)

• Dependency with less than # parallelism of sample delays can be implemented with internal routing

Page 10: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-10

Unfolding the DFG

T=Ts

T=J Ts

Not trivial, even for a simple graph

Page 11: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-11

Block Processing for FIR Filter

• One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)

• Block vector: [x(3k) x(3k+1) x(3k+2)]• Clock cycle: can be 3 times longer• Original (FIR filter):

• Rewrite 3 equations at a time: )2()1()()( ncxnbxnaxny

(3 ) (3 ) (3 1) (3 2)

(3 1) (3 1) (3 ) (3 1)

(3 2) (3 2) (3 1) (3 )

y k x k x k x k

y k a x k b x k c x k

y k x k x k x k

Page 12: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-12

Block Processing

Page 13: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-13

Block Processing for IIR Digital Filter

• Original formulation:

• Rewrite:

• Vector formulation:

( ) ( 2) ( )y n a y n x n n: sample period

k: processor period

Tsample≠Tclk

)12()12()12(

)2()22()2(

kxkayky

kxkayky

)()1()(

)12(

)2()( ,

)12(

)2()(

kkak

kx

kxk

kx

kxk

xyy

yx

Page 14: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-14

Block IIR Filter

D

D

S/P P/S+

+

x(2k)

x(2k+1)

y(2k+1)

y(2k)x(n) y(n)

y(2(k1))

y(2(k1)+1)

clock period not equal to sampling period

Page 15: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-15

Timing Comparison

• Pipelining

• Block processing

1 2 3 4x(1) x(2) x(3) x(4)

y(1) y(2) y(3) y(4)

1 2 3 4 5 6 7 8x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)

MAC

1 2 3 4 5 6 7 8

y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)Add

a y(1)

Mul

1 1 3 3 5 5 7 7

2 2 4 4 6 6 8 8x(2) x(4) x(6) x(8)

x(1) x(3) x(5) x(7)

Page 16: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-16

Definitions• Unfolding is the process of unfolding a loop so

that several iterations are unrolled into the same iteration.

• Also known as (a.k.a.)– Loop unrolling (in compilers for parallel programs)– Block processing

• Applications– Reducing sampling period to achieve iteration bound

(desired throughput rate) T.

– Parallel (block processing) to execute several iterations concurrently.

– Digit-serial or bit-serial processing

Page 17: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-17

Unfolding the DFG• y(n)=ay(n-9)+x(n)

• Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k)y(2k+1)=ay(2k-8)+x(2k+1)

y(2k)=ay(2(k-5)+1)+x(2k)y(2k+1)=ay(2(k-4))+x(2k+1)

• After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period.

Page 18: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-18

Timing Diagram

• Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold.

• Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later.

y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13)

T=Ts

y(0) y(2) y(4) y(6) y(8) y(10) y(12)

y(1) y(3) y(5) y(7) y(9) y(11) y(13)

T=2Ts

9 T

4T5T

9 T

Page 19: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-19

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 1. Duplicate J copies of each node

Page 20: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-20

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 2. Add all edges with 0 delay on them.

Page 21: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-21

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

D

Q1

S1

T1

R1

D

D 2D

J=2

T=3

T=6

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 3. Use table on the left to figure out edges with delays.

Page 22: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-22

Unfolding Transformation• For each node U in the original DFG, draw J node U0, U1,…, UJ-1• For each edge UV with w delays in the original DFG, draw the J edge

s UiV(i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1

Example

• Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J

• Unfolding preserves precedence constraints of a DSP algorithm

Page 23: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-23

Precedence Preservation

Page 24: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-24

Delay Preservation• Unfolding preserves the number of delays in a DFG• Let , where

11

11

111

mJ

Jw

mJ

Jm

J

nJnJm

J

nJw

mJ

JJm

J

nJnJm

J

nJw

mJ

w

nJmw Nnm 0, 10 Jn

w

nJm

nmnJm

J

Jw

J

nJw

J

nJw

J

w

1

11

Page 25: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-25

Example• Unfold the following DFG using folding factor 2 and 5

A B C E

D

7 DD

2 D

3 D

A 0 B 0 C 0 E 0

D 0

A 1 B 1 C 1 E 1

D 1

D

3 D

4 D

D

D

2 D

D

A 0 B 0 C 0 E 0 D 0

A 1 B 1 C 1 E 1 D 1

A 2 B 2 C 2 E 2 D 2

A 3 B 3 C 3 E 3 D 3

A 4 B 4 C 4 E 4 D 4

DD

D

D

2 D

2 D

D

DD

D

D

2 - unfo ld e d D F G5 - unfo ld e d D F G

Page 26: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-26

Properties of Unfolding• Unfolding preserves the

number of registers (delays) in a DFG

• For a loop with w delays in a DFG that has been unfolded J times, it leads to – g.c.d.(w, J) loops in the

unfolded DFG, with each of these loops containing

W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of

each node that appear in the original loop.

• Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT.

• A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG.

• Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.

Page 27: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-27

When a Loop is Unfolded• A loop ℓ with w delays in a DFG • Travel the loop A~>A p times also a loop with pw delays • In J-unfolded DFG, consider the path AiA(i+pw)%J . It is a loop if

i=(i+ pw)%J. This implies that J | pw• The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one c

an travel the loop A~>A J/gcd(J, w) times.• Recall that there are totally J copies of node A. Hence, there a

re J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays.

• The iteration bound in J-unfolded DFG is then

JTw

tJ

wjw

twj

J

Tl

l

l

l

l

ll

lmax

),gcd(

),gcd(max'

Page 28: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-28

When a Path is Unfolded• If w<J, then a path containing w delays within a DFG will lea

d to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG.

• If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical.

• Assume that the critical path of the J-unfolded DFG is c. If D(U,V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J

• Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding.

Page 29: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-29

When a Path is Unfolded• Suppose r’ is a legal retiming for the J-unfolded DFG, GJ, wh

ich leads to critical path c.• Let r(U) = i r’(Ui), 0≤i≤J-1.

– r is a feasible retiming for the original DFG, G.– The retiming leads to a critical path c

constraintpath critical

)( if ,1'' )2(

constraint feasible '' )1(

then,path critical a toleads and for retiming legal is ' Since

in delays with edgean Consider

)%()%(

)%(

cVUDJ

wiVrUr

J

wiVrUr

cGr

GwVU

JwiiJwii

Jwii

J

0≤i≤J-1

i

JVUWVrUr

wVrUr

),()()( )2(

)()( )1(

Page 30: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-30

Sample Period Reduction• Case1: A node in the DFG having

computation time greater than T∞

• Case2: Iteration bound is not an integer

• Case3: Longest node computation is larger than the iteration T∞, and T∞ is not an integer

Page 31: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-31

Case 1• Critical path dominates, since a node

computation time is more than iteration bound

Retiming cannot be used to reduce sample period

Page 32: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-32

Sample Period Reduction• Rule of Thumb: used be should unfolding

TtU

T∞=6,Tcritical=6

Page 33: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-33

Case 2• Iteration period cannot not achieve the

iteration bound

Page 34: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-34

Sample Period Reduction

Page 35: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-35

Case 3

Page 36: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-36

Parallel Processing• Parallel processing can be

performed by unfolding

Page 37: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-37

Bit-Level Parallel Processing

Page 38: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-38

Page 39: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-39

Bit-Serial Adder

Page 40: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-40

Unfolding of Switches

Page 41: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-41

Example

Page 42: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-42

Example

Page 43: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-43

Example

Page 44: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-44

Example

Page 45: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-45

Switches with Delays

Page 46: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-46

Switch with Delays

Page 47: ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation.

ADSP Lecture2 - Unfolding ([email protected]) 2-47

If Wordlength is not a Multiple of J