VLSI DSP 2010 Y.T. Hwang 6-1 Chapter 6 Folding Chapter 6 Folding VLSI DSP 2010 Y.T. Hwang 6-2 Introduction (1) Introduction (1) folding DSP architecture where multiple operations are multiplexed to a single function unit Trading area for time in a DSP architecture Reduce the number of function units by a factor of N at the expense of increasing the computing time by a factor of N N: folding factor Present a systematic way to derive the folded DSP architecture
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
VLSI DSP 2010 Y.T. Hwang 6-1
Chapter 6 FoldingChapter 6 Folding
VLSI DSP 2010 Y.T. Hwang 6-2
Introduction (1)Introduction (1)
foldingDSP architecture where multiple operations are multiplexed to a single function unit
Trading area for time in a DSP architecture
Reduce the number of function units by a factor of N at the expense of increasing the computing time by a factor of N
N: folding factor
Present a systematic way to derive the folded DSP architecture
VLSI DSP 2010 Y.T. Hwang 6-3
Introduction (2)Introduction (2)
Folding exampley(n) = a(n) + b(n) + c(n)
Time multiplexed on a single pipeline adder
An input sample must remains 2 clock cycles
VLSI DSP 2010 Y.T. Hwang 6-4
Introduction (3)Introduction (3)
More on foldingMay lead to an architecture using a large number of registers
More on foldingThe original DFG and the N-unfolded version of the folded DFG (synthesized with folding factor N) are retimed and/or pipelined versions of each other
An arbitrary DFG can be unfolded by a factor N and then folded again to generate a family of architectures
VLSI DSP 2010 Y.T. Hwang 6-14
Register minimization in folding (1)Register minimization in folding (1)
Lifetime analysisTo compute the minimum number of registers required to implement a DSP algorithm in hardware
A data sample (variable) is live from the time it is produced (excluded) through the time it is consumed (included)
A variable after lifetime is called “dead”
The maximum number of live variables at each time unit is the minimum number of registers required to implement the DSP program
VLSI DSP 2010 Y.T. Hwang 6-15
Register minimization in folding (2)Register minimization in folding (2)
ExampleAssume 3 variables a, b, c
Life time of variable a: {1,2,3,4}
Life time of variable b: {2,3,4,5,6,7}
Life time of variable c: {5,6,7}
Number of live variables {1,2,2,2,2,2,2}
2 registers are needed to implement the DSP program
VLSI DSP 2010 Y.T. Hwang 6-16
Linear lifetime chart (1) Linear lifetime chart (1)
When the iteration period is less than the span of the scheduling, the scheduling overlapsThe number of live variables at time instance n is the sum of the number of live variables at cycles n-kN, ∀k
Non-overlapped Overlapped withSchedule period 6
VLSI DSP 2010 Y.T. Hwang 6-17
Linear lifetime chart (2)Linear lifetime chart (2)
Matrix transpose example
Assume row-wise access
Input time: Tinput
Zero latency output time: Tzlout
Tdiff = Tzlout – Tinput
Required latency Tlat = magnitude of the most negative value of Tdiff
Toutput = Tzlout + Tlat
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡→
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
ifc
heb
gda
ihg
fed
cba
VLSI DSP 2010 Y.T. Hwang 6-18
Linear lifetime chart (3)Linear lifetime chart (3)
Matrix transpose example (cont.)Assume iteration period of the DSP program is N = 9
VLSI DSP 2010 Y.T. Hwang 6-19
Circular lifetime chartCircular lifetime chart
Circular lifetime chartPoint i represents the time partition i and all time instances {(Nl+i)}
linear circular
VLSI DSP 2010 Y.T. Hwang 6-20
Data allocation (1)Data allocation (1)
Forward backward register allocationTo achieve minimum number of registersDetermine how variables are assigned to registers in the “allocation table”
Step 1: determine the minimum number of registers using lifetime analysis
Step 2: Input each variable at the time step corresponding to the beginning of its lifetimeIf multiple variables are input in a given cycle, they are allocated to multiple registers according to lifetime in a descending order
VLSI DSP 2010 Y.T. Hwang 6-21
Data allocation (2)Data allocation (2)
Forward allocationIf register i holds the variable in the current cycle, then register i+1 holds the same variable in the next cycleIf the register i+1 is not available, then the variable is allocated to the first available forward register
Step 3:Each register is allocated in a forward manner until it is dead or reaches the last register
Step 4:In periodic scheduling, the allocation of current iteration also repeats itself in subsequent iterationsIf Rj is occupied by a variable in cycle l, hash the position for Rj at time unit l+N
VLSI DSP 2010 Y.T. Hwang 6-22
Data allocation (3)Data allocation (3)
Step 5:For a variable that reaches the last register and is not yet dead, allocate it in backward manner
If multiple registers available, choose the one with least but sufficient number of forward registers capable of completing the allocation
After a variable has been allocated backward, allocate it in a forward manner until it is dead or again reaches the last register
Step 6:Repeat step 4 and 5 as required until the allocation is complete
VLSI DSP 2010 Y.T. Hwang 6-23
Data allocation (4)Data allocation (4)
3X3 matrix transpose example with N = 9
After steps 1~4 completion
hashing
VLSI DSP 2010 Y.T. Hwang 6-24
Data allocation (5)Data allocation (5)
Another example
Linear lifetime chart
Step 1~4 completion
VLSI DSP 2010 Y.T. Hwang 6-25
Data allocation (6)Data allocation (6)
architecture design after register allocation
VLSI DSP 2010 Y.T. Hwang 6-26
Data allocation (7)Data allocation (7)
architecture design after register allocation
VLSI DSP 2010 Y.T. Hwang 6-27
Register minimization in foldingRegister minimization in folding
GoalTo synthesize control circuits in folded architectures with minimum number of registers
ProceduresPerform retiming for folding
Write folding equations
Use the folding equations to construct a lifetime table
Draw the lifetime chart and determine the required number of registers
Perform forward-backward register allocation
Draw the folded architecture that uses the minimum number of registers
VLSI DSP 2010 Y.T. Hwang 6-28
BiquadBiquad filter example (1)filter example (1)
Biquad filter
Original bi-quadFilter design
Design afterretiming
VLSI DSP 2010 Y.T. Hwang 6-29
BiquadBiquad filter example (2)filter example (2)
Design without register minimizationTotal of 6 external and 3 internal pipelining registers
Folding equations Folded architecture
VLSI DSP 2010 Y.T. Hwang 6-30
BiquadBiquad filter example (3)filter example (3)
Construct a lifetime tableEach a node with lifetime (Tinput→Toutput) corresponds to an entry in the lifetime table
Tinput : u (folding order) + PU (# of pipelining stages of the function unit)
Toutput : u+ PU +maxV{DF(U→V)
Foe node 1, folding order is 3, adder’s PU is 1⇒ Tinput = 3+1=4
Toutput = u+ PU +maxV{DF(U→V) = 3+1+max{1,0,2,3,5}=9
VLSI DSP 2010 Y.T. Hwang 6-31
BiquadBiquad filter example (4)filter example (4)
Construct a lifetime table and lifetime chartAssume N (iteration period) is 4
Minimum number of registers required is 2
VLSI DSP 2010 Y.T. Hwang 6-32
BiquadBiquad filter example (5)filter example (5)
Allocation tableOnly variables n1, n7 and n8
with non-zero duration are shown
Variable n1 is output in cycles 4,5,6,8,9, only the latest cycle 9 is shown in the table
VLSI DSP 2010 Y.T. Hwang 6-33
BiquadBiquad filter example (6)filter example (6)
Folded design with 2 registersEdge (1→2) has DF(1→2)= 1 delay
after 1 delay the variable n1 is located in R1
An edge from R1 to adder switched at 4l+1 because the node 2 has folding order 1
VLSI DSP 2010 Y.T. Hwang 6-34
BiquadBiquad filter example (7)filter example (7)
Folded design with 2 registers (cont.)Edge (1→7) has DF(1→7)= 3 delays
after 3 delays the variable n1 is located in R2
An edge from R2 to multiplier switched at 4l+2 because the node 7 has folding order 2
VLSI DSP 2010 Y.T. Hwang 6-35
IIR filter example (1)IIR filter example (1)
IIR filter before retimingy(n) = ay(n-3) + by(n-5) + x(n)