Top Banner
1 Processing Elements Design Shao-Yi Chien
101

Processing Elements Design

Apr 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Processing Elements Design

1

Processing

Elements Design

Shao-Yi Chien

Page 2: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 2

Introduction◼ Implementation of basic arithmetic operations

◼ Number systems

Conventional number systems

Redundant number systems

Residue number systems

◼ Arithmetic

Bit-parallel arithmetic

Bit-serial arithmetic

Serial-parallel arithmetic

Division

Distributed arithmetic

CORDIC

Page 3: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 3

Conventional Number Systems

◼ Conventional number systems are nonredundant, weighted, positional number systems

◼ Fix-point: the position of binary point is fixed

◼ Floating point: signed mantissa and signed exponent

Nonredundant: one number has only one representation

Wd: word length

wi: weights→weighted

wi depends only on the position of the digit→positional

For fix-radix systems, wi=ri

Page 4: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 4

Signed-Magnitude

Representation◼ Range

[-1+Q, 1-Q]

Q=(0.00..01)

◼ Complex for addition and subtraction

◼ Easy for multiplication and division

Page 5: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 5

One’s Complement

◼ Range

[-1+Q, 1-Q]

◼ Change sign is

easy

◼ Addition,

subtraction, and

multiplication are

complex

Page 6: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 6

Two’s Complement

◼ Range

[-1, 1-Q]

◼ The most widely used representation

Page 7: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 7

Binary Offset Representation

◼ Range [-1,1-Q]

◼ The sequence of digits is equal to the two’s complement representation, except for the sign bit

Page 8: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 8

Redundant Number Systems

(1/2)◼ Redundant: one number has more than one

representation

◼ Advantages Simply and speed up certain arithmetic operation

Addition and subtraction can be performed without carry (barrow) paths

◼ Disadvantages Increase the complexity for other operations, such as

zero detection, sign detection, and sign conversion

Page 9: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 9

Redundant Number Systems

(2/2)

◼ Signed-digit code

◼ Canonic signed digit code

◼ On-line arithmetic

Page 10: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 10

Signed-Digit Code (1/4)

◼ Range: [-2+Q, 2-Q]

◼ Redundant

(15/32)10=(0.01111)2C=(0.1000-1)SDC=(0.01111)SDC

(-15/32)10=(1.10001)2C=(0.-10001)SDC

=(0.0-1-1-1-1)SDC

Page 11: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 11

Signed-Digit Code (2/4)

◼ SDC number is not unique

◼ Has problems to

Quantize

Compare

Overflow check

Change to conventional number systems for

these operations

Page 12: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 12

Signed-Digit Code (3/4)

◼ Example of addition (1-11-1)SDC=(5)10

(0-111)SDC=(-1)10

◼ Rules for adding SDC numbers

◼ si=zi+ci+1

xiyi or yixi 0 0 0 1 0 1 0 -1 0 -1 1 -1 1 1 -1 -1

xi+1 yi+1 -- Neither is -1 At least one

is -1

Neither is -1 At least one

is -1

-- -- --

ci 0 1 0 0 -1 0 1 -1

zi 0 -1 1 -1 1 0 0 0

Page 13: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 13

Signed-Digit Code (4/4)

◼ (0100)SDC=(4)10

Page 14: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 14

Canonic Signed Digit Code (1/3)

◼ Range: [-4/3+Q, 4/3-Q]

◼ CSDC is a special case of SDC having a

minimum number of nonzero digits

Page 15: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 15

Canonic Signed Digit Code (2/3)

◼ Conversion of two’s-complement to CSDC numbers

(0.011111)2C=(0.10000-1)CSDC

Convert in iterative manner

Step1: 011…1→100…-1

Step2: (-1,1)→(0,-1), (0,1,1)→(1,0,-1)

Ex: (0.110101101101)2C

=(1.00-10-100-10-101)CSDC

Page 16: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 16

Canonic Signed Digit Code (3/3)

◼ Conversion of SDC to two’s complement

numbers

Separate the SDC number into two parts

◼ One parts holds the digit that are either 0 or 1

◼ The other part has –1 digits

Subtract these two numbers

Page 17: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 17

On-Line Arithmetic

◼ The number systems with the property that it is possible to compute the i-th digit of the results using only the first (i+d)-th digit, where d is a small positive constant

◼ Favorable in recursive algorithm using numbers with very long word lengths

◼ SDC can be used for on-line addition and subtraction, d=1

Page 18: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 18

Residue Number Systems (1/2)

◼ For a given number x and moduli set {mi}, i=1, 2, …, p x=qimi+ri RNS representation: x=(r1, r2, …, rp)

◼ Advantages The arithmetic operations (+, -, *) can be performed

for each residue independently

◼ Disadvantages Hard for comparison, overflow detection, and

quantization

Not easy to convert to other number systems

Page 19: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 19

Residue Number Systems (2/2)

◼ Example

Moduli set={5,3,2}

Number range=5*3*2=30

9+19=(4,0,1)RNS+(4,1,1)RNS

=((4+4)5,(0+1)3, (1+1)2)RNS=(3,1,0)RNS=28

8*3=(3,2,0)RNS*(3,0,1)RNS

=((3*3)5,(2*0)3,(0*1)2)RNS=(4,0,0)RNS=24

Page 20: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 20

Bit-Parallel Arithmetic (1/2)

◼ Addition and subtraction

Ripple carry adder (RCA) (carry propagation adder, CPA)

Carry-look-ahead adder (CLA)

Carry-save adder

Carry-select adder (CSA)

Carry-skip adder

Conditional-sum adder

Page 21: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 21

Bit-Parallel Arithmetic (2/2)

◼ Multiplication

Shift-and-add multiplication

Booth’s algorithm

Tree-based multipliers

Array multipliers

Look-up table techniques

Page 22: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 22

Ripple Carry Adder (RCA) (1/2)

◼ Also called carry propagation adder (CPA)

Full adder

Page 23: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 23

Ripple Carry Adder (RCA) (2/2)

◼ The speed of the RCA is determined by

the carry propagation time

Ripple-carry adder Ripple-carry adder/subtractor

Page 24: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 24

Carry-Look-Ahead Adder (CLA)

◼ Generate the carry with separate circuits

◼ Ci=Gi+Pi.Ci-1

◼ Gi=Ai.Bi

◼ Pi=Ai+Bi

*Different digit notation in this slide

Page 25: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 25

Carry-Save Adder

◼ Used when adding three or more operands

◼ Reduce the number of operands by one for each

stage

FA

x3 cin

3y

3

s3

c3

FA

x2

cin2 y

2

s2c

2

FA

x1

cin1 y

1

s1c

1

FA

x0

cin0 y

0

s0

c0

*Different digit notation in this slide

Page 26: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 26

Carry-Select Adder (CSA)

*Different digit notation in this slide

Page 27: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 27

Carry-Skip Adder

*Different digit notation in this slide

Page 28: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 28

Conditional-Sum

Adder

*Different digit notation in this slide

BAC

BAC

BAS

BAS

+=

=

−=

=

1

0

1

0

)(

Page 29: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 29

Multiplication

◼ Bit-parallel multiplication

Page 30: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 30

Shift-and-Add Multiplication (1/2)

Page 31: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 31

Shift-and-Add Multiplication (2/2)

◼ The operation can

be reduced with

CSDC

◼ Can be used to

design fix-operand

multiplier

Page 32: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 32

Booth’s Algorithm (1/3)◼ Used in modern general-purpose processors,

such as MIPS R4000

Page 33: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 33

Booth’s Algorithm (2/3)

x2i-2 x2i-1 x2i x2i-1’ Operation Comments

0 0 0 0 +0 String of zeros

0 0 1 1 +y Beginning of 1s

0 1 0 1 +y A single 1

0 1 1 2 +2y Beginning of 1s

1 0 0 -2 -2y End of 1’s

1 0 1 -1 -y A single 0

(beginning/end of 1’s)

1 1 0 -1 -y End of 1’s

1 1 1 0 -0 String of 1’s

Page 34: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 34

Booth’s Algorithm (3/3)

Z=X Y

Encoder

X’

Xi+1 Xi Xi-1

0 0 0 0

0 0 1 +Y (beginning of string)

0 1 0 +Y (isolated)

0 1 1 +2Y (beginning of string)

1 0 0 -2Y (end of string)

1 0 1 -Y (beginning / end of string)

1 1 0 -Y (end of string)

1 1 1 0

Page 35: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 35

Tree-Based Multipliers (Wallace

Tree Multipliers)

Page 36: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 36

Array Multipliers (1/3)

◼ Baugh-

Wooley’s

multiplier

Page 37: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 37

Array Multipliers (2/3)

◼ Partial products

Page 38: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 38

Array Multipliers (3/3)

Page 39: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 39

Look-Up Table Techniques

◼ A multiplier AxB can be done with a large

table with 2WA+WB words

◼ Simplified method

Can be implemented with one addition, two

subtraction, and two table look-up operations

4

)(

4

)( 22 yxyxyx

−−

+=

Page 40: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 40

Bit-Serial Arithmetic

◼ Advantages

Significantly reduce chip area

◼ Eliminate wide bus

◼ Small processing elements

Higher clock frequency

Often superior than bit-parallel

◼ Disadvantages

S/P P/S interface

Complicated clocking scheme

Page 41: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 41

Bit-Serial Addition and

Subtraction

Addition Subtraction

Page 42: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 42

Serial/Parallel Multiplier

◼ Use carry-save adders

◼ Need Wd+Wc-1 cycles to compute the

result

Page 43: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 43

Modified Serial/Parallel

Multiplier

Can be

implemented

with a half adder

Page 44: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 44

Transpose Serial/Parallel

Multiplier

Page 45: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 45

S/P Multiplier-Accumulator

◼ y=a*x+z

Page 46: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 46

S/P Multiplier with Fixed

Coefficients (1/3)

◼ Remove all AND gates

◼ Remove all FAs and corresponding D flip-flops, starting with the MSB in the coefficient, up to the first 1 in the coefficient

◼ Replace each FA that corresponds to a zero-bit in the coefficient with a feedthrough

Page 47: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 47

S/P Multiplier with Fixed

Coefficients (2/3)

Page 48: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 48

S/P Multiplier with Fixed

Coefficients (3/3)

◼ The number of FA = (the number of 1’s)-1

◼ The number of D flip-flops = the number of 1-bit positions between the first and last bit positions

Page 49: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 49

S/P Multiplier with CSDC

Coefficients

◼ a=(0.00111)2C=(0.0100-1)CSDC

Page 50: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 50

Minimum Number of Basic

Operations

Page 51: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 51

Division

◼ How to do binary division?

◼ In the following slides, we define Dividend z = z2k-1z2k-2…z1z0

Divisor d = dk-1dk-2…d1d0

Quotient q = qk-1qk-2…q1q0

Remainder s = [z-(dxq)]=sk-1sk-2…s1s0

Major reference:

B. Parham, Computer Arithmetic: Algorithms and Hardware Designs,

Oxford, 2000.

Page 52: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 52

What’s Different?

◼ Added complication of requiring quotient digit selection or estimation

The terms to be subtracted from the dividend z are not known a priori but become known as the quotient digits are computed

The terms to be subtracted from the initial partial remainder must be produced from top to bottom

More difficult and slower than multiplication

Long critical path

Page 53: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 53

Division

◼ Bit-serial division (sequential division algorithm)

◼ Programmed division

◼ Restoring bit-serial hardware divider

◼ Nonrestoring bit-serial hardware divider

◼ Division by constants

◼ Array divider

Page 54: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 54

◼ s(j)=2s(j-1)-qk-j(2kd) with s(0)=z and s(k)=2ks

◼ OrFor j=1 to k

{

If(2s(j-1)>=(2kd))

{

qk-j=1;

s(j)=2s(j-1)-(2kd);

}

Else

{

qk-j=0;

s(j)=2s(j-1);

}

}

Subtract

Bit-Serial division

(Sequential Division) Algorithm

Shift

Page 55: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 55

Programmed Division

Need more than 200 instructions

for a 32-bit division!!

Page 56: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 56

Restoring Bit-Serial Hardware

Divider (1/3)

◼ “Restoring division”

Assume q=1 first, do the trial difference

The remainder is restored to its correct value

if the trial subtraction indicates that 1 was not

the right choice for q

Page 57: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 57

Restoring Bit-Serial Hardware

Divider (2/3)

Page 58: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 58

Restoring Bit-Serial Hardware

Divider (3/3)Can be shared together

Critical path

Page 59: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 59

Nonrestoring Bit-Serial

Hardware Divider (1/4)

◼ Always store u-2kd back to the register

◼ If the value q in this stage is 1 ➔ correct!

Next stage: 2(u-2kd)-2kd=2u-3x2kd

◼ If the value q in this stage is 0 ➔ incorrect!

Next stage should be: 2u-2kd

Is equal to 2(u-2kd)+2kd

◼ Always store the result of trail difference

If q=1 ➔ use subtraction; if q=0 ➔ use addition

◼ Can reduce critical path

Page 60: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 60

Nonrestoring Bit-Serial

Hardware Divider (2/4)

Page 61: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 61

Nonrestoring Bit-Serial

Hardware Divider (3/4)

Critical path

Page 62: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 62

Nonrestoring Bit-

Serial Hardware

Divider (4/4)

Page 63: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 63

Division by Constants (1/2)

◼ Use lookup table + constant multiplier

◼ Exploit the following equations

Consider odd divisor only since even divisor

can be performed by first dividing by an odd

integer and then shifting the result

For an odd integer d, there exists an odd

integer m such that d x m=2n-1

Page 64: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 64

Division by Constants (2/2)

For example, for 24-bit precision:

)21)(21)(21(2)21(212

1 42 nnn

nnnn

mmm

d

−−−

−+++=

−=

−=

)21)(21)(21(16

3

)21(16

3

12

3

5

4,3,5

1684

44

−−−

−+++=

−=

−=

===

zzzz

nmd

Next term (1+2-32) does not contribute anything to 24-bit precision

Easy for hardware implementation

Page 65: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 65

Array Divider (1/2)

◼ Restoring array divider

FS: full subtractor

The critical path

passes through all k2

cells

Page 66: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 66

Array Divider (2/2)

◼ Nonrestoring array divider

FA: full adder

The critical path

passes through all k2

cells

Page 67: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 67

Distributed Arithmetic (1/8)

◼ Most DSP algorithms involve sum-of-

products (inner products)

◼ Distributed arithmetic (DA) is an efficient

procedure for computing inner products

between a fixed and a variable data vector

Fixed coefficient

Page 68: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 68

Distributed Arithmetic (2/8)

Put Fk in ROM

Page 69: Processing Elements Design

DSP in VLSI Design

Distributed Arithmetic (3/8)◼ Example:

y=a1x1+a2x2+a3x3

a1=2, a2=3, a3=4

Shao-Yi Chien 69

x1k x2k x3k Fk

0 0 0 0

0 0 1 4

0 1 0 3

0 1 1 7

1 0 0 2

1 0 1 6

1 1 0 5

1 1 1 9

y=?

When (x1, x2, x3)=(1, 2, 3)

x1 = 0 0 1

x2 = 0 1 0

x3 = 0 1 1

Page 70: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 70

Distributed Arithmetic (4/8)

◼ DA can be

implemented with a

ROM and a shift-

accumulator

◼ The computation

time: Wd cycles

◼ Word length of

ROM: )(log 2 NWW CROM +Data input from

LSB to MSB in

bit-serial

Page 71: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 71

Distributed Arithmetic (5/8)

◼ Example

y=a1x1+a2x2+a3x3

a1=(0.0100001)2C

a2=(0.1010101)2C

a3=(1.1110101)2C

◼ (a) The table? (b) The word length of the

shift-accumulator?

Page 72: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 72

Distributed Arithmetic (6/8)

◼ Ans:

(a)

(b) Word length=7 bits + 1 bit (sign bit) +1 bit (guard bit) = 9 bits

Page 73: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 73

Distributed Arithmetic (7/8)

◼ Example: linear-phase FIR filter

Page 74: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 74

Distributed Arithmetic (8/8)

◼ Parallel implementation of distributed

arithmetic

Page 75: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 75

Shift-Accumulator (1/4)

◼ The number of cycles for one inner product is Wd+WROM

First Wd cycles: input data

Last WROM cycles: shift out the results

Page 76: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 76

Shift-Accumulator (2/4)

◼ Shift-accumulator augmented with two

shift registers

Page 77: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 77

Shift-Accumulator (3/4)

◼ Scheduling

◼ Clock cycle NCL=max{WROM, Wd}

LSP(0)

MSP(0)

LSP(1)

MSP(1)

LSP(2)

MSP(2)......

Wd

WROM

Page 78: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 78

Shift-Accumulator (4/4)

◼ Detailed architecture

Page 79: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 79

Reducing the Memory Size (1/4)

◼ Method 1:

memory

partition

2*2N/2 < 2N

Ex: 2*25 = 64

< 210 = 1024

Page 80: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 80

Reducing the Memory Size (2/4)

◼ Method 2: memory coding

Page 81: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 81

Reducing the Memory Size (3/4)

Complement

Page 82: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 82

Reducing the Memory Size (4/4)

Page 83: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 83

CORDIC

◼ CORDIC (COordinate Rotation DIgital Computer)

An iterative arithmetic algorithm introduced by

Volder in 1956

Can handle many elementary functions, such as

trigonometric, exponential, and logarithm with only

shift-and-add arithmetic

For these functions CORDIC based architecture is

much efficient than multiplier and accumulator (MAC)

based architecture

Suitable for transformations and matrix based filters

Major reference:

[1] A.-Y. Wu, “CORDIC,” Slides of Advanced VLSI

[2] Y. H. Hu, “CORDIC-based VLSI architectures for digital signal

processing,” IEEE Signal Processing Magazine, pp. 16—35, July 1992.

[3] J. E. Volder, “The Birth of CORDIC,” J. VLSI Signal Processing,

vol.25, pp. 101—105, 2000.

Page 84: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 84

The Birth of CORDIC

B-58 Supersonic Bomber

CORDIC I

CORDIC II

Page 85: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 85

Simple Concepts of CORDIC

(1/2)

◼ Originally, CORDIC is invented to deal

with rotation problem with shift-and-add

arithmetic

−=

y

x

y

x

cossin

sincos

'

'

(x, y)

(x', y')

Page 86: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 86

Simple Concepts of CORDIC

(2/2)

◼ How to make it with shift-and-add?

◼ Decompose the desired rotation angle into

small rotation angles (micro-rotation)

◼ Rotate finite times (by “elementary angles”

) to achieve the desired

rotation 1

23

4

}10|{ − niai

Page 87: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 87

Conventional CORDIC

Algorithm (1/2)

ii

i

i

i

i

i

i

i

i

ii

ii

aa

iy

ixa

iy

ix

iy

ix

a

aa

iy

ix

iy

ix

aa

aa

iy

ix

2

1

21

1cos,2tan

)(

)(

12

21cos

)1(

)1(

)(

)(

1tan

tan1cos

)1(

)1(

)(

)(

cossin

sincos

)1(

)1(

−−

+==

−=

+

+

−=

+

+

−=

+

+

Page 88: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 88

Conventional CORDIC

Algorithm (2/2)

}1,1{ :rotation of Mode

21

1 :factor Scaling

12

21

12

21

12

21

cossin

sincos

'

'

1

0

22

)1(

1

)1(

1

0

0

0

0

+=

−=

−=

=

−−

−−

i

n

i

i

i

n

n

n

n

i

i

i

i

S

y

x

S

y

x

y

x

Can be implemented

with shift-and-add

arithmetic

(x, y)

(x', y')

22

21

2

11

11

1

00

01

0

2tan1

2tan1

2tan1

a

a

a

==

=−=

==

−−

−−

−−

0

12

Scaling

Page 89: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 89

Generalized CORDIC (1/2)

◼ Target:

◼ i-th elementary rotation angle is defined by

=

=1

0

)(n

i

mi ia

−=

=

→−

==−−−

−−−−

12tanh

12tan

02

2tan1

)(),1(1

).1(1

),0(

),(1

m

m

m

mm

iais

is

is

ims

m

Linear coordinate

Circular coordinate

Hyperbolic coordinate

sequenceshift integer gdescreasin-non :),(

rotation of mode :}1,1{

x is yx vector a of norm 22

ims

my

i

T

+

Page 90: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 90

Generalized CORDIC (2/2)

x2+y2=1

v(0)

v(1)

v(2)

v(3)

v(4)

v(i)=[x(i) y(i)]T

Circular Rotation

(m=1)

x=1

v(0)

v(1)

v(2)

v(3)

v(i)=[x(0) y(i)]T

Linear Rotation

(m->0)

y=-x

v(0)

v(1)

v(2)

v(3)

v(i)=[x(i) y(i)]T

Hyperbolic Rotation

(m=-1)y=x

x2-y2=1

Page 91: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 91

CORDIC Algorithm

+=

=

=

−=+

−=

+

+

=

=

)(

)(

21

1

)(

)(

)(

1

/* only) 1for (requiredoperation Scaling */

loop-i End

)()()1(

/*equation updating Angle */

)(

)(

12

21

)1(

)1(

/*equationiteration CORDIC */

Do 1,-n to0iFor

)0(),0(),0(Given :Initiation

1

0

),(22

),(

),(

ny

nx

mny

nx

nKy

x

m

iaiziz

iy

ix

iy

ix

zyx

n

i

ims

imf

f

mi

ims

i

ims

i

Scaling

),( ims

i

Remained problems:

Page 92: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 92

Mode of Operation (1/2)

◼ Vector rotation mode (θ is given)

For many DSP problems, θ is know in advance, and

sequence can be stored instead

)( ofsign

0|)(| make want towe

)()()()0(

:is rotated angle total the,iterationsn After

)0(

1

0

iz

nz

ianznzz

z

i

n

i

mi

=

=−=−

=

=

}{ i

z(n)

z(0)

Page 93: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 93

Mode of Operation (2/2)

◼ Angle accumulation mode (θ is not given)

The objective is to rotate the given initial

vector [x(0) y(0)]T back to the x-axis

◼ Summary

)()( ofsign

0)0(set

iyix

z

i −=

=

−=

)()( ofsign

)( ofsign

iyix

izi

Vector rotation mode

Angle accumulation mode

Page 94: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 94

Shift Sequence

◼ Usually defined in advance

◼ Walther has proposed a set of shift

sequence for each of the three coordinate

systems

For m=0 or 1, s(m,i)=i

For m=-1, s(-1, i)=1, 2, 3, 4, 4, 5, …, 12, 13,

13, 14, …

Page 95: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 95

Scaling Operation

◼ Significant computation overhead of CORDIC

◼ Fortunately, since , and assume is given, can be computed in advance

◼ Two approaches to compute scaling

CSD representation

Project of factors

)(

1

nKm

1|| =i

)(nKm

)},({ ims

=

−=

P

p

i

p

m

p

nK 1

2)(

1

=

−++=

Q

q

q

i

q

m

q

nK 1

)21()(

1

1=q

Page 96: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 96

Basic CORDIC Processor (1/3)

z-Reg

+/-

z(i)

z(i+1)

i

a(n-1)

.

.

a(1)

a(0)

For CORDIC Iteration and Scaling For Angle Update

X-Reg

Barrel

Shifter

X-Reg Y-Reg

Barrel

Shifter

Y-Reg

MUXMUX MUX MUX

+/- +/-

x(i) y(i)

x(i+1) y(i+1)

s(m,i)

i

Page 97: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 97

Basic CORDIC Processor (2/3)

◼ CORDIC Iteration

−=

+

+−

)(

)(

12

21

)1(

)1(),(

),(

iy

ix

iy

ixims

i

ims

i

X-Reg

Barrel

Shifter

X-Reg Y-Reg

Barrel

Shifter

Y-Reg

MUXMUX MUX MUX

+/- +/-

x(i) y(i)

x(i+1) y(i+1)

s(m,i)

i

Page 98: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 98

Basic CORDIC Processor (3/3)

◼ Scaling

X-Reg

Barrel

Shifter

X-Reg Y-Reg

Barrel

Shifter

Y-Reg

MUXMUX MUX MUX

+/- +/-

x(i) y(i)

x(i+1) y(i+1)

ip

or iq

qp or

=

−=

P

p

i

p

m

p

nK 1

2)(

1

=

−+=

Q

q

i

q

m

q

nK 1

)21()(

1

I:

II:

Given 𝑥′(0) = 𝑥(𝑛), 𝑦′(0)= 𝑦(𝑛)Type I:

ቐ𝑥′(𝑝 + 1) = 𝑥′(𝑝) + 𝜅𝑝2

−𝑖𝑝𝑥(𝑛)

𝑦′(𝑝 + 1) = 𝑦′(𝑝) + 𝜅𝑝2−𝑖𝑝𝑦(𝑛)

Type II:

ቐ𝑥′(𝑞 + 1) = 𝑥′(𝑞) + 𝜅𝑞2

−𝑖𝑞𝑥′(𝑞)

𝑦′(𝑞 + 1) = 𝑦′(𝑞) + 𝜅𝑞2−𝑖𝑞𝑦′(𝑞)

Page 99: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 99

Parallel and Pipelined Arrays

◼ n stages for CORDIC, and s stages for scaling

◼ Parallel

◼ Pipelined

CORDIC

Processor

(1)

x(0)

y(0)

CORDIC

Processor

(2)

CORDIC

Processor

(n+s)

...

...

xf

yf

CORDIC

Processor

(1)

x(0)

y(0)

CORDIC

Processor

(2)

CORDIC

Processor

(n+s)

...

...

xf

yf

D D D

Page 100: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 100

Discrete Fourier Transform

(DFT) with CORDIC (1/2)◼ DFT

◼ DFT with CORDIC

N

Nkj

N

kj

N

kj

eNXeXeXKY

)1(21202

)1()1()0()(

−−−−

−++=

loop-k End

)(

),()(

/*operation Scaling */

loop-m End

),(

),(

)(

)(

2cos

2sin

2sin

2cos

)(),1(

),1(

Do 1,-N to0mFor

Do 1,-N to0kFor

10for 0),0( :Initiation

1

1

nK

kNYkY

kmY

kmY

mx

mx

N

mk

N

mkN

mk

N

mk

nKkmY

kmY

NkkY

i

r

i

r

i

r

=

+

−−

−−

=

+

+

=

=

−=

Page 101: Processing Elements Design

DSP in VLSI Design Shao-Yi Chien 101

Discrete Fourier Transform

(DFT) with CORDIC (2/2)

0

m=0->N-1

Vector

Rotation

xr(m)

xi(m)

...

...

Buffer

m=0->N-1

Vector

Rotation

Nkm /2

Y(0) Y(k)

...

...

Buffer

m=0->N-1

Vector

Rotation

NmN /)1(2 −

Y(N-1)

xr(m)

xi(m)