Introduction to DSP processors - המחלקה להנדסת …adcomplab/dsplab/Documents/Introduction...1 DSP lab ד''סב 11/15/2014 1 Introduction to DSP processors Presented by:

DSP lab 1

ד''בס

11/15/2014 1

Introduction to DSP processors

Presented by:

ד''בס

11/15/2014 2

Contents:

The modern processor’s classification;

The digital signal processing methods &

algorithms;

The D(igital) S(ignal) P(rocessing) algorithms implementation

The SHARC processor - architecture; data types & formats, C & Assembler;

Getting started.

DSP lab 2

ד''בס

11/15/2014 3

FPGA & EPLD

The modern processor’s classification.

Today chips are distributed into three groups:

ASIC’s (Application

Specific Integrated Circuits)

Chips with hardware realization

of data processing algorithms

(microprocessors & microcontrollers)

ד''בס

11/15/2014 4


Microprocessors & microcontrollers

DSP microprocessors.

The processors are intended for Real-Time

Digital Signal Processing systems.

General-purpose microprocessors.

This kind of processor is intended for

computer systems: PC, workstation &

parallel supercomputer.

Microcontrollers

Very especial processors are intended for embedded systems

and in different

household devices.

DSP lab 3

ד''בס

11/15/2014 5

Review: Processor Classes

General Purpose - high performance – Pentiums, Alpha's, SPARC

– 64-128 bit word size

– Used for general purpose software

– Heavy weight OS - UNIX, NT

– Multiply layers of cache memory

– Workstations, PC's

Embedded processors and processor cores – ARM9, ARC, 486SX, Hitachi

SH7000, NEC V800

– 32 bit word size

– Single program

– Lightweight, often real-time OS

– Code and Data memory cache, DSP support

– Cellular phones, consumer electronics (e. g. CD players)

DSP processors – SHARC, BlackFin, TMS320C55x,

TMS320C67x, TMS320C64x

– 16-32 bit word size

– Single program

– Lightweight, often real-time OS

– Super Harvard Architecture Support, MAC, Circular buffer, Dual-Port RAM

– Audio, Image and Video processing, Coding and Decoding, Cellular Base Station, Adaptive Filtering, Real Time operations

Microcontrollers – PIC, AVR, HC11, ARM7, 8051,80251

– Extremely cost sensitive

– Small word size - 8 bit common

– Highest volume processors by far

– Automobiles, toasters, thermostats, ...

ד''בס

11/15/2014 6


Per

form

an

ce

Cost

GP – High

performance

Microcontrollers

GP – Embedded DSP

DSP lab 4

ד''בס

11/15/2014 7

The digital signal processing

methods & algorithms.

The analog signal processing example:

fC

fjwR

iR

fR

x(t)y(t)

11

y(t)

+

-

Ri

Rf

Cf

x(t)

f fc

ד''בס

11/15/2014 8



The digital signal processing system:

x(t)

Anti-aliasing filter

x’(t) A/D

x’(n)

D/A y’(t)

Smoothing filter y(t)

Digital filter or digital

transform

y’(n)

N

kknxk

kCny

0

fc

|H(f)|

DSP lab 5

ד''בס

The DSP vs. ASP

5th Order Analog Sallen Key Low-Pass Filter

5th Order Digital Filter of Direct Form II

11/15/2014 9

ד''בס

11/15/2014 10



Analog signal processing:

Cheaper;

More compact;

Power dissipation.

Digital signal processing:

More accurate;

More stable for different environments.

Analog signal processing versus Digital signal processing

DSP lab 6

ד''בס

11/15/2014 11



Time sampling: Amplitude quantization:

The basis concepts of DSP:

D

Df

TT

f1

or 1

FTf D

2

1or 2F

x(nT) – one sample at time T;

T – sample rate (time);

fD - sampling frequency:

Niquest frequency:

F – the highest frequency of signal.

x(nT) ~ x(n) ;

Resolution:

eQ quantization error:

NR 2

12max N

Qe

N– bit number.

ד''בס

11/15/2014 12


methods & algorithms. The basis concepts of DSP:

time.algorithm

time;sample

a

T

T

a del

.lg

;

;

1

del

orithmns in af operatio- number oN

uencyr clk freq- processof

timeoperation than

fNT

CLKCPU

op

CLKCPU

opopaa

DSP lab 7

ד''בס

11/15/2014 13

The D(igital) S(ignal) P(rocessing)

algorithms implementation

The major tasks in DSP:

Filter design (Linear Filtering);

Speech detection, image recognition (Spectral Analysis);

Image & Speech compression (Timing-Frequency Analysis);

Image & signal processing (Adaptation Filtering);

Coding, median filters (Non-Linear processing);

Interpolation & decimation (Multi Speed Processing).

ד''בס

11/15/2014 14



FIR filter

IIR filter

FFT

STD

Convolution

Correlation

The very usable DSP algorithms.

DSP lab 8

ד''בס

11/15/2014 15



FIR – filter

(Finite Impulse Response)

1

0

N

i

i inxbny

+

x(n)

+

Z-1 Z-1

+

+

y(n)

b2 b1 b0

ד''בס

11/15/2014 16



IIR – filter (Infinite Impulse Response)

1

0

1

1

N

i

M

k

ki knyainxbny

x(n)

y(n)

–

+

–

+

Z-1 Z-1

+

+

b2 b1 b0

a1 a0

DSP lab 9

ד''בס

11/15/2014 17



1,...,0,1

)(

21

0

NkenxN

kX N

iknN

n

Discrete Fourier Transform

N

ikn

kn

N eW

2

ד''בס

11/15/2014 18



Frequency Domain INVERSE DFT Time Domain

Frequency Domain DFT Time Domain

1

0

)(1

)(N

n

nk

NWnxN

kX

1

0

)()(N

k

nk

NWkXnx

THE COMPLEX DFT

DSP lab 10

ד''בס

11/15/2014 19



X(0) = x(0)W80 + x(1)W8

0 + x(2)W80 + x(3)W8

0 + x(4)W80 + x(5)W8

0 + x(6)W80 + x(7)W8

0

X(1) = x(0)W80 + x(1)W8

1 + x(2)W82 + x(3)W8

3 + x(4)W84 + x(5)W8

5 + x(6)W86 + x(7)W8

7

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W88 + x(5)W8

10 + x(6)W812 + x(7)W8

14

X(3) = x(0)W80 + x(1)W8

3 + x(2)W86 + x(3)W8

9 + x(4)W812 + x(5)W8

15 + x(6)W818 + x(7)W8

21

X(4) = x(0)W80 + x(1)W8

4 + x(2)W88 + x(3)W8

12 + x(4)W816 + x(5)W8

20 + x(6)W824 + x(7)W8

28

X(5) = x(0)W80 + x(1)W8

5 + x(2)W810 + x(3)W8

15 + x(4)W820 + x(5)W8

25 + x(6)W830 + x(7)W8

35

X(6) = x(0)W80 + x(1)W8

6 + x(2)W812 + x(3)W8

18 + x(4)W824 + x(5)W8

30 + x(6)W836 + x(7)W8

42

X(7) = x(0)W80 + x(1)W8

7 + x(2)W814 + x(3)W8

21 + x(4)W828 + x(5)W8

35 + x(6)W842 + x(7)W8

49

1

0

)(1

)(N

n

nk

NWnxN

kXTHE 8-POINT DFT:

ד''בס

11/15/2014 20



Direct computation of the DFT is basically

inefficient because it does not exploit the symmetry

and periodicity properties of the phase factor WN. In

particular, these two properties are:

Symmetry property:

Periodicity property:

k

N

Nk

N WW 2/

k

N

Nk

N WW

X(7) = x(0)W80+x(1)W8

7+x(2)W86+x(3)W8

5+x(4)W84+x(5)W8

3+x(6)W82+x(7)W8

1

X(7) = x(0)W88+x(1)W8

7+x(2)W814+x(3)W8

21+x(4)W828+x(5)W8

35+x(6)W842+x(7)W8

49

DSP lab 11

ד''בס

11/15/2014 21



x(7) WN

4

x(3)

x(5) WN

4

x(1)

X(7)

X(6)

X(5)

X(4)

WN0

WN0

WN6

WN4

WN2

WN0

x(6) WN

4

x(2)

x(4) WN

4

x(0)

X(3)

X(2)

X(1)

X(0)

WN0

WN0

WN6

WN4

WN2

WN0

WN7

WN6

WN5

WN4

WN3

WN2

WN1

WN0

X(7) = x(0)W80 + x(1)W8

7 + x(2)W86 + x(3)W8

13 + x(4)W84 + x(5)W8

11 + x(6)W810 + x(7)W8

17

X(7) = x(0)W88 + x(1)W8

7 + x(2)W86 + x(3)W8

5 + x(4)W84 + x(5)W8

3 + x(6)W82 + x(7)W8

1

(W88= W8

0, W813= W8

5, W811= W8

3, W810= W8

2, W817= W8

1)

X(0) + X(4)WN4

X(2) + X(6)WN4

X(0) + X(4)WN4+

X(2)WN6+ X(6)WN

10

X(1) + X(5)WN4

X(3) + X(7)WN4

X(1) + X(5)WN4+

X(3)WN6+ X(7)WN

10

ד''בס

11/15/2014 22



x(7) WN

4

x(3)

x(5) WN

4

x(1)

X(7)

X(6)

X(5)

X(4)

WN0

WN0

WN6

WN4

WN2

WN0

x(6) WN

4

x(2)

x(4) WN

4

x(0)

X(3)

X(2)

X(1)

X(0)

WN0

WN0

WN6

WN4

WN2

WN0

WN7

WN6

WN5

WN4

WN3

WN2

WN1

WN0

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W88 + x(5)W8

10 + x(6)W812 + x(7)W8

14

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W80 + x(5)W8

2 + x(6)W84 + x(7)W8

6

(W88= W8

0, W810= W8

2, W812= W8

4, W814= W8

6)

X(0) + X(4)WN0

X(2) + X(6)WN0

X(0) + X(4)WN0+

X(2)WN4+ X(6)WN

4

X(1) + X(5)WN0

X(3) + X(7)WN0

X(1) + X(5)WN0+

X(3)WN4+ X(7)WN

4

000 – 000

100 – 001

010 – 010

110 – 011

001 – 100

101 – 101

011 – 110

111 - 111

Bit reverse operations

DSP lab 12

ד''בס



11/15/2014 23

Cross correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them

ד''בס



11/15/2014 24

Discrete correlation

MmmngmfN

nyM

m

0 1

0

Discrete autocorrelation

MmmnfmfN

nyM

m

0 1

0

DSP lab 13

ד''בס



11/15/2014 25

Convolution of two square pulses: the resulting waveform is a

triangular pulse. One of the functions (in this case g) is first reflected

about τ = 0 and then offset by t, making it g(t − τ). The area under the

resulting product gives the convolution at t. The horizontal axis

is τ for f and g, and t for f*g.

Convolution of two square pulses: the resulting waveform is a triangular pulse. One of the functions (in this case g) is first reflected about τ = 0 and then offset by t, making it g(t − τ). The area under the resulting product gives the convolution at t. The horizontal axis is τ for f andg, and t for .

Convolution is a mathematical operation on two functions f and g,

producing a third function that is typically viewed as a modified

version of one of the original functions.

ד''בס



11/15/2014 26

Discrete convolution

M

Mm

M

Mm

mgmnfmngmfngnfny *

Circular discrete convolution

mngkMmfnyM

m

N

Nk

1

0

STD – standard deviation

1

0

21 N

i

iN xxN

S

DSP lab 14

ד''בס

11/15/2014 27

The DSP processor ‘s architecture

Requirement for DSP processors:

1. High speed input data, different interface devices;

2. Input data wide dynamic range;

3. ADD, MULT & SHIFT hardware implementation. Parallel processing;

4. Flexible processing (possibility to “jump” from one process to another);

5. Algorithm’s regularity (Operation “come back”);

DSP processors features

1. Various interface highspeed ports and timers

2. Parallel access memory architecture;

3. Three mathematical units: ALU, barrel Shifter and Multiplier with fast MAC operation (MBR = MBR + Rx * Ry);

4. Cycles, branches & interrupt fast handling. Addressing special modes;

5. Circular buffer.

ד''בס

11/15/2014 28

Data types & formats.

Data types in DSP processors algorithms:

Integer (cycles, coefficients and arrays numbers);

Real (input & output data);

Complex (applications in frequency domain);

Logic (bitwise operation).

Data format in DSP processors :

Byte – 8 bit;

Short word – 16 bit;

Normal word – 32 bit;

Instruction word – 48 bit;

Extended normal word – 40 bit;

Long word – 64 bit.

DSP lab 15

ד''בס

Fixed-Point Design

Digital signal processing algorithms

– Often developed in floating point

– Later mapped into fixed point for digital hardware realization

Fixed-point digital hardware

– Lower area

– Lower power

– Lower per unit production cost

Idea

Floating-Point Algorithm

Quantization

Fixed-Point Algorithm

Code Generation

Target System

Alg

orith

m L

evel

Imple

menta

tion

Level

Range Estimation

ד''בס

X S

Fixed-Point Representation

Fixed point type

– Wordlength

– Integer wordlength

Quantization modes

– Round

– Truncation

Overflow modes

– Saturation

– Saturation to zero

– Wrap-around

S X X X X X

Wordlength

Integer wordlength

X X X X X

Wordlength

X

X

DSP lab 16

ד''בס Overflow Handling in Fixed Point Computations

Overflow handling is an important consideration when implementing signal processing algorithms. If overflow is not controlled appropriately it can lead to problems such as detection

errors, or poor quality audio output. Typical digital signal processing CPUs include hardware support for handling overflow. Some RISC processors may include these modes as well.

(In fact I helped define and implement such modes for the 32 bit MIPS processor core used in many Broadcom products). These processors often have a “saturating” mode that sets an

instruction result to a minimum or maximum value on an overflow condition. (The term “saturating” comes from analog electronics, in which an amplifier output will be limited, or

clipped, between fixed values when a large input is applied.) Commonly the CPU will limit the result to a 32 bit twos complement integer (0x7FFFFFFF or 0x80000000). For

unsigned operations, the result would be limited to 0xFFFFFFFF. There are a number of situations in which overflow can occur, and I will discuss some of them below.

Addition and Subtraction

Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer.

In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU

with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.

Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a

positive number, or when a positive number is subtracted from a negative number.

Negation

There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive

value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set,

then the CPU will set the result to 0x7FFFFFFF (just less than +1).

Arithmetic Shift

Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the

format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the

original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer

avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.

Overflow will not occur when right shifting a number.

Multiplication

Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When

multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional

bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting

the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.

I have used a CPU that handles this special case with saturation hardware. Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The

reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if

the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication

mode is enabled.

A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. I would recommend not

using the “left shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle

overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can

be greatly reduced in some cases.

Division

Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the

denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on,

and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.

11/15/2014 31

ד''בס

Optimum Wordlength

Longer wordlength – May improve application

performance – Increases hardware cost

Shorter wordlength – May increase quantization errors

and overflows – Reduces hardware cost

Optimum wordlength – Maximize application performance

or minimize quantization error – Minimize hardware cost

Wordlength (w)

Cost c(w) Distortion d(w)

[1/performance]

Optimum

wordlength

DSP lab 17

ד''בס

Filter Implementation

Finite word-length effects (fixed point

implementation)

• Coefficient quantization

• Overflow & quantization in arithmetic operations • scaling to prevent overflow

• quantization noise statistical modeling

• limit cycle oscillations

ד''בס

Coefficient Quantization

The coefficient quantization problem :

Filter design in Matlab (e.g.) provides filter coefficients to 15

decimal digits (such that filter meets specifications)

For implementation, need to quantize coefficients to the word

length used for the implementation.

As a result, implemented filter may fail to meet specifications… ??

PS: In present-day signal processors, this has become less of a problem

(e.g. with 16 bits (=4 decimal digits) or 24 bits (=7 decimal digits)

precision). In hardware design, with tight speed requirements, this is still

a relevant problem.

DSP lab 18

ד''בס


Coefficient quantization effect on pole locations :

1. tightly spaced poles (e.g. for narrow band filters) imply high sensitivity of pole locations to coefficient quantization

2. hence preference for low-order systems (parallel/cascade)

Example: Implementation of a band-pass IIR 12-order filter

Cascade structure with 16-bit coeff. Direct form with 16-bit coeff.

ד''בס


Coefficient quantization effect on pole locations :

example : 2nd-order system (e.g. for cascade

realization)

21

21

..1

..1)(

zz

zzzH

ii

iii

DSP lab 19

ד''בס


example (continued) :

with 5 bits per coefficient, all possible pole positions are...

Low density of permissible pole locations at z=1, z=-1, hence

problem for narrow-band LP and HP filters

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

end

end

)plot(roots

1:0625.0:1for

2:1250.0:2for

i

i

ד''בס


example (continued) :

possible remedy: `coupled realization’

poles are where are realized/quantized

hence permissible pole locations are (5 bits)

-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

.j ,

+

+ +

-

y[k]

u[k]

DSP lab 20

ד''בס

Quantization of an FIR filter

Transfer function ΔH(z)

The effect of coefficient quantization to

linear phase

ד''בס

FIR filter example

Passband attenuation 0.01, Radial frequency (0,0.4)

Stopband attenuation 0.001, Radial frequency (0.4, )

DSP lab 21

ד''בס

FIR filter example – 16bits

ד''בס

FIR filter example - 8bits

DSP lab 22

ד''בס

Arithmetic Operations

Finite word-length effects in arithmetic operations:

In linear filters, have to consider additions & multiplications

Addition:

if, two B-bit numbers are added, the result has (B+1) bits.

Multiplication:

if a B1-bit number is multiplied by a B2-bit number, the

result has (B1+B2-1) bits.

For instance, two B-bit numbers yield a (2B-1)-bit product

Typically (especially so in an IIR (feedback) filter), the result of an addition/multiplication has to be represented again as a B’-bit number (e.g. B’=B). Hence have to get rid of either most significant bits or least significant bits…

ד''בס

Arithmetic Operations

Option-1: Most significant bits

If the result is known to be upper bounded so that the most significant

bit(s) is(are) always redundant, it(they) can be dropped, without loss of

accuracy.

This implies we have to monitor potential overflow, and introduce scaling

strategy to avoid overflow.

Option-2 : Least significant bits

Rounding/truncation/… to B’ bits introduces quantization noise.

The effect of quantization noise is usually analyzed in a statistical manner.

Quantization, however, is a deterministic non-linear effect, which may give

rise to limit cycle oscillations.

DSP lab 23

ד''בס

Scaling

The scaling problem:

Finite word-length implementation implies maximum

representable number. Whenever a signal (output or

internal) exceeds this value, overflow occurs.

Digital overflow may lead (e.g. in 2’s-complement

arithmetic) to polarity reversal (instead of saturation

such as in analog circuits), hence may be very harmful.

Avoid overflow through proper signal scaling

Scaled transfer function may be c*H(z) instead of H(z)

(hence need proper tracing of scaling factors)

ד''בס

Scaling

Time domain scaling:

Assume input signal is bounded in magnitude

(i.e. u-max is the largest number that can be represented in the `words’

reserved for the input signal’)

Then output signal is bounded by

To satisfy

(i.e. y-max is the largest number that can be represented in the `words’

reserved for the output signal’)

we have to scale H(z) to c.H(z), with

max][ uku

1max

0

max

00

.][.][.][][].[][ huihuikuihikuihkyiii

1max

max

. hu

yc

max][ yky

DSP lab 24

ד''בס

Scaling

Example:

assume u[k] comes from 12-bit A/D-converter

assume we use 16-bit arithmetic for y[k] & multiplier

hence inputs u[k] have to be shifted by

3 bits to the right before entering the filter

(=loss of accuracy!)

y[k]

u[k] +

x

0.99

10099.01

1...

.99.01

1)(

1

1

h

zzH

3

1

12

16

2

116.0

.2

2

hc

y[k]

u[k]

+

x

0.99

shift

ד''בס

Scaling

L2-scaling: (`scaling in L2 sense’)

Time-domain scaling is simple & guarantees that overflow will

never occur, but often over-conservative (=too small c)

If an `energy upper bound’ for the input signal is known

then L2-scaling uses

where

…is an L2-norm (this leads to larger

c)

1max

max

. hu

yc

0k

2

max u[k]UE

0

2

2][

i

ihh

2max

max

. hE

yc

U

DSP lab 25

ד''בס

Scaling

So far considered scaling of H(z), i.e. transfer function

from u[k] to y[k]. In fact we also need to consider

overflow and scaling of each internal signal, i.e. scaling of

transfer function from u[k] to each and every internal

signal !

This requires quite some thinking….

(but doable)

x

bo

x

b4

x

b3

x

b2

x

b1

+ + + +

y[k]

+ + + +

x

-a4

x

-a3

x

-a2

x

-a1

x1[k] x2[k] x3[k] x4[k]

ד''בס

Scaling

Something that may help: If 2’s-complement arithmetic is used, and

if the sum of K numbers (K>2) is guaranteed not to overflow, then

overflows in partial sums cancel out and do not affect the final

result (similar to `modulo arithmetic’).

Example:

if x1+x2+x3+x4 is guaranteed not to

overflow, then if in (((x1+x2)+x3)+x4)

the sum (x1+x2) overflows, this overflow

can be ignored, without affecting the

final result.

As a result (1), in a direct form realization,

eventually only 2 signals have to be

considered in view of scaling :

x

bo

x

b4

x

b3

x

b2

x

b1

+ + + +

+ + + +

x

-a4

x

-a3

x

-a2

x

-a1

x1[k] x2[k] x3[k] x4[k]

DSP lab 26

ד''בס

Scaling

As a result (2), in a transposed direct form realization, eventually

only 1 signal has to be considered in view of scaling……….:

hence preference for transposed direct form over direct form.

u[k]

x

-a4

x

-a3

x

-a2

x

-a1

y[k]

x

bo

x

b4

x

b3

x

b2

x

b1

+ + + +

x1[k] x2[k] x3[k] x4[k]

ד''בס

Quantization Noise

The quantization noise problem :

If two B-bit numbers are added (or multiplied), the result is a B+1 (or 2B-1) bit number. Rounding/truncation/… to (again) B bits, to get rid of the least significant bit(s) introduces quantization noise.

The effect of quantization noise is usually analyzed in a statistical manner.

Quantization, however, is a deterministic non-linear effect, which may give rise to limit cycle oscillations.

PS: Will focus on multiplications only. Assume additions are implemented with sufficient number of output bits, or are properly scaled, or…

DSP lab 27

ד''בס

Quantization Noise

Quantization mechanisms:

Rounding Truncation Magnitude Truncation

mean=0 mean=(-0.5)LSB (biased!) mean=0

variance=(1/12)LSB^2 variance=(1/12)LSB^2 variance=(1/6)LSB^2

input

probability

error

output

ד''בס

Quantization Noise

Statistical analysis based on the following assumptions :

- each quantization error is random, with uniform probability

distribution function (see previous slide)

- quantization errors at the output of a given multiplier are

uncorrelated/independent (=white noise assumption)

- quantization errors at the outputs of different multipliers are

uncorrelated/independent (=independent sources assumption)

One noise source is inserted for each multiplier.

Since the filter is linear filter the output noise generated by

each noise source is added to the output signal.

DSP lab 28

ד''בס

Quantization Noise

The effect on the output signal of noise generated at a

particular point in the filter is computed as follows:

noise is e[k]. noise mean & variance are

transfer function from from e[k] to filter output is G(z),g[k]

(‘noise transfer function’)

Noise mean at the output is

Noise variance at the output is (remember L2-norm!)

Repeat procedure for each noise source…

y[k]

u[k] +

x

-.99

+ e[k]

2, ee

1)(.gain)DC.(

zee zG

2

2

2

0

22

222

.][.

))(2

1.()gain'-noise.(`

gkg

deG

e

k

e

j

ee

ד''בס

Quantization Noise

In a transposed direct realization all `noise transfer

functions’ are equal (up to delay), hence all noise

sources can be lumped into one equivalent source

etc...

u[k]

x

-a4

x

-a3

x

-a2

x

-a1

y[k]

x

bo

x

b4

x

b3

x

b2

x

b1

+ + + +

x1[k] x2[k] x3[k] x4[k]

e[k]

DSP lab 29

ד''בס

Quantization Noise

In a direct realization all noise sources can be lumped into

two equivalent sources

etc...

e1[k]

x

bo

x

b4

x

b3

x

b2

x

b1

+ + + +

y[k]

+ + + +

x

-a4

x

-a3

x

-a2

x

-a1

x1[k] x2[k] x3[k] x4[k]

u[k]

e2[k]

ד''בס

Quantization Noise

PS: Quantization noise of A/D-converters can be

modeled/analyzed in a similar fashion.

Noise transfer function is filter transfer function H(z).

DSP lab 30

ד''בס

Limit Cycles

Statistical analysis is simple/convenient, but quantization

is truly a non-linear effect, and should be analyzed as a

deterministic process.

Though very difficult, such analysis may reveal odd

behavior:

Example: y[k] = -0.625.y[k-1]+u[k]

4-bit rounding arithmetic

input u[k]=0, y[0]=3/8

output y[k] = 3/8, -1/4, 1/8, -1/8, 1/8, -1/8, 1/8, -1/8,

1/8,..

Oscillations in the absence of input (u[k]=0) are called

`zero-input limit cycle oscillations’.

Copyright Marc Moonen [1]

ד''בס

Limit Cycles

Example: y[k] = -0.625.y[k-1]+u[k]

4-bit truncation (instead of rounding)

input u[k]=0, y[0]=3/8

output y[k] = 3/8, -1/4, 1/8, 0, 0, 0,.. (no limit cycle!)

Example: y[k] = 0.625.y[k-1]+u[k]

4-bit rounding

input u[k]=0, y[0]=3/8

output y[k] = 3/8, 1/4, 1/8, 1/8, 1/8, 1/8,..

Example: y[k] = 0.625.y[k-1]+u[k]

4-bit truncation

input u[k]=0, y[0]=-3/8

output y[k] = -3/8, -1/4, -1/8, -1/8, -1/8, -1/8,..

Conclusion: weird, weird, weird,… ! Copyright Marc Moonen [1]

DSP lab 31

ד''בס

Limit Cycles

Limit cycle oscillations are clearly unwanted (e.g. may be

audible in speech/audio applications)

Limit cycle oscillations can only appear if the filter has

feedback. Hence FIR filters cannot have limit cycle

oscillations.

Mathematical analysis is very difficult.

Truncation often helps to avoid limit cycles (e.g. magnitude

truncation, where absolute value of quantizer output is

never larger than absolute value of quantizer input

(`passive quantizer’)).

Some filter structures can be made limit cycle free, e.g.

coupled realization, orthogonal filters (see below).

ד''בס

11/15/2014 62

The DSP processor ‘s architecture.

DSP processors with fixed and floating point.

Fixed versus Floating:

Fixed point arithmetic operations are more simple for hardware realization;

Floating point DSP processor has more data types and commands;

Floating point advantages:

Increases accuracy;

Wide dynamic range;

Doesn’t have problem with data overflow;

Friendly for C compiler.

Fixed point advantages:

Cheaper;

Compact.

DSP lab 32

ד''בס

11/15/2014 63

Data types & formats.

Dynamic range:

or in [db]:

maximum linearity error :

max. precision [bits]

(b – data width):

0 min

max

volue

volueDynR

0 min

maxlog20

volue

voluedbDynR

b2

erroron quantizatimax

valuemaxlog2

ד''בס

C vs. Assembly

DSP programs are different from traditional software tasks in two important respects. – First, the programs are usually much shorter, say, one-

hundred lines versus ten-thousand lines.

– Second, the execution speed is often a critical part of the application.

If assembly is used at all, it is restricted to short subroutines that must run with the utmost speed.

DSP lab 33

ד''בס

C vs. Assembly

Programs in C are more flexible and quick to develop.

Programs in assembly often have better performance , they run faster and use less memory, resulting in lower cost.

11/15/2014 65

ד''בס

C vs. Assembly

Which language is best for your application?

– If you need flexibility and fast development, choose C.

– use assembly if you need the best possible performance.

How complicated is the program? – If it is large and intricate use C. – If it is small and simple, assembly

may be a good choice.

Are you pushing the maximum speed of the DSP?

– If so, assembly will give you the last drop of performance from the device.

– For less demanding applications, assembly has little advantage, and you should consider using C.

How many programmers will be working together?

– If the project is large enough for more than one programmer, lean toward C and use in-line assembly only for time critical segments.

Which is more important, product cost or development cost?

– If it is product cost, choose assembly;

– If it is development cost, choose C.

What is your background? – If you are experienced in assembly

(on other microprocessors), choose assembly for your DSP.

– If your previous work is in C, choose C for your DSP.

DSP lab 34

ד''בס

11/15/2014 67


“Traditional” fon Neiman architecture

Harvard architecture

CPU Memory

data & instruction

Address bus

Data bus

Program Memory instruction

only PM data bus

Data Memory

data only DM data bus CPU

DM address bus PM address bus

ד''בס

11/15/2014 68


Super Harvard architecture

I/O Controller

Data

Program Memory instruction

only PM data bus

Data Memory

data only DM data bus

CPU DM address bus PM address bus

Instruction Cache

This is SHARC DSP processor structure

DSP lab 35

ד''בס

11/15/2014 69


SHARC DSP processor structure

ד''בס

11/15/2014 70


The ADSP-21160 hardware structure.

SERIAL PORTS

(2)

LINK PORTS

(6)

DMA

CONTROLLER

ADDR BUS

MUX

IOD

64

IOA

18

IOP

REGISTERS

6

6

6x10

4

Dual-Ported SRAM

External Port

I/O Processor

PROCESSOR

PORT I/O

PORT ADDR DATA ADDR DATA

Two Independent,

Dual-Ported Memory

Blocks

ADDR DATA ADDR DATA

MULTIPROCESSOR

32

64

HOST PORT

INTERFACE

PM Address Bus 32

DM Address Bus 32

PM Data Bus 16/32/40/48/64

DM Data Bus 32/40 64

INSTRUCTION

CACHE 32 x 48-Bit

DA G 2 8 x 4 x 32

DA G 1 8 x 4 x 32

Core Processor

PROGRAM

SEQUENCER

TIMER

Connect

Bus

(PX)

7 JTAG

Test &

Emulation

P M D

D M D

E P D

I O D

BL

OC

K 0

BL

OC

K 1

DATA BUS

MUX

MULTIPLIER BARREL

SHIFTER ALU

DATA

REGISTER

FILE

16 x 40-Bit

DSP lab 36

ד''בס

11/15/2014 71


100 MHz - 600 MFLOPS- SIMD Core

1024 point, complex FFT benchmark: 90 us

4 Mbits on chip SRAM

14 zero overhead DMA channels

Sustained 700 Mbyte/sec over IOP bus

Two 50 mbit/sec Synchronous Serial Ports

Six 100 Mbyte/sec link ports

64 bit synchronous external port

Cluster multiprocessing support

ADSP-21160 Features

ד''בס

11/15/2014 72

Peak (technical) performance of microprocessor: Maximum theoretical microprocessor’s speed in ideal conditions. It’s

defined by number of calculating operation which had done in some time.

Real (sustained) performance of microprocessor: Real microprocessor’s speed in real conditions. The real performance is

calculated by execution of some popular programs. (like FIR,IIR or FFT).

The methods for computer performance measurement



DSP lab 37

ד''בס

11/15/2014 73


Pipe-Line command execution:

Instruction fetching (a);

Decoding (b);

Execution (c).

n-1 operation

n operation

n+1 operation

a b c

a b c

a b c

ד''בס

11/15/2014 74

SHARC instruction set

SHARC programming model.

SHARC assembly language.

SHARC data operations.

SHARC flow of control.

DSP lab 38

ד''בס

SHARC programming model

Register files:

R0-R15 (aliased as F0-F15 for floating point)

Status registers.

Loop registers.

Data address generator registers.

ד''בס

11/15/2014 76

SHARC assembly language

R1=DM(M0,I0), R2=PM(M8,I8); // comment

label: R3=R1+R2;

data memory access program memory access

Algebraic notation terminated by semicolon:

DSP lab 39

ד''בס

11/15/2014 77

Simple ALU Instructions

Rn = Rx + Ry Fn = Fx + Fy

Rx = Rx – Ry Fn = Fx - Fy

Rn = Rx + Ry + CI (Carry In) Fn = ABS(Fx + Fy)

Rn = Rx - Ry + CI - 1 Fn = ABS(Fx – Fy)

Rn = (Rx + Ry)/2 Fn = (Fx + Fy)/2

COMP(Rx, Ry) COMP(Fx, Fy)

Rn = Rx + CI – 1 Fn = - Fx

Rn = Rx + 1 Fn= ABS Fx

Rn = Rx – 1 Fn= PASS Fx

Rn = -Rx Fn = RND Fx

Rn = ABS Rx Fn = SCALB Fx BY Ry

Rn = PASS Rx Rn = MANT Fx

Rn = Rx AND Ry Rn = LOGB Fx

Rn = Rx OR Ry Rn = FIX Fx BY Ry

Rn = NOT Rx Fn = FLOAT Rx BY Ry

Rn = MIN(Rx, Ry) Rn = TRUNC Fx

Rn = MAX(Rx, Ry) Fn = RECIPS Fx

ד''בס

11/15/2014 78

MAC instructions - mainly INTEGER

Multiply and Accumulate

Rn = Rx * Ry MRF = Rx * Ry

MRB = Rx * Ry Rn = MRF + Rx * Ry

Rn = MRB + Rx * Ry MRF = MRF + Rx * Ry

MRB = MRB + Rx * Ry Rn = MRF – Rx * Ry

Rn = MRB – Rx * Ry MRF = MRF – Rx * Ry

MRB = MRB – Rx * Ry Rn = SAT MRF

Rn = SAT MRB MRF = SAT MRF

MRB = SAT MRB Rn = RND MRF

Rn = RND MRB MRF = RND MRF

MRB = RND MRB MR = Rn

Rn = MR FLOAT – Fx * Fy

DSP lab 40

ד''בס

11/15/2014 79

Shifter Instructions - mainly integer

FPACK is a cast and means (32bit -> 16bit) Fx

UNPACK is a cast and means (16bit -> 32bit) Rx

BUT WITH A LOT OF HIDDEN STUFF TOO!

Rn = LSHIFT Rx BY Ry/<dataa8>

Rn = Rn OR LSHIFT Rx BY Ry/<data8>

Rn = ASHIFT Rx BY Ry/<data8>

Rn = ROT Rx BY Ry/<data8>

Rn = BCLR Rx BY Ry/<data8>

Rn = BSET Rx BY Ry/<data8>

Rn = BTGL Rx BY

Rx/<data8>

BTST Rx BY Ry/<data8>

Rn = Rn OR FDEP Rx BY Ry/<bit6>:<len6> (SE)

Rn = Rx BY Ry/<bit 6>:<len6> (SE)

Rn = EXP Rx (EX) Rn = LEFTZ Rx

Rn = LEFT0 Rx Rn = FPACK Fx

Fn = UNPACK Rx

ד''בס

11/15/2014 80

Flag operations

ALU operations set: AZ (zero), AN (negative), AV (overflow), AC (fixed-point carry), AI (floating-point invalid), AF (last ALU operation).

Multiplier operations set: MN (negative), MV (overflow), MU (flouting point overflow), MI (floating-point invalid).

Shifter operations set: SV (overflow), SZ (zero), SS (sign).

Fixed-point: -1 + 1 = 0:

AZ = 1, AN = 0, AV = 0, AC = 1, AI = 0, AF = 0.

Fixed-point: -2*3=-6:

MN = 1, MV = 0, MU = 1, MI = 0.

LSHIFT 0x7fffffff BY 3: SV=1, SZ=0, SS=0.

DSP lab 41

ד''בס

Multifunction computations

The modified Harvard architecture allows – multiple data fetches in a single instruction.

The most common instructions allow – a memory reference and a computation to be

performed at the same time.

Memory references – can be done two at a time in many instructions, with

each reference using a DAG.

Can issue some computations in parallel: – dual add-subtract;

– fixed-point multiply/accumulate and add, subtract

– floating-point multiply and ALU operation

ד''בס

SHARC DSP Architecture

The machine supports both memory parallelism and operation parallelism.

Reduce the number of instructions required for common operations.

For example, the basic operation in a dot product loop can be performed in one cycle that performs two fetches, a multiplication, and an addition.

DSP lab 42

ד''בס

11/15/2014 83

Example Multi-Function Instruction

In a SingleCycle the SHARC Performs:

1(2) Multiply

1 (2) Addition

1 (2) Subtraction

1 (2) Memory Read

1 (2) Memory Write

2 Address Pointer Updates

Plus the I/O Processor Performs:

Active Serial Port Channels (2 Transmit, 2 Receive)

Active Link Ports (6)

Memory DMA

2 DMA Pointer Updates

f11=f1*f7, f3=f9+f14, f9=f9-f14, dm(i2,m0)=f13,

f7=pm(i8,m8);

ד''בס

Parallelism Restrictions on the sources of the operands when

operations are combined.

The operands going to the multiplier must come from R0 through R7 (or in the case of floating-point operands, F0 to F7), with one input coming from RO-R3/FO-F3 and the other from R4-R7/f0-f7.

The ALU operands must come from R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5.

performs three operations: R6 = R0 * R4, R9 = R8 + R12, RI0 = R8 - R12

DSP lab 43

ד''בס

11/15/2014 85

SHARC load/store

Load/store architecture: no memory-direct operations.

Two data address generators (DAGs): data memory.

program memory;

Must set up DAG registers to control loads/stores.

Provide indexed, modulo, bit-reverse indexing. – Bit-reversal addressing can be performed only in I0 and I8, as

controlled by the BR0 and BR8 bits in the MODE1 register.

ד''בס

11/15/2014 86

BASIC addressing

Immediate value: r0 = DM(0x20000000);

Direct load: r0 = DM(_a); // Loads contents of _a

Direct store: DM(_a)= r0; // Stores R0 at _a

Base-Plus-Offset

R0 = DM(M1, I0); // Loads from location I0 + M1

//M and I are registers from the DAG register file

DSP lab 44

ד''בס

11/15/2014 87


Circular buffer

ד''בס

Circular buffer

11/15/2014 88

DSP lab 45

ד''בס

Circular buffer

A circular buffer is an array

of n elements; when the n +

1th element is referenced, the

reference goes to buffer

location 0, wrapping around

from the end to the beginning

of the buffer.

L register is set with a positive,

nonzero value as the starting

point in the circular buffer,

B register of the same number

is loaded with the base address

of the circular buffer.

an

a1

a0

L1=x

an-1

I1=B1+M

B1

a2

I1

ד''בס

Circular buffer

– circular buffer with

N fixed-size slots

an

a1

L1=x

an-1

I1=B1+M

B1

a2

I1

an+1

DSP lab 46

ד''בס

11/15/2014 91

DAGs registers

I0

I1

I2

I3

I4

I5

I6

I7

M0

M1

M2

M3

M4

M5

M6

M7

L0

L1

L2

L3

L4

L5

L6

L7

B0

B1

B2

B3

B4

B5

B6

B7

ד''בס

11/15/2014 92


I register holds start address.

M register/immediate holds modifier value. r0 = DM(I3,M3) // Load

DM(I2,1) = r1 // Store

Circular buffer: I register is buffer start index, B is buffer base address.

Allows transmission two values of data to/from memory per cycle:

f0 = DM(I0,M0), f1 = PM(I9,M8);

Compiler allows to programmer to define which memory values are stored in.

DSP lab 47

ד''בס

11/15/2014 93


M6 = 1;

R0 = dm(I4, M6); // post-modify

// means: R0 = dm(I4), and then I4 = I4 + M6

// However:

R0 = dm(M6, I4); // offset index only

// means: R0 = dm(M6 +I4), and still keeps I4 = I4

ד''בס

11/15/2014 94


B4 = 4000;

L4 = 0; // set to 0

I4 = 4002;

M6 = 1;



// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)

// with I4 = 4002 still unchanged at the end of the

code



// means R0 = dm(4002) and R1 = dm(4003)

// with I4 = 4004 at the end of the code

Post-incrementing and Offset

DSP lab 48

ד''בס

11/15/2014 95


B4 = 4000;

L4 = 3;

I4 = 4002;

M6 = 1;



// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)

// with I4 = 4002 still

R0 = dm(I4, M6); // post-increment

R1 = dm(I4, M6); // post-increment

// means R0 = dm(4002) with I4 = 4003,

// however R1 = dm(4000) {4003 – 3} with I4 = 4001

Circular buffer implementation

ד''בס

11/15/2014 96

Example: C assignments

C: x = (a + b) - c;

Assembler: r0 = DM(_a) // Load a

r1 = DM(_b); // Load b

r3 = r0+r1;

r2 = DM(_c); // Load c

r3 = r3-r2;

DM(_x) = r3; // Store result in x

DSP lab 49

ד''בס

11/15/2014 97


C: y = a*(b+c);

Assembler: r1 = DM(_b); // Load b

r2 = DM(_c); // Load c

r2 = r1 + r2;

r0 = DM(_a); // Load a

r2 = r2*r0;

DM(_y) = r2; // Store result in y

ד''בס

11/15/2014 98


Shorter version using pointers: // Load b, c

r2 = DM(I1,M5), r1 = PM(I8,M13);

// load a in parallel with multiplication

r0 = r2+r1, r12 = DM(I0,M5);

r8 = r12*r0;

DM(I0,M5)= r8; // Store in y

DSP lab 50

ד''בס

11/15/2014 99


C: z = (a << 2) | (b & 15);

Assembler: r0 = DM(_a); // Load a

r0 = LSHIFT r0 by 2; // Left shift

r1 = DM(_b), r3 = 15;// Load immediate

r1 = r1 AND r3;

r0 = r1 OR r0;

DM(_z) = r0;

ד''בס

11/15/2014 100

SHARC jump

Unconditional flow of control change:

JUMP label;

Three addressing modes:

– Direct (specifies a 24-bit address in immediate );

– Indirect (supply by DAG2 data address generator);

– PC-relative (specifies an immediate value that is added to the

current PC).

All Instructions may be executed conditionally

– if EQ r1=pm(i15,0x11);

– if LE r0 = LSHIFT r0 by 2;

Conditions come from:

– arithmetic status (ASTAT)

– mode control 1 (MODE1)

– loop register

DSP lab 51

ד''בס

11/15/2014 101

Example: C if statement

C: if (a > b)

y = c + d;

else y = c - d;

Assembler: // if condition

r0 = DM(_a);

r1 = DM(_b);

COMP(r0,r1); // Compare

IF GT JUMP label;

// False block

r0 = DM(_c);

r1 = DM(_d);

r1 = r0 - r1;

DM(_y)= r1;

JUMP other; // Skip false block

// True block

label: r0 = DM(_c);

r1 = DM(_d);

r1 = r0 + r1;

DM(_y) = r1;

other: // Code after if

True version

EQ

LT

LE

AC

AV

TF

Description

ALU = 0

ALU<0

ALU≤0

ALU carry

ALU overflow

Bit test flag

Complement version

NE

GE

GT

NOT AC

NOT AV

NOT TF

ד''בס

11/15/2014 102

The best if implementation

C: if (a > b)

y = c + d;

else y = c - d;

Assembler: // Load values

r1 = DM(_a), r2 = PM(_b);

r3 = DM(_c), r4 = PM(_d);

// Compute both sum and difference

r12 = r3 + r4, r0 = r3 - r4;

// Choose which one to save

comp(r2,r1);

if GT r0 = r12;

dm(_y) = r0 // Write to y

DSP lab 52

ד''בס

DO UNTIL loops

DO UNTIL instruction provides efficient looping: LCNTR = 30, DO label UNTIL LCE;

r0 = DM(I0,M0), f2 = PM(I8,M8);

r1 = r0 - r15;

label: f4 = f2 + f3;

Loop length (16 bit) Last instruction in loop

Termination condition

The SHARC processor allows up to six nested loops

Another version of loop:

DO label UNTIL EQ;

R0 = R0-1;

label: comp(R0,R1);

LCE

NOT LCE

Loop counter expired

Loop counter not expired

ד''בס

11/15/2014 104

Example: FIR filter

C: for (i=0, y=0; i<N; i++)

y = y + c[i]*x[i];

C X

DSP lab 53

ד''בס

11/15/2014 105

FIR filter assembler

// setup

I0 = _c; I8 = _x;// c[0] (DAG0), x[0] (DAG1)

r12 = 0; // f = 0;

M0 = 1; M8 = 1; // Set up increments

// Loop body

LCNTR = N, DO loopend UNTIL LCE;

// Use post-increment mode

r1 = DM(I0,M0), r2 = PM(I8,M8);

r8 = r1 * r2 (uui);

loopend: r12 = r12 + r8;

ד''בס

11/15/2014 106

Example: C main + ASM function

C: int dm c[4] = {1,2,3,4};

int pm x[7] = {1,2,3,4,5,6,7};

int dm y;

extern int fir(int dm *,int pm *);

//main

void main()

{

y = fir(c,x);

}

DSP lab 54

ד''בס

11/15/2014 107

Example: C main + ASM

function Assembler: #include <asm_sprt.h>

.SEGMENT/PM seg_pmco;

.global _fir;

.extern _c, _x, _y;

_fir: entry;

// setup

I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)

// or I0 = r4, I8 = r8

r12 = 0; // f = 0;

M0=1; M8=1; // Set up increments

// Loop body

LCNTR = 4, DO loopend UNTIL LCE;

r1 = DM(I0,M0), r2 = PM(I8,M8);

r3 = r1 * r2 (ssi);

loopend: r12 = r12 + r3;

r0 = r12; // or dm(_y)=r12;

exit;

_fir.end:

.endseg;

ד''בס

11/15/2014 108

Example: Using MAC operation

Assembler: #include <asm_sprt.h>


.global _fir;

.extern _c, _x, _y;

_fir: entry;

//setup

I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)

//or I0 = r4, I8 = r8

r12 = 0; // f = 0;

M0=1; M8=1; // Set up increments

//Loop body

LCNTR = 4, DO loopend UNTIL LCE;

r1 = DM(I0,M0), r2 = PM(I8,M8);

loopend: MRF = MRF + r1 * r2 (ssi);

r0 = MR0F;

exit;

_fir.end:

.endseg;

DSP lab 55

ד''בס

11/15/2014 109


(work with STACK)

int a,b,c,d,e,f;

extern int asm_proc( int a, int b, int c,

int d, int e );

void main()

{

a = 0xAAAAAA;

b = 0xBBBBBB;

c = 0xCCCCCC;

d = 0xDDDDDD;

e = 0xEEEEEE;

f = asm_proc(a,b,c,d,e);

}

ד''בס

11/15/2014 110


(work with STACK)

int a,b,c,d,e,f;

extern int asm_proc( int a, int b, int c,

int d, int e );

void main()

{

a = 0xAAAAAA;

b = 0xBBBBBB;

c = 0xCCCCCC;

d = 0xDDDDDD;

e = 0xEEEEEE;

f = asm_proc(a,b,c,d,e);

}

DSP lab 56

ד''בס

11/15/2014 111


(work with STACK) #include "asm_sprt.h"


.GLOBAL _asm_proc;

_asm_proc:

start:

// m7 = -1 (compiler definition)

// m6 = 1 (compiler definition)

r15 = i6;

// i6 - save C sp (stack pointer)

// i7 - asm sp (stack pointer)

i2 = r15;

modify(i2,m6);

r0 = r4;

r1 = r8;

r2 = r12;

r3 = dm(i2,m6);

// C sp + 2 (fourth argument place)

r4 = dm(i2,m6);

// C sp + 3 (fifth argument place)

r5 =0x555555;

r0 = r0 + r5;

// r0 = return()

_asm_proc.end:

.endseg;

exit;

ד''בס

11/15/2014 112

Special instructions to handle “C”

Cjump -- getting to “C” compatible subroutine

– Processor architecture customized for C

– Replaces 3 instructions for faster operations

– Difficult to use in ENCM515

• Will not be having assembly code calling other subroutines (95%) -- Why

bother since slow!

RFRAME -- returning to “C” environment

– Processor architecture customized for C

– Part of MAGIC lines of code

– See reference card

DSP lab 57

ד''בס

11/15/2014 113

“C” interface to assembly code

C/ASSEMBLY LANGUAGE INTERFACE

Special Purpose Registers – usage predetermined by compiler

I7 – C runtime stack pointer – next empty place – NOT last used

I6 – C runtime frame pointer – start of frame of current function

(cdefines.i -- I7 = CTOPstack, I6 = FP)

L6/L7 – must remain as zero – controls stack memory

characteristics

DAG1 registers – M5 (0), M6 (1), M7 (-1) – in “C” runtime header

(cdefines.i -- zeroDM, plus1DM, minus1DM)

DAG2 registers – M13 (0), M14 (1), M15 (-1) – in “C” header

(cdefines.i – zeroPM, plus1PM, minus1PM)

LENGTH registers MUST RETURN to 0 – don’t touch L6/L7

ד''בס

11/15/2014 114

Volatile registers when using “C”

R4 (INPAR1), R8 (INPAR2), R12 (INPAR3), R0 (retvalue)

Scratch or Volatile registers (cdefines.i definitions)

Don’t keep useful values in them across subroutine calls

R0, R1, R2 (cdefines.i -- retvalue, scratchR1, scratchR2)

R4, I4, M4 (cdefines.i -- INPAR1, scratchDMpt, scratchMDM)

R8 (cdefines.i -- INPAR2)

R12, I12, M12 (cdefines.i -- INPAR3, scratchPMpt, scratchMPM)

DSP lab 58

ד''בס

Important programming reminders

Registers for parameters transfer: r4,r8,r12,r0;

Interrupt does not occur until 2 instructions after delayed branch (needs 2 NOPs);

Some DAG register transfers are disallowed in assembler routine;

It is preferable not use the following couples in all combinations: (M7,I6), (M14,I12), (M6,I5), (M5,I6).

ד''בס

SOS filters creation

% The cut off frequency normalization with fs/2. Wn=[200 1000]/(fs/2); % calculate coeffitions for Bandpass filter N = 5 [b, a] = butter(N, Wn); Convertion zero-pole-gain filter parameters to second-

order sections form [z,p,k]=butter(N, Wn); [sos,g] = zp2sos(z,p,k);

11/15/2014 116

H(z) H1(z) H2(z) H3(z) x(t) x(t) y(t) x(t) z1(t) z2(t)

DSP lab 59

ד''בס

11/15/2014 117

Getting started

ד''בס

11/15/2014 118

DSP lab 60

ד''בס

11/15/2014 119

ד''בס

11/15/2014 120

DSP lab 61

ד''בס

11/15/2014 121

ד''בס

11/15/2014 122

DSP lab 62

ד''בס

11/15/2014 123

ד''בס

11/15/2014 124

DSP lab 63

ד''בס

11/15/2014 125

ד''בס

11/15/2014 126

DSP lab 64

ד''בס

11/15/2014 127

ד''בס

11/15/2014 128

DSP lab 65

ד''בס

11/15/2014 129

ד''בס

11/15/2014 130

DSP lab 66

ד''בס

11/15/2014 131

ד''בס

11/15/2014 132

DSP lab 67

ד''בס

11/15/2014 133

ד''בס

11/15/2014 134

DSP lab 68

ד''בס

11/15/2014 135

ד''בס

11/15/2014 136

DSP lab 69

ד''בס

11/15/2014 137

ד''בס

11/15/2014 138

DSP lab 70

ד''בס

11/15/2014 139

ד''בס

11/15/2014 140

DSP lab 71

ד''בס

11/15/2014 141

ד''בס

11/15/2014 142

DSP lab 72

ד''בס

11/15/2014 143

ד''בס

11/15/2014 144

DSP lab 73

ד''בס

11/15/2014 145

ד''בס

11/15/2014 146

DSP lab 74

ד''בס

11/15/2014 147

ד''בס

11/15/2014 148

DSP lab 75

ד''בס

11/15/2014 149

ד''בס

11/15/2014 150

DSP lab 76

ד''בס

11/15/2014 151

ד''בס

11/15/2014 152

DSP lab 77

ד''בס

11/15/2014 153

ד''בס

11/15/2014 154

DSP lab 78

ד''בס

11/15/2014 155

ד''בס

11/15/2014 156

DSP lab 79

ד''בס

11/15/2014 157

ד''בס

11/15/2014 158

DSP lab 80

ד''בס

11/15/2014 159

Paths to examples

C:\Program Files\Analog Devices\VisualDSP\211xx\examples

C:\Program Files\Analog Devices\VisualDSP\21k\examples

or

D:\Program Files\Analog Devices\VisualDSP\211xx\examples

D:\Program Files\Analog Devices\VisualDSP\21k\examples

ד''בס

11/15/2014 160

Introduction to DSP processors

The END

DSP lab 81

ד''בס

References

1. Marc Moonen, Lecture 4 : Filter

implementation, lecture slides.

2. Kyungtae Han, ``Fixed-Point Wordlength

Optimization and Its Applications to

Broadband Wireless Demodulator

Design,'' Samsung Advanced Institute of

Technology, Korea, Jun 24, 2004

http://homes.esat.kuleuven.be/~rombouts/dspII/transp2005-2006/lecture4.ppt




http://signal.ece.utexas.edu/~khan/talks/2004/wlopt_sait2004.ppt