Presented by: Class Presentation of Custom DSP Implementation Course on: This is a class presentation. All data are copy rights of their respective authors.

May 2005

Presented by:

Class Presentation ofCustom DSP Implementation Course on:

This is a class presentation. All data are copy rights of their respective authors as listed in the references and have been

used here for educational purpose only.

ECE Department – University of Tehran

Shahab adin Rahmanian

TMS320C54x DSPprocessor

Outline

• Introduction• Architecture• Applications• features• Instruction Set and addressing• FIR Filtering• Accelerating Polynomial Evaluation• Numerical Issues• Write code in C• Conclusion

Introduction

[2]

TMS320C54x

• a fixed-point digital signal processor (DSP) in the TMS320 family.• Low power DSP : 0.54 mW/MIP• Acceleration for FIR and LMS filtering, code book search, polynomial

evaluation, Viterbi decoding ,Fast Fourier transform

[4]

Some Typical Applications• General-Purpose

– Adaptive filtering– Digital filtering– Fast Fourier transforms

• Control– Disk drive control– Laser printer control– Robotics control

• Military– Missile guidance– Radar processing– Secure communication

• Telecommunications– 1200- to 19200-bps modems– Adaptive equalizers– Cellular telephones– Echo cancellation– Video conferencing

Software Applications

• Circular Buffers• Single-Instruction Repeat (RPT) Loops• Extended-Precision Arithmetic

– Addition and Subtraction – Multiplication – Division– Square Root

• Floating-Point Arithmetic• Application-Oriented Operations

– Symmetric FIR Filters– Adaptive Filtering– Viterbi Algorithm for Channel Decoding

• Fast Fourier Transforms

Some key features • CPU

– Advanced multi bus architecture with three separate 16-bit data buses and one program bus

– 40-bit arithmetic logic unit (ALU), including a 40-bit barrel shifter and two independent 40-bit accumulators

– 17-bit × 17-bit parallel multiplier coupled to a 40-bit dedicated adder for non-pipelined single-cycle multiply/accumulate (MAC) operation

• Memory– 192K words × 16-bit maximum addressable

memory space (64K words program, 64K words data, and 64K words I/O)

– 28K words × 16-bit single-access on-chip ROM with 8K words configurable as program or data memory (’C541 only)

Some key features

• On-chip peripherals– On-chip phase-locked loop (PLL) clock

generator with internal oscillator or external clock source

– Two full-duplexed serial ports to support 8- and 16-bit transfers (’C541only)

– Time-division multiplexed (TDM) serial port (’C542/’C543 only)

– One 16-bit timer• Speed: 25/20-ns execution time for a single-

cycle fixed-point instruction (40 MIPS/50 MIPS) with 5-V power supply

C54x Addressing Modes

ADD #0FFh

• Immediate– Operand is part of

the instruction

• Absolute– Address of operand

is part of the instruction

• Register– Operand is

specified in a register

LD *(LABEL), A

READA DATA;(data read from address in accumulator A)

C54x Addressing Modes

ADD 010h,A

ADD *AR1

• Direct– Address of operand is

part of the instruction (added to implied memory page)

• Indirect– Address of operand is

stored in a register – Offset addressing– Register offset (ar1+ar0)– Autoincrement/

decrement– Bit reversed addressing– Circular addressing

ADD *AR1(10)

ADD *AR1+0

ADD *AR1+B

ADD *AR1+

ADD *AR1+0B

C54X Instructions Set by Category

LogicalANDBITBITFCMPLCMPM

ORROLRORSFTASFTCSFTLXOR

ArithmeticADDMACMASMPYNEGSUB

ZERO

DataManagement

LDMAR

MV(D,K,M,P)ST

ProgramControl

BBC

CALLCC

IDLEINTRNOPRCRETRPT

RPTBRPTZTRAPXC

ApplicationSpecific

ABSABDSTDELAY

EXPFIRSLMSMAXMIN

NORMPOLYRNDSAT

SQDSTSQUR

SQURASQURS

NotesCMPL complement MAR modify address reg.CMPM compare memory MAS multiply and subtract

Block FIR Filtering

• y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(n-1)]

– h stored as linear array of N elements (in prog. mem.)– x stored as circular array of N elements (in data mem.)

; Addresses: a4 h, a5 N samples of x, a6 input buffer, a7 output buffer; Modulo addressing prevents need to reinitialize regs each sample; Moving filter coefficients from program to data memory is not shownfirtask: ld #firDP,dp ; initialize data page pointer

stm #frameSize-1,brc ; compute 256 outputsrptbd firloop-1stm #N,bk ; FIR circular buffer sizeld *ar6+,a ; load input value to

accumulator bstl a,*ar4+% ; replace oldest sample

with newestrptz a,#(N-1) ; zero accumulator a,

do N tapsmac *ar4+0%,*ar5+0%,a; one tap, accumulate in a sth a,*ar7+ ; store y[n]

firloop: ret

Accelerating Symmetric FIR Filtering

x in twocircularbuffers

h inprogrammemory

• Coefficients in linear phase filters are either symmetric or anti-symmetric

• Symmetric coefficients using 2 mult’s 3 addsy[n] = h0 x[n] + h1 x[n-1] + h1 x[n-2] + h0 x[n-3] y[n] = h0 (x[n] + x[n-3]) + h1 (x[n-1] + x[n-2])

• Accelerated by FIRS (FIR Symmetric) instruction

Accelerating Symmetric FIR Filtering

; Addresses: a6 input buffer, a7 output buffer; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8; Modulo addressing prevents need to reinitialize regs each samplefirtask: ld #firDP,dp ; initialize data page pointerstm #frameSize-1,brc ; compute 256 outputsrptbd firloop-1stm #N/2,bk ; FIR circular buffer sizeld *ar6+,b ; load input value to accumulator bmvdd *ar4,*a5+0% ; move old x[n-N/2] to new x[n-N/2-1]stl b,*ar4% ; replace oldest sample with newestadd *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]rptz b,#(N/2-1) ; zero accumulator b, do N/2-1 tapsfirs *ar4+0%,*ar5+0%,coeffs ; b += a * h[i], do next amar *+a4(2)% ; to load the next newest samplemar *ar5+% ; position for x[n-N/2] sample sth b,*ar7+firloop:ret

Architecture - FIRS

Accelerating Polynomial Evaluation

• Function approximation and spline interpolation• Fast polynomial evaluation (N coefficients)

– y(x) = c0 + c1 x + c2 x2 + c3 x3 Expanded form

– y(x) = c0 + x (c1 + x (c2 + x (c3))) Horner’s form

– POLY reduces 2 N cycles using MAC+ADD to N cycles

; ar2 contains address of array [c3 c2 c1 c0]; poly uses temporary register t for multiplicand x; first two times poly instruction executes gives; 1. a = c(3) + x * 0 = c(3); b = c2; 2. a = c(2) + x * c(3); b = c1

ld *ar2+,16,b ; b = c3 << 16ld *ar3,t ; t = x (ar3 contains addr of x)rptz a,#3 ; a = 0, repeat next inst. 4

timespoly *ar2+ ; a = b + x*a || b = c(i-1) << 16sth a,*ar4 ; store result (ar4 is addr of y)

Integer Multiplication

• Does the user store the lower (1) or upper (8) result?– Both must be kept, resulting in additional resources (two

cycles ,words of code, and RAM locations) to complete the store.

– Worse, how can the double-sized result be used recursively as an input in later calculations, given that the multiplier inputs an input in later calculations, given that the multiplier inputs are single-width?

• Integer multiplication yields products larger than the inputs, as can be seen in the example below, using single digit decimal values as inputs:

Fractional Multiplication

• Multiplication of fractions yields products that never exceed the range of a fraction, as can be seen in the example below, using single digit decimal fractions as inputs:

• Don’t we still have a double sized result to store?– In this case, we can store just the upper result (.8) – This allows storage of result with fewer resources– Results may be used recursively

• Has accuracy been lost by dropping the lower accumulator value?

Accuracy vs. Precision

• Often the programmer wants to retain the fullest accuracy of a calculation, thus dropping the 16 LSB’s of the result in the previous example seems a bad choice.

• Note though, the inputs: how much accuracy do they offer?

• The product offers double precision but its’ accuracy is based on the single-width inputs.

• Thus, storing a single precision result is not only an efficient solution, but represents the limit of the accuracy of the result.

• The accumulator is double-sized for two reasons:– To allow for integer operations, which would

possibly require the LSB’s for the result.– So that sum-of-product operations will generate

accumulative noise at the 32nd vs. the 16th bit.

Redundant Sign Bit

• Multiplication of two signed numbers yields product with two sign bits• Extra sign bit causes problems if stored to memory as result:Wastes spaceCreates off-size Q• Solution: Fractional mode bit!• When FRCT (mode bit in ST1) is set, the multiplier output is left-shifted by one• For 16-bit ‘C54x:Q1 5*Q1 5=Q1 5

Accumulation• With fractions, we were able to guarantee that

no multiplicative overflow could occur, ie: F*F<=F.

• For addition, this rule does not apply, ie: F+F>F. • Therefore, we need additional measures to

manage the possibility of overflow for accumulation. Two general methods apply:– Guard Bits: the ‘C54x offers an 8-bit

extension above the high accumulator to allow valid representation of the result of up to 256 summations.

– Non-gain Systems: offer additional criteria that allow a simple solution for unlimited length summations.

Guard Bits and saturation

• Saturation (SAT)– SAT instruction saturates value exceeding

32-bit range in the selected accumulator:

• Guard Bits: the ‘C54x offers an 8-bit extension above the high accumulator to allow valid representation of the result of up to 256 summations.

SAT A

SAT B

Non-gain Systems

• Many systems can be modeled to have no DC gain:– Filters with low Q.– Any systems scaled by its’ maximum gain value.

• Input values from A/D converters are automatically fractions, if the limits of the A/D are presumed to be +/-1

• Coefficient values can similarly bonded by making the largest value the scaling factor for all other values.

• For these systems, it is known that the final value of the process is less than or equal to the input values.

• The accumulator therefore can be allowed to temporarilyoverflow, since the final result is known to be bonded +/-1.

• Allows maximum usage of selected A/D and D/A converters– D/A bits for gain are more expensive than using analog

components

Division• The ‘C54x does not have a single cycle 16-bit divide

instruction – Divide is a rare function in DSP– Division hardware is expensive

• The ‘C54x does have a single cycle 1-bit divide instruction: conditional subtract or SUBC– Preceded by RPT #15, a 16-bit divide is performed– Is much faster than without SUBC

• The SUBC process operates only on unsigned operands, thus software must:– Compare the signs of the input operands

• If they are alike, plan a positive quotient• If they differ, plan to negate (NEG) the quotient

– Strip the signs of the inputs– Perform the unsigned division– Attach the proper sign based on the comparison of the

inputs

Division Routine

B = num*den (tells sign)Strip sign of numerator

Strip sign of denominator

16 iterations1-bit divide

If result needs to be negative

Invert signStore negative result

Rounding

• Result of multiplication can be rounded for MPY,• and MAS operations. This is specified by appending the

instruction with an “R” suffix.• Example: MAC with rounding is MACR. Rounding consists of

adding 215 to the result and then clearing the low accumulator.

• In a long sum-of-products, only the last MAC operation should specify rounding:

•Rounding can also be achieved with a load operation:

Sign Extension (SXM)

Write code in C

• Inline Assembly– Allows direct access to assembly language from C– Useful for operating on components not used by

C, ex:

• Note: first column after leading quote is label field• Long operations should be written in ASM and called

from C– main C file retains portability– yields more easily maintained structures– eliminates risk of interfering with registers in use by C

Accessing MMRs from C

• Using pointers to access Memory-Mapped Registers:– Create a pointer and set its value to the assigned memory

address:

– Read and write to the register as any other pointer:

• Accessing I/O Ports from C– 1. create the port:– 2. access the port:

volatile unsigned int *SPC_REG = (volatile unsigned int *) 0x0022;

*SPC_REG=OxC8;

ioport unsigned port8000

x = port8000;port8000 = y;

Summary and Conclusion

• C54x is a conventional digital signal processor– Separate data/program busses (3 reads & 1

write/cycle)– Extended precision accumulators– Single-cycle multiply-accumulate– Saturation and wraparound arithmetic– Bit-reversed and circular addressing modes

• C54x has instructions to accelerate algorithms– Communications: FIR & LMS filtering, Viterbi decoding– Speech coding: vector distances for code book search– Interpolation: polynomial evaluation

References

[1] Texas instrument TMS320C54x DSP Design Workshop May 1997

[2] TMS320C54x User’s guide [3] www.ti.com[4] SIGNAL AND IMAGE PROCESSING ON THE

TMS320C54x DSP by Prof. Brian L. Evans [5] TMS320C54x Assembly Language Tools

Presented by: Class Presentation of Custom DSP Implementation Course on: This is a class presentation. All data are copy rights of their respective authors.

Documents

words data

words program

areada data data

words configurable

data buses

words io

ar1 badd

program bus