May 2005 Presented by: Class Presentation of Custom DSP Implementation Course on: This is a class presentation. All data are copy rights of their respective authors as listed in the references and have been used here for educational purpose only. ECE Department – University of Tehran Shahab adin Rahmanian TMS320C54x DSP processor
31
Embed
Presented by: Class Presentation of Custom DSP Implementation Course on: This is a class presentation. All data are copy rights of their respective authors.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
May 2005
Presented by:
Class Presentation ofCustom DSP Implementation Course on:
This is a class presentation. All data are copy rights of their respective authors as listed in the references and have been
used here for educational purpose only.
ECE Department – University of Tehran
Shahab adin Rahmanian
TMS320C54x DSPprocessor
Outline
• Introduction• Architecture• Applications• features• Instruction Set and addressing• FIR Filtering• Accelerating Polynomial Evaluation• Numerical Issues• Write code in C• Conclusion
Introduction
[2]
TMS320C54x
• a fixed-point digital signal processor (DSP) in the TMS320 family.• Low power DSP : 0.54 mW/MIP• Acceleration for FIR and LMS filtering, code book search, polynomial
– h stored as linear array of N elements (in prog. mem.)– x stored as circular array of N elements (in data mem.)
; Addresses: a4 h, a5 N samples of x, a6 input buffer, a7 output buffer; Modulo addressing prevents need to reinitialize regs each sample; Moving filter coefficients from program to data memory is not shownfirtask: ld #firDP,dp ; initialize data page pointer
stm #frameSize-1,brc ; compute 256 outputsrptbd firloop-1stm #N,bk ; FIR circular buffer sizeld *ar6+,a ; load input value to
accumulator bstl a,*ar4+% ; replace oldest sample
with newestrptz a,#(N-1) ; zero accumulator a,
do N tapsmac *ar4+0%,*ar5+0%,a; one tap, accumulate in a sth a,*ar7+ ; store y[n]
firloop: ret
Accelerating Symmetric FIR Filtering
x in twocircularbuffers
h inprogrammemory
• Coefficients in linear phase filters are either symmetric or anti-symmetric
; Addresses: a6 input buffer, a7 output buffer; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8; Modulo addressing prevents need to reinitialize regs each samplefirtask: ld #firDP,dp ; initialize data page pointerstm #frameSize-1,brc ; compute 256 outputsrptbd firloop-1stm #N/2,bk ; FIR circular buffer sizeld *ar6+,b ; load input value to accumulator bmvdd *ar4,*a5+0% ; move old x[n-N/2] to new x[n-N/2-1]stl b,*ar4% ; replace oldest sample with newestadd *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]rptz b,#(N/2-1) ; zero accumulator b, do N/2-1 tapsfirs *ar4+0%,*ar5+0%,coeffs ; b += a * h[i], do next amar *+a4(2)% ; to load the next newest samplemar *ar5+% ; position for x[n-N/2] sample sth b,*ar7+firloop:ret
Architecture - FIRS
Accelerating Polynomial Evaluation
• Function approximation and spline interpolation• Fast polynomial evaluation (N coefficients)
– y(x) = c0 + c1 x + c2 x2 + c3 x3 Expanded form
– y(x) = c0 + x (c1 + x (c2 + x (c3))) Horner’s form
– POLY reduces 2 N cycles using MAC+ADD to N cycles
; ar2 contains address of array [c3 c2 c1 c0]; poly uses temporary register t for multiplicand x; first two times poly instruction executes gives; 1. a = c(3) + x * 0 = c(3); b = c2; 2. a = c(2) + x * c(3); b = c1
ld *ar2+,16,b ; b = c3 << 16ld *ar3,t ; t = x (ar3 contains addr of x)rptz a,#3 ; a = 0, repeat next inst. 4
timespoly *ar2+ ; a = b + x*a || b = c(i-1) << 16sth a,*ar4 ; store result (ar4 is addr of y)
Integer Multiplication
• Does the user store the lower (1) or upper (8) result?– Both must be kept, resulting in additional resources (two
cycles ,words of code, and RAM locations) to complete the store.
– Worse, how can the double-sized result be used recursively as an input in later calculations, given that the multiplier inputs an input in later calculations, given that the multiplier inputs are single-width?
• Integer multiplication yields products larger than the inputs, as can be seen in the example below, using single digit decimal values as inputs:
Fractional Multiplication
• Multiplication of fractions yields products that never exceed the range of a fraction, as can be seen in the example below, using single digit decimal fractions as inputs:
• Don’t we still have a double sized result to store?– In this case, we can store just the upper result (.8) – This allows storage of result with fewer resources– Results may be used recursively
• Has accuracy been lost by dropping the lower accumulator value?
Accuracy vs. Precision
• Often the programmer wants to retain the fullest accuracy of a calculation, thus dropping the 16 LSB’s of the result in the previous example seems a bad choice.
• Note though, the inputs: how much accuracy do they offer?
• The product offers double precision but its’ accuracy is based on the single-width inputs.
• Thus, storing a single precision result is not only an efficient solution, but represents the limit of the accuracy of the result.
• The accumulator is double-sized for two reasons:– To allow for integer operations, which would
possibly require the LSB’s for the result.– So that sum-of-product operations will generate
accumulative noise at the 32nd vs. the 16th bit.
Redundant Sign Bit
• Multiplication of two signed numbers yields product with two sign bits• Extra sign bit causes problems if stored to memory as result:Wastes spaceCreates off-size Q• Solution: Fractional mode bit!• When FRCT (mode bit in ST1) is set, the multiplier output is left-shifted by one• For 16-bit ‘C54x:Q1 5*Q1 5=Q1 5
Accumulation• With fractions, we were able to guarantee that
no multiplicative overflow could occur, ie: F*F<=F.
• For addition, this rule does not apply, ie: F+F>F. • Therefore, we need additional measures to
manage the possibility of overflow for accumulation. Two general methods apply:– Guard Bits: the ‘C54x offers an 8-bit
extension above the high accumulator to allow valid representation of the result of up to 256 summations.
– Non-gain Systems: offer additional criteria that allow a simple solution for unlimited length summations.
Guard Bits and saturation
• Saturation (SAT)– SAT instruction saturates value exceeding
32-bit range in the selected accumulator:
• Guard Bits: the ‘C54x offers an 8-bit extension above the high accumulator to allow valid representation of the result of up to 256 summations.
SAT A
SAT B
Non-gain Systems
• Many systems can be modeled to have no DC gain:– Filters with low Q.– Any systems scaled by its’ maximum gain value.
• Input values from A/D converters are automatically fractions, if the limits of the A/D are presumed to be +/-1
• Coefficient values can similarly bonded by making the largest value the scaling factor for all other values.
• For these systems, it is known that the final value of the process is less than or equal to the input values.
• The accumulator therefore can be allowed to temporarilyoverflow, since the final result is known to be bonded +/-1.
• Allows maximum usage of selected A/D and D/A converters– D/A bits for gain are more expensive than using analog
components
Division• The ‘C54x does not have a single cycle 16-bit divide
instruction – Divide is a rare function in DSP– Division hardware is expensive
• The ‘C54x does have a single cycle 1-bit divide instruction: conditional subtract or SUBC– Preceded by RPT #15, a 16-bit divide is performed– Is much faster than without SUBC
• The SUBC process operates only on unsigned operands, thus software must:– Compare the signs of the input operands
• If they are alike, plan a positive quotient• If they differ, plan to negate (NEG) the quotient
– Strip the signs of the inputs– Perform the unsigned division– Attach the proper sign based on the comparison of the
inputs
Division Routine
B = num*den (tells sign)Strip sign of numerator
Strip sign of denominator
16 iterations1-bit divide
If result needs to be negative
Invert signStore negative result
Rounding
• Result of multiplication can be rounded for MPY,• and MAS operations. This is specified by appending the
instruction with an “R” suffix.• Example: MAC with rounding is MACR. Rounding consists of
adding 215 to the result and then clearing the low accumulator.
• In a long sum-of-products, only the last MAC operation should specify rounding:
•Rounding can also be achieved with a load operation:
Sign Extension (SXM)
Write code in C
• Inline Assembly– Allows direct access to assembly language from C– Useful for operating on components not used by
C, ex:
• Note: first column after leading quote is label field• Long operations should be written in ASM and called
from C– main C file retains portability– yields more easily maintained structures– eliminates risk of interfering with registers in use by C
Accessing MMRs from C
• Using pointers to access Memory-Mapped Registers:– Create a pointer and set its value to the assigned memory
address:
– Read and write to the register as any other pointer:
• Accessing I/O Ports from C– 1. create the port:– 2. access the port:
volatile unsigned int *SPC_REG = (volatile unsigned int *) 0x0022;
*SPC_REG=OxC8;
ioport unsigned port8000
x = port8000;port8000 = y;
Summary and Conclusion
• C54x is a conventional digital signal processor– Separate data/program busses (3 reads & 1
write/cycle)– Extended precision accumulators– Single-cycle multiply-accumulate– Saturation and wraparound arithmetic– Bit-reversed and circular addressing modes
• C54x has instructions to accelerate algorithms– Communications: FIR & LMS filtering, Viterbi decoding– Speech coding: vector distances for code book search– Interpolation: polynomial evaluation
References
[1] Texas instrument TMS320C54x DSP Design Workshop May 1997
[2] TMS320C54x User’s guide [3] www.ti.com[4] SIGNAL AND IMAGE PROCESSING ON THE
TMS320C54x DSP by Prof. Brian L. Evans [5] TMS320C54x Assembly Language Tools