ECE 5655/4655 Real-Time DSP 3–1 TMS320C6x Programming Introduction In this chapter programming the TMS320C6x in assembly, linear assembly, and C will be introduced. Preference will be given to explaining code development for the DSK memory map. The basis for the material presented in this chapter are the course notes from TI’s C6000 4-day design workshop 1 . Programming Alternatives 1.TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000. C Linear ASM ASM Efficiency* Effort Compiler Optimizer Assembly Optimizer 70 – 80% 95 – 100% 100% Low Medium High * Typical efficieny versus hand optimized assembly see TI benchmarks for more information Hand Optimize Intrinsics Chapter 3
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ECE 5655/4655 Real-Time DSP 3–1
TMS320C6x ProgrammingIntroductionIn this chapter programming the TMS320C6x in assembly, linearassembly, and C will be introduced. Preference will be given toexplaining code development for the DSK memory map. Thebasis for the material presented in this chapter are the coursenotes from TI’s C6000 4-day design workshop1.
Programming Alternatives
1.TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.
C
Linear
ASM
ASM
Efficiency* EffortCompilerOptimizer
AssemblyOptimizer
70 – 80%
95 – 100%
100%
Low
Medium
High
* Typical efficieny versus hand optimized assembly see TI benchmarks for more information
HandOptimize
Intrinsics
Chapter
3
Chapter 3 • TMS320C6x Programming
3–2 ECE 5655/4655 Real-Time DSP
Introduction to Assembly Language Pro-gramming
A Dot Product Example
• Recall the C6000 block diagram
• To motivate this introduction to assembly programming, con-sider a basic sum of products or dot product example
(3.1)
• Assembly instructions will initially be shown only with lim-ited detail
• In a later section the details of putting together an actualassembly file will be given
• The core of this algorithm is multiplication and addition
Internal BusesInternal Buses
CPUCPU
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Regs (B
0R
egs (B0 -- B
15)B15)
Regs (A
0R
egs (A0 -- A
15)A
15)
Control RegsControl Regs
CPUCPU
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Regs (B
0R
egs (B0 -- B
15)B15)
Regs (A
0R
egs (A0 -- A
15)A
15)
Control RegsControl Regs
EMIFEMIF
Ext’lMemory
Ext’lExt’lMemoryMemory
-- SyncSync-- AsyncAsync
ProgramProgramRAMRAM Data RamData Ram
D (32)D (32)
Serial PortSerial Port
Host PortHost Port
Boot LoadBoot Load
TimersTimers
Pwr DownPwr Down
DMADMA
AddrAddr
y anxnn 1=
40
¦=
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–3
• To multiply we use the .M (multiply) unit
– As shown here MPY calls a 16-bit multiply which gives a32-bit result
• To add or accumulate we use the .L (logical) unit
.M.M.M
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MPYMPY .M.M a, x, proda, x, prod
.M.M.M
.L.L.L
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MPYMPY .M.M a, x, proda, x, prodADDADD .L.L Y, prod, YY, prod, Y
Where arethe variables
stored?
Where areWhere arethe variablesthe variables
stored?stored?
Chapter 3 • TMS320C6x Programming
3–4 ECE 5655/4655 Real-Time DSP
• Note that we need to store the working variables in a registerfile, the C6000 has two, but for now we will just use the Aside
• We now rewrite the code to include the actual register names
• The original equation (3.1) specifies 40 multiply accumulates
– MVKH .S a,A5 ;will move the upper or high 16-bits without altering the lower 16-bits
– Use MVKL and MVKH in ordered combination to load con-stants greater the 16-bits, and MVK for 16-bit or less con-stants
• What should appear above the code MVK .S 40,A2 is:MVKL .S a,A5 ;store lower half of aMVKH .S a,A5 ;store upper half of aMVKL .S x,A6 ;store lower half of xMVKH .S x,A6 ;store upper half of xMVKL .S y,A7 ;store lower half of yMVKH .S y,A7 ;store upper half of y
• To properly loop over the data, the pointers need to be incr-mented
• The C notation “++” can be used to pre- or post-incrementregisters being used as pointers, e.g., A5++ increments byone the address held in A5 after it is used
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–9
• Pointer incrementing is summarized in the following figure:
• Since there is another set of function units we should havespecified which the side, e.g., .S1 for side A, etc.
• In total, the processor has only about 48 instructions, andhence is considered to be a RISC device
• Before going any further in assembly programming we needto spend some time studying the pipeline
Introduction to the Pipeline• DSP microprocessors rely heavily on the performance advan-
tages of pipelining, the C6x is no exception
• It would be nice to never have to worry about pipeline issues,but some exposure will be helpful in future programming
• Getting code to work only requires a few basic guidelines,while full optimization of the eight function units is beyondthe scope of this section of the notes
• The basic operations of the CPU are:
– (F) Fetch or Program Fetch (PF): get an instruction frommemory
– (D) Decode: figure out what type of instruction it is (ADD,MPY)
– (E) Execute: Actually perform the operation
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–13
Pipelined and Non-Pipelined
• Once the pipeline is full the multiple buses of the C6x cancarry out the F, D, and E operations in parallel, all within thesame clock cycle
• On the downside, when discontinuities such as programbranching occur, the pipeline must be flushed which results inadded processor overhead
Program Fetch Stage
• The program fetch stage actally is broken into four phases
e.g., MPYSP (1.4) means a single precision float multiplyrequires a single function unit latency and three delay slots.
ECE 5655/4655 Real-Time DSP 3–27
C ProgrammingThe section will focus on some of the uses of the C6x develop-ment tools and some of the compiler, assembler, and linker set-tings.
• As stated at the beginning of this chapter, the use of C codecan achieve from 80–100% the efficiency of hand assembly
– Further optimization, what is discussed in this section, willlikely be required, but it is safe to say that C code is a goodstarting point for algorithm development
• Recall the basic code building tool layout is:
• When the compiler tools are coupled with Code ComposerStudio (CCS) we have a compete development environment:
.out.out.out.outLinkerLinker
.obj.obj
Link.cmdLink.cmd
LinkerLinker.obj.obj
Link.cmdLink.cmd
EditorEditor
.sa.sa
AsmAsmOptimizerOptimizer
.sa.sa
AsmAsmOptimizerOptimizer
.c / ..c / .cppcpp
CompilerCompiler
.c / ..c / .cppcpp
CompilerCompiler
.c / ..c / .cppcpp
CompilerCompiler
AsmAsm.asm.asm
AsmAsm.asm.asm
Chapter 3 • TMS320C6x Programming
3–28 ECE 5655/4655 Real-Time DSP
• The output code can be controlled with a very large numberof options that span the compiler, assemble, and linker
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
-- oo <file><file> Output file nameOutput file name LinkerLinker-- mm <file><file> Map file nameMap file name LinkerLinker-- cc AutoAuto--initialize global/static C variablesinitialize global/static C variables LinkerLinker
Options Description Options Tab
debug
speedopto
-mv6700 Generate ‘C6700 code (‘C6200 is default) Compiler-fr <dir> Directory containing source files Compiler-g Enables src-level symbolic debugging Comp/Asm-s Interlist C statements into assembly listing Compiler-k Keep assembly file Compiler-mg Enables minimum debug to allow profiling Compiler-mt No aliasing used Compiler-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3) Compiler-pm Combine all C source files before compile Compiler-ms Minimize code size (-ms0/-ms, -ms1, -ms2) Compiler-oi0 Disables automatic function inlining Compiler -l Create assembler listing file (small -L) Assembler-s Retain asm symbols for debugging Assembler-o <dir> Output file name Linker-m <dir> Map file name Linker-c Auto-Init C variables (-cr turns off autoinit) Linker
• The system software is broken into modules of code and dataknown as sections
• The sections as found in a typical C program are shownbelow:
• The above names seem reasonable, but the compiler usesnames associated with the common object files format(coff) developed many years ago by AT&T for use with Cand Unix
• The real names used by the C6x complier tools are the fol-lowing:
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap)
.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions
.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals
.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar
.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions
Chapter 3 • TMS320C6x Programming
3–38 ECE 5655/4655 Real-Time DSP
• A possible section placement solution for the C6201:
• A more generalized way of describing the memory sections isto use the terms initialized and uninitialized as opposed toROM and RAM, i.e.,
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized
.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized
.bss.bss
.text.text
.cinit.cinit
Global and static variablesGlobal and static variables
CodeCode
Initial values for global/static varsInitial values for global/static vars
uninitializeduninitialized
initializedinitialized
initializedinitialized
DescriptionDescriptionSection Section NameName
MemoryMemoryTypeType
.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized
.const.const Global and static Global and static sstring literalstring literals initializedinitialized.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals initializedinitialized
.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap).sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized
.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–39
Memory Management
• We control the physical mapping of memory to program anddata sections sections via a linker command file
• The linker command file .cmd has two parts
.cmd.cmd.cmd.cmd
LinkerLinker.obj.obj.obj.obj
.map.map--mm
.out.out--ooLinkerLinker.obj.obj
.obj.obj.obj.obj.obj.obj
.map.map--mm
.map.map--mm
.out.out--oo
.out.out--oo
MemoryMemoryMemory
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
.obj.obj.obj.obj
MEMORYMEMORY{ {
Memory DescriptionMemory Description
}}
SECTIONSSECTIONS{{
Binding Code/Data Sections to MemoryBinding Code/Data Sections to Memory
}}
Chapter 3 • TMS320C6x Programming
3–40 ECE 5655/4655 Real-Time DSP
• In the memory description portion we create a description ofboth processor and system resources
• Each line is of the formname:origin = address, length = size-in-bytes
– Note that we can shorten origin to simply o or org, andlength to simply len or l, i.e., consider the memoryportion of the C6711 command file we have used thus farMEMORY{
vecs: org = 00000000h , len = 220h IRAM: org = 00000220h , len = 0000fdc0h CE0: org = 80000000h , len = 01000000h
FLASH: org = 90000000h , len = 00020000h}
– Quantities may be specified in hex or decimal, but hex ispreferred, e.g., 100h or 0x100
• Note: The vectors section must come first, so that followingreset, initialization can occur
• The vecs space must be at least 200 hex long since on theC6x there are a total of 16 interrupts, each requiring one fetchpacket of 8, 32-bit instructions ( )
– Here the 220h leaves room for 32 bits more
– There will be more discussion of interrupts later
• To understand the rest of the memory space assignments,recall the C6x11 memory map
16 32× 200h=
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–41
• On the C6x13 DSK we frequently place all of the sections,program and data, in the internal RAM (IRAM)
• In the third tab of the project options dialog box, we set linkeroptions
• The -o specifies the executable file, e.g., norm_sq_c.out
• The -m creates a map file which shows in detail how thelinker has located everything in memory
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–43
• The -c option, run-time autoinitialization, invokes BOOT.Cso that variables are autoinitialized, that is initial values in the.cinit section are copied into the .bss section
– We can turn of autoinit by using -cr
• -stack sets the size of the stack, e.g., .stack section; thedefault is 0x400
• -heap sets the size of the heap, which is actually the .sys-mem section, has a default value of 0x400
• -q supresses the banner display and -w has the linkerexhaustively read all libraries
Chapter 3 • TMS320C6x Programming
3–44 ECE 5655/4655 Real-Time DSP
Calling Assembly with CBeing able to call assembly routines from C is a powerful capa-bility of the compiler tools. In this section we explore the mainpoints.
• For more detail refer to spru187t or newer, TMS320C6000Optimizing Compiler v 7.3: User's Guide
– Sections 7.4 & 7.5
• To begin with all C labels are accessed in the assembly filewith an underscore (_) character, e.g., sum --> _sum
• To call an assembly routine requires that we follow a fewsimple rules
• Things we would like to do are:
– Pass arguments in
– Return results
– Access C’s global variables in assembly
• More advanced issues, not dealt with here, are use of andaccess to the stack and optimal access to global variables
main( )main( ){{
}}
_asm_asmFunction:Function:
bb
Calling Assembly with C
ECE 5655/4655 Real-Time DSP 3–45
• To find a function we have a global (inter-file) reference
• To pass variables in, take a return value, and return to the par-ent code flow, we use a set of argument/register passing rules
Child.C
int child(int a, int b){
return(a + b);}
Child.CChild.C
int child(int a, int b)int child(int a, int b){{
return(a + b);return(a + b);}}
Child.ASMChild.ASM
.global.global _child_child
_child: _child:
; end of subroutine; end of subroutine
�� UseUse __underscoreunderscore�� Make label Make label globalglobal
Parent.C
int child(int, int);int x = 7, y, w = 3;
void main (void){
y = child(x, 5);}
Parent.CParent.C
int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;
void main (void)void main (void){ {
y = child(x, 5);y = child(x, 5);}}
...assembly code...
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
112233445566778899
101011111212131314141515
00AA BB
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
112233445566778899
101011111212131314141515
00AA BBAA BB
�� Arguments are passed in Arguments are passed in registers as shownregisters as shown
�� Return value in A4Return value in A4and return to addressand return to addressin B3in B3
int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;
void main (void)void main (void){ {
y = child(y = child(x, 5x, 5););}}
�� Declare Declare globalglobal labelslabels�� Use _Use _underscoreunderscore when accessing C variables (labels)when accessing C variables (labels)�� Advantages of declaring variables in C?Advantages of declaring variables in C?
�� Declaring in C is easierDeclaring in C is easier�� Compiler does variable initCompiler does variable init ( ( int w = 3 int w = 3 ))
Parent.C
int child2(int, int);int x = 7, y, w = 3;
void main (void){
y = child2(x, 5);}
Parent.CParent.C
int child2(int, int);int child2(int, int);int x = 7, y, int x = 7, y, w = 3w = 3;;
• Registers A10–A15 and B10–B15 must be saved/preserved
• There is actually a bit more to this (see below), but more later
112233445566778899
101011111212131314141515
00AA BBAA BB
These must be saved and These must be saved and restored if you use them restored if you use them
in Assemblyin Assembly
00112233445566778899101011111212131314141515
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
extraextraargumentsarguments
StackStack
PriorPriorStackStack
ContentsContents
extraextraargumentsarguments
StackStack
PriorPriorStackStack
ContentsContents
Chapter 3 • TMS320C6x Programming
3–48 ECE 5655/4655 Real-Time DSP
Linear Assembly and Assembly OptimizationBeing able to call highly efficient linear assembly routines fromC is another powerful capability of the compiler tools. In thissection we explore the main points.
• Linear assembly has the ease of C programming (almost) andthe efficiency approaching that of assembly, but without toomany headaches, as the tools do a lot of the work
• The development flow for linear assembly modules
• Features of linear assembly for subroutines include:
– Pass parameters
– Return results
– Use symbolic variable names
– Ignore pipeline issues (delay slots)
– Automatically return to the calling function
– Call other functions written in C or linear assembly
• Linear assembly can also call another subroutine
Linear Assembly Compiler Settings
• Specific assembly optimizer options are:
– Use -g -s for algorithm verification
– Use -k -mgt -o3 -pm for software pipelining
Example: Vector Norm SquaredIn this example we will be computing the squared length of avector using 16-bit (short) signed numbers. In mathematicalterms we are finding
• The solution will be obtained in three different ways:
– Conventional C programming
– C6x assembly
– C6x linear assembly
• Optimization is not a concern at this point
• The focus here is to see by way of a simple example, how tocall a C routine from C (obvious), how to call an assemblyroutine from C, and how to call and write a simple linearassembly routine from C
C Version
• We implement this simple routine in C using a declared vec-tor length N and vector contents in the array A
• The C source, which includes the called function norm_sqis given below
– Labels must start in the first column, up to 200 characters,and must begin with a letter, the colon is optional
• When accessing from C the register calling convention isobserved, that is, when we enter the functionnorm_asm(arg1, arg2),
– arg1, is a pointer or address to the first value of the arrayA, and is stored in register A4
– arg2 is an int value, e.g., a full 32-bit signed integer,and is stored in register B4
• Since arg2 is the array dimension, we will use it as the loopcounter starting value
Chapter 3 • TMS320C6x Programming
3–58 ECE 5655/4655 Real-Time DSP
• B4 is not a suitable register for loop control, so we move(mv) the value stored in B4, in this case to B1
• We initialize the accumulator register, A2, using zero instruc-tion, alternatively mvk .s1 0,A2 works as well
• Starting at the top of the loop section, we begin by loading(ldh since we only have 16-bits) the values pointed to by A4into working register A1
– The pointer A4 is post incremented by just 2-bytes or 16-bits address steps following the load operation
– The default increment size is controlled by the data type,here it is halfwords (16-bits)
– Various pre- and post-increment options are available,including the offset amount, and wether it modifies theoriginal pointer or not (see the table below)
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–59
• To satisfy the pipeline delays, we follow the ldh with 4NOP’s
• Next, we perform a 16-bit multiply (MPY), actually a squar-ing; the result is stored in A3
• To satisfy the pipeline we follow the MPY with one NOP
• We accumulate the result into register A2 using ADD
• Next, we branch to loop subject to the state of B1
• The branch is followed by five NOP’s to satisfy the pipelinedelay
a. If [disp] is omitted the displacement is one unit of the data type, other-wise the displacement is by integer multiples of Word, Halfword, or Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.
SyntaxPointer changed
Description
*A1 no Basic pointer
*+A1[disp] no +Pre-offset
*-A1[disp] no -Pre-offset
*++A1[disp] yes Pre-increment
*--A1[disp] yes Pre-decrement
*A1++[disp] yes Post-increment
*A1--[disp] yes Post-decrement
Chapter 3 • TMS320C6x Programming
3–60 ECE 5655/4655 Real-Time DSP
• Finally, the squared and accumulated value held in A2 issaved to the return register A4
• To return back to the C module, we must branch to theaddress saved in B3
• If we had needed to use registers A10–A15 or B10–B15, wewould of had to save and restore them accordingly
• The final numerical result is again 99
Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_asm.pjt, and debugged andprofiled
• The profiling results of the new norm_sq function are:
• With the assembly routine the cycle count is reduced to 91,which as a ratio makes the C routine 152/91 = 1.67 timesslower, assuming no optimization
• With optimization the tables are turned and the C is faster bythe factor ?
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–61
The Linear Assembly Version
• The parent C calling routine is again of the form:/******************************************************
int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;norm_sq = norm_sa(A, N);printf("Vector norm squared = %d",norm_sq);return 0;
}
• The assembly routine is the following:; Vector norm in linear assembly
.global _norm_sa;reference name from C
_norm_sa:.cproc A, N ;input variables.reg m, sum ;working variableszero sum ;zero the accumulator
loop:
ldh *A++, m ;load values pointed to by A
Chapter 3 • TMS320C6x Programming
3–62 ECE 5655/4655 Real-Time DSP
mpy m, m, m ;square each valueadd m, sum, sum;accumulate the squared valuessub N, 1, N ;decrement the loop counter
[N]b loop ;branch until N == 0
.return sum ;return value
.endproc ;end linear assembly routine
• The function/subroutine is declared .global just as in theassembly case
• Following the assembly label _norm_sa, we begin the lin-ear assembly routine with .cproc followed by the inputvariables (may be dummy names);
• Working variables are declared using .reg
• The accumulator is cleared using the assembler instructionzero
• A loop is then set up in a similar fashion to the pure assemblyversion, except now the precise management of the registersis left to the assembly optimizer
• There is also no need to include NOP’s
• As before the final answer is 99
Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_sa.pjt, and debugged andprofiled
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–63
• The profiling results of the new norm_sq function are:
• This result is very similar to the assembly result (on the 671390 .sa & 91 .asm)
• With say -o3 optimization the linear assembly is faster by theratio ?
• When debugging a linear assembly routine it is best to use themixed mode to display assembly interlisted with C and/or lin-ear assembly
• The registers window can then be used to watch what is hap-pening when the code is stepped