Ti DSP Optimization over Jacinto Hank 2015/06/09
Ti DSP Optimization over JacintoHank
2015/06/09
Generation
● 2002 OMAP● 2006 Jacinto 1● 2008 Jacinto 3● 2010 Jacinto 4/5● 2016 Jacinto 6
OMAP
– Application:
● CD-DA and CD-ROM/DVD-ROM/USB/SD with MP3, WMA, and AAC audio decoder support
– Software platform:
● Cooperate with QNX Software Systems
Jacinto 1-DDR
– Application:
● Compressed video playback, Bluetooth A2DP audio streaming
– Improvement:
● C64x+ fixed-point for graphics acceleration, compressed audio decoding, voice recognition
Jacinto 3
– Application:
● Compressed video playback, Bluetooth A2DP audio streaming
– Hardware improvement:
● ARM Cortex A8● GPU PowerVR
SGX
Jacinto 4/5
– Application:
● Full-HD 1080p video decode/endcode
● QNX CAR 2 platfomr
– Hardware improvement:
● Dual ARM Cortex-M3-used for decoding video stream
● C674x DSP
Jacinto 6
– Application:
● Advanced Driver Assistance System (ADAS)
– Hardware improvement:
● ARM Cortex A15● DSP C66x● GPU SGX544
DSP Generation
C6000 DSP Optimization
● Code generation tool support languages:
– ANSI C (C89)
– ISO C++ (C++98)
– C6000 DSP assembly
– C6000 linear assembly
Optimization-Five key concepts
● Core(architecture)– parallel processing
● Pipeline– High throughput
● Software pipelining– Instruction scheduling
● Compiler optimization● Optimizied software library
– Intrinsic opertions in C6000, inlined functions
C6000 Core
● 8 paralleral function unit
– D: data load/store (.D1, .D2)
– S: shift, branch(.S1, .S2)
– M: mulitply(.M1, .M2)
– L: logic, arithmetic operations(.L1, .L2)
C6000 Core (conti.)
● 32 32-bit registers for each side of function units
– A0-A31(.D1, .S1, .M1, .L1)
– B0-B31(.D2, .S2, .M2, .L2)
● Separate program and data memory (L1P, L1D)
● 256-bit internal program bus- fetch 8 32-bit instructions from L1P every cycle
● 2 64-bit internal data buses that allows both .D1 and .D2 to fetch data from L1D every cycle
Core optimizationC++C++
Compiled parallelAssembly
Pseudo assembly
Pipeline
F: fetch D: decode E: execute
C6000 pipeline
● Divide fetch, decode, execute into more substages: 4-stage fetch, 2-stage decode, 10-stage execute
Delay slots
● Pipeline will not optimize– Current instruction depends on results of previous
instruction and it takes more than 1 cycle
– A branch is performed
● Solution– Software scheduling (software pipelining)
– Hardware enhancement (SPLOOP buffer)
Software pipelining
● Enable● Codes in C, just
add compiler option -o3 to enable software pipelining
● Drawback● Assembly code
size increases● Solution
● Software pipeline loop buffer
SPLOOP buffer
● Support platform– C64x+, C674x, and C66x
● SPLOOP buffer sotres a single scheduled iteration of the loop in a specialized buffer
● C compiler automatically utilize SPLOOP● Cannot handle loops that exceed 14 execute
packets(most 8 instructions/execute packet)– Nested loops, conditional branches inside loops,
function calls inside loops
Compiler Optimization
● Using C compiler to generate assembly codes that utilize C6000 functional units and pipeline as fully as possible– Add additional information and instructions help
compiler maximally optimize your codes● Compiler options, e.g. -o3● Keywords(C or C6000), e.g, restrict● Pragma directives, e.g. MUST_ITERATE
– Understand compiler feedback
Loop qualification (option -k -mw)
Compiler feedback (option -k -mw)
Dependency & resource information
● Minimize iteration interval– The loop carried dependency bound
● Distance of the largest loop carry path
– Partitioned resource bound● Maximum number of cycles any functional unit is used in
a single iteration
Loop carried path
Explicit code optimization
● Previous solution is suitable but– Function calls in a loop
– Complex, hard-to-implement operations
● Solutions – explicit code optimization– Intrinsic operations
– Optimized C6000 DSP libraries
– C inline functions
Intrinsic operations
● Sample
– Shuffle operation seperates even and odd bits of a 32-bit value into two variables
● Intrinsic operations
– Function-like statements
– Leading underscore, e.g. _shfl
– Not a function call, no branch needed● Lists in “TMS320C6000 Optimizing Compiler v7.6
User's Guide“
● Devices depend
● _abs could be used directly
Optimized DSP software libraries
● Fundational Math & signal processing– MathLIB
– IQMath
– FastRTS
– DSPLIB● Adaptive filtering, matrix computations
● Image & video processing– IMGLIB
– Video Analytics & Vision Library (VLIB)
– VICP Signal Processing Library
Inline functions
● Pros– To reduce overhead of a function call
– Make optimizer perform loop optimization
● Cons– Size of codes increases
● To use– Use -O2 or -O3 to automatically make functions
inline
– Use explicit inline keyword
Optimization flow
Profiling
Optimization practice
● Use –o3 and consider –mt for optimization; use –k and consider –mw for compiler feedback (mt : assume all pointers in loop are independent)
● Apply the restrict keyword to minimize loop carried dependency bound (alternative to mt)
● Use the MUST_ITERATE and UNROLL pragmas to optimize pipeline usage
● Choose the smallest applicable data type and ensure proper data alignment to help compiler invoke
● Single Instruction Multiple Data (SIMD) operations● Use intrinsic operations and TI libraries in case major code
modification is needed (avoid standard I/O functions)
Using pragma
●
● Without minimum iterate count, compiler needs to assume it will iterate once– Providing factor gives compiler freedom to loop
unrolling
Unbalanced resource partition
Manual unroll
Compiler unroll
Reference
● Texas Instruments, 『 Introduction to TMS320C6000 DSP
Optimization』
– Recommended to read first
● Texas Instruments, 『 In-Vehicle Connectivity is So
Retro』
● Texas Instruments, 『 TMS320C6000 Programmer's
Guide』
● 『 TMS320C6000 Optimizing Compiler v7.6 User's
Guide』 - Intrinsic operations
● http://processors.wiki.ti.com/index.php/Software_libraries – optimized library list