Ti DSP optimization on Jacinto

Ti DSP Optimization over JacintoHank

2015/06/09

Generation

● 2002 OMAP● 2006 Jacinto 1● 2008 Jacinto 3● 2010 Jacinto 4/5● 2016 Jacinto 6

OMAP

– Application:

● CD-DA and CD-ROM/DVD-ROM/USB/SD with MP3, WMA, and AAC audio decoder support

– Software platform:

● Cooperate with QNX Software Systems

Jacinto 1-DDR

– Application:

● Compressed video playback, Bluetooth A2DP audio streaming

– Improvement:

● C64x+ fixed-point for graphics acceleration, compressed audio decoding, voice recognition

Jacinto 3

– Application:

● Compressed video playback, Bluetooth A2DP audio streaming

– Hardware improvement:

● ARM Cortex A8● GPU PowerVR

SGX

Jacinto 4/5

– Application:

● Full-HD 1080p video decode/endcode

● QNX CAR 2 platfomr


● Dual ARM Cortex-M3-used for decoding video stream

● C674x DSP

Jacinto 6

– Application:

● Advanced Driver Assistance System (ADAS)


● ARM Cortex A15● DSP C66x● GPU SGX544

DSP Generation

C6000 DSP Optimization

● Code generation tool support languages:

– ANSI C (C89)

– ISO C++ (C++98)

– C6000 DSP assembly

– C6000 linear assembly

Optimization-Five key concepts

● Core(architecture)– parallel processing

● Pipeline– High throughput

● Software pipelining– Instruction scheduling

● Compiler optimization● Optimizied software library

– Intrinsic opertions in C6000, inlined functions

C6000 Core

● 8 paralleral function unit

– D: data load/store (.D1, .D2)

– S: shift, branch(.S1, .S2)

– M: mulitply(.M1, .M2)

– L: logic, arithmetic operations(.L1, .L2)

C6000 Core (conti.)

● 32 32-bit registers for each side of function units

– A0-A31(.D1, .S1, .M1, .L1)

– B0-B31(.D2, .S2, .M2, .L2)

● Separate program and data memory (L1P, L1D)

● 256-bit internal program bus- fetch 8 32-bit instructions from L1P every cycle

● 2 64-bit internal data buses that allows both .D1 and .D2 to fetch data from L1D every cycle

Core optimizationC++C++

Compiled parallelAssembly

Pseudo assembly

Pipeline

F: fetch D: decode E: execute

C6000 pipeline

● Divide fetch, decode, execute into more substages: 4-stage fetch, 2-stage decode, 10-stage execute

Delay slots

● Pipeline will not optimize– Current instruction depends on results of previous

instruction and it takes more than 1 cycle

– A branch is performed

● Solution– Software scheduling (software pipelining)

– Hardware enhancement (SPLOOP buffer)

Software pipelining

● Enable● Codes in C, just

add compiler option -o3 to enable software pipelining

● Drawback● Assembly code

size increases● Solution

● Software pipeline loop buffer

SPLOOP buffer

● Support platform– C64x+, C674x, and C66x

● SPLOOP buffer sotres a single scheduled iteration of the loop in a specialized buffer

● C compiler automatically utilize SPLOOP● Cannot handle loops that exceed 14 execute

packets(most 8 instructions/execute packet)– Nested loops, conditional branches inside loops,

function calls inside loops

Compiler Optimization

● Using C compiler to generate assembly codes that utilize C6000 functional units and pipeline as fully as possible– Add additional information and instructions help

compiler maximally optimize your codes● Compiler options, e.g. -o3● Keywords(C or C6000), e.g, restrict● Pragma directives, e.g. MUST_ITERATE

– Understand compiler feedback

Loop qualification (option -k -mw)

Compiler feedback (option -k -mw)

Dependency & resource information

● Minimize iteration interval– The loop carried dependency bound

● Distance of the largest loop carry path

– Partitioned resource bound● Maximum number of cycles any functional unit is used in

a single iteration

Loop carried path

Explicit code optimization

● Previous solution is suitable but– Function calls in a loop

– Complex, hard-to-implement operations

● Solutions – explicit code optimization– Intrinsic operations

– Optimized C6000 DSP libraries

– C inline functions

Intrinsic operations

● Sample

– Shuffle operation seperates even and odd bits of a 32-bit value into two variables

● Intrinsic operations

– Function-like statements

– Leading underscore, e.g. _shfl

– Not a function call, no branch needed● Lists in “TMS320C6000 Optimizing Compiler v7.6

User's Guide“

● Devices depend

● _abs could be used directly

Optimized DSP software libraries

● Fundational Math & signal processing– MathLIB

– IQMath

– FastRTS

– DSPLIB● Adaptive filtering, matrix computations

● Image & video processing– IMGLIB

– Video Analytics & Vision Library (VLIB)

– VICP Signal Processing Library

Inline functions

● Pros– To reduce overhead of a function call

– Make optimizer perform loop optimization

● Cons– Size of codes increases

● To use– Use -O2 or -O3 to automatically make functions

inline

– Use explicit inline keyword

Optimization flow

Profiling

Optimization practice

● Use –o3 and consider –mt for optimization; use –k and consider –mw for compiler feedback (mt : assume all pointers in loop are independent)

● Apply the restrict keyword to minimize loop carried dependency bound (alternative to mt)

● Use the MUST_ITERATE and UNROLL pragmas to optimize pipeline usage

● Choose the smallest applicable data type and ensure proper data alignment to help compiler invoke

● Single Instruction Multiple Data (SIMD) operations● Use intrinsic operations and TI libraries in case major code

modification is needed (avoid standard I/O functions)

Using pragma

●

● Without minimum iterate count, compiler needs to assume it will iterate once– Providing factor gives compiler freedom to loop

unrolling

Unbalanced resource partition

Manual unroll

Compiler unroll

Reference

● Texas Instruments, 『 Introduction to TMS320C6000 DSP

Optimization』

– Recommended to read first

● Texas Instruments, 『 In-Vehicle Connectivity is So

Retro』

● Texas Instruments, 『 TMS320C6000 Programmer's

Guide』

● 『 TMS320C6000 Optimizing Compiler v7.6 User's

Guide』 - Intrinsic operations

● http://processors.wiki.ti.com/index.php/Software_libraries – optimized library list

http://processors.wiki.ti.com/index.php/Software_libraries

Ti DSP optimization on Jacinto

Engineering