Introducing the ConnX D2 DSP Engine

Introducing the

ConnX D2 DSP Engine

Introduced: August 24, 2009

Copyright © 2009, Tensilica, Inc.

Fastest Growing Processor / DSP IP Company

Customizable Dataplane Processor/DSP IP Licensing– Leading provider of customizable Dataplane Processor Units (DPUs)

– Unique combination of processor & DSP IP cores + software design tools– Customization enables improved power, cost, performance

– Standard DPU solutions for audio, video/imaging & baseband comms – Dominant patent portfolio for configurable processor technology

Broad-Based Success– 150+ Licensees, including 5 of the top 10 semiconductor companies– Shipping in high volume today (>200M/yr rate)– Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09)

• 21% revenue growth in 2007, 25% in 2008

2


Focus: Dataplane Processing Units (DPUs)

EmbeddedController

ForDataplaneProcessing

Main Applications CPU

Tensilica focus: Dataplane Processors

DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL

3


Communications DSP Trends / Challenges

4

Markets Changing Faster

• Market requirements in flux as economy wobbles

• Emerging standards evolve faster in the Internet age

Development Teams Shrink

• SOC development schedules tightening• Tightening resource constraints (do

more with less)

Code Size Increases

• Communications standards growing innumber & complexity

• DSP algorithm code heavily integrated with more (and more complex) control code

Maintenance and flexibility pushes DSP algorithms towards C-code


Trends Within Licensable DSP Architectures

1st Generation Licensable DSP Cores• Modest/Medium performance (single/dual MAC)

• Simple architecture (single issue, compound Instructions)

• Limited or no compiler support (mostly hand coded)

2nd Generation Licensable DSP Cores• Added RISC like architecture features (register arrays)

• Improved compiler targets, but still assembly

• Some offer wide VLIW for performance• Large area; code bloat

• Some offer wide SIMD for performance• Good area/performance tradeoff• No performance when vectorization fails

5


Vectorization Benefits (SIMD)

• Loop counts can be reduced

• Data computation can be done in parallel

• Cheapest (hardware cost) method to get higher performance

Example: 2-way SIMD performance benefit

Data7

Data6

Data5

Data4

Data3

Data2

Data1

Data0

Data6

Data4

Data2

Data0

Data7

Data5

Data3

Data1

Before Vectorization After Vectorization

Single Execution

2-way SIMD Execution

6


VLIW Technology

7

• Parallel execution of Instructions• Effective use of multiple ALUs/MACs• Compiler allocates instructions to VLIW slots• Orthogonal allocation yields more flexibility

Instruction #1

Instruction #2

Instruction #3

Instruction #4

Execution ALU

Instruction #1

Instruction #3

VLIW Execution ALU1

Instruction #2

Instruction #4

VLIW Execution ALU2


Ideal 3rd Generation Licensable DSP

Ideal Characteristics• VLIW capability for good performance on general code

• Parallelization of independent operations

• SIMD capability for good performance on loop code

• Data parallel execution

• Good C compiler target

• Reduce or eliminate need to assembly program

• Productivity benefit

• Small, compact size

• Keep costs down in brutally competitive markets

8


Tensilica - the Stealth DSP Company

Single MAC

Dual MAC

Quad MAC

8 MAC

ConnX D2

8 MAC and

more

Xtensa TIE

HiFi 2

Single PrecisionFloating Point Unit

Double PrecisionAcceleration

Floating Point HW

Custom DSPsDSP Building Blocks

ConnX Vectra

LX

ConnX 545CK DSP

ConnX BBE

16 MAC

388VDO

MAC16

MUL32

DIV32

Comms Audio Video Xtensa: Other Markets

9


ConnX D2 DSP Engine - Overview

• Dual 16b MAC Architecture with Hybrid SIMD / VLIW• Optimum performance on a wide range of algorithms

• SIMD offers high data computation rate for DSP algorithms

• 2-way VLIW allows parallel instruction execution on SIMD and scalar code

• “Out of the Box” industry standard software compatibility• TI C6x fixed-point C intrinsics supported

• Fully bit for bit equivalent with TI C6x

• ITU reference code fixed point C intrinsics directly supported

• Goals: Ease of Use, Low Area/Cost• Click and go “Out of the Box” performance from standard C code

• Standard C and fixed point data types - 16-bit, 32-bit and 40-bit

• Advanced optimizing, vectorizing compiler

• Less than 70K gates (under 0.2mm2 in 65nm)

10


Target Applications: ConnX D2

• Embedded control

• VoIP gateways, voice-over-networks (including VoIP codecs)

• Femto-cell and pico-cell base stations

• Next generation disk drives, data storage

• Mobile terminals and handsets

• Home entertainment devices

• Computer peripherals, printers

General purpose 16-bit DSP for a wide range of applications

11


ConnX D2 DSP:An ingredient of an Xtensa DPU

Hardware Use Model

• Click-button configuration option within Xtensa LX core

• Part of the Tensilica configurable core deliverable package

• Two reference configurations

• Typical DSP solution for high performance

• Small size for cost and power sensitive applications

• Full tool support from Tensilica

• High level simulators (SystemC), ISS and RTL

• Debugger and Trace

• Compiler, IDE and Operating Systems

12


ConnX D2 Processor Block Diagram (Typical)

13


ConnX D2 Engine Architecture

AR Register Bank(32 bits)

Local Memory and/or Cache XDU Alignment Registers

(4 x 32 bits)XDD Register File

(8 x 40-bits)

16-bit vector

Overflow State

Carry State

Hi / Lo 16-bit select

LoadStore Unit

LoadStore Unit

32-bits

32b

32b

32-bits

16-bits16-bits

16-bits

16-bits

16-bit vector8-bit8-bit8-bit8-bit

40-bit, 32-bit & 16-bit fixed40-bit, 32-bit & 16-bit integer

16-bit imaginary16-bit real

Q Q

X

+

Shift / Saturation

Accumulator (up to 40-bits)DR Register

rounding 40b

16b16b

32b

Q Q

X

+

Shift / Saturation

Accumulator (up to 40-bits)DR Register

rounding 40b

16b16b

32b

Addressing Modes• Immediate• Immediate updating• Indexed• Indexed updating• Aligning updating• Circular (instruction)• Bit-reversed

(instruction)

32b

14

DSP specific instructions

Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation

Add-Compare-Exchange : Useful for Viterbi implementation

Add-Modulo : Circular buffer implementation. Useful for FIR implementation

Copyright © 2009, Tensilica, Inc. 15

ConnX D2 : Instruction Allocation Options

16-bit InstructionsBase ISA

24-bit InstructionsBase ISA or ConnX D2

Slot 0ConnX D2 or Base ISA

Slot 1ConnX D2 or Base ISA (register

moves & C ops on register data)

VLIW Instructions (64-bits)

• Flexible allocation of instructions available to compiler• Optimum use of VLIW slots (ConnX D2 or base ISA instructions)

• Improved performance and no code bloat (reduced NOPs)

• Reduce code size when algorithm is less performance intensive

• Modeless switching between instruction formats


loopgtz a3,.LBB52_energy # [3] l16si a3,a2,2 # [0*II+0] id:16 a+0x0 l16si a5,a2,4 # [0*II+1] id:16 a+0x0 l16si a6,a2,6 # [0*II+2] id:16 a+0x0 l16si a7,a2,8 # [0*II+3] id:16 a+0x0 mul16s a3,a3,a3 # [0*II+4] mul16s a5,a5,a5 # [0*II+5] mul16s a6,a6,a6 # [0*II+6] mul16s a7,a7,a7 # [0*II+7] addi.n a2,a2,8 # [0*II+8] add.n a3,a4,a3 # [0*II+9] add.n a3,a3,a5 # [0*II+10] add.n a3,a3,a6 # [0*II+11] add.n a4,a3,a7 # [0*II+12]

ConnX D2 : SIMD with VLIW – Extra Performance

Example : Energy CalculationExample : Energy Calculation

Combining SIMD and VLIW can give 6 times performance

127

A = ∑ Xn* Xn

n=0

Instruction Execution (Control)

• Vectorization and SIMD gives double data computation performance

• VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads

• ConnX D2 architecture gives this combination and performance

loop { # format XD2_FLIX_FORMAT xd2_la.d16x2s.iu xdd0,xdu0,a4,4; xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 }

Slot0 Slot1

416 cyclesBase Xtensa Configuration

ConnX D2: 64 cyclesConnX D2: 64 cycles

128 iterationC algorithm

SIMD Computation

16One instruction (64-bit VLIW instruction)


When Vectorization is Not PossiblePerformance for scalar code bases

int energy(short *a, int col, int cols, int rows){ int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum;}

• Energy computation of column ‘col’ in 2-D array• Above code loop cannot be vectorized

• Non–contiguous memory accesses thwarts vectorizers• Regular compilers can not map this code into traditional SIMD

DSPs

17


When Vectorization is Not PossiblePerformance for scalar code bases

• Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code

• ConnX D2 compiler can however use VLIW to increase performance

int energy(short *a, int col, int cols, int rows){ int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum;}

entry a1,32 blti a5,1,.Lt_0_2306 addx2 a2,a3,a2 slli a3,a4,1 addi.n a4,a5,-1 sub a2,a2,a3{ # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 xd2_movi.d40 xdd1,0 } loopgtz a4,.LBB43_energy { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 ; xd2_mula32.d16s.ll_s1 xdd1,xdd0,xdd0 }…………

ConnX D2 : One cycle within loop

Load scalar 16-bitsxdd0 is loaded with memory contents defined in a2 register. a2 register value is updated by value in a3

MAC operation on lower 16-bits.Multiplies xdd0 with xdd0. Accumulated result is stored in xdd1

C-Code

Generated Assembly Code

18


Optimization with ITU / TI IntrinsicsPerformance for generic code bases

#define ASIZE 1000extern int a[ASIZE];extern int red;void energy(){ int i; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0;}

entry a1,32 l32r a2,.LC1_40_18 l32r a5,.LC0_40_17 xd2_l.d16x2s.iuxdd0,a2,4test_arr_1+0x0l32i.n a3,a5,0 test_global_red_0+0x0{ # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 }

loopgtz a3,{ # format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 }.......... Generated Assembly Code

19

Energy calculation loop1000 looping, using L_mac ITU intrinsic

L_mac maps to one ConnX D2 instructionCompiler further optimizes by using SIMD to accelerate loopVLIW allows further accelerates with parallel loads1000 loop C algorithm optimized to 500 cycles loop

Sustained 3 operations / cycle


“Out of the Box” Performance - Results

Comparison to TI C55x(TI C55x is an industry benchmark Dual-MAC, 2-way VLIW)

• 20% more performance (256 point complex FFT)

Comparison to other DSP IP vendors• Almost twice the performance

ConnX D2"Out of the Box" C code

TI C55xOptimized assembly

Cycle count(lower is better) 3740 4786 #

ConnX D2

(Out of the Box ITU reference code)CEVA - X1620

(Out of the Box ITU reference code)

Required MHz for AMR-NB (VAD2)Encode + Decode

27.7 MHz 48 MHz *

* - 2008, From CEVA published Whitepaper# - Dec 2008, www.ti.com 20

• FFT specific instructions• Dual write to Register Files• Advanced Complier• SIMD and VLIW performance

• 1 to 1 mapping of ITU intrinsics

• SIMD and VLIW performance• Flexibility in VLIW allocation• VLIW Performance for scalar

code

Why better?

Why better?


Small, Low Power, & High Performance

Optimized for low area / low cost applications

• Less than 70,000 gates

• 0.18mm2 in 65nm GP *

Low power

• 52uW/MHz power consumption• 65nm GP, measured running AMR-NB algorithm

Very high performance

• 600MHz in 65nm GP **

* - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option** - After full Place and Route, when optimized for speed

21


Flexible and Customizable

Configure memory subsystems to exact requirements• Up to 4 local memories

• Instruction memory, data memory• RAM and ROM options• DMA path into these memories

• Instruction and data cache configurations• MMU and memory region protection• Memory port interface• Option of dual load/store architecture

Full customization• Instruction set extensions• Custom I/O Interfaces

• TIE Ports, Queues and Lookup Memory interfaces

22


ConnX D2 DSP Engine: Summary

• Small size

• Low power

• Excellent performance on wide range of code

• Easy to use – C programming centric• “Out of the Box” performance

• Reduce development time – reduced cost

• ITU and T.I. C intrinsic support – large existing code base

• Bit equivalent to TI C6x

• Take current TI code, port and get same functionality on ConnX D2

• Flexible & customizable

23

Introducing the ConnX D2 DSP Engine

Documents

customizable cpu dsp

good performance

vliw execution alu1instruction

vliw execution alu2copyright

wide vliw

x higher performance

vliw technology

execution aluinstruction