8/17/2019 Tutorial on TI C6678
1/65
Copyright © 2010 Texas Instruments. All rights reserved.
Texas Instruments
TMS320C6678 (Shannon)
DSP Training
Brighton FengNovember, 2010
8/17/2019 Tutorial on TI C6678
2/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Peripherals overview
8/17/2019 Tutorial on TI C6678
3/65
Copyright © 2010 Texas Instruments. All rights reserved.
4
Shannon Functional Diagram
• Multi-Core SoC
• Fixed/Floating C66x™ Core
– Eight cores @ 1.0 GHz, 0.5 MB Local L2
– 4.0 MB shared memory
– 256 GMAC, 128 GFLOP
• Navigator
– Multicore eco system
• Packet Infrastructure
• Network Coprocessor – IP Network solution for IP v4/6
– 1.5M packets per sec (1Gb Ethernet
wire-rate)
– IPsec, SRTP, Air Interface Encryption
fully offloaded
• 3-port GigE Switch (Layer 2)
• Low Power Consumption – Adaptive Voltage Scaling (Smart
ReflexTM)
• Hyperlink 50
– 50G Expansion port
– Transparent to Software
• Multicore Debugging
C6678 (Shannon)
C66x core
L 2 M e m o r y
L1 D L1 P
. . . 8 C66x Cores
Peripherals and I/O
sRIO
Flash PCIe
TSIP
UART SPI, I2C
System Elements
Power Mgt
Debug EDMA
SysMon
Memory System
D D R
- 3
6 4 b
Shared Memory
Multicore MemoryController
Hyperlink50TeraNet 2
M u l t i c o r e
N a
v i g a t o r
Enet
Switch
S G M I I
S G M I I
Packet
CoProcessor
Crypto/IPSec
CoProcessor
8/17/2019 Tutorial on TI C6678
4/65
Copyright © 2010 Texas Instruments. All rights reserved.
100% backward object
code compatible
Increased
Fixed and floating
Point capability
Improved support for
complex arithmetic
and matrix computation
Enhanced DSP core
Native
instructions for
IEEE 754, SP&DP
Advanced VLIW
architecture
2x registers
Enhanced
floating-point
add capabilities
Advanced fixed-
pointinstructions
Four 16-bit or
eight 8-bit MACs
Two-level cache
SPLOOP and 16-
bit instructions
for smaller code
size
Flexible level one
memory
architecture
iDMA for rapid
data transfers
between local
memories
C66x
C64x+
C64xC67x
C67x+
FLOATING-POINT VALUE FIXED-POINT VALUE
P e r f o r m a n c e i m p r o v e m e n
t
8/17/2019 Tutorial on TI C6678
5/65
Copyright © 2010 Texas Instruments. All rights reserved.
C66x core block diagram
C66x Core
Data Path 1 Data Path 2
A Register FileA0 – A31
B Register FileB0 –B31
Instruction Decode
Instruction Dispatch
Instruction Fetch Control RegistersInterrupt
Control
In-Circuit Emulation
D2 S2 L2S1L1
+
+
+
+
M1 D1 M2
x
x
x
x
SPLOOP Buffer
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
256 Bits
2x64 Bits
+
+
+
+
+
+
++
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
8/17/2019 Tutorial on TI C6678
6/65
Copyright © 2010 Texas Instruments. All rights reserved.
Key Improvements of C66x
4x Multiply Accumulate improvement
Enhanced complex arithmetic and matrix operations 2x Arithmetic and Logical operations
improvement
Support the floating point arithmetic. Singleprecision floating point operation capability
same as 32 bit fixed point operation capability
division and square root is supported byfloating point instruction
8/17/2019 Tutorial on TI C6678
7/65Copyright © 2010 Texas Instruments. All rights reserved.
C64x+ C66x Comparison
Operation Precision Operationsper cycle
on C64x+
Operationsper cycle
on C66x
Function Unit
MAC Real 8 x 8 2 x 4 = 8 2 x 8 = 16 M1, M2
Real 16 x 16 2 x 2 = 4 2 x 8 = 16 M1, M2
Real 32 x 32 2 x 1 = 2 2 x 4 = 8 M1, M2
Complex (16,16)
x (16,16)
2 x 1 = 2 2 x 4 = 8 M1, M2
Complex (32,32)x (32,32)
N/A 2 x 1 = 2 M1, M2
Arithmetic
Logical
8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2
16 bit 4 x 2 = 8 4 x 4 = 16 L1, L2, S1, S2
32 bit 4 x 1 = 4 4 x 2 = 8 L1, L2, S1, S2Memory
Access
8 bit, 16 bit, 32bit, 64 bit
2 x 1 = 2 2 x 1 = 2 D1, D2
8/17/2019 Tutorial on TI C6678
8/65Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Memory Architecture Overview
Shannon Memory Architecture
Improvement Programming model
Interconnection and resource sharing
Peripherals overview
8/17/2019 Tutorial on TI C6678
9/65Copyright © 2010 Texas Instruments. All rights reserved.
TCI6486 Memory Architecture
Core 0
Internal
L2 RAM
Core N
Internal
L2 RAM
DMA
SCR
(Core
speed)/3
128 bit
Shared
L2Control
(Core
speed)/2
256 bit
Shared
L2 RAM
External
Memory
Shared
L2 ROM
EDMA
.
.
.
S
M
SM
M
8/17/2019 Tutorial on TI C6678
10/65Copyright © 2010 Texas Instruments. All rights reserved.
Shannon Memory Architecture
Core 0
Internal
L2 RAM
Core N
Internal
L2 RAM
DMA
SCR
(Core
speed)/3
128 bit
Shared
Memory
Control
(Core
speed)/2
256 bit
Shared
L2 RAM
External
Memory
EDMA
.
.
.
S
S
MEDMA M
8/17/2019 Tutorial on TI C6678
11/65Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Memory Architecture Overview
Shannon Memory Architecture
Improvement Programming model
Interconnection and resource sharing
Peripherals overview
Addi i f XMC
8/17/2019 Tutorial on TI C6678
12/65Copyright © 2010 Texas Instruments. All rights reserved.
Addition of XMC
Bring over existing EMC MDMA path
Fat pipe to external (and internal) shared memory
Bus width: 256 instead of 128 bits
Clock rate: CPUCLK/2 instead of CPUCLK/3
Optimize requests for MSMC / DDR3 memory
L2 line allocations and evictions are split into sub-lines of 64 bytes
Memory Protection and Address Extension (MPAX) support
16 segments of programmable size (powers of 2: 4KB to 4GB)
Each segment maps a 32-bit address to a 36-bit address.
Each segment controls access: supervisor/user, R/W/X, (non-)secure
Memory protection for shared internal MSMC memory and external DDR3memory
Multi-stream Prefetch support
Program prefetch buffer up to 128 bytes
Data prefetch buffer up to 8 x 128 bytes
Prefetch enabled/disabled on 16MB ranges defined in MAR
Manual flush for coherence purposes
Note: no IDMA path
MAR R i t E t i
8/17/2019 Tutorial on TI C6678
13/65Copyright © 2010 Texas Instruments. All rights reserved.
MAR Register Extension
• L2 memory controller extends the MAR registers by adding the “PFX” field,
L2 memory controller uses this bit to convey XMC whether a given addressrange is prefetchable.
8/17/2019 Tutorial on TI C6678
14/65Copyright © 2010 Texas Instruments. All rights reserved.
oc agram
RAM banks,
256-bits per
bank
CGEM
Slave Port
CGEM
Slave Port
System
Slave Port
for shared
SRAM
(SMS)
System
Slave Port
forexternal
memory
(SES)
MSMC System
Master Port
MSMC EMIF
Master Port
MSMC Datapath
x N CGEM cores
Arbitration for Banks
256
256
256
256
256
Memory
Protection
and
Extension
Unit
(MPAX)
256 256
VBUSM 256
events
VBUSM 256
VBUSM 256
Memory
Protection
and
Extension
Unit(MPAX)
MSMC Core
EMIF – 64 bit
DDR3
SCR
SCR
VBUSM 256
EDC for SRAM
One slave interface per C66xMegamodule (256 bits @ CPUCLK/2)
Uses a 36 bit address extended insidea C66x Megamodule core
Two slave interfaces (256 bits @CPUCLK/2) for access from systemmasters
SMS interface for accesses to MSMCSRAM space
SES interface for accesses to DDR3space
Both interfaces support memoryprotection and address extension
One master interface (256-bits @CPUCLK/2) for access to the DDR3EMIF
One master interface (256 bits @
CPUCLK/2) for access to systemslaves
MSMC Sh d M
8/17/2019 Tutorial on TI C6678
15/65Copyright © 2010 Texas Instruments. All rights reserved.
MSMC Shared Memory
4 banks x 2 sub-banks, sub-bank are 256-bitwide.
Reduces conflicts between C66x Megamodule coresand system masters
Features a dynamic fair-share bank arbitration foreach transfer
Supports bandwidth management. Avoidindefinite starvation for lower priority requestsdue to higher priority requests
Features Not Supported
Cache coherency between L1/L2 caches in C66xMegamodule cores and MSMC memory
Cache coherency between XMC prefetch buffers andMSMC memory
C66x Megamodule to C66x Megamodule cachecoherency via MSMC
MPAX U it
8/17/2019 Tutorial on TI C6678
16/65Copyright © 2010 Texas Instruments. All rights reserved.
MPAX Units
MPAX stands for “Memory Protection andAddress Extension”
There are N+2 MPAX units in a system with NC66x Megamodules
N MPAX units for all requests from N C66xMegamodules to internal shared memory, external
shared memory or any system slave 1 MPAX unit for all requests from any system master
to internal shared memory
1 MPAX unit for all requests from any system masterto external shared memory
Each MPAX unit operates on a number ofsegments of programmable size
Each segment maps a 32-bit address to a 36-bitaddress.
Each segment controls access.
N b f S t
8/17/2019 Tutorial on TI C6678
17/65Copyright © 2010 Texas Instruments. All rights reserved.
Number of Segments
Each C66x Megamodule has 16 segments whichcontrol direct (load/store) requests to internalshared memory, external shared memory andany other system slave.
Any master identified by a privilege ID has
8 segments for requests to internal shared memory
8 segments for requests to external shared memory.
Some masters work on behalf of other masters.They will inherit the privilege ID of theircommanding master. As such, each C66x
Megamodule also has 8 segments for indirect (DMA) requests to internal
shared memory
8 segments for indirect (DMA) requests to externalshared memory
S t D fi iti
8/17/2019 Tutorial on TI C6678
18/65Copyright © 2010 Texas Instruments. All rights reserved.
Segment Definition
Each segment is defined by a base address and a size
The segment size can be set to any power of 2 from 4K to4GB
The segment base address is constrained to power-of-2boundary equal to size.
One would expect that each request should find onematching segment, however ...
a request may find two or more matching segments, inwhich case segments with higher ID take priority oversegments with lower ID. This allows
creating non-power of 2 segments
creating 3 segments with just 2 segment definitions
... a request may find no matching segment, in which case an
error is reported in Memory protection fault reportingregisters (XMPFAR, XMPFSR)
XMC S t R i t
8/17/2019 Tutorial on TI C6678
19/65Copyright © 2010 Texas Instruments. All rights reserved.
XMC Segment RegistersXMPAXH/XMPAXL[15-0]
MPAX D f lt M M
8/17/2019 Tutorial on TI C6678
20/65Copyright © 2010 Texas Instruments. All rights reserved.
MPAX Default Memory Map
Segment 1
Segment 0
DisabledSegment 2
DisabledSegment 3
DisabledSegment 4
DisabledSegment 5
DisabledSegment 6
DisabledSegment 7
DisabledSegment 8
DisabledSegment 9
DisabledSegment 10
DisabledSegment 11
DisabledSegment 12
DisabledSegment 13
DisabledSegment 14
DisabledSegment 15
CGEM Logical
32-bit Memory Map
Upper 60GB
System Physical36-bit Memory Map
Lower 4GB
0000_0000
7FFF_FFFF
8000_0000
FFFF_FFFF
(not remappable)0BFF_FFFF
0C00_0000 0:FFFF_FFFF
0:8000_0000
0:7FFF_FFFF
0:0C00_0000
0:0BFF_FFFF
0:0000_0000
1:0000_0000
F:FFFF_FFFF
7:FFFF_FFFF8:0000_0000
BADDR = 00000h; RADDR = 000000h; Size = 2GB
BADDR = 80000h; RADDR = 800000h; Size = 2GB
8:8000_0000
8:7FFF_FFFF
XMC configures MPAX segments 0 and 1 so that
C66x Megamodule can access system memory. The power up configuration is that segment 1
remaps 8000_0000 – FFFF_FFFF in C66xMegamodule’s address space to 8:0000_0000 – 8:7FFF_FFFF in the system address map.
This corresponds to the first 2GB of address space
dedicated to EMIF by the MSMC controller.
MPAX MSMC Ali i E l
8/17/2019 Tutorial on TI C6678
21/65Copyright © 2010 Texas Instruments. All rights reserved.
MPAX MSMC Aliasing Example
BADDR = 0C000h; RADDR = 00C000h; Size = 2MB
BADDR = 20000h; RADDR = 00C000h; Size = 2MB
CGEM 32-bit Memory Map
0000_0000
FFFF_FFFF
(not remappable)0BFF_FFFF
0Cxx_xxxx
0:0C1F_FFFF
0:0C00_0000
BADDR = 21000h; RADDR = 00C000h; Size = 2MB
20xx_xxxx
21xx_xxxx
“Fast” MSMC RAM
MSMC RAM Alias 1
MSMC RAM Alias 2
MSMC RAM
(2MB)
Example shows 3 segments to map the MSMC RAM address
space into C66x Megamodule’s address space as three distinct2MB ranges. By programming the MARs accordingly, the threesegments could have different semantics.
Accesses to MSMC RAM via this alias do not use the “fast RAM”path and incur additional cycles of latency.
MPAX Overlayed Segments Example
8/17/2019 Tutorial on TI C6678
22/65
Copyright © 2010 Texas Instruments. All rights reserved.
MPAX Overlayed Segments Example
BADDR = 00000h; RADDR = 000000h; Size = 2GB
BADDR = 80000h; RADDR = 080000h; Size = 2GBSegment 1
Segment 0
BADDR = C0007h; RADDR = 050042h; Size = 4KSegment 2
DisabledSegment 3
DisabledSegment 4
DisabledSegment 5
DisabledSegment 6
DisabledSegment 7
DisabledSegment 8
DisabledSegment 9
DisabledSegment 10
DisabledSegment 11
DisabledSegment 12
DisabledSegment 13
DisabledSegment 14
DisabledSegment 15
CGEM 32-bit Memory Map
Upper 60GB
System 36-bit Memory Map
Lower 4GB
0000_0000
7FFF_FFFF
8000_0000
FFFF_FFFF
(not remappable)0BFF_FFFF
0C00_0000
0:FFFF_FFFF
0:8000_0000
0:7FFF_FFFF
0:0C00_0000
0:0BFF_FFFF0:0000_0000
1:0000_0000
F:FFFF_FFFF
0:5004_2xxx
0:C000_7xxx
C000_7xxx
segment 1 matches 8000_0000 through FFFF_FFFF,
and segment 2 matches C000_7000 through C000_7FFF. Because segment 2 is higher priority than segment 1,
its settings take priority, effectively carving a 4K hole insegment 1’s 2GB address space.
Furthermore, it maps this 4K space to 0:5004_2000 -0:5004_2FFF, which overlaps the mapping establishedby segment 0. This physical address range is nowaccessible by two logical address ranges.
Outline
8/17/2019 Tutorial on TI C6678
23/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Memory Architecture Overview
Shannon Memory Architecture
Improvement Programming model
Interconnection and resource sharing
Peripherals overview
single program image
8/17/2019 Tutorial on TI C6678
24/65
Copyright © 2010 Texas Instruments. All rights reserved.
single program image
L2 memory
C6000
Core 0
L1 Prog
L1 Data
C6000
Core 1
L1 Prog
L1 Data
C6000
Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App.out App.out App.out
codeand
read/write
data
Shared L2 or
DDR memoryApp.out
Shared code
and
Read onlydata
Data 0
Data 1
Data 2
Data 0 Data 1 Data 2
Same image on each DSP core
Aliased addressing used for DSP core to access local L2
DNUM DSP core register for:
Global addressing when programming EDMA3, SRIO, …
Separate buffer per DSP core in DDR: dp= bufBase+ BUF_SIZE*DNUM
Shannon MPAX enables easy single program image
8/17/2019 Tutorial on TI C6678
25/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shannon MPAX enables easy single program image
M P A X
M P A X
code1
data2
data2
code2
data3
data3
MSMC RAM
internal
External memory
code1
data2
code2
data3
MSMC RAM
internal
External memory
SoC address spaceCGEM address space (1)
code1
data2
code2
data3
MSMC RAM
internal
External memory
CGEM address space (n)
virtual address space (1) virtual address space (n)SoC address space
multiple program image
8/17/2019 Tutorial on TI C6678
26/65
Copyright © 2010 Texas Instruments. All rights reserved.
multiple program image
L2 memory
C6000
Core 0
L1 Prog
L1 Data
C6000
Core 1
L1 Prog
L1 Data
C6000
Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App0.outApp1.out
C6000
Core 0
L1 Prog
L1 Data
C6000
Core 1
L1 Prog
L1 Data
C6000
Core 2
L1 Prog
L1 Data
App0.out App1.out App2.out
App2.out
Shared L2 or
DDR memory
Data 0 Data 1 Data 2
Data 0 Data 1Data 2
Each DSP core has its image
Static split of DDR2 per DSP core
Global or local addressing used for L2 addressing
Shannon Software
8/17/2019 Tutorial on TI C6678
27/65
Copyright © 2010 Texas Instruments. All rights reserved.
43
Shannon Software• Flexible development
environment for the customer.
• Customer can choose to developtheir application using all or anyone of the software layers.
• Will contain following softwarelayers – BIOS and Linux Operating System
support
– Chip Support Library – Platform Development Kit
– Inter Core Communication
– Optimized DSP functions library
– Optimized Audio, Video andSpeech codecs
– Voice Gateway Demonstration Kit
– Video Transcoding DemonstrationKit
– Demonstration applications
C6678 Software
Operating System w/ Boot Loader
BIOS
Full Silicon Entitlement
Multi-core Entitlement
Linux
Chip Support Library
Platform Development Kit
Inter Core Communication
Voice Gateway
Demonstration KitVideo
Transcoding
Demonstration Kit
Speech
CodecDSPLIB
Audio
Codec
Video
Codec
Demo
App
Customer Application
Sh D b
8/17/2019 Tutorial on TI C6678
28/65
Copyright © 2010 Texas Instruments. All rights reserved.
Data
Visualization
Shannon DebugBest Multicore Debug and Visualization Debug enabled Multicore SoC
Debug visibility at core, across multicore and for SoC45
C6678 (Shannon)
C66xcore
L 2 M e m o r y
L1 D L1 P
. . . 8 C66x Cores
Peripherals and I/O
sRIO
Flash PCIe
TSIP
UART SPI, I2C
System Elements
Power Mgt
Debug EDMA
SysMon
Memory System
D D R - 3
6 4
b
SharedMemory
Multicore MemoryController
TeraNet 2
M u l t i c o r e
N a v i g a t o r
Enet
Switch
S G M
I I
S G M I I
PacketCoProcessor
Crypto/IPSec
CoProcessor
E T B
TRACE
T R A
C E
Hyperlink50
Outline
8/17/2019 Tutorial on TI C6678
29/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Peripherals overview
Shannon Switch Fabric
8/17/2019 Tutorial on TI C6678
30/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shannon Switch Fabric
MSMC_SS
CPU/2
256b
VBUSMSCR
Shared
L2 RAM
CPU/3
128b
VBUSM
SCR
S
S
SRIO
M
PCIe
QM_SS
M
M
16ch DMAMTC0
MTC1
M
M DDR3SXMC
64ch
DMA
MTC2
MTC3
MTC4
MTC5
64ch
DMA
MTC6
MTC7
MTC8
MTC9
CPU/3
32b
VBUSP
SCR
PA_SS M
VUSR M
VUSRS
TSIP 0,1 M
QM_SS
PCIe
S
S
EMIF16S
CONFIG
M
EDMA_0
EDMA_1,2
GEMS MGEMS MGEMS MGEMS MGEMS M
GEMS M
GEMS MGEMS M
Outline
8/17/2019 Tutorial on TI C6678
31/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Peripherals overview
H d Q A hit t
8/17/2019 Tutorial on TI C6678
32/65
Copyright © 2010 Texas Instruments. All rights reserved.
Hardware Queue Architecture
packetized Data transfer architecturedesigned to minimize DSP coreinteraction while maximizing memoryand bus efficiency
the key communication platform for TI’sfuture Infrastructure DSPs
Used by following peripherals inShannon:
Serial RapidIO, Packet Accelerator
Each module contains its own DMA totransfer associated data with the ‘jobs’, NoCPU resources involved
Hardware Queue
8/17/2019 Tutorial on TI C6678
33/65
Copyright © 2010 Texas Instruments. All rights reserved.
Queue 1..x
Hardware Queue
Producer writes ‘jobs’ into a Queue.
Consumer reads ‘jobs’ from the
Queue Supports Multiple In – Multiple Out
Multiple Producers can write to thesame Queue
Used to share common Hardware
Multiple Consumers can read fromthe same Queue
Used for Load Balancing
Abstracts the Consumer
Consumer can be a Hardware IP(accelerator, peripheral) or asoftware (ie a CPU core)
Transparent for the Producer
‘Easy’ to upgrade to newhardware. The ‘job gets done’.
Minimize changes to Hostsoftware, Easy maintenance
CPU1
CPU2
CPU3
Packet Acc.
RapidIO
....
Producer Queue
Manager
CPUx
Acc 1
Acc 2
RapidIO
Peri x
…
QueueController
DMA
Consumer
Send a ‘job’ Retrieve a ‘job’
Packet Queuing Data Structure Diagram
8/17/2019 Tutorial on TI C6678
34/65
Copyright © 2010 Texas Instruments. All rights reserved.
Packet Queuing Data Structure Diagram
Hardware Queue Operation
8/17/2019 Tutorial on TI C6678
35/65
Copyright © 2010 Texas Instruments. All rights reserved.
Hardware Queue Operation
Push to a queue
Host write pointer of new descriptor to a queue register.
Queue manager links (modify the link RAM) the newdescriptor to the tail (or header) of the queue.
Tail (or header) pointer points to the new descriptor.
Pop from a queue
Host read a descriptor pointer from a queue register. Queue manager returns the descriptor pointed by the header
pointer
Header pointer points to the next descriptor.
Monitor queue
Queue manager generates events when queue changes: notempty, entry count, exceed threshold, starvation…
Queue Diversion
Entire queue contents can be cleared or moved to anotherqueue destination using a single register write
Shannon Hardware queue architecture
8/17/2019 Tutorial on TI C6678
36/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shannon Hardware queue architecture
DSP coreDSP core
Queue Manage Subsystem
DSP corePacket DMA
(SRIO)
Packet DMA
(PA)
VBUS
Accumulation
Buffer
Buffer
Memory
.
.
.
Link
RAM
Descriptor
RAM Queue
Manager
Q1
IF
Q0
IF
Qx
IF
Queue Events
Queue Event Queue Event
Packet
DMA
(Internal)
APDSP
APDSP
Queue Interrupt
Queue
Interrupts
Queue Manager Subsystem
8/17/2019 Tutorial on TI C6678
37/65
Copyright © 2010 Texas Instruments. All rights reserved.
Queue Manager Subsystem
Support 8192 queues
HW queues are multi-core safe without mutualexclusion, multiple senders can use a destinationqueue without restrictions
Can Notify Packet DMA when transfer is pending
Can notify DSP core when packet is pending, cancopy descriptor pointers of transferred data todestination core’s local memory to reduce accesslatency
Internal Packet DMA Transfer packet from one queue to another queue. Good for
core to core data transfer.
Descriptor RAM
8/17/2019 Tutorial on TI C6678
38/65
Copyright © 2010 Texas Instruments. All rights reserved.
Descriptor RAM
Data elements (buffers) tobe passed on queues are
first described to adescriptor region managerbuilt into the QM.
Although technically calleddescriptors, these memoryelements can hold any
arbitrary data.The size of the dataelements must be a power of2, from 32 bytes to 8192bytes in length.
20 configurable memory
regions (for descriptorstorage)
The number of elements inthe region must be a powerof 2, from 32 buffers to 4096buffers in the region.
32 byte
buffers
256 byte
buffers
Memory Descriptor Region
Registers
16
00x1000
0x2000
0x1000
0 16
Region 0
Region 1
Region 19
…
32
16 4 256
0x2000
15
19
Linking RAM
8/17/2019 Tutorial on TI C6678
39/65
Copyright © 2010 Texas Instruments. All rights reserved.
Linking RAM
Linking RAM contains 1 entry for eachdescriptor . Linking RAM entry is effectively
an extension of the descriptor
Linking RAM stores Forward data pointerthat is critical for the PUSH / POP operationsperformed by the Queue Manager
Linkage between physical address of
descriptor and physical address of LinkingRAM is performed inside the QM usinginformation provided in the QueueManagement configuration registers
Linking RAM is typically placed in localmemory for speed. This allows data
elements to be linked and unlinked in aqueue very quickly, even though the buffersthemselves may be in external memory
There is no limit to the length of a singlequeue, only a limit on the total number ofdata elements in the system.
2 configurable Linking RAM regions
Queue Contents
Linking RAM
0
17
Forward Pointer Table- - -
- x - -
- - - -
- - - -
- 5 19 x
Queue 0 Queue 1
17
5
19
18
Queue Data Flow Example, Transmit
8/17/2019 Tutorial on TI C6678
40/65
Copyright © 2010 Texas Instruments. All rights reserved.
Queue Data Flow Example, Transmit
Host Processor
Queue ManagerRxQueue
Rx Port
INIT: Host AllocatesRx Free Descriptorsand initializes queues
Interrupt Generator
FreeDescriptorQueue
TxQueue
TX 2 ProcessorQueues a packet
to a Tx Queue
TX 3 Port transmitsthe buffer beingpointed to by
the descriptor
TX 4 Port PostsPacket Descriptorto return Queue
Tx Port
TX 1 Processorfetches a descriptorto fill with the datato transmit
8/17/2019 Tutorial on TI C6678
41/65
Accumulator (A Programmable DSP)
8/17/2019 Tutorial on TI C6678
42/65
Copyright © 2010 Texas Instruments. All rights reserved.
Accumulator (A Programmable DSP)
Accumulator is used to helpDSP core efficiently POP
descriptor pointers fromqueue.
Accumulator pop descriptorpointer from queue and writeto accumulation memory(normally in DSP local
memory). Accumulator generates
interrupt to DSP coreaccording to interrupt pacingconfiguration.
Two Accumulator (PDSP)
One generate 32 interrupts,each for one queue.
The other generate 16interrupts, each is combinedevent for 32 queues. Totallymonitor 16x32 queues.
DSP core Accumulation Memory
(Descriptor Pointer Array)
Queue Manager
Monitor Queue
Changes
APDSP
Queue Events
Queue
Interrupts
Descriptor
RAM
Timer for
InterruptPacing
Hardware queue Performance Consideration
8/17/2019 Tutorial on TI C6678
43/65
Copyright © 2010 Texas Instruments. All rights reserved.
Hardware queue Performance Consideration
Push Operation
1~4 words write. Since it is post operation, normally,
do not stall DSP core.
Pop Operation
1~4 words read. Stall DSP core about 80~100 cycles.
Accumulator (PDSP) can pop the descriptors to DSPlocal memory which will save DSP cycles dramatically.
Descriptor Access
Write/read full descriptor may consume many cycles.
For most applications, DSP core can initialize alldescriptors during initialization, and only write/read
few fields of the descriptor during run time.
Outline
8/17/2019 Tutorial on TI C6678
44/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Peripherals overview
Shared Data in the L2 SRAM of transmitter
8/17/2019 Tutorial on TI C6678
45/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shared Data in the L2 SRAM of transmitter
If cache is enabled, Core Y needs invalidate cache beforeread
Data Switch
Fabric Center
DDR2 SDRAM
L2 RAM
L2 Cache
DSPCore X
L1 Cache
L2 RAM
L2 Cache
DSPCore Y
L1 Cache
Shared Data in the L2 SRAM of receiver
8/17/2019 Tutorial on TI C6678
46/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shared Data in the L2 SRAM of receiver
Data Switch
Fabric Center
DDR2 SDRAM
L2 RAM
L2 Cache
DSPCore X
L1 Cache
L2 RAM
L2 Cache
DSPCore Y
L1 Cache
If cache is enabled, Core X needs write back cache afterwrite
Shared Data in the shared memory
8/17/2019 Tutorial on TI C6678
47/65
Copyright © 2010 Texas Instruments. All rights reserved.
Shared Data in the shared memory
Data Switch
Fabric Center
Shared L2 or DDR
L2 RAM
L2 Cache
DSPCore X
L1 Cache
L2 RAM
L2 Cache
DSPCore Y
L1 Cache
If cache is enabled, Core X needs write back cache afterwrite; core Y needs invalidate cache before read
Use IPC register for inter-core communication
8/17/2019 Tutorial on TI C6678
48/65
Copyright © 2010 Texas Instruments. All rights reserved.
Use C eg ste o te co e co u cat o
Configuration
Switch Fabric
L2 RAM
L2 Cache
DSP
Core X
L1 Cache
L2 RAM
L2 Cache
DSP
Core Y
L1 Cache
IPC
Interrupt is generated for Core Y
No cache coherency issue
Inter-core Data Block exchange with EDMA
8/17/2019 Tutorial on TI C6678
49/65
Copyright © 2010 Texas Instruments. All rights reserved.
g
Data Switch
Fabric Center
EDMA
L2 RAM
L2 Cache
DSP
Core X
L1 Cache
L2 RAM
L2 Cache
DSP
Core Y
L1 Cache
Data Data
Interrupt is generated for Core Y
No cache coherency issue
Inter-core data exchange through hardware queue(P k t DMA )
8/17/2019 Tutorial on TI C6678
50/65
Copyright © 2010 Texas Instruments. All rights reserved.
g g q(Packet DMA copy)
Data Switch
Fabric Center
Packet
DMA
L2 RAM
L2 Cache
DSP
Core X
L1 Cache
L2 RAM
L2 Cache
DSP
Core Y
L1 Cache
Src
Que
Dst
Que
Core X simply push data to Source Queue
Packet DMA transfer the data Dest Queue
Core Y simply pop data from Dest Queue
If Queue buffers are in L2 RAM, Software on both cores do notneed maintenance the cache coherency.
Inter-core data exchange through hardware queue(Z C )
8/17/2019 Tutorial on TI C6678
51/65
Copyright © 2010 Texas Instruments. All rights reserved.
g g q(Zero Copy)
Core X push data to Shared Queue, Core Y pop data from SharedQueue
Multi-core can access Shared Queue simultaneously without mutualexclusion
Software need maintenance the cache coherency.
Data Switch
Fabric Center
Queue
Manager
L2 RAM
L2 Cache
DSP
Core X
L1 Cache
L2 RAM
L2 Cache
DSP
Core Y
L1 Cache
Shared
Queue
Outline
8/17/2019 Tutorial on TI C6678
52/65
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
C6678 DSP Overview
Multi-core DSP programming Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue Inter-core communication
Shared Resource Management
Peripherals overview
Shared resources
8/17/2019 Tutorial on TI C6678
53/65
Copyright © 2010 Texas Instruments. All rights reserved.
S a ed esou ces
Internal shared L2 and External Shared memory (DDR)
Each core access shared memory independently. Arbitration
handled by switch fabric and end-point arbiters.
Shared on-chip Peripherals
Configuration: Typically done at startup to set the operatingmode of a particular logic block (e.g. DDR settings). Should bedone by a single core as part of the boot process.
Use: Peripherals with Hardware queue, Each core access hardware
queue independently. Arbitration handled by queue manager.
Ethernet, SRIO on Shannon…
Multi-channel peripherals can be split amongst the cores forconcurrent, orthogonal control
EDMA, TSIP, Timer…
Single-channel peripherals can be controlled by a single master,servicing the other cores if needed. Or mutual exclusively used bymulti-masters through semaphore.
I2C, SPI…
System-level prioritization for arbitration
8/17/2019 Tutorial on TI C6678
54/65
Copyright © 2010 Texas Instruments. All rights reserved.
y p
A user-specified priority may be assigned to:
Any DSP core accesses
Any EDMA, sRIO, Ethernet, … on-chip transfers
Each of the master ports are assigned a priority (8levels) configurable
Hardware Semaphores on Shannon for atomic accesses
8/17/2019 Tutorial on TI C6678
55/65
Copyright © 2010 Texas Instruments. All rights reserved.
Hardware Semaphores on Shannon for atomic accesses
What function does the Semaphore module provide?
A method to control who accesses a shared resource
Provides accesses for shared resources in an atomic manner Read-modify-write sequence is not broken
Features of the Semaphore module
Binary Semaphore
Contains 64 semaphores to be used within the system
Two methods of accessing a semaphore resource
Direct Access
A core directly accesses a semaphore resource. If free, the semaphorewill be granted. If not, the semaphore is not granted
Useful if the system can afford to poll for the semaphore
Indirect access
A core indirectly accesses a semaphore resource by writing to it. Once itis free an interrupt will notify the DSP core that it is available.
Outline
8/17/2019 Tutorial on TI C6678
56/65
Copyright © 2010 Texas Instruments. All rights reserved.
C6678 DSP Overview
Multi-core DSP programming Interconnection and resource sharing
Peripherals overview
Shannon RapidIO Gen 2 Features and Enhancements
8/17/2019 Tutorial on TI C6678
57/65
Copyright © 2010 Texas Instruments. All rights reserved.
p
4 lanes – options include 2x
Baud rates: 5 Gbaud per
lane in addition to 1.25, 2.5,3.125 Gbaud per lane
DeviceID Support
16 Local DeviceIDs (upfrom 1)
8 Multicast IDs (up from 3)24 Interrupt outputs (up
from 8)
Messaging
Type 9 Packets Support (Data
Streaming) Type 11 Message –
classification improvements
DirectIO
8 Load/Store (DirectIO) Units(up from 4)
Shadow register sets for LSUsto simplify management andminimize overhead
Provide up to 1MB blocktransfers (up from 4KB)
Packet Forwarding with Reset
Isolation
88
RapidIO – Topology Examples
8/17/2019 Tutorial on TI C6678
58/65
Copyright © 2010 Texas Instruments. All rights reserved.
89
p p gy p
C6678
DSP
C6678
DSP
C6678
DSP
C6678
DSP
C6678
DSP
C6678
DSP
Mesh
C6678
DSP
C6678
DSP
C6678
DSP
C6678
DSP
Chain
SRIOSwitch
C6678
DSP
C6678
DSP
C6678
DSP
C6678
DSP
Swi tch
C6678DSP
C6678DSP
C6678
DSP
C6678
DSP
C6678DSP
C6678
DSP
C6678DSP
C6678
DSP
Ring
Packet Accelerator Subsystem On Shannon
8/17/2019 Tutorial on TI C6678
59/65
Copyright © 2010 Texas Instruments. All rights reserved.
y
3 Port Ethernet Switch Port 0: Internal hardware queue port
Port 1: SGMII 0 Port, 1Gbps
Port 2: SGMII 1 Port, 1Gbps
Packet Accelerator (PA)
L2, L3, and L4 packet processing
1.5M packets per sec Security Accelerator (SA)
Encryption/Decryption
IPSEC ESP
IPSEC AH
SRTP
3GPP
91
IEEE 1588 support
8/17/2019 Tutorial on TI C6678
60/65
Copyright © 2010 Texas Instruments. All rights reserved.
pp
EMAC hardware supports classifying at the physical levelingress and egress frames as timing synchronizationframes and the timestamp is recorded.
A software algorithm running on DSP core would then runthe algorithm to calculate the delay and adjust local timeaccordingly.
Device A is the master deviceDevice B is the slave device
Message B is used to send the actualtransmit time (tA) of Message AMessage D is used to send the actualreceive time (rC) of Message C
wire time in one direction
((rC - tA)-(tC - rA))/2
TSIP Overview
8/17/2019 Tutorial on TI C6678
61/65
Copyright © 2010 Texas Instruments. All rights reserved.
1024 8-bit timeslots receive and transmit.
8 links of 128 timeslots at 8.192 Mbps.
4 links of 256 timeslots at 16.384 Mbps. 2 links of 512 timeslots at 32.768 Mbps.
Two clock and frame sync inputs.
Independent clocking – 1 receive clock and 1 transmitclock.
Redundant/common clocking –
1 receive/transmit clockwith second clock as backup.
Shannon PCIe Interface
8/17/2019 Tutorial on TI C6678
62/65
Copyright © 2010 Texas Instruments. All rights reserved.
Nyquist/Shannon incorporates PCIe interface withthe following characteristics:
Two SERDES lanes running at 5 GBaud/2.5GBaud
Gen2 compliant
Three different operational modes (default defined by pininputs at power up; can be overwritten by software):
Root Complex (RC)
End Point (EP) Legacy End Point
Single Virtual Channel (VC)
Single Traffic Class (TC)
Maximum Payloads
Egress – 128 bytes
Ingress – 256 bytes
Configurable BAR filtering, IO filtering and configurationfiltering
94
Remaining Peripherals & System Elements (1/2)
8/17/2019 Tutorial on TI C6678
63/65
Copyright © 2010 Texas Instruments. All rights reserved.
EMIF16
Supports NAND flash memory, up to 256MB
Supports NOR flash up to 16MB Supports asynchronous SRAM mode, up to 1MB
Used for booting, logging, announcement, etc.
64-Bit Timers
Total of 16 64-bit timers
One 64-bit timer per core is dedicated to serve as a watchdog (or may be used
as a general purpose timer)
Eight 64-bit timers are shared for general purpose timers
Each 64-bit timer can be configured as two individual 32-bit timers
Timer Input/Output pins
Two timer Input pins
Two timer Output pins
Timer input pins can be used as GPI
Timer output pins can be used as GPO
Remaining Peripherals & System Elements (2/2)
8/17/2019 Tutorial on TI C6678
64/65
Copyright © 2010 Texas Instruments. All rights reserved.
UART Interface – Operates at up to 128,000 baud
I2C Interface
Supports 400Kbps throughput Supports full 7-bit address field
Supports EEPROM size of 4 Mbit
SPI Interface
Operates at up to 66MHz
Supports two chip selects Support master mode
GPIO Interface
16 GPIO pins
Can be configured as interrupt pins
Interrupt can select either rising edge or falling edge
Q&A
8/17/2019 Tutorial on TI C6678
65/65
Q&A