Top Banner

of 65

Tutorial on TI C6678

Jul 06, 2018

Download

Documents

Shashwat Dubey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/17/2019 Tutorial on TI C6678

    1/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Texas Instruments

    TMS320C6678 (Shannon)

    DSP Training

    Brighton FengNovember, 2010

  • 8/17/2019 Tutorial on TI C6678

    2/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming

    Interconnection and resource sharing

    Peripherals overview

  • 8/17/2019 Tutorial on TI C6678

    3/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    4

    Shannon Functional Diagram

    • Multi-Core SoC

    • Fixed/Floating C66x™ Core 

     – Eight cores @ 1.0 GHz, 0.5 MB Local L2

     – 4.0 MB shared memory

     – 256 GMAC, 128 GFLOP

    • Navigator

     – Multicore eco system

    • Packet Infrastructure

    • Network Coprocessor – IP Network solution for IP v4/6

     – 1.5M packets per sec (1Gb Ethernet

    wire-rate)

     – IPsec, SRTP, Air Interface Encryption

    fully offloaded

    • 3-port GigE Switch (Layer 2)

    • Low Power Consumption –  Adaptive Voltage Scaling (Smart

    ReflexTM)

    • Hyperlink 50

     – 50G Expansion port

     – Transparent to Software

    • Multicore Debugging

    C6678 (Shannon)

    C66x core

       L   2   M   e   m   o   r   y

    L1 D L1 P

    . . . 8 C66x Cores

    Peripherals and I/O

    sRIO

    Flash PCIe

    TSIP

    UART SPI, I2C

    System Elements

    Power Mgt

    Debug  EDMA

    SysMon

    Memory System

       D   D   R

      -   3

       6   4    b

    Shared Memory

    Multicore MemoryController

    Hyperlink50TeraNet 2

       M   u    l   t   i   c   o   r   e

       N   a

       v   i   g   a   t   o   r

    Enet

    Switch

       S   G   M   I   I

       S   G   M   I   I

    Packet

    CoProcessor

    Crypto/IPSec

    CoProcessor

  • 8/17/2019 Tutorial on TI C6678

    4/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    100% backward object

    code compatible

    Increased

    Fixed and floating

    Point capability

    Improved support for

    complex arithmetic

    and matrix computation

    Enhanced DSP core

    Native

    instructions for

    IEEE 754, SP&DP

    Advanced VLIW

    architecture

    2x registers

    Enhanced

    floating-point

    add capabilities

    Advanced fixed-

    pointinstructions

    Four 16-bit or

    eight 8-bit MACs

    Two-level cache

    SPLOOP and 16-

    bit instructions

    for smaller code

    size

    Flexible level one

    memory

    architecture

    iDMA for rapid

    data transfers

    between local

    memories

    C66x

    C64x+

    C64xC67x

    C67x+

    FLOATING-POINT VALUE FIXED-POINT VALUE

       P  e  r   f  o  r  m  a  n  c  e   i  m  p  r  o  v  e  m  e  n

       t

  • 8/17/2019 Tutorial on TI C6678

    5/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    C66x core block diagram

    C66x Core

     

    Data Path 1 Data Path 2

    A Register FileA0 – A31

    B Register FileB0 –B31

    Instruction Decode

    Instruction Dispatch

    Instruction Fetch Control RegistersInterrupt

    Control

    In-Circuit Emulation

    D2 S2 L2S1L1

    +

    +

    +

    +

    M1 D1 M2

    x

    x

    x

    x

    SPLOOP Buffer

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    256 Bits

    2x64 Bits

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

  • 8/17/2019 Tutorial on TI C6678

    6/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Key Improvements of C66x

    4x Multiply Accumulate improvement

    Enhanced complex arithmetic and matrix operations 2x Arithmetic and Logical operations

    improvement

    Support the floating point arithmetic. Singleprecision floating point operation capability

    same as 32 bit fixed point operation capability

    division and square root is supported byfloating point instruction

  • 8/17/2019 Tutorial on TI C6678

    7/65Copyright © 2010 Texas Instruments. All rights reserved. 

    C64x+ C66x Comparison

    Operation Precision Operationsper cycle

    on C64x+

    Operationsper cycle

    on C66x

    Function Unit

    MAC Real 8 x 8 2 x 4 = 8 2 x 8 = 16 M1, M2

    Real 16 x 16 2 x 2 = 4 2 x 8 = 16 M1, M2

    Real 32 x 32 2 x 1 = 2 2 x 4 = 8 M1, M2

    Complex (16,16)

    x (16,16)

    2 x 1 = 2 2 x 4 = 8 M1, M2

    Complex (32,32)x (32,32)

    N/A 2 x 1 = 2 M1, M2

     Arithmetic

    Logical

    8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2

    16 bit 4 x 2 = 8 4 x 4 = 16 L1, L2, S1, S2

    32 bit 4 x 1 = 4 4 x 2 = 8 L1, L2, S1, S2Memory

     Access

    8 bit, 16 bit, 32bit, 64 bit

    2 x 1 = 2 2 x 1 = 2 D1, D2

  • 8/17/2019 Tutorial on TI C6678

    8/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Memory Architecture Overview

    Shannon Memory Architecture

    Improvement Programming model

    Interconnection and resource sharing

    Peripherals overview

  • 8/17/2019 Tutorial on TI C6678

    9/65Copyright © 2010 Texas Instruments. All rights reserved. 

    TCI6486 Memory Architecture

    Core 0

    Internal

    L2 RAM

    Core N

    Internal

    L2 RAM

    DMA

    SCR

    (Core

    speed)/3

    128 bit

    Shared

    L2Control

    (Core

    speed)/2

    256 bit

    Shared

    L2 RAM

    External

    Memory

    Shared

    L2 ROM

    EDMA

    .

    .

    .

    S

    M

    SM

    M

  • 8/17/2019 Tutorial on TI C6678

    10/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Shannon Memory Architecture

    Core 0

    Internal

    L2 RAM

    Core N

    Internal

    L2 RAM

    DMA

    SCR

    (Core

    speed)/3

    128 bit

    Shared

    Memory

    Control

    (Core

    speed)/2

    256 bit

    Shared

    L2 RAM

    External

    Memory

    EDMA

    .

    .

    .

    S

    S

    MEDMA M

  • 8/17/2019 Tutorial on TI C6678

    11/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Memory Architecture Overview

    Shannon Memory Architecture

    Improvement Programming model

    Interconnection and resource sharing

    Peripherals overview

    Addi i f XMC

  • 8/17/2019 Tutorial on TI C6678

    12/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Addition of XMC

    Bring over existing EMC MDMA path

    Fat pipe to external (and internal) shared memory

    Bus width: 256 instead of 128 bits

    Clock rate: CPUCLK/2 instead of CPUCLK/3

    Optimize requests for MSMC / DDR3 memory

    L2 line allocations and evictions are split into sub-lines of 64 bytes

    Memory Protection and Address Extension (MPAX) support

    16 segments of programmable size (powers of 2: 4KB to 4GB)

    Each segment maps a 32-bit address to a 36-bit address.

    Each segment controls access: supervisor/user, R/W/X, (non-)secure

    Memory protection for shared internal MSMC memory and external DDR3memory

    Multi-stream Prefetch support

    Program prefetch buffer up to 128 bytes

    Data prefetch buffer up to 8 x 128 bytes

    Prefetch enabled/disabled on 16MB ranges defined in MAR

    Manual flush for coherence purposes

    Note: no IDMA path

    MAR R i t E t i

  • 8/17/2019 Tutorial on TI C6678

    13/65Copyright © 2010 Texas Instruments. All rights reserved. 

    MAR Register Extension

    • L2 memory controller extends the MAR registers by adding the “PFX” field,

    L2 memory controller uses this bit to convey XMC whether a given addressrange is prefetchable.

  • 8/17/2019 Tutorial on TI C6678

    14/65Copyright © 2010 Texas Instruments. All rights reserved. 

    oc agram

    RAM banks,

    256-bits per 

    bank

    CGEM

    Slave Port

    CGEM

    Slave Port

    System

    Slave Port

    for shared

    SRAM

    (SMS)

    System

    Slave Port

    forexternal

    memory

    (SES)

    MSMC System

    Master Port

    MSMC EMIF

    Master Port

    MSMC Datapath

    x N CGEM cores

     Arbitration for Banks

    256

    256

    256

    256

    256

    Memory

    Protection

    and

    Extension

    Unit

    (MPAX)

    256  256

    VBUSM 256

    events

    VBUSM 256

    VBUSM 256

    Memory

    Protection

    and

    Extension

    Unit(MPAX)

    MSMC Core

    EMIF – 64 bit

    DDR3

     SCR

    SCR

    VBUSM 256

    EDC for SRAM

     

    One slave interface per C66xMegamodule (256 bits @ CPUCLK/2)

    Uses a 36 bit address extended insidea C66x Megamodule core

    Two slave interfaces (256 bits @CPUCLK/2) for access from systemmasters

    SMS interface for accesses to MSMCSRAM space

    SES interface for accesses to DDR3space

    Both interfaces support memoryprotection and address extension

    One master interface (256-bits @CPUCLK/2) for access to the DDR3EMIF

    One master interface (256 bits @

    CPUCLK/2) for access to systemslaves

    MSMC Sh d M

  • 8/17/2019 Tutorial on TI C6678

    15/65Copyright © 2010 Texas Instruments. All rights reserved. 

    MSMC Shared Memory

    4 banks x 2 sub-banks, sub-bank are 256-bitwide.

    Reduces conflicts between C66x Megamodule coresand system masters

    Features a dynamic fair-share bank arbitration foreach transfer

    Supports bandwidth management. Avoidindefinite starvation for lower priority requestsdue to higher priority requests

    Features Not Supported

    Cache coherency between L1/L2 caches in C66xMegamodule cores and MSMC memory

    Cache coherency between XMC prefetch buffers andMSMC memory

    C66x Megamodule to C66x Megamodule cachecoherency via MSMC

    MPAX U it

  • 8/17/2019 Tutorial on TI C6678

    16/65Copyright © 2010 Texas Instruments. All rights reserved. 

    MPAX Units

    MPAX stands for “Memory Protection andAddress Extension” 

    There are N+2 MPAX units in a system with NC66x Megamodules

    N MPAX units for all requests from N C66xMegamodules to internal shared memory, external

    shared memory or any system slave 1 MPAX unit for all requests from any system master

    to internal shared memory

    1 MPAX unit for all requests from any system masterto external shared memory

    Each MPAX unit operates on a number ofsegments of programmable size

    Each segment maps a 32-bit address to a 36-bitaddress.

    Each segment controls access.

    N b f S t

  • 8/17/2019 Tutorial on TI C6678

    17/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Number of Segments

    Each C66x Megamodule has 16 segments whichcontrol direct (load/store) requests to internalshared memory, external shared memory andany other system slave.

    Any master identified by a privilege ID has

    8 segments for requests to internal shared memory

    8 segments for requests to external shared memory.

    Some masters work on behalf of other masters.They will inherit the privilege ID of theircommanding master. As such, each C66x

    Megamodule also has 8 segments for indirect (DMA) requests to internal

    shared memory

    8 segments for indirect (DMA) requests to externalshared memory

    S t D fi iti

  • 8/17/2019 Tutorial on TI C6678

    18/65Copyright © 2010 Texas Instruments. All rights reserved. 

    Segment Definition

    Each segment is defined by a base address and a size

    The segment size can be set to any power of 2 from 4K to4GB

    The segment base address is constrained to power-of-2boundary equal to size.

    One would expect that each request should find onematching segment, however ...

    a request may find two or more matching segments, inwhich case segments with higher ID take priority oversegments with lower ID. This allows

    creating non-power of 2 segments

    creating 3 segments with just 2 segment definitions

    ... a request may find no matching segment, in which case an

    error is reported in Memory protection fault reportingregisters (XMPFAR, XMPFSR)

    XMC S t R i t

  • 8/17/2019 Tutorial on TI C6678

    19/65Copyright © 2010 Texas Instruments. All rights reserved. 

    XMC Segment RegistersXMPAXH/XMPAXL[15-0]

    MPAX D f lt M M

  • 8/17/2019 Tutorial on TI C6678

    20/65Copyright © 2010 Texas Instruments. All rights reserved. 

    MPAX Default Memory Map

    Segment 1

    Segment 0

    DisabledSegment 2

    DisabledSegment 3

    DisabledSegment 4

    DisabledSegment 5

    DisabledSegment 6

    DisabledSegment 7

    DisabledSegment 8

    DisabledSegment 9

    DisabledSegment 10

    DisabledSegment 11

    DisabledSegment 12

    DisabledSegment 13

    DisabledSegment 14

    DisabledSegment 15

    CGEM Logical

    32-bit Memory Map

    Upper 60GB

    System Physical36-bit Memory Map

    Lower 4GB

    0000_0000

    7FFF_FFFF

    8000_0000

    FFFF_FFFF

    (not remappable)0BFF_FFFF

    0C00_0000 0:FFFF_FFFF

    0:8000_0000

    0:7FFF_FFFF

    0:0C00_0000

    0:0BFF_FFFF

    0:0000_0000

    1:0000_0000

    F:FFFF_FFFF

    7:FFFF_FFFF8:0000_0000

    BADDR = 00000h; RADDR = 000000h; Size = 2GB

    BADDR = 80000h; RADDR = 800000h; Size = 2GB

    8:8000_0000

    8:7FFF_FFFF

    XMC configures MPAX segments 0 and 1 so that

    C66x Megamodule can access system memory. The power up configuration is that segment 1

    remaps 8000_0000 – FFFF_FFFF in C66xMegamodule’s address space to 8:0000_0000 – 8:7FFF_FFFF in the system address map.

    This corresponds to the first 2GB of address space

    dedicated to EMIF by the MSMC controller.

    MPAX MSMC Ali i E l

  • 8/17/2019 Tutorial on TI C6678

    21/65Copyright © 2010 Texas Instruments. All rights reserved. 

    MPAX MSMC Aliasing Example

    BADDR = 0C000h; RADDR = 00C000h; Size = 2MB

    BADDR = 20000h; RADDR = 00C000h; Size = 2MB

    CGEM 32-bit Memory Map

    0000_0000

    FFFF_FFFF

    (not remappable)0BFF_FFFF

    0Cxx_xxxx

    0:0C1F_FFFF

    0:0C00_0000

    BADDR = 21000h; RADDR = 00C000h; Size = 2MB

    20xx_xxxx

    21xx_xxxx

    “Fast” MSMC RAM

    MSMC RAM Alias 1

    MSMC RAM Alias 2

    MSMC RAM

    (2MB)

    Example shows 3 segments to map the MSMC RAM address

    space into C66x Megamodule’s address space as three distinct2MB ranges. By programming the MARs accordingly, the threesegments could have different semantics.

    Accesses to MSMC RAM via this alias do not use the “fast RAM”path and incur additional cycles of latency.

    MPAX Overlayed Segments Example

  • 8/17/2019 Tutorial on TI C6678

    22/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    MPAX Overlayed Segments Example

    BADDR = 00000h; RADDR = 000000h; Size = 2GB

    BADDR = 80000h; RADDR = 080000h; Size = 2GBSegment 1

    Segment 0

    BADDR = C0007h; RADDR = 050042h; Size = 4KSegment 2

    DisabledSegment 3

    DisabledSegment 4

    DisabledSegment 5

    DisabledSegment 6

    DisabledSegment 7

    DisabledSegment 8

    DisabledSegment 9

    DisabledSegment 10

    DisabledSegment 11

    DisabledSegment 12

    DisabledSegment 13

    DisabledSegment 14

    DisabledSegment 15

    CGEM 32-bit Memory Map

    Upper 60GB

    System 36-bit Memory Map

    Lower 4GB

    0000_0000

    7FFF_FFFF

    8000_0000

    FFFF_FFFF

    (not remappable)0BFF_FFFF

    0C00_0000

    0:FFFF_FFFF

    0:8000_0000

    0:7FFF_FFFF

    0:0C00_0000

    0:0BFF_FFFF0:0000_0000

    1:0000_0000

    F:FFFF_FFFF

    0:5004_2xxx

    0:C000_7xxx

    C000_7xxx

    segment 1 matches 8000_0000 through FFFF_FFFF,

    and segment 2 matches C000_7000 through C000_7FFF. Because segment 2 is higher priority than segment 1,

    its settings take priority, effectively carving a 4K hole insegment 1’s 2GB address space.

    Furthermore, it maps this 4K space to 0:5004_2000 -0:5004_2FFF, which overlaps the mapping establishedby segment 0. This physical address range is nowaccessible by two logical address ranges.

    Outline

  • 8/17/2019 Tutorial on TI C6678

    23/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Memory Architecture Overview

    Shannon Memory Architecture

    Improvement  Programming model

    Interconnection and resource sharing 

    Peripherals overview

    single program image

  • 8/17/2019 Tutorial on TI C6678

    24/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    single program image

    L2 memory

    C6000

    Core 0

    L1 Prog

    L1 Data

    C6000

    Core 1

    L1 Prog

    L1 Data

    C6000

    Core 2

    L1 Prog

    L1 Data

    L2 memory L2 memory

    App.out App.out App.out

    codeand

    read/write

    data

    Shared L2 or

    DDR memoryApp.out

    Shared code

    and

    Read onlydata

    Data 0

    Data 1

    Data 2

    Data 0 Data 1 Data 2

    Same image on each DSP core

    Aliased addressing used for DSP core to access local L2

    DNUM DSP core register for:

    Global addressing when programming EDMA3, SRIO, … 

    Separate buffer per DSP core in DDR: dp= bufBase+ BUF_SIZE*DNUM

    Shannon MPAX enables easy single program image

  • 8/17/2019 Tutorial on TI C6678

    25/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shannon MPAX enables easy single program image

         M     P     A     X

         M     P     A     X

    code1

    data2

    data2

    code2

    data3

    data3

    MSMC RAM

    internal

    External memory

    code1

    data2

    code2

    data3

    MSMC RAM

    internal

    External memory

    SoC address spaceCGEM address space (1)

    code1

    data2

    code2

    data3

    MSMC RAM

    internal

    External memory

    CGEM address space (n)

    virtual address space (1) virtual address space (n)SoC address space

    multiple program image

  • 8/17/2019 Tutorial on TI C6678

    26/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    multiple program image

    L2 memory

    C6000

    Core 0

    L1 Prog

    L1 Data

    C6000

    Core 1

    L1 Prog

    L1 Data

    C6000

    Core 2

    L1 Prog

    L1 Data

    L2 memory L2 memory

    App0.outApp1.out

    C6000

    Core 0

    L1 Prog

    L1 Data

    C6000

    Core 1

    L1 Prog

    L1 Data

    C6000

    Core 2

    L1 Prog

    L1 Data

    App0.out App1.out App2.out

    App2.out

    Shared L2 or

    DDR memory

    Data 0 Data 1 Data 2

    Data 0 Data 1Data 2

    Each DSP core has its image

    Static split of DDR2 per DSP core

    Global or local addressing used for L2 addressing

    Shannon Software

  • 8/17/2019 Tutorial on TI C6678

    27/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    43

    Shannon Software• Flexible development

    environment for the customer.

    • Customer can choose to developtheir application using all or anyone of the software layers.

    • Will contain following softwarelayers – BIOS and Linux Operating System

    support

     – Chip Support Library – Platform Development Kit

     – Inter Core Communication

     – Optimized DSP functions library

     – Optimized Audio, Video andSpeech codecs

     – Voice Gateway Demonstration Kit

     – Video Transcoding DemonstrationKit

     – Demonstration applications

    C6678 Software

    Operating System w/ Boot Loader

    BIOS

    Full Silicon Entitlement

    Multi-core Entitlement

    Linux

    Chip Support Library

    Platform Development Kit

    Inter Core Communication

    Voice Gateway

    Demonstration KitVideo

    Transcoding

    Demonstration Kit

    Speech

    CodecDSPLIB

    Audio

    Codec

    Video

    Codec

    Demo

    App

    Customer Application

    Sh D b

  • 8/17/2019 Tutorial on TI C6678

    28/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Data

    Visualization

    Shannon DebugBest Multicore Debug and Visualization  Debug enabled Multicore SoC 

    Debug visibility at core, across multicore and for SoC45

    C6678 (Shannon)

    C66xcore

       L   2   M   e   m   o   r   y

    L1 D L1 P

    . . . 8 C66x Cores

    Peripherals and I/O

    sRIO

    Flash PCIe

    TSIP

    UART SPI, I2C

    System Elements

    Power Mgt

    Debug  EDMA

    SysMon

    Memory System

       D   D   R  -   3

       6   4

        b

    SharedMemory

    Multicore MemoryController

    TeraNet 2

       M   u    l   t   i   c   o   r   e

       N   a   v   i   g   a   t   o   r

    Enet

    Switch

       S   G   M

       I   I

       S   G   M   I   I

    PacketCoProcessor

    Crypto/IPSec

    CoProcessor

       E   T   B

    TRACE

       T   R   A

       C   E

    Hyperlink50

    Outline

  • 8/17/2019 Tutorial on TI C6678

    29/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Interconnection and resource sharing

    Interconnection Architecture

    Shannon Hardware queue

    Inter-core communication

    Shared Resource Management 

    Peripherals overview

    Shannon Switch Fabric

  • 8/17/2019 Tutorial on TI C6678

    30/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shannon Switch Fabric

    MSMC_SS

    CPU/2

    256b

    VBUSMSCR

    Shared

    L2 RAM

    CPU/3

    128b

    VBUSM

    SCR

    S

    S

    SRIO

    M

    PCIe

    QM_SS

    M

    M

    16ch DMAMTC0

    MTC1

    M

    M DDR3SXMC

    64ch

    DMA

    MTC2

    MTC3

    MTC4

    MTC5

    64ch

    DMA

    MTC6

    MTC7

    MTC8

    MTC9

    CPU/3

    32b

    VBUSP

    SCR

    PA_SS M

    VUSR M

    VUSRS

    TSIP 0,1 M

    QM_SS

    PCIe

    S

    S

    EMIF16S

    CONFIG

    M

    EDMA_0

    EDMA_1,2

    GEMS MGEMS MGEMS MGEMS MGEMS M

    GEMS M

    GEMS MGEMS M

    Outline

  • 8/17/2019 Tutorial on TI C6678

    31/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Interconnection and resource sharing

    Interconnection Architecture

    Shannon Hardware queue

    Inter-core communication

    Shared Resource Management 

    Peripherals overview

    H d Q A hit t

  • 8/17/2019 Tutorial on TI C6678

    32/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Hardware Queue Architecture

    packetized Data transfer architecturedesigned to minimize DSP coreinteraction while maximizing memoryand bus efficiency

    the key communication platform for TI’sfuture Infrastructure DSPs

    Used by following peripherals inShannon:

     Serial RapidIO, Packet Accelerator

    Each module contains its own DMA totransfer associated data with the ‘jobs’, NoCPU resources involved

    Hardware Queue

  • 8/17/2019 Tutorial on TI C6678

    33/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Queue 1..x

    Hardware Queue

    Producer writes ‘jobs’ into a Queue.

    Consumer reads ‘jobs’ from the

    Queue Supports Multiple In – Multiple Out

    Multiple Producers can write to thesame Queue

    Used to share common Hardware

    Multiple Consumers can read fromthe same Queue

    Used for Load Balancing

    Abstracts the Consumer

    Consumer can be a Hardware IP(accelerator, peripheral) or asoftware (ie a CPU core)

    Transparent for the Producer

      ‘Easy’ to upgrade to newhardware. The ‘job gets done’. 

      Minimize changes to Hostsoftware, Easy maintenance

    CPU1

    CPU2

    CPU3

    Packet Acc.

    RapidIO

    ....

    Producer Queue

    Manager

    CPUx

     Acc 1

     Acc 2

    RapidIO

    Peri x

    … 

    QueueController

    DMA

    Consumer

    Send a ‘job’  Retrieve a ‘job’ 

    Packet Queuing Data Structure Diagram

  • 8/17/2019 Tutorial on TI C6678

    34/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Packet Queuing Data Structure Diagram

    Hardware Queue Operation

  • 8/17/2019 Tutorial on TI C6678

    35/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Hardware Queue Operation

    Push to a queue

    Host write pointer of new descriptor to a queue register.

    Queue manager links (modify the link RAM) the newdescriptor to the tail (or header) of the queue.

    Tail (or header) pointer points to the new descriptor.

    Pop from a queue

    Host read a descriptor pointer from a queue register. Queue manager returns the descriptor pointed by the header

    pointer

    Header pointer points to the next descriptor.

    Monitor queue

    Queue manager generates events when queue changes: notempty, entry count, exceed threshold, starvation… 

    Queue Diversion

    Entire queue contents can be cleared or moved to anotherqueue destination using a single register write

    Shannon Hardware queue architecture

  • 8/17/2019 Tutorial on TI C6678

    36/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shannon Hardware queue architecture

    DSP coreDSP core

    Queue Manage Subsystem

    DSP corePacket DMA

    (SRIO)

    Packet DMA

    (PA)

    VBUS

     Accumulation

    Buffer 

    Buffer 

    Memory

    .

    .

    .

    Link

    RAM

    Descriptor 

    RAM Queue

    Manager 

    Q1

    IF

    Q0

    IF

    Qx

    IF

    Queue Events

    Queue Event Queue Event

    Packet

    DMA

    (Internal)

     APDSP

     APDSP

    Queue Interrupt

    Queue

    Interrupts

    Queue Manager Subsystem

  • 8/17/2019 Tutorial on TI C6678

    37/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Queue Manager Subsystem

    Support 8192 queues

    HW queues are multi-core safe without mutualexclusion, multiple senders can use a destinationqueue without restrictions

    Can Notify Packet DMA when transfer is pending

    Can notify DSP core when packet is pending, cancopy descriptor pointers of transferred data todestination core’s local memory to reduce accesslatency

    Internal Packet DMA Transfer packet from one queue to another queue. Good for

    core to core data transfer.

    Descriptor RAM

  • 8/17/2019 Tutorial on TI C6678

    38/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Descriptor RAM

    Data elements (buffers) tobe passed on queues are

    first described to adescriptor region managerbuilt into the QM.

    Although technically calleddescriptors, these memoryelements can hold any

    arbitrary data.The size of the dataelements must be a power of2, from 32 bytes to 8192bytes in length.

    20 configurable memory

    regions (for descriptorstorage)

    The number of elements inthe region must be a powerof 2, from 32 buffers to 4096buffers in the region.

    32 byte

    buffers

    256 byte

    buffers

    Memory Descriptor Region

    Registers

    16

    00x1000

    0x2000

    0x1000

    0 16

    Region 0

    Region 1

    Region 19

    … 

    32

    16 4 256

    0x2000

    15

    19

    Linking RAM

  • 8/17/2019 Tutorial on TI C6678

    39/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Linking RAM

    Linking RAM contains 1 entry for eachdescriptor . Linking RAM entry is effectively

    an extension of the descriptor

    Linking RAM stores Forward data pointerthat is critical for the PUSH / POP operationsperformed by the Queue Manager

    Linkage between physical address of

    descriptor and physical address of LinkingRAM is performed inside the QM usinginformation provided in the QueueManagement configuration registers

    Linking RAM is typically placed in localmemory for speed. This allows data

    elements to be linked and unlinked in aqueue very quickly, even though the buffersthemselves may be in external memory

    There is no limit to the length of a singlequeue, only a limit on the total number ofdata elements in the system.

    2 configurable Linking RAM regions

    Queue Contents

    Linking RAM

    0

    17

    Forward Pointer Table- - -

    - x - -

    - - - -

    - - - -

    - 5 19 x

    Queue 0 Queue 1

    17

    5

    19

    18

    Queue Data Flow Example, Transmit

  • 8/17/2019 Tutorial on TI C6678

    40/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Queue Data Flow Example, Transmit

    Host Processor

    Queue ManagerRxQueue

    Rx Port

    INIT: Host AllocatesRx Free Descriptorsand initializes queues

    Interrupt Generator

    FreeDescriptorQueue

    TxQueue

    TX 2 ProcessorQueues a packet

    to a Tx Queue

    TX 3 Port transmitsthe buffer beingpointed to by

    the descriptor

    TX 4 Port PostsPacket Descriptorto return Queue

    Tx Port

    TX 1 Processorfetches a descriptorto fill with the datato transmit

  • 8/17/2019 Tutorial on TI C6678

    41/65

    Accumulator (A Programmable DSP)

  • 8/17/2019 Tutorial on TI C6678

    42/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Accumulator (A Programmable DSP)

    Accumulator is used to helpDSP core efficiently POP

    descriptor pointers fromqueue.

    Accumulator pop descriptorpointer from queue and writeto accumulation memory(normally in DSP local

    memory). Accumulator generates

    interrupt to DSP coreaccording to interrupt pacingconfiguration.

    Two Accumulator (PDSP)

    One generate 32 interrupts,each for one queue.

    The other generate 16interrupts, each is combinedevent for 32 queues. Totallymonitor 16x32 queues.

    DSP core Accumulation Memory

    (Descriptor Pointer Array)

    Queue Manager 

    Monitor Queue

    Changes

     APDSP

    Queue Events

    Queue

    Interrupts

    Descriptor 

    RAM

    Timer for

    InterruptPacing

    Hardware queue Performance Consideration

  • 8/17/2019 Tutorial on TI C6678

    43/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Hardware queue Performance Consideration

    Push Operation

    1~4 words write. Since it is post operation, normally,

    do not stall DSP core.

    Pop Operation

    1~4 words read. Stall DSP core about 80~100 cycles.

    Accumulator (PDSP) can pop the descriptors to DSPlocal memory which will save DSP cycles dramatically.

    Descriptor Access

    Write/read full descriptor may consume many cycles.

    For most applications, DSP core can initialize alldescriptors during initialization, and only write/read

    few fields of the descriptor during run time.

    Outline

  • 8/17/2019 Tutorial on TI C6678

    44/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Interconnection and resource sharing

    Interconnection Architecture

    Shannon Hardware queue

    Inter-core communication

    Shared Resource Management 

    Peripherals overview

    Shared Data in the L2 SRAM of transmitter

  • 8/17/2019 Tutorial on TI C6678

    45/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shared Data in the L2 SRAM of transmitter

    If cache is enabled, Core Y needs invalidate cache beforeread

    Data Switch

    Fabric Center 

    DDR2 SDRAM

    L2 RAM

    L2 Cache

    DSPCore X

    L1 Cache

    L2 RAM

    L2 Cache

    DSPCore Y

    L1 Cache

    Shared Data in the L2 SRAM of receiver

  • 8/17/2019 Tutorial on TI C6678

    46/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shared Data in the L2 SRAM of receiver

    Data Switch

    Fabric Center 

    DDR2 SDRAM

    L2 RAM

    L2 Cache

    DSPCore X

    L1 Cache

    L2 RAM

    L2 Cache

    DSPCore Y

    L1 Cache

    If cache is enabled, Core X needs write back cache afterwrite

    Shared Data in the shared memory

  • 8/17/2019 Tutorial on TI C6678

    47/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Shared Data in the shared memory

    Data Switch

    Fabric Center 

    Shared L2 or DDR

    L2 RAM

    L2 Cache

    DSPCore X

    L1 Cache

    L2 RAM

    L2 Cache

    DSPCore Y

    L1 Cache

    If cache is enabled, Core X needs write back cache afterwrite; core Y needs invalidate cache before read

    Use IPC register for inter-core communication

  • 8/17/2019 Tutorial on TI C6678

    48/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Use C eg ste o te co e co u cat o

    Configuration

    Switch Fabric

    L2 RAM

    L2 Cache

    DSP

    Core X

    L1 Cache

    L2 RAM

    L2 Cache

    DSP

    Core Y

    L1 Cache

    IPC

    Interrupt is generated for Core Y

    No cache coherency issue

    Inter-core Data Block exchange with EDMA

  • 8/17/2019 Tutorial on TI C6678

    49/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    g

    Data Switch

    Fabric Center 

    EDMA

    L2 RAM

    L2 Cache

    DSP

    Core X

    L1 Cache

    L2 RAM

    L2 Cache

    DSP

    Core Y

    L1 Cache

    Data Data

    Interrupt is generated for Core Y

    No cache coherency issue

    Inter-core data exchange through hardware queue(P k t DMA )

  • 8/17/2019 Tutorial on TI C6678

    50/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    g g q(Packet DMA copy)

    Data Switch

    Fabric Center 

    Packet

    DMA

    L2 RAM

    L2 Cache

    DSP

    Core X

    L1 Cache

    L2 RAM

    L2 Cache

    DSP

    Core Y

    L1 Cache

    Src

    Que

    Dst

    Que

    Core X simply push data to Source Queue

    Packet DMA transfer the data Dest Queue

    Core Y simply pop data from Dest Queue

    If Queue buffers are in L2 RAM, Software on both cores do notneed maintenance the cache coherency.

    Inter-core data exchange through hardware queue(Z C )

  • 8/17/2019 Tutorial on TI C6678

    51/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    g g q(Zero Copy)

    Core X push data to Shared Queue, Core Y pop data from SharedQueue

    Multi-core can access Shared Queue simultaneously without mutualexclusion

    Software need maintenance the cache coherency.

    Data Switch

    Fabric Center 

    Queue

    Manager 

    L2 RAM

    L2 Cache

    DSP

    Core X

    L1 Cache

    L2 RAM

    L2 Cache

    DSP

    Core Y

    L1 Cache

    Shared

    Queue

    Outline

  • 8/17/2019 Tutorial on TI C6678

    52/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Outline

    C6678 DSP Overview

    Multi-core DSP programming Interconnection and resource sharing

    Interconnection Architecture

    Shannon Hardware queue Inter-core communication

    Shared Resource Management 

    Peripherals overview

    Shared resources

  • 8/17/2019 Tutorial on TI C6678

    53/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    S a ed esou ces

    Internal shared L2 and External Shared memory (DDR)

    Each core access shared memory independently. Arbitration

    handled by switch fabric and end-point arbiters.

    Shared on-chip Peripherals

    Configuration:  Typically done at startup to set the operatingmode of a particular logic block (e.g. DDR settings). Should bedone by a single core as part of the boot process.

    Use: Peripherals with Hardware queue, Each core access hardware

    queue independently. Arbitration handled by queue manager.

    Ethernet, SRIO on Shannon… 

    Multi-channel peripherals can be split amongst the cores forconcurrent, orthogonal control

    EDMA, TSIP, Timer… 

    Single-channel peripherals can be controlled by a single master,servicing the other cores if needed. Or mutual exclusively used bymulti-masters through semaphore.

    I2C, SPI… 

    System-level prioritization for arbitration

  • 8/17/2019 Tutorial on TI C6678

    54/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    y p

     A user-specified priority may be assigned to:

     Any DSP core accesses

     Any EDMA, sRIO, Ethernet, … on-chip transfers

    Each of the master ports are assigned a priority (8levels) configurable

    Hardware Semaphores on Shannon for atomic accesses

  • 8/17/2019 Tutorial on TI C6678

    55/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Hardware Semaphores on Shannon for atomic accesses

    What function does the Semaphore module provide?

    A method to control who accesses a shared resource

    Provides accesses for shared resources in an atomic manner Read-modify-write sequence is not broken

    Features of the Semaphore module

    Binary Semaphore

    Contains 64 semaphores to be used within the system

    Two methods of accessing a semaphore resource

    Direct Access

    A core directly accesses a semaphore resource. If free, the semaphorewill be granted. If not, the semaphore is not granted

    Useful if the system can afford to poll for the semaphore

    Indirect access

    A core indirectly accesses a semaphore resource by writing to it. Once itis free an interrupt will notify the DSP core that it is available.

    Outline

  • 8/17/2019 Tutorial on TI C6678

    56/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    C6678 DSP Overview

    Multi-core DSP programming Interconnection and resource sharing

    Peripherals overview

    Shannon RapidIO Gen 2 Features and Enhancements

  • 8/17/2019 Tutorial on TI C6678

    57/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    p

    4 lanes – options include 2x

    Baud rates: 5 Gbaud per

    lane in addition to 1.25, 2.5,3.125 Gbaud per lane

    DeviceID Support

    16 Local DeviceIDs (upfrom 1)

    8 Multicast IDs (up from 3)24 Interrupt outputs (up

    from 8)

    Messaging

    Type 9 Packets Support (Data

    Streaming) Type 11 Message – 

    classification improvements

    DirectIO

    8 Load/Store (DirectIO) Units(up from 4)

    Shadow register sets for LSUsto simplify management andminimize overhead

    Provide up to 1MB blocktransfers (up from 4KB)

    Packet Forwarding with Reset

    Isolation

    88

    RapidIO – Topology Examples

  • 8/17/2019 Tutorial on TI C6678

    58/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    89

    p p gy p

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    Mesh

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    Chain

    SRIOSwitch

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    C6678

    DSP

    Swi tch

    C6678DSP

    C6678DSP

    C6678

    DSP

    C6678

    DSP

    C6678DSP

    C6678

    DSP

    C6678DSP

    C6678

    DSP

    Ring

    Packet Accelerator Subsystem On Shannon

  • 8/17/2019 Tutorial on TI C6678

    59/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    y

    3 Port Ethernet Switch Port 0: Internal hardware queue port

    Port 1: SGMII 0 Port, 1Gbps

    Port 2: SGMII 1 Port, 1Gbps

    Packet Accelerator (PA)

    L2, L3, and L4 packet processing

    1.5M packets per sec Security Accelerator (SA)

    Encryption/Decryption

    IPSEC ESP

    IPSEC AH

    SRTP

    3GPP

    91

    IEEE 1588 support

  • 8/17/2019 Tutorial on TI C6678

    60/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    pp

    EMAC hardware supports classifying at the physical levelingress and egress frames as timing synchronizationframes and the timestamp is recorded.

     A software algorithm running on DSP core would then runthe algorithm to calculate the delay and adjust local timeaccordingly.

    Device A is the master deviceDevice B is the slave device

    Message B is used to send the actualtransmit time (tA) of Message AMessage D is used to send the actualreceive time (rC) of Message C

    wire time in one direction

    ((rC - tA)-(tC - rA))/2

    TSIP Overview

  • 8/17/2019 Tutorial on TI C6678

    61/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    1024 8-bit timeslots receive and transmit.

    8 links of 128 timeslots at 8.192 Mbps.

    4 links of 256 timeslots at 16.384 Mbps. 2 links of 512 timeslots at 32.768 Mbps.

    Two clock and frame sync inputs.

    Independent clocking  – 1 receive clock and 1 transmitclock.

    Redundant/common clocking –

     1 receive/transmit clockwith second clock as backup.

    Shannon PCIe Interface

  • 8/17/2019 Tutorial on TI C6678

    62/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    Nyquist/Shannon incorporates PCIe interface withthe following characteristics:

    Two SERDES lanes running at 5 GBaud/2.5GBaud

    Gen2 compliant

    Three different operational modes (default defined by pininputs at power up; can be overwritten by software):

    Root Complex (RC)

    End Point (EP) Legacy End Point

    Single Virtual Channel (VC)

    Single Traffic Class (TC)

    Maximum Payloads

    Egress – 128 bytes

    Ingress – 256 bytes

    Configurable BAR filtering, IO filtering and configurationfiltering

    94

    Remaining Peripherals & System Elements (1/2)

  • 8/17/2019 Tutorial on TI C6678

    63/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    EMIF16

    Supports NAND flash memory, up to 256MB

    Supports NOR flash up to 16MB Supports asynchronous SRAM mode, up to 1MB

    Used for booting, logging, announcement, etc.

    64-Bit Timers

    Total of 16 64-bit timers

    One 64-bit timer per core is dedicated to serve as a watchdog (or may be used

    as a general purpose timer)

    Eight 64-bit timers are shared for general purpose timers

    Each 64-bit timer can be configured as two individual 32-bit timers

    Timer Input/Output pins

    Two timer Input pins

    Two timer Output pins

    Timer input pins can be used as GPI

    Timer output pins can be used as GPO

    Remaining Peripherals & System Elements (2/2)

  • 8/17/2019 Tutorial on TI C6678

    64/65

    Copyright © 2010 Texas Instruments. All rights reserved. 

    UART Interface – Operates at up to 128,000 baud

    I2C Interface

    Supports 400Kbps throughput Supports full 7-bit address field

    Supports EEPROM size of 4 Mbit

    SPI Interface

    Operates at up to 66MHz

    Supports two chip selects Support master mode

    GPIO Interface

    16 GPIO pins

    Can be configured as interrupt pins

    Interrupt can select either rising edge or falling edge

    Q&A

  • 8/17/2019 Tutorial on TI C6678

    65/65

    Q&A