This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
The Scalable Communications Core: A Multi-Core Wireless
Baseband Prototype
Dr. Anthony (Tony) ChunDSP Architect
Wireless Communications LabCorporate Technology Group
� Split VLIW microcode –� Long Configuration Words
� Long Address Words
� Address Generators
� Stream programming model
IEEE Signal Processing Society19
Reed-Solomon Decoding� Maximum throughput
� 84.2Mbps ATSC
� 105.8Mbps DVB-H (PHY)
� 22.9Mbps DVB-H (MPE)
� Up to 4 resident configurations
� GF(2m); m<=8
� T<=32
� g(x)=(x+1)(x+a)…(x+a2T-1)
� p(x)=c0xm+ c1x
m-1… cm-1x+1
� Up to 4 simultaneous streams
� Example supported standards:
� ATSC
� DVB-H
� 802.16de
� ITU-T J.83
� Integrated clock gating
� Fine grained power management
Input DMA & Codeword
Reassembly
Error Correction & Output DMA
Header Table RAM
Code Profile Registers
Switch Matrix
Codeword RAM
Codeword RAM
to mesh
from mesh
OCP Slave
Socket
control
data
Syndrome Calculator (Horner’s
Rule)
Key Equation Solver
(Berlekamp-Massey
Algorithm)
Error Locator & Evaluator
(Chien Search & Forney
Algorithm)
1/x LUT
Context RAM
1/x LUT
Codeword RAM
Codeword RAM
Codeword RAM
Codeword RAM
Codeword RAM
Codeword RAM
IEEE Signal Processing Society20
Opportunity: Radio Composition using Shared Resources
� Smaller – reduce redundancy by sharing resources
� More Energy Efficient – reduced redundancy equates to lower leakage
� Scalable – can easily add new processing elements to cover emerging standards
� Wider Roaming – can compose radios on-the-fly based on signals detected in the air
� Improved Coexistence – wider array of future interference mitigation and coordination options
� Potential Time to Market Reduction – future drag and drop methodology for building a multi-radio baseband processor using well characterized processing elements on a flexible and scalable interconnect
IEEE Signal Processing Society21
Dataflow & Resource Sharing: WiFi vs. Mobile WiMax TX Case
Shared Shared
ResourcesResources
Mobile WiMAX WiFi
IEEE Signal Processing Society22
Dataflow & Resource Sharing:Fixed WiMAX vs. DVB RX Case
Fixed WiMAX DVB
Shared Shared
ResourcesResources
IEEE Signal Processing Society23
Distributed MemoryMemory Bandwidth
0.000E+00
2.000E+10
4.000E+10
6.000E+10
8.000E+10
1.000E+11
1.200E+11
1.400E+11
802.11n 802.16e DVB
Acc
esse
s pe
r Sec
ond
Cumulative
Single Stream
Memory Bandwidth
0.000E+00
1.000E+09
2.000E+09
3.000E+09
4.000E+09
5.000E+09
6.000E+09
7.000E+09
802.11n 802.16e DVB
Acc
esse
s pe
r Sec
ond
Cumulative
Single Stream
Number of Ports vs. Clock Frequency
0100200300400500600700800900
1000
125 250 500
Clock Frequency (MHz)
Num
ber of
Por
ts R
equire
d
DVB
802.16e
802.11n
DSP + FEC DSP alone
Shared memory not practical – distributed memory required for bandwidth.
Number of Ports vs. Clock Frequency
0
10
20
30
40
50
60
125 250 500
Clock Frequency (MHz)
Req
uire
d Por
tsDVB
802.16e
802.11n
IEEE Signal Processing Society24
Power vs. Flexibility
0.000E+00
5.000E+10
1.000E+11
1.500E+11
2.000E+11
2.500E+11
0 100 200 300
Flexibility Metric
Pow
er M
etric NoC
Sparse OCP matrix
Split OCP Matrix
Full OCP Matrix
Interconnect Considerations
DPE0
DPE1
DFE0
DFE1
DFE2
HSV
ILV
RSE
CC
RSD
LPV
TD
MAC0
MAC1
MAC2
DPE0
DPE1
DFE0
DFE1
DFE2
HSV
ILV
RSE
CC
RSD
LPV
TD
MAC0
MAC1
MAC2
DPE0
DPE1
DFE0
DFE1
DFE2
HSV
ILV
RSE
CC
RSD
LPV
TD
MAC0
MAC1
MAC2
DPE0
DPE1
DFE0
DFE1
DFE2
HSV
ILV
RSE
CC
RSD
LPV
TD
MAC0
MAC1
MAC2
Full Matrix (shared bus) Split Matrix (segmented bus) Sparse Matrix
3-ary 2-cube NoC
NoC provides lowest NoC provides lowest
power with maximum power with maximum
flexibilityflexibility
IEEE Signal Processing Society25
NoC Issues
� Latency – caused by multiple streams contending for a shared interconnect
� Jitter – caused by time division multiplexing with variations in workload
IEEE Signal Processing Society26
Using Fragmentation to Constrain Latency
Single Long Single Long
PacketPacket
Many Small FragmentsMany Small Fragments
IEEE Signal Processing Society27
Using Time Division Multiplexing to Share Interconnect Segments
DSP blocks DSP blocks
for transferfor transferMultiplexed Multiplexed
fragmentsfragmentsDemultiplexed Demultiplexed
DSP blocksDSP blocks
IEEE Signal Processing Society28
Using Timestamps to Constrain Jitter
0
1
1
3Input
Timestamps
Output
2
32
0 1f (x,y...z) 2
4
t0
4
t1 t2 t3 t4 t5
Time
reference
=
timestampfalse
true
outputinput
5
5
f (x,y...z)
3 4 5
Time
...
...
...
Router
south
north
west
0
east
Packets arrive with jitterPackets arrive with jitter Functions complete with jitterFunctions complete with jitter
Output Output
transmission is transmission is
precisely timedprecisely timed
IEEE Signal Processing Society29
Data Driven Processing: Using a System of Tags to form Linked Lists
Stream IDStream ID
references a references a
context for context for
multimulti--stream stream
processingprocessing
Function IDFunction ID
references references
function function
parametersparameters
Output headerOutput header
contains route to contains route to
next PE, FID, & next PE, FID, &
SIDSID
IEEE Signal Processing Society30
NoC Performance Requirements
1898per channel
(aggregate)
314DVB
336802.16e
1248802.11n
Throughput
(Mbps)
Protocol
0.6per channel
(7 hops)
5.8
4.2
PE Budget
NoC Budget
6.0
10.0
MAC Budget
PHY Budget
16.0802.11n SIFS
Latency
(µs)
Budget
Worst Case NoC Throughput:(RX coded soft-bits @8 bits/soft-bit)
Worst Case NoC Latency:(802.11n SIFS timing budget)
IEEE Signal Processing Society31
Dimension Order Minimal Routing Satisfies Throughput Requirement
IEEE Signal Processing Society32
Latency is Constrained by Packet Size Not by Choice of Routing Algorithm
IEEE Signal Processing Society33
Agenda
� Introduction
�Motivation
�Architecture
�Programming
�Test Chip
� Implementation Examples
� Learnings
�Summary
IEEE Signal Processing Society34
Programming Technology Challenges
�Vision: program the architecture as if it was a single DSP
–We are not there yet
�Programming of heterogeneous accelerators
–Degree of programmability varies i.e. DPE is more programmable than Viterbi decoder