RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department [email protected] Ph.D. Thesis Proposal
Dec 13, 2015
RICE UNIVERSITY
A real-time baseband communications processor for
high data rate wireless systems
Sridhar RajagopalECE [email protected]
Ph.D. Thesis Proposal
RICE UNIVERSITY
A proposed cellular base-station
M ultius e rD e te c tio n
M ultius e rC ha nne l
E s tim a tio n
V ite rbi de c o ding
V ite rbi de c o ding
V ite rbi de c o ding
::
R F+
A D C
D a ta
T ra ining
us e r 1
us e r 2
us e r K
us e r 1
us e r 2
us e r K
High data rates in emerging wireless systems (Mbps)
Sophisticated algorithms for high spectral efficiencyMultiuser estimation, multiuser detection, Viterbi
What does it take to build it?
RICE UNIVERSITY
Designing wireless systems
Traditional design - Area-Time-Power - ASICshigh data rates, low power, minimum area
Flexibility became more important - FPGAs, DSPs faster algorithm evaluation and prototyping
Heterogeneous solutions -DSPs+FPGAs+ASICsTask-partitioning & hardware-software co-design
New challengesFlexibility in algorithms -- multi-standard supportrapidly evaluate and implement new algorithmsadapting architectures to ever increasing data rates
RICE UNIVERSITY
Emerging wireless systems
0 50 100 150 200 250 300100
101
102
103
AL
Us
req
uir
ed f
or
real
-tim
e at
500
MH
z
Number of W-CDMA Cellular Users
AddMultiply
SLOW FADING (estimation every 1000 bits)
MEDIUM FADING(estimation every 100 bits)
FAST FADING(estimation every 10 bits)
GOPS of computations in emerging base-stations
Example real-time target : 32 users at 128 Kbps
each
RICE UNIVERSITY
Part I - Build real-time base-stations
Current single processor DSPs not powerful enough
Algorithms well understood at data-flow level
I can design real-time architectures in VLSI
What do I lose when I make the architecture programmable? Can I quantify the loss?
Can I solve the existing bottlenecks, if any?
Build a real-time “efficient” communications processor
RICE UNIVERSITY
Part II - Robust to future updates
Data rates will increase Decoding constraint lengths may increase Number of users, spreading length may increase
Algorithms will change - with MORE computations
System complexity WILL increase.
I want to test new algorithms and changes quickly I want my design to adapt quickly to those changes
RICE UNIVERSITY
Part III - Extensions
W-LAN base-stationDifferent algorithmsFFT, Viterbi, FIR - main computational blocks100+ Mbps data rates
HandsetDifferent algorithms - simplerPower and area - more critical
How will my design extend to these wireless systems?
RICE UNIVERSITY
Expected thesis contributions
A programmable processor design for communicationsadapts to real-time requirements“efficient” in terms of #functional units, their
utilization and memory stalls
Hardware-software co-design framework to rapidly evaluate new algorithmsand rapidly implement them too.
Design limits and extensions to other wireless systems
RICE UNIVERSITY
Outline
Motivation
Thesis proposal
Background
Initial results
Work proposed for thesis completion
Comparisons with existing work
RICE UNIVERSITY
I - Efficient communications processor
Propose using an architecture simulator to design the communications processorhigh performance streaming processor simulator
based on the “Imagine” architecture
Streaming processor becauseGPP architectures not good for media, wirelessstreaming processor shown to be good for media
applications such as FFT and FIR.Media and communication algorithms similar
Media architectures popular --> wireless architectures?
RICE UNIVERSITY
I - Simulator functionality
Simulatorcycle accurateallows us to investigate bottlenecksnumber and type of functional units flexiblegives functional unit utilization
Can propose and evaluate solutions to solve bottlenecks in the implementation
RICE UNIVERSITY
NewAlgorithm
ComplexityParallelismFixed-point
Libraryof existingalgorithms
Algorithmselection
for end-to-endsystem
Existingarchitectureparameters
Compile
AssemblyCode
Real-time/area/powersatisfied?
Done
54
3
2
1
Run on existingarchitecture
Architecture design scaling(# Functional units, # clusters)New architecture parameters
Operationscount
Real-time/(area/power)requirements
Compile
AssemblyCode
Fabricationfeasible?
NewArchitecture
YES
Designfailed
Re-designalgorithms/architecture
12
11
109
8
6
7
NO
NO
YES
II - Robust to future updates
RICE UNIVERSITY
II - Propose this methodology
New algorithms in high level language
Easy to evaluate functionality and can use the same algorithms for actual design
If new algorithms are similar in algorithm and complexity, still real-time with no changes in architecture
If notAn automated tool scales the architecture appropriatelyThe proposed scaling algorithm for the tool targets
real-time and FU utilization simultaneously while scaling.
RICE UNIVERSITY
Outline
Motivation
Thesis proposal
Background -- Architecture and Algorithms
Initial results
Work proposed for thesis completion
Comparisons with existing work
RICE UNIVERSITY
The Imagine architecture
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
Figure borrowed from S. Rixner
RICE UNIVERSITY
Arithmetic clusters
VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Figure borrowed from S. Rixner
RICE UNIVERSITY
Stream programming
StreamCExecutes on hostC++Controls stream
transfers between main memory and SRF
void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }
KernelCExecutes on clustersC-like SyntaxKernel computationCompiled by iscd
KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }
Figure borrowed from S. Rixner
RICE UNIVERSITY
Parallel W-CDMA Estimation/Detection/Decoding
Multiuser estimationreplaced matrix inversion by gradient descent
Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based
detection
Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency
RICE UNIVERSITY
Estimation/Detection (64,32 sizes)
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)RR*A(AA brbb
1ii1iii RxCxLxyy )y(signd ii
H
1H10
H01
H10
H0
1H0
L R
)]AAAdiag(AAAARe[A C
]ARe[A L
)y(signd
]xAxARe[y
ii
1iH1i
H0i
MultiuserEstimation
MultiuserDetection
Prepare Matricesfor Detection
RICE UNIVERSITY
X(0)
X(2)
X(4)
X(6)
X(8)
X(10)
X(12)
X(14)
X(1)
X(3)
X(5)
X(7)
X(9)
X(11)
X(13)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
b. Shuffled Trellisa. Trellis
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
Viterbi trellis for rate ½ code with K = 5
RICE UNIVERSITY
Outline
Motivation
Thesis proposal
Background
Initial results
Work proposed for thesis completion
Comparisons with existing work
RICE UNIVERSITY
Stream data flow
Matrixtranspose
Viterbikernel
Matrix multkernel
Correlationupdate kernel
Matrix mulC kernel
Data rearrangement
Buffer
Estimation bits
Detectionbits
MultiuserChannel
Estimation
MultiuserDetection
Decoding
Computation
Communication
Iterationupdate kernel
Matchedfilter kernel Matrix mul
L kernel
PIC kernel
RICE UNIVERSITY
Matrix multiplication kernel (Imagine)
32 cycle loop Executed on all 8 clusters Complexity
O(N3) multipliesO(N3) adds
100% multiplier utilization in the loop
Divider is unnecessary!
Inner Loop
Instruction
Communication(waiting for input)
FU unavailable(input ready but
FU busy)
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0
RICE UNIVERSITY
Replace divider with multiplier
22 cycle loop Executed on all 8
clusters 97% multiplier utilization
in the loop 85% adder utilization in
the loop
Changing functional unitsSupported by
simulator/compilerArchitecturally
realistic
Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2
RICE UNIVERSITY
Definition of “efficiency”
Idle time includes time spent by functional units not doing any computations and the time spent between kernel operations.
Alternate metric: Idle time ...any USEFUL computations ....
Some computations unavoidable in a programmable architecture ( i = i +1)
Good “Efficiency” means min. memory stalls and high FU utilization
TimeTotal
TimeIdle1Efficiency
RICE UNIVERSITY
Kernel computational time
KernelFunctional unit
utilization(3 +, 2 *)
ExecutionTime
(cycles)
Functional unitutilization(3 +, 3 *)
ExecutionTime
(cycles)Corr update 70%,100% 1224 78.6%,78% 1064Matrix mul 53%,91% 22720 85%,99% 14360Iteration 55%,42% 1058 55%,28% 1058
Total 14464Mat mul L 59%,91% 7468 78%,84% 5573Mat mul C 63%,96% 12192 68%,71% 11084
Total 16657Mf 67%,100% 366 90%,90% 275PIC 67%,96% 996 89%,84% 760
Total 1035Viterbi K = 5 13%,2% 8044 13%,1.4% 8044Viterbi K = 9 35%,9% 80006 35%,6% 80006
RICE UNIVERSITY
Estimation and detection executionKernel Execution Memory TransfersCycle
Stalledwaitingfor data
frommemory
Estimation
Detection(10 bits)
RICE UNIVERSITY
Viterbi execution
Initialization
Decode(32 bits)
Kernel Execution Memory TransfersCycle
RICE UNIVERSITY
Viterbi discussion
Viterbi: not enough computations for Imagine.
Significantly large communication between trellis states
If re-ordering done outside kernel,shows up as memory stalls
If re-ordering done within kernel,shows up as poor FU utilization
poor “efficiency” in any respect
Unsolved bottleneck at this point
RICE UNIVERSITY
C6711 DSP comparisons
Slow Fading Medium Fading Fast FadingStage
Imagine DSP Imagine DSP Imagine DSPEstimation 46 2228 459 22288 4594 222879Detection 1035 44077 1035 44077 1035 44077Decoding 80006 72736 800006 72736 80006 72736
Stalls 104 - 606 - 5620 -
48X improvement for estimation, 42X for detection over DSP simulator
Approx. additional 15X improvement over actual DSP
Viterbi is actually slower than a DSP implementation
RICE UNIVERSITY
Summary for proposal - Part I
Estimation, detection kernels well behaved
Stalls in estimation kernel - but normalized over detection bits.
A 3 adder, 3 multiplier per cluster configuration attained reasonable good FU on most kernels.
Decoding kernel behaves extremely poorly on this architecture.
RICE UNIVERSITY
Issues in scaling architectures
Scaling parameters to meet real-time#FU, their type and #clusters
Increasing clusters more area and energy efficient than increasing #FU as intra-cluster communication and distributing instructions high [Khailany2002]
[Khailany 2002] provide exhaustive searches for best scaling and show effects on area,energy, delay.
Do not target FU utilization and which FU’s to change Do not show how and what to scale to meet real-time
RICE UNIVERSITY
“Ad hoc” algorithm for scaling architectures
Changes #FUs as well as their type
Algorithm optimizes for functional unit utilization simultaneously (not “efficiency”)
looks at functional units that have a very high efficiency they could be bottlenecking other functional unitsadd more of them to see when their utilization
decreasesutilization of other units will increase
look at the functional unit with the next highest utilization.
UnitsOld*#
nutilizatiomax2
nutilizatiomaxUnitsNew#
nd
RICE UNIVERSITY
Algorithm (contd.)
Once functional units have high utilization (or, increasing them by one does not give any more benefits), look at real-time requirements
Increase cluster sizes to meet real-time requirements
The remaining factor is to be achieved by scaling functional units
Will not work if all units have poor efficiency Good and bad
timeexecutioncurrent
timereal)2ofpowersin(increaseclusters#
RICE UNIVERSITY
Example - Matrix multiplication
Base: 3 adders (53%), 2 multipliers (91%)bottlenecked on multipliersmultipliers = (91/53) * 2 = 3
New: 3 adders(85%), 3 multipliers(99%)
Could have been done automatically before instead of looking at the flops computations
Better as it scales FUs such as inter-cluster communication units that are based on computations.
RICE UNIVERSITY
Outline
Motivation
Thesis proposal
Background
Initial results
Work proposed for thesis completion
Comparisons with existing work
RICE UNIVERSITY
Work proposed for completion
Detailed comparisons to other architectures
Innovations needed to solve bottlenecks
Formalize scaling architectures and methodology
Extensions to W-LAN and handsets
RICE UNIVERSITY
Time-scales
Work Time Frame Status
Fixed point estimationdetection in Imagine
August
Integration ofImagine code
August
C64x DSP comparisons AugustSingle cluster
implementationAugust
Comparisonpoints
VLSI comparisons AugustMemory stalls SeptemberInnovations
Viterbi FU OctoberScaling Scaling algorithm November
W-LAN base-station DecemberExtensionsHandset issues January
RICE UNIVERSITY
Approach to complete work
Comparisons:Time-consuming but straight-forward
Innovations to solve bottlenecks:At the algorithm level
• re-designing to minimize data re-arrangementsAt the software level
• BLAS sub-routines for blocking computationsAt the hardware level
• Shuffle networks for Viterbi trellis
• Instruction set extensions for Viterbi acs
• Permutation-based interleaved memory
RICE UNIVERSITY
Approach (Contd.)
Scaling algorithm for automated toolFrom “ad hoc” to “optimal”
Extensions to W-LAN base-stationImagine already shown to be good for FFT,FIR.Viterbi already understood well by this time
Extensions to hand-setsPower, area more criticalDynamic adaptation of wordlengths in FUs
[Leung98]
RICE UNIVERSITY
Goals and deliverables
Goalsreal-time processor design for communicationsmethodology for fast evaluation and
implementation
DeliverablesInnovations in architectureComparison points with existing workAlgorithm for automated tool to scale
architecturesAnalysis of extensions to W-LAN and handset
RICE UNIVERSITY
Outline
Motivation
Thesis proposal
Background
Initial results
Work proposed for thesis completion
Comparisons with existing work
RICE UNIVERSITY
Existing work - I
Texas Instruments approachDSP+ASIC C64x : Viterbi & Turbo co-processor (2000) - 4.7 Mbps
Rack of DSPs needed for our W-CDMA base-stationmakes sense as a system with single user algorithmssliding correlator, matched filter, Viterbi - linear with usersDSP per user -- processing “independent”
Does not work with multiuser algorithmsexchange information between multiple DSPscommunication bottlenecks
RICE UNIVERSITY
Existing work - I
Reconfigurable architecturesChameleon (2001), PACT(2000)RaPiD(2001), PipeRench(1999), Stallion (1998)
Adapt architectures dynamically with computations
Promising but (personal view)harder to program than DSPs needs detailed knowledge of hardware designdifficult to design big systemsnot as robust to future systems as my proposed solution
RICE UNIVERSITY
Existing work - II
Embedded architecture design and automation (CAD Tools)PICO - VLIW (HP Labs) (April 2001)Aachen - Blume (July 2002) University of Austin - Lapinskii (August 2002)
Design for heterogeneous systems on a chipASIC/DSP/FPGA system on a chip
My target: A homogeneous, adaptable and flexible design
RICE UNIVERSITY
Conclusions
Flexibility, fast evaluation and adapting architectures to ever increasing real-time are new challenges in designing wireless systems in addition to previous area-time-power efficiency.
My proposal tries to address these goals by designing a programmable and “efficient”
processor for communicationsproviding a hardware-software co-design
methodology with a fast design cycle time between evaluation to a real-time implementation.
RICE UNIVERSITY
Other research contributions
Multiuser estimation and detection algorithmsImplementation issuesParallel, iterative, fixed point, pipelining
Task-partitioning estimation-detection on 2 DSPs
On-line arithmetic and its application to detection
Programmable processor design for W-LAN base-station [Nokia Research Center]
RICE UNIVERSITY
Survivor management in Viterbi decoding
Two techniquesTraceback – commonly used Register exchange
Traceback is simplerLess area in VLSI architecturesDrawback: Sequential and additional latency
Register exchange is fasterParallel updatesPacking decoded bits in the register needs to
access the entire register
RICE UNIVERSITY
Variable computations / Non-2^
Variable computations with fading Worst case design for different fading models?
If fewer computations,Dynamic Voltage /Frequency scaling?
Users not a power of 2 then,not all kernels will scale downdummy data?spare codes available that some users can use to
increase data rates • Multiple antennas, multi-rate systems