RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department [email protected] Ph.D.

RICE UNIVERSITY

A real-time baseband communications processor for

high data rate wireless systems

Sridhar RajagopalECE [email protected]

Ph.D. Thesis Proposal

RICE UNIVERSITY

A proposed cellular base-station

M ultius e rD e te c tio n

M ultius e rC ha nne l

E s tim a tio n

V ite rbi de c o ding



::

R F+

A D C

D a ta

T ra ining

us e r 1

us e r 2

us e r K

us e r 1

us e r 2

us e r K

High data rates in emerging wireless systems (Mbps)

Sophisticated algorithms for high spectral efficiencyMultiuser estimation, multiuser detection, Viterbi

What does it take to build it?

RICE UNIVERSITY

Designing wireless systems

Traditional design - Area-Time-Power - ASICshigh data rates, low power, minimum area

Flexibility became more important - FPGAs, DSPs faster algorithm evaluation and prototyping

Heterogeneous solutions -DSPs+FPGAs+ASICsTask-partitioning & hardware-software co-design

New challengesFlexibility in algorithms -- multi-standard supportrapidly evaluate and implement new algorithmsadapting architectures to ever increasing data rates

RICE UNIVERSITY

Emerging wireless systems

0 50 100 150 200 250 300100

101

102

103

AL

Us

req

uir

ed f

or

real

-tim

e at

500

MH

z

Number of W-CDMA Cellular Users

AddMultiply

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

GOPS of computations in emerging base-stations

Example real-time target : 32 users at 128 Kbps

each

RICE UNIVERSITY

Part I - Build real-time base-stations

Current single processor DSPs not powerful enough

Algorithms well understood at data-flow level

I can design real-time architectures in VLSI

What do I lose when I make the architecture programmable? Can I quantify the loss?

Can I solve the existing bottlenecks, if any?

Build a real-time “efficient” communications processor

RICE UNIVERSITY

Part II - Robust to future updates

Data rates will increase Decoding constraint lengths may increase Number of users, spreading length may increase

Algorithms will change - with MORE computations

System complexity WILL increase.

I want to test new algorithms and changes quickly I want my design to adapt quickly to those changes

RICE UNIVERSITY

Part III - Extensions

W-LAN base-stationDifferent algorithmsFFT, Viterbi, FIR - main computational blocks100+ Mbps data rates

HandsetDifferent algorithms - simplerPower and area - more critical

How will my design extend to these wireless systems?

RICE UNIVERSITY

Expected thesis contributions

A programmable processor design for communicationsadapts to real-time requirements“efficient” in terms of #functional units, their

utilization and memory stalls

Hardware-software co-design framework to rapidly evaluate new algorithmsand rapidly implement them too.

Design limits and extensions to other wireless systems

RICE UNIVERSITY

Outline

Motivation

Thesis proposal

Background

Initial results

Work proposed for thesis completion

Comparisons with existing work

RICE UNIVERSITY

I - Efficient communications processor

Propose using an architecture simulator to design the communications processorhigh performance streaming processor simulator

based on the “Imagine” architecture

Streaming processor becauseGPP architectures not good for media, wirelessstreaming processor shown to be good for media

applications such as FFT and FIR.Media and communication algorithms similar

Media architectures popular --> wireless architectures?

RICE UNIVERSITY

I - Simulator functionality

Simulatorcycle accurateallows us to investigate bottlenecksnumber and type of functional units flexiblegives functional unit utilization

Can propose and evaluate solutions to solve bottlenecks in the implementation

RICE UNIVERSITY

NewAlgorithm

ComplexityParallelismFixed-point

Libraryof existingalgorithms

Algorithmselection

for end-to-endsystem

Existingarchitectureparameters

Compile

AssemblyCode

Real-time/area/powersatisfied?

Done

54

3

2

1

Run on existingarchitecture

Architecture design scaling(# Functional units, # clusters)New architecture parameters

Operationscount

Real-time/(area/power)requirements

Compile

AssemblyCode

Fabricationfeasible?

NewArchitecture

YES

Designfailed

Re-designalgorithms/architecture

12

11

109

8

6

7

NO

NO

YES

II - Robust to future updates

RICE UNIVERSITY

II - Propose this methodology

New algorithms in high level language

Easy to evaluate functionality and can use the same algorithms for actual design

If new algorithms are similar in algorithm and complexity, still real-time with no changes in architecture

If notAn automated tool scales the architecture appropriatelyThe proposed scaling algorithm for the tool targets

real-time and FU utilization simultaneously while scaling.

RICE UNIVERSITY

Outline

Motivation

Thesis proposal

Background -- Architecture and Algorithms

Initial results



RICE UNIVERSITY

The Imagine architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

Figure borrowed from S. Rixner

RICE UNIVERSITY

Arithmetic clusters

VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File


RICE UNIVERSITY

Stream programming

StreamCExecutes on hostC++Controls stream

transfers between main memory and SRF

void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

KernelCExecutes on clustersC-like SyntaxKernel computationCompiled by iscd

KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }


RICE UNIVERSITY

Parallel W-CDMA Estimation/Detection/Decoding

Multiuser estimationreplaced matrix inversion by gradient descent

Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based

detection

Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency

RICE UNIVERSITY

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

MultiuserDetection

Prepare Matricesfor Detection

RICE UNIVERSITY

X(0)

X(2)

X(4)

X(6)

X(8)

X(10)

X(12)

X(14)

X(1)

X(3)

X(5)

X(7)

X(9)

X(11)

X(13)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

b. Shuffled Trellisa. Trellis

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

Viterbi trellis for rate ½ code with K = 5

RICE UNIVERSITY

Outline

Motivation

Thesis proposal

Background

Initial results



RICE UNIVERSITY

Stream data flow

Matrixtranspose

Viterbikernel

Matrix multkernel

Correlationupdate kernel

Matrix mulC kernel

Data rearrangement

Buffer

Estimation bits

Detectionbits

MultiuserChannel

Estimation

MultiuserDetection

Decoding

Computation

Communication

Iterationupdate kernel

Matchedfilter kernel Matrix mul

L kernel

PIC kernel

RICE UNIVERSITY

Matrix multiplication kernel (Imagine)

32 cycle loop Executed on all 8 clusters Complexity

O(N3) multipliesO(N3) adds

100% multiplier utilization in the loop

Divider is unnecessary!

Inner Loop

Instruction

Communication(waiting for input)

FU unavailable(input ready but

FU busy)

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0

RICE UNIVERSITY

Replace divider with multiplier

22 cycle loop Executed on all 8

clusters 97% multiplier utilization

in the loop 85% adder utilization in

the loop

Changing functional unitsSupported by

simulator/compilerArchitecturally

realistic

Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2

RICE UNIVERSITY

Definition of “efficiency”

Idle time includes time spent by functional units not doing any computations and the time spent between kernel operations.

Alternate metric: Idle time ...any USEFUL computations ....

Some computations unavoidable in a programmable architecture ( i = i +1)

Good “Efficiency” means min. memory stalls and high FU utilization

TimeTotal

TimeIdle1Efficiency

RICE UNIVERSITY

Kernel computational time

KernelFunctional unit

utilization(3 +, 2 *)

ExecutionTime

(cycles)

Functional unitutilization(3 +, 3 *)

ExecutionTime

(cycles)Corr update 70%,100% 1224 78.6%,78% 1064Matrix mul 53%,91% 22720 85%,99% 14360Iteration 55%,42% 1058 55%,28% 1058

Total 14464Mat mul L 59%,91% 7468 78%,84% 5573Mat mul C 63%,96% 12192 68%,71% 11084

Total 16657Mf 67%,100% 366 90%,90% 275PIC 67%,96% 996 89%,84% 760

Total 1035Viterbi K = 5 13%,2% 8044 13%,1.4% 8044Viterbi K = 9 35%,9% 80006 35%,6% 80006

RICE UNIVERSITY

Estimation and detection executionKernel Execution Memory TransfersCycle

Stalledwaitingfor data

frommemory

Estimation

Detection(10 bits)

RICE UNIVERSITY

Viterbi execution

Initialization

Decode(32 bits)

Kernel Execution Memory TransfersCycle

RICE UNIVERSITY

Viterbi discussion

Viterbi: not enough computations for Imagine.

Significantly large communication between trellis states

If re-ordering done outside kernel,shows up as memory stalls

If re-ordering done within kernel,shows up as poor FU utilization

poor “efficiency” in any respect

Unsolved bottleneck at this point

RICE UNIVERSITY

C6711 DSP comparisons

Slow Fading Medium Fading Fast FadingStage

Imagine DSP Imagine DSP Imagine DSPEstimation 46 2228 459 22288 4594 222879Detection 1035 44077 1035 44077 1035 44077Decoding 80006 72736 800006 72736 80006 72736

Stalls 104 - 606 - 5620 -

48X improvement for estimation, 42X for detection over DSP simulator

Approx. additional 15X improvement over actual DSP

Viterbi is actually slower than a DSP implementation

RICE UNIVERSITY

Summary for proposal - Part I

Estimation, detection kernels well behaved

Stalls in estimation kernel - but normalized over detection bits.

A 3 adder, 3 multiplier per cluster configuration attained reasonable good FU on most kernels.

Decoding kernel behaves extremely poorly on this architecture.

RICE UNIVERSITY

Issues in scaling architectures

Scaling parameters to meet real-time#FU, their type and #clusters

Increasing clusters more area and energy efficient than increasing #FU as intra-cluster communication and distributing instructions high [Khailany2002]

[Khailany 2002] provide exhaustive searches for best scaling and show effects on area,energy, delay.

Do not target FU utilization and which FU’s to change Do not show how and what to scale to meet real-time

RICE UNIVERSITY

“Ad hoc” algorithm for scaling architectures

Changes #FUs as well as their type

Algorithm optimizes for functional unit utilization simultaneously (not “efficiency”)

looks at functional units that have a very high efficiency they could be bottlenecking other functional unitsadd more of them to see when their utilization

decreasesutilization of other units will increase

look at the functional unit with the next highest utilization.

UnitsOld*#

nutilizatiomax2

nutilizatiomaxUnitsNew#

nd

RICE UNIVERSITY

Algorithm (contd.)

Once functional units have high utilization (or, increasing them by one does not give any more benefits), look at real-time requirements

Increase cluster sizes to meet real-time requirements

The remaining factor is to be achieved by scaling functional units

Will not work if all units have poor efficiency Good and bad

timeexecutioncurrent

timereal)2ofpowersin(increaseclusters#

RICE UNIVERSITY

Example - Matrix multiplication

Base: 3 adders (53%), 2 multipliers (91%)bottlenecked on multipliersmultipliers = (91/53) * 2 = 3

New: 3 adders(85%), 3 multipliers(99%)

Could have been done automatically before instead of looking at the flops computations

Better as it scales FUs such as inter-cluster communication units that are based on computations.

RICE UNIVERSITY

Outline

Motivation

Thesis proposal

Background

Initial results



RICE UNIVERSITY

Work proposed for completion

Detailed comparisons to other architectures

Innovations needed to solve bottlenecks

Formalize scaling architectures and methodology

Extensions to W-LAN and handsets

RICE UNIVERSITY

Time-scales

Work Time Frame Status

Fixed point estimationdetection in Imagine

August

Integration ofImagine code

August

C64x DSP comparisons AugustSingle cluster

implementationAugust

Comparisonpoints

VLSI comparisons AugustMemory stalls SeptemberInnovations

Viterbi FU OctoberScaling Scaling algorithm November

W-LAN base-station DecemberExtensionsHandset issues January

RICE UNIVERSITY

Approach to complete work

Comparisons:Time-consuming but straight-forward

Innovations to solve bottlenecks:At the algorithm level

• re-designing to minimize data re-arrangementsAt the software level

• BLAS sub-routines for blocking computationsAt the hardware level

• Shuffle networks for Viterbi trellis

• Instruction set extensions for Viterbi acs

• Permutation-based interleaved memory

RICE UNIVERSITY

Approach (Contd.)

Scaling algorithm for automated toolFrom “ad hoc” to “optimal”

Extensions to W-LAN base-stationImagine already shown to be good for FFT,FIR.Viterbi already understood well by this time

Extensions to hand-setsPower, area more criticalDynamic adaptation of wordlengths in FUs

[Leung98]

RICE UNIVERSITY

Goals and deliverables

Goalsreal-time processor design for communicationsmethodology for fast evaluation and

implementation

DeliverablesInnovations in architectureComparison points with existing workAlgorithm for automated tool to scale

architecturesAnalysis of extensions to W-LAN and handset

RICE UNIVERSITY

Outline

Motivation

Thesis proposal

Background

Initial results



RICE UNIVERSITY

Existing work - I

Texas Instruments approachDSP+ASIC C64x : Viterbi & Turbo co-processor (2000) - 4.7 Mbps

Rack of DSPs needed for our W-CDMA base-stationmakes sense as a system with single user algorithmssliding correlator, matched filter, Viterbi - linear with usersDSP per user -- processing “independent”

Does not work with multiuser algorithmsexchange information between multiple DSPscommunication bottlenecks

RICE UNIVERSITY

Existing work - I

Reconfigurable architecturesChameleon (2001), PACT(2000)RaPiD(2001), PipeRench(1999), Stallion (1998)

Adapt architectures dynamically with computations

Promising but (personal view)harder to program than DSPs needs detailed knowledge of hardware designdifficult to design big systemsnot as robust to future systems as my proposed solution

RICE UNIVERSITY

Existing work - II

Embedded architecture design and automation (CAD Tools)PICO - VLIW (HP Labs) (April 2001)Aachen - Blume (July 2002) University of Austin - Lapinskii (August 2002)

Design for heterogeneous systems on a chipASIC/DSP/FPGA system on a chip

My target: A homogeneous, adaptable and flexible design

RICE UNIVERSITY

Conclusions

Flexibility, fast evaluation and adapting architectures to ever increasing real-time are new challenges in designing wireless systems in addition to previous area-time-power efficiency.

My proposal tries to address these goals by designing a programmable and “efficient”

processor for communicationsproviding a hardware-software co-design

methodology with a fast design cycle time between evaluation to a real-time implementation.

RICE UNIVERSITY

Other research contributions

Multiuser estimation and detection algorithmsImplementation issuesParallel, iterative, fixed point, pipelining

Task-partitioning estimation-detection on 2 DSPs

On-line arithmetic and its application to detection

Programmable processor design for W-LAN base-station [Nokia Research Center]

RICE UNIVERSITY

Survivor management in Viterbi decoding

Two techniquesTraceback – commonly used Register exchange

Traceback is simplerLess area in VLSI architecturesDrawback: Sequential and additional latency

Register exchange is fasterParallel updatesPacking decoded bits in the register needs to

access the entire register

RICE UNIVERSITY

Variable computations / Non-2^

Variable computations with fading Worst case design for different fading models?

If fewer computations,Dynamic Voltage /Frequency scaling?

Users not a power of 2 then,not all kernels will scale downdummy data?spare codes available that some users can use to

increase data rates • Multiple antennas, multi-rate systems

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department [email protected] Ph.D.

Documents

new algorithms

wireless streaming processor

programmable processor

wireless architectures

changes slide

thesis proposal slide

realtime architectures

design limits