Srikrishna Bhashyam, Joseph R. Cavallaro, and …sridhar/research/asap-ppt.pdfArea-Time efficient : Hardware used Blocks Quantity Full Adder Cells Complex Total Counter 2K *8 16K -

Efficient VLSI architectures for basebandsignal processing in wireless base-station

receivers

Sridhar RajagopalSrikrishna Bhashyam, Joseph R. Cavallaro, and

Behnaam Aazhang

This work is supported by Nokia, TI, TATP and NSF

Motivation

Computationally complex algorithms for base-stations

– multiple users, high data rates

– matrix inversions, floating point accuracy needed

– DSP solutions infeasible for real-time [S.Das’99]

Real-time implementations for baseband receiver?

– multiuser channel estimation

*S.Das et al., “Arithmetic Acceleration Techniques for Wireless Base-station Receivers”, Asilomar 1999

Contributions

New estimation scheme

– designed from an implementation perspective

– bit-streaming, fixed-point architecture

– reduced complexity, same error rate performance

Real-time architecture design– exploit bit-level parallelism

– area-constrained, time-constrained

– real-time with minimum area

Baseband signal processing

MultipleUsers

Base-Station Receiver

MultiuserChannel

estimation

MultiuserDetection Decoding

Antenna

Information Bits

TrackingTraining

Channel estimation

Direct Path

Reflected Path

Noise +MAI

User 1 User 2

Base Station

Estimates unknown fading amplitudes and asynchronousdelays.

Need for multiuser channel estimation

Detector performance depends on estimation accuracy

Best estimator : Maximum Likelihood

=> jointly estimate parameters for all users

=> Multiuser channel estimation

Single-user sliding correlator used for implementation

�=L

Hiibr rbR

Ti

Libb bbR �=

Multiuser channel estimation algorithm

- Training/Tracking bits

- Received signal N - Spreading gain (typically fixed ,e.g: 32) K - Number of users (variable, <=N) - Maximum Likelihood channel estimate

bi

ri

A

brbb RA*R =

N*K2

N*K2br

K2*K2bb

Ni

2Ki

CA

CR

RCr

}1,1{b

∈

∈

ℜ∈∈

−∈

Outline

Background

Channel Estimation - An implementation perspective

VLSI architectures

– Area-constrained, Time-constrained, Area-Time efficient

DSP Comparisons and Conclusions

Iterative scheme for channel estimation

Bit-streaming, method of gradient descent

Stable convergence behavior with µ

Simple fixed-point architecture

T00

TLL

)1i(bb

)i(bb b*bb*bRR −+= −

H00

HLL

)1i(br

)i(br r*br*bRR −+= −

)RR*A(AA )i(br

)i(bb

)1i()1i()i( −µ−= −−

4 5 6 7 8 9 10 11 1210

-3

10-2

10-1 Comparison of Bit Error Rates (BER)

Signal to Noise Ratio (SNR)

BE

R

Iterative Channel Est. Original Channel Est.

O(K2N)

O(K3+K2N)

Simulations - Static multipath channel

SINR = 0 dB

Paths =3

Training =150 bits

Spreading N = 31

Users K = 15

Outline

Background


VLSI architectures– Area-constrained, Time-constrained, Area-Time efficient


Design specifications

32 Users (K)

32 spreading code length (N)

Target = 128 Kbps

– 4000 cycles available at 500 MHz

Single cycle addition/multiplication

Task decomposition

IterateCorrelationMatrices (Per Bit)

AO(4K2N,8)

RbrO(2KN,8)

RbbO(2K2,8)

TIME

ChannelEstimate

to Detector

b0(2K,1)

Tracking Window

r0(N,8)

bL(2K,1)

rL(N,8)

L

Architecture design

XNOR gates, UP/DOWN counters

T00

TLL

)1i(bb


H00

HLL

)1i(br


8-bit adders

)RR*A(AA )i(br

)i(bb

)1i()1i()i( −µ−= −−

8-bit multipliers [Schulte’93]

* Schulte, Swartzlander “Truncated Multiplication with Correction Constant”, Workshop on VLSI Signal Processing,1993

Area-constrained : Min. area, not real- time

b0

bL MUX Counter

Rbb A(i)

DEMUXMUX

MAC

Add/Sub

Add/Sub

Subtract

Subtract

A(i-1)

U/D

Load Store

ji

i j

j jr0rL

bL

b0

16

8

8

88

8 8

1

11

1

1

1

1

1

1

88

88

Rbr

>>8

816

T00

TLL

)1i(bb


H00

HLL

)1i(br

)i(br r*br*bRR −+= − )RR*A(AA )i(

br)i(

bb)1i()1i()i( −µ−= −−

Channel Estimate

Area-constrained : Hardware used

Blocks Quantity Full AdderCells

Complex Total

Counter 1*8 8 - 8

Multiplier 1*8 64 *2 128

Adders 3*8 + 2*16 56 *2 112

Total Area 248FA cells

Total Time(N=K=32)

4K2N 128,000cycles

Time-constrained : Real time, large area

b*bT

b0*b0T

bL

b0

MUX

Rbr

MUX

rL

r0

MUX

Rbb A

Mult

Subtract >>

Subtract

2K*12K*1

2K*1 K(2K-1)*1

K(2K-1)*1

2K2*8

2KN*16

2KN*162KN*8

2K*1

N*8

N*8

N*8

2KN*8

2KN*8

ChannelEstimate

T00

TLL

)1i(bb


H00

HLL

)1i(br


)RR*A(AA )i(br

)i(bb

)1i()1i()i( −µ−= −−

Time-constrained : Hardware used


Complex Total

Counter 2K2*8 16K2 - 16K2

Multiplier 4K2N*8 256K2N *2 512K2N

Adders 2KN*16 +2KN*8 +4K2N*16

48KN +64K2N

*2 96KN +128K2N

Total Area(N=K=32)

20,000,000FA cells

Total Time Log2(2K) 6 cycles

Area-Time efficient architecture design

Area - constrained– single 8-bit multiplier– cycles (128,000) [3.81 Kbps, 248 FA Cells]

Time-constrained– 8-bit multipliers– log2(2K) cycles (6) [83.33 Mbps, 20,000,000 FA Cells]

Goal : real-time with minimum areaDifferent parallelism levels for multipliers

N4K2

N4K2

Area-Time efficient : Real-time, min. area

bL*bLT b0*b0

T

bL b0

MUX

MUX

rL

r0

MUX

Mult

Subtract >>

Subtract

2K*1 2K*1

2K*12K*1

2K*1 2K*8

2K*8

1*16

1*161*8

1*1

1*8

N*8

N*8

1*8

Rbr

Counters

StoreLoad

RbbA(i)

DEMUXMUX

A(i-1)

1*8

Adder

1*8

2K*1

2K*8

2K*8

T00

TLL

)1i(bb


H00

HLL

)1i(br


)RR*A(AA )i(br

)i(bb

)1i()1i()i( −µ−= −−

Channel Estimate

Area-Time efficient : Hardware used


Complex Total

Counter 2K*8 16K - 16K

Multiplier 2K*8 128K *2 256K

Adders 2K*16 +2*8 + 1*16

32K + 32 *2 64K + 64

Total Area(N=K=32)

10,000FA cells

Total Time 2KN 2,000cycles

Outline

Background


VLSI architectures– Area-constrained, Time-constrained, Area-Time efficient


DSP comparisons

Implementation ClockRate

Full AdderCells

Data Rates

C67 DSP 166 MHz - 1.02 KbpsArea 500 MHz 248 3.81 Kbps

: : : :Area-Time 500 MHz 104 256 Kbps

: : : :Time 500 MHz 2x107 83.33 Mbps

DSPs unable to exploit bit-level parallelismInefficient storage of bitsUnable to replace bit-multiplications by add/sub.

Scalability of architectures

Design for maximum number of users in the system

Fewer users– turn off functional units to reduce power

– reconfigure hardware for higher data rates (FPGA)

Investigating K-user design using K/2-user designs.

Investigating DSP extensions

Conclusions

New estimation scheme– designed from an implementation perspective– bit-streaming, fixed-point architecture– reduced complexity, same error rate performance

Real-time architecture designs– exploit bit-level parallelism– area-constrained, time-constrained– real-time with minimum area

=> Real-time architectures for base-band signal processing

Srikrishna Bhashyam, Joseph R. Cavallaro, and …sridhar/research/asap-ppt.pdfArea-Time efficient : Hardware used Blocks Quantity Full Adder Cells Complex Total Counter 2K *8 16K -

Documents