Top Banner
www.eecs.umich.edu/~sdrg 1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1 , Hyunseok Lee 1 , Mark Woh 1 , Yoav Harel 1 , Scott Mahlke 1 , Trevor Mudge 1 , Chaitali Chakrabarti 2 , Krisztian Flautner 3 1 Advanced Computer Architecture Lab, University of Michigan 2 Department of Electrical Engineering, Arizona State University 3 ARM, Ltd.
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ppt

www.eecs.umich.edu/~sdrg

1

SODA: A Low-power Architecture For Software Radio

Yuan Lin1, Hyunseok Lee1, Mark Woh1, Yoav Harel1, Scott Mahlke1, Trevor Mudge1, Chaitali Chakrabarti2, Krisztian Flautner3

1Advanced Computer Architecture Lab, University of Michigan2Department of Electrical Engineering, Arizona State University

3ARM, Ltd.

Page 2: ppt

www.eecs.umich.edu/~sdrg

2

Anatomy of 3G Cellular Phone

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASIC

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASICs

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASICs

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASICs

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASICs

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

Transport

Network

Link

MAC

GPP

PHY

DSP+ASICs

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASICs

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA

Transport

Network

Link

MAC

GPP

PHY

DSPs

Software Defined Radio (SDR):Use of software routines instead of

ASICs for physical layer operations ofwireless protocol systems

SDR BasebandProcessorGPP+DSPs

ApplicationProcessorGPP+DSP

Camera

Keyboard

Display

Speaker

Microphone

GPSAnalog ASICs

BluetoothAnalog ASICs

W-CDMAAnalog ASICs

802.11Analog ASICs

Page 3: ppt

www.eecs.umich.edu/~sdrg

3

Advantages of Software Defined Radio

• Multi-mode operations• Lower costs

– Faster time to market– Prototyping and bug fixes– Chip volumes– Longevity of platforms

• Protocol complexity favors software dominated solutions

• Enables future wireless communication innovations– Cognitive radio

UWB EDGE 802.16a

802.16a Bluetooth

802.11b WCDMA 802.11n

SDR

Page 4: ppt

www.eecs.umich.edu/~sdrg

4

Why is SDR Challenging?

1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Pe

rfo

rma

nc

e (

Go

ps

)

Better

Pow

er Efficiency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

GeneralPurpose

ProcessorsEmbeddedDSPs

Mobile SDRRequirements

Pentium MTI C6x

• SDR Design Objectives for 3G and WiFi– Throughput requirements

• 40Gops peak throughput

– Power budget• 100mW~500mW peak power

IBM Cell

High-endDSPs

Page 5: ppt

www.eecs.umich.edu/~sdrg

5

The Anatomy of Wireless Protocols

1. Filtering: suppress signals outside frequency band

2. Modulation: map source information onto signal waveforms

3. Channel Estimation: Estimate channel condition for transceivers

4. Error Correction: correct errors induced by noisy channel

LPF-Tx scrambler spreader InterleaverChannelencoder

LPF-Rx

searcher

descrambler despreader combiner

descrambler despreader

...

deinteleaverChanneldecoder

(turbo/viterbi)

Upper layersTransmitter

Receiver

D/A

A/D

FrontendW-CDMA Physical Layer Processing

LPF-Tx

LPF-Rx

scrambler spreader

descrambler despreader

descrambler despreader

combiner

searcher

InterleaverChannelencoder

deinteleaverChanneldecoder

(turbo/viterbi)

Page 6: ppt

www.eecs.umich.edu/~sdrg

6

SDR – Application Specific Design

• Wireless protocols are systems of DSP algorithms– System-level

• Example: Specification of W-CDMA DCH channel

– Algorithm-level• Example: Implementation of a 64 point FFT

Page 7: ppt

www.eecs.umich.edu/~sdrg

7

System Level Design DecisionsSystem Characteristics SODA Architectural Decisions

1. Algorithm macro-pipelining with streaming computation

1. Multi-core system2. Communication through DMA

2. Multiple periodic real-time deadlines

3. Deterministic hardware behavior4. Compile-time algorithm mapping and scheduling

3. Low streaming throughput between algorithms

5. Low throughput interconnect

4. Heterogeneous inter- algorithm communication

6. Multi-level scratchpad memories

Page 8: ppt

www.eecs.umich.edu/~sdrg

8

SODA System Architecture

• 4 PEs– static kernel mapping

and scheduling– SIMD+Scalar units

• 1 ARM GPP controller– scalar algorithms and

protocol controls

SIMDRF

SIMDMEM

scalarRF

scalarMEM

WtoS&

StoW

DMA

Scalar ALU SIMD ALU

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

SIMDRF

SIMDMEM

scalarRF

scalarMEM

WtoS&

StoW

DMA

Scalar ALU SIMD ALU

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

SIMDRF

SIMDMEM

scalarRF

scalarMEM

WtoS&

StoW

DMA

Scalar ALU SIMD ALU

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

Page 9: ppt

www.eecs.umich.edu/~sdrg

9

SODA Memory Organization

SIMDRF

SIMDMEM

scalarRF

scalarMEM

WtoS&

StoW

DMA

Scalar ALU SIMD ALU

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

• 2-Level scratchpad memories– 12KB Local scratchpad

memory for stream queues– 64KB global scratchpad

memory for large buffers

• Low-throughput shared bus– 200MHz 32-bit bus– inter-PE communication

using DMA

Page 10: ppt

www.eecs.umich.edu/~sdrg

10

DSP Algorithm Characteristics

• 8 to 16-bit precision

• Vector operations– long vectors– constant vector size

• Static data movement patterns

• Scalar operations

Algorithms Type of Computation

Vector Width

W-CDMA

Filter Vector 64

Modulation Vector 2560

Channel Est. Vector 320

Error Correction Mixed 8 or 256

802.11a

Filter Vector 33

Modulation (FFT) Vector 64

Channel Est. Mixed 16

Error Correction Mixed 64

Page 11: ppt

www.eecs.umich.edu/~sdrg

11

SODA PE ArchitecturePE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit

Vector to ScalarStage 2

Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

2 issue LIW (400MHz) - SIMD + (Scalar or AGU) DMA: - mem-to-mem transfer - access global memory

Page 12: ppt

www.eecs.umich.edu/~sdrg

12

SODA PE SIMD PipelinePE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit

Vector to ScalarStage 2

Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

16-bit 16 entries2 read/ 1 write port

RF

EX

16-bitMultiplier

40-bit ACC

16-bit

ALU

16bit

16bitWB

16bit

Page 13: ppt

www.eecs.umich.edu/~sdrg

13

SODA PE SIMD PipelinePE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit

Vector to ScalarStage 2

Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

SIMD: - 32 wide - predicated exec. - predicated neg.

Memory: - 512bit port - 1 read port - 1 write port - 8 KBytes

Page 14: ppt

www.eecs.umich.edu/~sdrg

14

SODA PE SIMD Shuffle NetworkPE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit

Vector to ScalarStage 2

Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

SIMD Shuffle NetworkShuffle Exchange (SE)Inverse Shuffle Exchange (SE)Exchange Only (EX)Iterative Feedback

Page 15: ppt

www.eecs.umich.edu/~sdrg

15

SODA PE Scalar PipelinePE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit

Vector to ScalarStage 2

Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

RF ID

16bit EX Multiplier16bit W

BALU

Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar

Page 16: ppt

www.eecs.umich.edu/~sdrg

16

W-CDMA Mapping On SODA

LPF-Tx scrambler spreader InterleaverChannelencoder

LPF-Rx

searcher

descrambler despreader combiner

descrambler despreader

...

deinteleaverChanneldecoder

(turbo/viterbi)

Upper layersTransmitter

Receiver

D/A

A/D

FrontendW-CDMA Physical Layer Processing

2 LPF-RxMisc.

ControlSearcher

De-interleaver

PowerControl

PN CodeTX/RX

TurboDecoder

Buffer(1K Bytes)

Buffer(1K Bytes) Buffer

(2K Bytes)FIFO Queue

(12.5 KBytes)

Buffer(10 Bytes)

Buffer(20 KBytes)

Buffer(20 KBytes)

Buffer(1K Bytes)

ARM PE PE PE PE GlobalMemory

Buffer(1K Bytes)

WCDMA Receiver WCDMATransmitter

4 LPF-Rx

Scrambler

Spreader

TurboEncoder

Interleaver

De-scrambler

Despreader

Combiner

4 LPF-Rx

Scrambler

Spreader

TurboEncoder

Interleaver

descrambler despreader combiner

descrambler despreader

...

TurboDecoderSearcher

2 LPF-Rx

De-scrambler

Despreader

Combiner

Misc.Control

De-interleaver

PowerControl

PN CodeTX/RX

Channeldecoder

(turbo/viterbi)deinteleaver

searcher

LPF-Rx

Channelencoder

InterleaverspreaderscramblerLPF-Tx

Buffer(1K Bytes)

Buffer(1K Bytes) Buffer

(2K Bytes)FIFO Queue

(12.5 KBytes)

Buffer(10 Bytes)

Buffer(20 KBytes)

Buffer(20 KBytes)

Buffer(1K Bytes)

Buffer(1K Bytes)

Page 17: ppt

www.eecs.umich.edu/~sdrg

17

SDR Performance Distribution

0

100

200

300

400

500

600

700

filter modulation channelestimation

errorcorrection

Co

mp

uta

tio

ns

(Mcy

cles

)

W-CDMA (2Mbps) 802.11a (24Mbps)

• 802.11a has higher number of total computational cycles• W-CDMA requires higher computational cycles per bit

Page 18: ppt

www.eecs.umich.edu/~sdrg

18

Power Consumption at 180nm

0

200

400

600

800

1000

1200

1400

PE DataMemory

PE SIMDRF

PE SIMDALUs

PE SIMDPipeline

PE Others GlobalMemory

SystemOthers

Po

wer

(m

W)

in 1

80n

m

W-CDMA (2Mbps) 802.11a (24Mbps)

• Wide SIMD requires higher number of pipeline registers• 802.11a consumes higher power than W-CDMA• 8-bit W-CDMA computation versus 16-bit 802.11a computation

Page 19: ppt

www.eecs.umich.edu/~sdrg

19

Summary

• Key features of SODA– Multi-PE with scratchpad memories– Low throughput shared bus– 2-issue LIW: SIMD+(Scalar or AGU)– 32-wide SIMD processing– SIMD shuffle network

SDR Hardware Requirements SODA Results

Comp. requirements: 10 ~ 100 GOPS

W-CDMA & 802.11a: 1.3 ~ 2 GOPS (with SODA LIW ops)

Sub-watt power budget: ~ 0.2 Watt for cellular phones

180nm: ~ 3 Watts (area: 26.6mm2) 90nm (est.): ~ 0.5 Watt (6.7 mm2)

Page 20: ppt

www.eecs.umich.edu/~sdrg

20

Conclusion & Future Work

• Conclusion– 2G and 3G SDR solutions are achievable in 90nm– Optimization opportunities at the algorithm, software

and hardware levels

• Future Work– SDR for Idle mode operation (ISLPED ’06)– SODA for 4G protocols– Application-specific language for SDR– Compiler for SODA

Page 21: ppt

www.eecs.umich.edu/~sdrg

21

Questions?

• www.eecs.umich.edu/~sdrg

Page 22: ppt

www.eecs.umich.edu/~sdrg

22

Backup Slides

Page 23: ppt

www.eecs.umich.edu/~sdrg

23

Different Levels of Software Radio

Tier Name Description

Tier 0 Hardware Radio (HR)Implemented using hardware components. Cannot be modified

Tier 1Software Controlled

Radio (SCR)Only control functions are implemented in software: inter-connects, power levels, etc.

Tier 2Software Defined

Radio (SDR)

Software control of a variety of modulation techniques, wide-band or narrow-band operation, security functions, etc.

Tier 3Ideal Software Radio

(ISR)Programmability extends to the entire system with analog conversion only at the antenna.

Tier 4Ultimate Software

Radio (USR)Defined for comparison purposes only

<source:http://www.sdrforum.org>

Page 24: ppt

www.eecs.umich.edu/~sdrg

24

Power Methodology

• Our flow sequence was– Design Compiler and Silicon Ensemble

• For Initial Floorplan Estimation– Physical Compiler

• For placement and Optimization– Silicon Ensemble

• Routing

• We optimized for power and delay• Blocks like memory were generated with Artisan

Memory Generators• We used the Synopsys IP Blocks as much as

possible to get better compiled blocks

Page 25: ppt

www.eecs.umich.edu/~sdrg

25

ARM

PE1

PE2

PE3

PE4

SearcherReal-Time Critical

Path (5 msec)

Power ControlReal-Time CriticalPath (0.67 msec)

1 W-CDMA frame (15 slots), 10 msec

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

FIR

MOD

ENC

I Modulation FIR

PN Code<0.1 msec

Power Control<0.1 msec

Deinterleaver0.2 msec

Interleaver0.2 msec

Turbo Encoder0.2 msec

FIR(Tx)8 msec

Modulation1 msec

FIR(Rx)0.3 msec

Demodulation0.1 msec

Turbo Decoder10 msec

Searcher5 msec

Time

PN

PN

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

PN

PC

IPC

SYNCP1, ARM

PN

FIR M

PN

FIR M

PN

FIR M

PN

FIR M

Inputfrom A/D

PE1FIR/Mod.

PE2Searcher

PE3Turbo

PE4TX

1 slot0.67 mSec

ARM:PN/Power Ctrl.

PC

SYNCP1, ARM

SYNCP1, P2, ARM

SYNCP1, ARM

SYNCP1, ARM

SYNCP1, ARM

Searcher

PN