1 Kurt Keutzer Lecture 10b: Implementing DSP Functionality: Alternatives Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang
Lecture 10b: Implementing DSP Functionality: Alternatives. Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Kurt Keutzer
Lecture 10b: Implementing DSP Functionality:
Alternatives
Prepared by: Professor Kurt Keutzer
Computer Science 252, Spring 2000
With contributions from:
Prof. Heinrich Meyr, University of Aachen
Philip Chong, David Chinnery, Rhett Davis, Paul Husted,
Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang
2Kurt Keutzer
System Implementation Choices
DSP Core
ProgramROM
CoefficientROM
Control
EMBEDDEDCORE µP/DSP
OFF-THESHELF µP/
DSP
DSP
APPLICATIONSPECIFIC µP (ASIP)
ASIC
System Functionality
ASIP Core
ProgramROM
CoefficientROM
Control
3Kurt Keutzer
Making a Successful Comparison - 1
Find an interesting application kernel viterbi decoding for speech processing (not a full modem!)
Find realistic constraints native to the application n=2, K=7, QPSK, 100KBS, BER= 10^-4
Find architectures/implementations that are promising for the application TI TMS320C54, Tensilica Xtensa What are the relevant features of this architecture that support this
application?
Fix application constraints across all implementations (above)
Fix key parameters for implementation comparison performance (constraint) area power
4Kurt Keutzer
Making a Successful Comparison - 2
Identify how key parameters will be measured performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note
Implement your application kernel Examine different algorithms Start with code downloaded from the web - multimedia
benchmarks etc. Build your software development/evaluation environment:
Implement your application kernel (cont) Phase 0: Research
Find application notes, research reports for your own or comparable architectures
Phase 1: Estimation Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check
Phase 2: Real implementation/Tuning Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner
Phase 3: Evaluation Apply evaluation tools to key parameters Evaluate and compare results - return to 2
If your life depended on choosing the right part - what would you do?
6Kurt Keutzer
Making a Successful Comparison - 4
Final evaluation and comparison - compare all implementations
To evaluate for a product - everything is fair game
To evaluate principally the architectures - need to consider: Fab differences - TSMC vs. IBM (10-20% faster) process differences - .35 micron vs. .25 (50% faster) power supply differences 3.0V vs. 1.5V asic vs. custom implementations - (2x faster)
Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?
cache sizes register availability additional instructions on chip memory
7Kurt Keutzer
Making a Successful Comparison - 5
Just for fun …
In addition to primary constraints (speed, cost, power)
final real world considerations business relationships (joint partnership with Lucent) Time-to-market issues
time to configure? software development environment library/application software support application engineering support
8Kurt Keutzer
Viterbi Algorithm
Prof. Heinrich Meyr
University of Aachen
9Kurt Keutzer
Viterbi Decoders in digital communication systems
Signal Source Source CoderConvolutional orTrellis Coder &Mapper
Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code
ACS
ACS
ACS
Path metricmemory
ACS
0,k
1,k
3,k
2,k
ACS
ACS
0,k+1
2,k+1
3,k+1
1,k+1
0,k
1,k
3,k
2,k
MUX
MUX
MUX
MUX
oldstatemetrics
newstatemetrics
18Kurt Keutzer
Survivor Memory Unit
19Kurt Keutzer
REA hardware architecture
d
3
0
1
2
d
d
d
0=
0
00
11
11
0 1 D
1
1=
1
1
0
0
0=
0=
PE
3,k
0,k
1,k
2,k
s
s
s
s
u
[1]
k-D
u
[2]
k-D
u
[3]
k-D
u
[0]
k-D
k-1
k-1
k-1
k-1
^
^
^
^
u
[1]u
[2]u
[3]u
[0]^
^
^
^
u(0,0)
u(0,0)
u(1,0)
u(1,3)
k
k
k
k
u
[1]u
[2]u
[3]u
[0]^
^
^
^
u
[1]
k-D+1
u
[2]
k-D+1
u
[3]
k-D+1
u
[0]
k-D+1
^
^
^
^
20Kurt Keutzer
Decoded Sequence: 0 0 ... 0 1 0
Acquisition of final survivorDecoding
10
0
Decoded Sequence : 0 0 ... 0 1 0
00
ku[0]^
k-Du[0]^u[0]^
k-(D+ M-1)
21Kurt Keutzer
Viterbi Project Constraints
•uncoded word length = 1
•coded word length (n) = 2 this means that it is rate 1/2
•constraint length (K aka. L) = 7 this means that the number
of states in trellis is 2^(K-1) or 64 states
•branch metric calculation is QPSK
• soft decision wordlength (q) = 6
•chain-backing depth (D) = 96
•generator polynomials: p0 = 171, p1= 133 (octal) this means that p0=1111001,
p1=1011011
• data rate 100 kbs
• goal: bit error rate (BER) = 10^-4
• signal to noise ratio (SNR)
• degradation 0.05dB
22Kurt Keutzer
Viterbi Decoder Implementation on an ARM
EE 290S Final Project
May 4, 1999
Phillip Chong
23Kurt Keutzer
ARM Overview
32-bit RISC microprocessor
Five stage pipeline
Features fast ALU operations (barrel shifter)
Scalar integer unit, no FPU
24Kurt Keutzer
Algorithm Tweaking
Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)
Parity computation (Viterbi code) can also be done through table lookup
25Kurt Keutzer
Reducing Memory Footprint
Cache misses can be very costly due to pipeline stalls
We are willing to give up some algorithmic efficiency to eliminate cache misses
To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)
Simulated decoding of 4096 bits on a 125 MHz 3.3V model
Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate
Power consumption was estimated at 52.47 mW
Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW
27Kurt Keutzer
Summary
Clock speed: 275 MHz
Execution Performance: 96kb/s
Power Dissipation: 42.40 mW (5.68 mW/mm2)
Area: 7.47mm2 in 0.25 m
Design Effort: 4 days
Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking
28Kurt Keutzer
Conclusion/Thanks
One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR
Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available
Many thanks to Marlene Wan for providing power estimation
29Kurt Keutzer
Viterbi Decoder Implementation on a TI C54x
EE 290S Final Project
May 4, 1999
Paul Husted
30Kurt Keutzer
Introduction
Implemented Viterbi Decoder on a TI TMS320VC5402 DSP
Examine: Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)
31Kurt Keutzer
Viterbi Decoder Specifications
Implementation Specifications: Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4
32Kurt Keutzer
C54x Capabilities
Capabilities of all C54x DSP Cores: Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel
load
33Kurt Keutzer
Helpful Instructions for the Viterbi Decoder
The C54x Has Specialized Instruction Set Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)
Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle
Allows Butterfly (2 States) in 5 cycles
34Kurt Keutzer
Butterfly Implementation
DADSTCMPS
DSADTCMPS
Old(2*j)
Old(2*j+1)
New(j)
New(j+2(K-2))
T Register = Local Distance
35Kurt Keutzer
TI TMS320VC5402 DSP
Specific Chip Characteristics: Operates at 100 MIPS
Core Voltage of 1.8V I/O Pins Operate at 3.3V
16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology
36Kurt Keutzer
Dataflow
Data I/O Input Values Assumed to be Placed at Specified
Memory Location by Internal DMA Output Values Assumed to be removed from another
Memory Location by Internal DMA Alternatively, Data Could be Placed in this Memory
Location After Other On-Chip Receiver Processing
37Kurt Keutzer
Implementation Analysis
Viterbi Decoder Code Created in Assembly
Linked to Processor Specific Memory Map
Simulated on Cycle-Accurate Simulator Used Correct Memory Model for VC5402
38Kurt Keutzer
Implementation Results
Estimated ActualCode Size 500
Instructions1032 (16 bit)Words
Data Size 1280 (16 bit)Words
1280 (16 Bit)Words
MIPS(100 Kbps)
18.425 21.53125
Max. Speed(100 MIPS)
582 Kbps 464.7 Kbps
39Kurt Keutzer
Power Calculation
Compared with TI Figures: TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS
Fully Static Design can be Clocked at Any Rate Viterbi Code Uses 1.08 Times More Current than TI
Estimate
At 22 MIPS, 19.25 mW are Consumed in the Core
40Kurt Keutzer
Area Estimate
TI Will Not Release Die Sizes .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on
a 144 pin BGA Maximum Die Size is thus 10.24 mm2
41Kurt Keutzer
Development Cost
Engineering Time Estimate - 3 days
Assumes Engineer Has Experience with Assembly Language and TI Tools
Tool Cost - $13262.45 Includes Emulator, Simulator, Compiler, Assembler,
Linker, Debugger
Cost of Chip - $8.52
42Kurt Keutzer
Conclusion
Optimized Instructions Make Algorithm Efficient
Static Design Allows Clock Rate to be Set As Needed to Reduce Power
Flexibility Exists to Perform Other Processing of Data
Very Little Development Time/Cost
43Kurt Keutzer
ACS TIE Extension with State (ACS)
bm331 24:2316:15 8:7 0
bm2 bm1 bm0
+
+
17pm- pm-
1127
-=1?
31Rs
msbmsb
+
+
17pm-pm-
11 27
- =1?
31Rt
msbmsb
11pm
310:1decision bitdecision bit
Rrpm
16:17
0:11:0
27
decision bitdecision bit
Control
instruction
44Kurt Keutzer
Tensilica Viterbi Implementation
Niraj Shah
Scott Weber
290A Final Presentation
45Kurt Keutzer
Tensilica Flow
.c
.o xt-run
.c.c
gen uArch Designer
gen
xt-gcc
TIE
TensilicaProcessorGenerator
46Kurt Keutzer
Xtensa Architecture
XtensaCore
Rs Rt RrI
TIE
TIE Extensions: single cycle state free no new exceptions no stalls typeless data
Rs, Rt, Rr are 32 bit regs
I is the instruction controlling the TIE unit
Xtensa Core is a 32 bit configurable RISC processor
47Kurt Keutzer
Viterbi Architecture
ACS
TraceBackRAMInit
ADC I/0Device
MeasuredMeasuredPerformancePerformance
HereHere
48Kurt Keutzer
TIE SetupBMreg (ACS)
-++
31 8:7 0I
Rs Rt
Rr
31 8:7 0Q
bm33123:2415:167:80
bm2bm1bm0
-0x7F0x7F
-
Controlinstruction
49Kurt Keutzer
ACS TIE Extension (ACS)
+
+
bm331 24:23 16:15 8:7 0
bm2 bm1 bm017
pm- pm-11 1:027
-=1?
11:12pm
310:10’sdecision bitdecision bit
ACS03 ||ACS12 ||ACS30 ||ACS21
31
instruction
RtRs
Rr
msbmsb
50Kurt Keutzer
ACS TIE Extension with State (ACS)
bm331 24:2316:15 8:7 0
bm2 bm1 bm0
+
+
17pm- pm-
1127
-=1?
31Rs
msbmsb
+
+
17pm-pm-
11 27
- =1?
31Rt
msbmsb
11pm
310:1decision bitdecision bit
Rrpm
16:17
0:11:0
27
decision bitdecision bit
Control
instruction
51Kurt Keutzer
TIE Zmask (TraceBack)
&
31 1:0Rs Rt
Rr
31 6:5 0
6:70
|
0x7F0x7F
<<1<<1
&0x3F0x3F
31
Controlinstruction
52Kurt Keutzer
Designs
All designs had a BER of 0.000095 after 10 million iterations
Design Compiler, Power Compiler (Static timing, power analysis with back-annotated interconnect parasitics)
Synthesis & Module Generation
Pre-Layout Verification & Analysis
Post-Layout Verification & Analysis
Floor Planning Place & Route
64Kurt Keutzer
Synthesis and SRAM Generation
Synthesis with Synopsys Design Compiler Constraint: 66 kHz clock (effectively infinite) Bottom-up synthesis of 62 VHDL entities
Low-Power SRAM generator (from Pleiades) Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2
65Kurt Keutzer
Simulation Models
Behavioral C
Behavioral VHDL
RTL VHDL
• Parameterized, bit-true, and fast
• Used for system level design and BER simulations
• Synthesizable, crafted for specific parameters and implementation structure• Used for synthesis quality
• Parameterized, bit-true, and cycle-true• Used for structural simulations and test bench reference
66Kurt Keutzer
BER Simulation Results
67Kurt Keutzer
SRAM
Simulation Tools: TimeMill & PowerMill
Parameters 66 MHz clock Voltage 2.5V Random Generated Test Vectors