Embedded DSP Processor Design Application Specific Instruction Set Processors Dake Liu • i^ :-t\ AMSTERDAM • BOSTON • HEIDELBERG • LONDON ? T^Wäfll NEW YORK • OXFORD • PARIS • SAN DIEGO 8 * äBpL. SAN FRANCISCO «SINGAPORE« SYDNEY »TOKYO |Vfl ^^ ELSEVIER Morgan Kaufmann Publishcrsis an imprint of Elsevier MORGAN KAUFMANN PUBLISHERS
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Embedded DSP Processor Design
Application Specific Instruction Set Processors
Dake Liu
• i^ :-t\ AMSTERDAM • BOSTON • HEIDELBERG • LONDON
? T ^ W ä f l l NEW YORK • OXFORD • PARIS • SAN DIEGO 8
* äBpL. SAN FRANCISCO «SINGAPORE« SYDNEY »TOKYO | V f l ^ ^
E L S E V I E R Morgan Kaufmann Publishcrsis an imprint of Elsevier MORGAN KAUFMANN PUBLISHERS
Contents
Preface xix
List of Trademarks and Product Names xxv
CHAPTER1 Introduction 1 1.1 How to Read the Book 1 1.2 DSPTheory for Hardware Designers 5
1.2.1 Review of DSPTheory and Fundamentals 5 1.2.2 ADC and Finite-Length Modeling 6 1.2.3 Digital Filters 8 1.2.4 Transform 10 1.2.5 Adaptive Filter and Signal Enhancement 12 1.2.6 Random Process and Autocorrelation 14
1.3 Theory, Applications, and Implementations 15 1.4 DSP Applications 17
1.4.1 Real-Time Concept 17 1.4.2 Communication Systems 17 1.4.3 Multimedia Signal Processing Systems 19 1.4.4 Review on Applications 23
1.5 DSP Implementations 24 1.5.1 DSP Implementation on GPP 25 1.5.2 DSP Implementation on GP DSP Processors 25 1.5.3 DSP Implementation onASIP 26 1.5.4 DSP Implementation onASIC 26 1.5.5 Trade-off and Decision of Implementations 28
1.6 Review of Processors and Systems 29 1.6.1 DSP Processor Architecture 29 1.6.2 DSP Firmware 30 1.6.3 Embedded System Overview 32 1.6.4 DSP in an Embedded System 34 1.6.5 Fundamentals of Embedded Computing 35
2.1.1 An Intuitive Example 48 2.1.2 Fixed-Point Numerical Representation 50 2.1.3 Fixed-Point Binary Representation 51 2.1.4 Integer Binary Representation 52 2.1.5 Fractional Binary Representation 53 2.1.6 Fixed-Point Operands 54 2.1.7 Integer or Fractional 55 2.1.8 Other Binary Data Formats 63
2.2 Data Quality Measure 65 2.2.1 Noise, Distortion, Dynamic Range, and Precision 65 2.2.2 Quantitative Concept of Dynamic Range and
Precision 68 2.3 Floating-Point Numerical Representation 69 2.4 Block Floating-Point 73 2.5 DSP Based on Finite Precision 76
2.5.1 The Way of Quantization—Rounding and Truncation 76 2.5.2 Overflow Saturation and Guards 78 2.5.3 Requirements on Guards 81 2.5.4 Execution Order 82
2.6 Examples of Corner Cases 82 2.7 Conclusions 83
3.2.1 Inside a DSP Subsystem 89 3-2.2 DSP (Memory Bus) Architecture 91 3 2.3 Functional Description at Top Architecture Level 95 32.4 DSP Architecture Design 97
3.3 Inside a DSP Core 101 3.31 The Datapath and Register Bus 101 3.3-2 MAC 101 3-3-3 ALU 103 3.3.4 Register File 104 3.3.5 Control Path 105 3-3.6 Address Generator (AGU) 108
3.4 The Difference between GPP and ASIP DSP 109 3.4.1 The Difference between Designing a GPP
and ASIP DSP 109 3.4.2 Comparing DSP Processors to Other Processors 110 3.4.3 CISC or RISC 113
Contents ix
3.5 Advanced DSP Architecture 116 3.5.1 DSP with Extreme Specification 116 3.5.2 ILP DSP Processors 120 3.5.3 Dual MAC and SIMD 122 3.5.4 VLIW and Superscalar 128 3.5.5 On-Chip Multicore DSP 145
3.6 Conclusions 153 Exercises 154 References 157
CHAPTER 4 DSP ASIP Design Flow 159 4.1 Design and Use of ASIP 159
4.1.1 What Is ASIP? 159 4.1.2 DSP ASIP Design Flow 160
4.2 Understanding Applications Through Profiling 162 4.3 Architecture Selection 163
4.7.1 Real-time Firmware 180 4.7.2 Firmware with Finite Precision 181 4.7.3 Firmware Design Flow for One Application 181 4.7.4 Firmware Design Flow for MultiappHcations 183
4.8 Conclusions 184 Exercises 184 References 185
CHAPTER 5 A Simple DSP Core—The Junior Processor 187 5.1 Junior—A Simple DSP Processor 187 5.2 Instruction Set and Operations 188
5.2.1 Load/Store Instructions 188 5.2.2 Addressing for Data Memory Access 190 5.2.3 Instructions for Basic Arithmetic Operations 190 5.2.4 Logic and Shirt Operations 191 5.2.5 Program Flow Control Instructions 192
5.4.1 Benchmarking of Block Transfer 199 54.2 Benchmarking of Single-Sample FIR 199 5.4.3 Benchmarking of Frame FIR 201 5.4.4 Benchmarking of Single-Sample Biquad HR 204
x Contents
5.4.5 Benchmarking of 16-bit Division 205 5.4.6 Benchmarking of Vector MaximumTracking 206 5.4.7 Benchmarking of 8 X 8 DCT 207 5.4.8 Benchmarking of 256-point FFT 210 54.9 Benchmarking ofWindowing 211
5.5 Discussion of Junior DSP 212 5.6 Conclusions 214
6.3 Dynamic Profiling 231 6.3.1 Instrumentation for Coarse-grained Profiling 231 6.3.2 Instrumentation for Fine-grained Profiling 231 6.3.3 Implement Instrumentation 232
6.4 Use of Reference Assembly Codes 234 6.4.1 Expose Hidden Costs 234 6.4.2 Understanding Assembly Codes 235
6.5 Quality Evaluation of Results 236 6.5.1 Evaluating Results of Source Code Profiling 236 6.5.2 Using Profiling Results 236
6.6 Conclusions 237 Exercises 237 References 237
CHAPTER 7 Assembly Instruction Set Design 239 7.1 Methodology 239
7.1.1 Opportunities and Constraints 239 7.1.2 Classification of General Instructions 244 7.1.3 Design of General RISC Subset Instructions 245 7.1.4 Specify CISC Instructions 248 7.1.5 For Undergraduates: From Junior to Senior 249
7.2 Designing RISC Subset Instructions 250
Contents xi
7.2.1 Data Access Instructions 250 7.2.2 BasicArithmetic Instructions 256 7.2.3 Unsigned ALU Instructions 264 7.2.4 Program Flow Control Instructions 265
7.3 CISC Subset Instructions 271 7.3-1 MAC and Multiplication Instructions 271 7.3-2 Double-Precision Arithmetic Instructions 274 7.3.3 Other CISC Instructions 277
10.5.2 Allocation and Partitioning of Microoperations 391 10.5.3 Pipeline Scheduling Microoperations 393 10.54 HW Multiplexing of Microoperations 393 10.5.5 Microoperations Integration 394
10.6 Conclusions 396 Exercises 396 References 397
CHAPTER 11 Design of Register File and Register Bus 399 11.1 Datapath 399 11.2 Design of Register Files 400
11.2.1 General Register File 400 11.2.2 Design of a Simple Register File 401 11.2.3 Pipeline around Register File 403 11.2.4 Special Registers in a General Register File 404
11.3 Design of Advanced Register Files 406 11.3.1 Register File for Cluster Datapath 406 11.3.2 Ultra Large Register File 408
11.4 Conclusions 410 Exercises 410 References 411
CHAPTER 12 ALU HW Implementation 413 12.1 Arithmetic and Logic Unit (ALU) 413 12.2 Design of Arithmetic Unit (AU) 415
12.2.1 Implementation Methodology 415 12.2.2 Select Kernel Components 416 12.2.3 Implementing SimpleAU Instructions 418 12.2.4 Implementing Special AU Instructions 423
12.3 Shirt and Rotation 426 12.31 Design a Shifter Using a Shifter Primitive 427 12.3.2 Design a Shifter UsingTruthTables 430 12.33 Logic Operation and Data Manipulation 430
12.4 ALU Integration 433 12.4.1 Preprocessing and Postprocessing 433 12.4.2 ALU Integration 433
12.5 Conclusions 434 Exercises 435 References 438
CHAPTER 13 MAC Hardware Implementation 439 13.1 Introduction 439
13-1.1 Review of Convolution 439 13.1.2 MAC Fundamentals 440
xiv Contents *
13.2 MAC Implementation 442 13.2.1 MAC Instructions 442 13-2.2 Implementing Multiplications 442 13.2.3 Implementing MAC Instructions 446 13-2.4 Implementing Double-Precision Instructions 449 13.2.5 Accessing ACR Context 451 132.6 Flag Operations and Other Postoperations 455
13.3 A MAC Design Case 456 13.4 MAC Integrations 465
13.4.1 Physical Critical-Path 465 13.4.2 Pipeline in a MAC 466
17.2 Accelerator Specification 601 17.2.1 Principle 601 17.2.2 An Accelerator with One Single Instruction 601 17.2.3 An Accelerator with Multiple Instructions 602 17.2.4 An Accelerator as a Slave Processor 603
20.2 Parallel Architecture, Divide and Conquer 707 20.2.1 Review of Parallel Architectures 707 20.2.2 Divide and Conquer 710
20.3 Expose Control Complexities 712 20.31 General Control Handling 712 20.3.2 Exposing Challenges 713 20.3.3 SIMTArchitecture for Low-level Parallel
Applications 716 20.3.4 Design of Multicore DSP Subsystems 721
20.4 Streaming Data Manipulations 726 20.4.1 Data Complexity of Streaming DSP 726 20.4.2 Data Complexity: Case 1—Video 726 20.4.3 Data Complexity: Case 2—Radio Baseband 732
20.5 NoC for Parallel Memory Access 735 20.5.1 Design Methods 735 20.5.2 Analyses of Parallel Memory Access
for NoC Design 736 20.6 Parallel Memory Architecture 739