E&CE 327: Digital Systems Engineering
Lecture Slides
Mark Aagaard2011t1–Winter
University of WaterlooDept of Electrical and Computer Engineering
Contents
I Lecture Notes 1
1 VHDL 31.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . . . 51.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . . 111.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . . . 121.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Comparison of VHDL to Other Hardware Description Languages . . 14
i
ii CONTENTS
1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . . . 141.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . . . 151.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . . 181.3.5 Component Declaration and Instantiations . . . . . . . . . . . 211.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . 261.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . . . 27
1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . . . 271.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . 281.4.2 Conditional Assignment vs If Statements . . . . . . . . . . . 291.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . 301.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . . 321.5.1 Combinational Process vs Clocked Process . . . . . . . . . . 361.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . . . . . 461.6.1 Simple Simulation . . . . . . . . . . . . . . . . . . . . . . . . 461.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . 48
CONTENTS iii
1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . . . . . . . . 481.6.4 Definitions and Algorithm . . . . . . . . . . . . . . . . . . . . 50
1.6.4.1 Process Modes . . . . . . . . . . . . . . . . . . . . 501.6.4.2 Simulation Algorithm . . . . . . . . . . . . . . . . . 541.6.4.3 Delta-Cycle Definitions . . . . . . . . . . . . . . . . 57
1.6.5 Example 1: Process Execution (Bamboozle) . . . . . . . . . 581.6.6 Example 2: Process Execution (Flummox) . . . . . . . . . . . 581.6.7 Ex: Need for Provisonal Asn . . . . . . . . . . . . . . . . . . 631.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . . . . . 69
1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . . . . . . . . 781.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791.7.2 Technique for Register-Transfer Level Simulation . . . . . . . 801.7.3 Examples of RTL Simulation . . . . . . . . . . . . . . . . . . 81
1.7.3.1 RTL Simulation Example 1 . . . . . . . . . . . . . . 811.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . . . . . 85
1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . 851.8.2 Deprecated Building Blocks for RTL . . . . . . . . . . . . . . 901.8.3 Hardware and Code for Flops . . . . . . . . . . . . . . . . . . 92
1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . . . . . . 921.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . 94
iv CONTENTS
1.8.3.3 Flop with Chip-Enable and Mux on Input . . . . . . 1011.8.3.4 Flops with Chip-Enable, Muxes, and Reset . . . . . 102
1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . . . . . 1021.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . . . . 1031.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . . . . . 1041.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . . . . . 1041.10.4 Different Widths and Arithmetic . . . . . . . . . . . . . . . . 1041.10.5 Overloading of Comparisons . . . . . . . . . . . . . . . . . 1041.10.6 Different Widths and Comparisons . . . . . . . . . . . . . . 1051.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . 106
1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . . . . . 1081.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . . . . 109
1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . . . . . 1091.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . . . . 1101.11.1.3 Different Wait Conditions . . . . . . . . . . . . . . 1111.11.1.4 Multiple “if rising edge” in Process . . . . . . . . . 1131.11.1.5 “if rising edge” and “wait” in Same Process . . . . 1141.11.1.6 “if rising edge” with “else” Clause . . . . . . . . . . 115
CONTENTS v
1.11.1.7 “if rising edge” Inside a “for” Loop . . . . . . . . . . 1161.11.1.8 “wait” Inside of a “for loop” . . . . . . . . . . . . . . 118
1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . 120
2 RTL Design with VHDL 1212.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . 122
2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . 1222.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . 123
2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1282.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . 1342.2.2.2 Clocks for Generic FPGAs . . . . . . . . . . . . . . 1342.2.2.3 Special Circuitry in FPGAs . . . . . . . . . . . . . . 135
2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . 1392.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . 1432.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . 144
2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . 1442.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . 1442.5.1.2 Introduction to State Machines and VHDL . . . . . . 147
vi CONTENTS
2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . 1492.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . 154
2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . 1552.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . 1572.5.2.3 Explicit Moore with Combinational Outputs . . . . . 1592.5.2.4 Explicit-Current+Next Moore with Concurrent As-
signment . . . . . . . . . . . . . . . . . . . . . . . . 1612.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . . . . 163
2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . 1652.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1662.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 170
2.6 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1712.6.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . 1712.6.2 Dataflow Diagrams, Hardware, and Behaviour . . . . . . . . 1842.6.3 Dataflow Diagram Execution . . . . . . . . . . . . . . . . . . 1882.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . . . . 1982.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1992.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2012.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . . . . 203
2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . 206
CONTENTS vii
2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . 2062.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 2082.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2092.8.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . 2102.8.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . 2112.8.5 Optimize Resources . . . . . . . . . . . . . . . . . . . . . . . 2132.8.6 Assign Names to Registered Values . . . . . . . . . . . . . . 2162.8.7 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . 2172.8.8 Tangent: Combinational Outputs . . . . . . . . . . . . . . . . 2202.8.9 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . 2212.8.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . 2232.8.11 Hardware Block Diagram and State Machine . . . . . . . . 224
2.8.11.1 Control for Registers . . . . . . . . . . . . . . . . . 2252.8.11.2 Control for Datapath Components . . . . . . . . . 2282.8.11.3 Control for State . . . . . . . . . . . . . . . . . . . 2302.8.11.4 Complete State Machine Table . . . . . . . . . . . 231
2.8.12 VHDL Code with Explicit State Machine . . . . . . . . . . . 2332.8.13 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . 2372.8.14 Notes and Observations . . . . . . . . . . . . . . . . . . . . 240
2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
viii CONTENTS
2.9.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . 2422.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . 2482.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
2.10 Design Example: Pipelined Massey . . . . . . . . . . . . . . . . . . 2522.11 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . 256
2.11.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . 2562.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . 2602.11.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 2602.11.4 Memory and Dataflow Diagrams . . . . . . . . . . . . . . . 2652.11.5 Ex: Mem Array and Dataflow Diagram . . . . . . . . . . . . 272
2.12 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . . . 2792.13 Example: Moving Average . . . . . . . . . . . . . . . . . . . . . . . 280
2.13.1 Requirements and Environmental Assumptions . . . . . . . 2812.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2822.13.3 Pseudocode and Dataflow Diagrams . . . . . . . . . . . . . 2862.13.4 Control Tables and State Machine . . . . . . . . . . . . . . . 2912.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
CONTENTS ix
3 Performance Analysis and Optimization 2973.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2983.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 2993.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 302
3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 3023.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . 304
3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . 3053.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3053.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 3063.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . 3103.4.4 Effect of Time to Market on Relative Performance . . . . . . 3123.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . 312
3.5 Performance Analysis and Dataflow Diagrams . . . . . . . . . . . . 3133.5.1 Dataflow Diagrams, CPI, and Clock Speed . . . . . . . . . . 3133.5.2 Examples of Dataflow Diagrams for Two Instructions . . . . . 316
3.5.2.1 Scheduling of Operations for Different Clock Periods 3173.5.2.2 Performance Computation for Different Clock Periods 3203.5.2.3 Example: Two Instructions Taking Similar Time . . . 3213.5.2.4 Example: Same Total Time, Different Order for A . . 322
3.5.3 Example: From Algorithm to Optimized Dataflow . . . . . . . 323
x CONTENTS
3.6 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 3263.6.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . 326
3.6.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . 3263.6.1.2 Boolean Strength Reduction . . . . . . . . . . . . . 327
3.6.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . 3283.6.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . 3283.6.2.2 Common Subexpression Elimination . . . . . . . . . 3293.6.2.3 Computation Replication . . . . . . . . . . . . . . . 331
3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3323.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
CONTENTS xi
4 Functional Verification 3354.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
4.1.1 Terminology: Validation / Verification / Testing . . . . . . . . . 3364.1.2 The Difficulty of Designing Correct Chips . . . . . . . . . . . 336
4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . 3374.1.2.2 Notes from Aart de Geus (Chairman and CEO of
Synopsys) . . . . . . . . . . . . . . . . . . . . . . . 3374.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . 338
4.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3384.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . . . 339
4.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3444.3.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . 3444.3.2 Reference Model Style Testbench . . . . . . . . . . . . . . . 3454.3.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . 3454.3.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . 3464.3.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . 3474.3.6 Verification Tips . . . . . . . . . . . . . . . . . . . . . . . . . 348
4.4 Functional Verification for Datapath Circuits . . . . . . . . . . . . . . 3494.4.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . 3514.4.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . 352
xii CONTENTS
4.4.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . 3534.4.4 Have Separate Specification Entity . . . . . . . . . . . . . . . 3554.4.5 Generate Test Vectors Automatically . . . . . . . . . . . . . . 3584.4.6 Relational Specification . . . . . . . . . . . . . . . . . . . . . 359
4.5 Functional Verification of Control Circuits . . . . . . . . . . . . . . . 3604.5.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . 3614.5.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
4.5.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . 3684.5.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . 368
4.5.3 Code Structure for Verification . . . . . . . . . . . . . . . . . 3694.5.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . 3714.5.5 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3764.5.6 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . 3804.5.7 Queue Specification . . . . . . . . . . . . . . . . . . . . . . . 3854.5.8 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . 389
4.6 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . 391
CONTENTS xiii
5 Timing Analysis 4015.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 402
5.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . . 4025.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . . 403
5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . 4035.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . 4055.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . 406
5.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . . 4085.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . . 408
5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . 4105.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . 411
5.1.5.1 Minimum Clock Period . . . . . . . . . . . . . . . . . 4115.1.5.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . 4125.1.5.3 Example Timing Violations . . . . . . . . . . . . . . 412
5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . 4155.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . 415
5.2.1.1 Structure and Behaviour of Multiplexer Latch . . . . 4165.2.1.2 Strategy for Timing Analysis of Storage Devices . . 4205.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . 4215.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . 422
xiv CONTENTS
5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . 4285.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . . . 430
5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . 4315.3.1 Introduction to Critical and False Paths . . . . . . . . . . . . 431
5.3.1.1 Example of Critical Path in Full Adder . . . . . . . . 4345.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . . . 4365.3.1.3 Longest Path and Critical Path . . . . . . . . . . . . 436
5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 4405.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . . . 441
5.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . 4415.3.3.2 Almost-Correct Algorithm to Detect a False Path . . 4475.3.3.3 Examples of Detecting False Paths . . . . . . . . . 447
5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . . 4495.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . 4505.3.4.2 Examples of Finding Next Candidate Path . . . . . . 451
5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . . . 4545.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . . . 4545.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . . . 4555.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . . . 4565.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . 456
CONTENTS xv
5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . 4575.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . 4625.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . 462
5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4635.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . 4635.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . 475
5.4.2.1 Example Derivation: Equation for Voltage at Node 3 4795.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . 483
5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . 4875.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . 491
5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . 4915.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . 495
5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . 4985.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 500
5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . 5015.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . 502
5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . 5025.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . . 503
xvi CONTENTS
6 Power Analysis and Power-Aware Design 5076.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
6.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . 5086.1.2 Industrial Names and Products . . . . . . . . . . . . . . . . . 5096.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . . . . . . . . 5096.1.4 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . 510
6.1.4.1 Do Batteries Store Energy or Power? . . . . . . . . 5106.1.4.2 Battery Life and Efficiency . . . . . . . . . . . . . . 5116.1.4.3 Battery Life and Power . . . . . . . . . . . . . . . . 512
6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5156.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . 5176.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . 5206.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 5216.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5226.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . 522
6.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . . . 5226.4 Voltage Reduction for Power Reduction . . . . . . . . . . . . . . . . 5276.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . 531
6.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . 5316.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . 535
CONTENTS xvii
6.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . 5356.5.2.2 Additional Information . . . . . . . . . . . . . . . . . 5366.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . 538
6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5446.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . 5446.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . 5456.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . 5466.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . 5466.6.5 Example: Reduced Activity Factor with Clock Gating . . . . . 5506.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . . . . . . 552
6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . 5526.6.6.2 How Many Clock Cycles for Module? . . . . . . . . 5556.6.6.3 Adding Clock-Gating Circuitry . . . . . . . . . . . . 556
6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . 559
xviii CONTENTS
7 Fault Testing and Testability 5637.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . 5647.1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 5647.1.1.2 Causes of Faults . . . . . . . . . . . . . . . . . . . . 5657.1.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . 5657.1.1.4 Burn In . . . . . . . . . . . . . . . . . . . . . . . . . 5667.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . . . . . . . . 5667.1.1.6 Testing Techniques . . . . . . . . . . . . . . . . . . 5677.1.1.7 Design for Testability (DFT) . . . . . . . . . . . . . . 567
7.1.2 Example Problem: Economics of Testing . . . . . . . . . . . 5677.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 567
7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . 5687.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . 5697.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . 5707.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . 570
7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . 5717.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . 571
7.1.5 Mathematical Models of Faults . . . . . . . . . . . . . . . . . 5747.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . 575
CONTENTS xix
7.1.6 Generate Test Vector to Find a Mathematical Fault . . . . . . 5777.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5777.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . 578
7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 5797.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . 5797.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . 582
7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5837.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . 5837.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . 584
7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . 5857.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . 5867.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . 5877.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . 5887.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . 588
7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 5897.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . 5907.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . 591
7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . 5927.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . 5957.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . 597
xx CONTENTS
7.2.5.4 Faults Not Covered by Required Test Vectors . . . . 5987.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 5997.2.5.6 Summary of Technique to Find and Order Test Vectors601
7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 6027.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 604
7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 6047.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 6077.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 6087.3.2.3 Scan in Operation with Example Circuit . . . . . . . 610
7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 6147.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 615
7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 6167.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 617
7.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 6207.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 6217.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . 6247.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . 6287.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . 630
CONTENTS xxi
7.5.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 6337.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . 6367.5.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . 6407.5.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . . . . . 6417.5.6 Shift Registers and Characteristic Polynomials . . . . . . . . 643
7.5.6.1 Circuit Multiplication . . . . . . . . . . . . . . . . . . 6467.5.7 Bit Streams and Characteristic Polynomials . . . . . . . . . . 6477.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6487.5.9 Signature Analysis: Math and Circuits . . . . . . . . . . . . . 651
7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
xxii CONTENTS
8 Review 6618.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . 6628.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
8.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6638.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . 664
8.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 6658.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 6658.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . 666
8.4 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 6678.4.1 Verification Topics . . . . . . . . . . . . . . . . . . . . . . . . 6678.4.2 Verification Example Problems . . . . . . . . . . . . . . . . . 668
8.5 Performance Analysis and Optimization . . . . . . . . . . . . . . . . 6698.5.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . 6698.5.2 Performance Example Problems . . . . . . . . . . . . . . . . 670
8.6 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6718.6.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6718.6.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . 672
8.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6738.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6738.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . 674
CONTENTS xxiii
8.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6758.8.1 Testing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 6758.8.2 Testing Example Problems . . . . . . . . . . . . . . . . . . . 676
8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . 677
Part I
Lecture Notes
1
Chapter 1
VHDL: The Language
3
4 CHAPTER 1. VHDL
1.1 Introduction to VHDL
1.1.1 Levels of AbstractionTransistor Signal values and time are continous (analog). Each transistor is mod-
eled by a resistor-capacitor network.
Switch Time is continuous, but voltage may be either continuous or discrete. Lin-ear equations are used.
Gate Transistors are grouped together into gates. Voltages are discrete valuessuch as 0 and 1.
Register transfer level Hardware is modeled as assignments to registers andcombinational signals. Basic unit of time is one clock cycle.
Transaction level A transaction is an operation such as transfering data acrossa bus. Building blocks are processors, controllers, etc. VHDL, SystemC, orSystemVerilog.
Electronic-system level Looks at an entire electronic system, with both hard-ware and software.
1.1.2 VHDL Origins and History 5
1.1.2 VHDL Origins and History
VHDL = VHSIC Hardware Description LanguageVHSIC = Very High Speed Integrated Circuit
The VHSIC Hardware Description Language (VHDL) is a formal notationintended for use in all phases of the creation of electronic systems.Because it is both machine readable and human readable, it supports thedevelopment, verification, synthesis and testing of hardware designs, thecommunication of hardware design data, and the maintenance,modification, and procurement of hardware.
Language Reference Manual (IEEE Design Automation StandardsCommittee, 1993a)
VHDL is a lot more than synthesis of digitalhardware
6 CHAPTER 1. VHDL
1.1.3 Semantics
The original goal of VHDL was to simulate circuits. The semantics of the languagedefine circuit behaviour .
a
b
c
simulationc <= a AND b;
But now, VHDL is used in simulation and synthesis. Synthesis is concerned withthe structure of the circuit.
Synthesis: converts one type of description (behavioural) into another, lower level,description (usually a netlist).
a
b cc <= a AND b; synthesis
1.1.3 Semantics 7
Synthesis
Synthesis is a computer-aided design (CAD) technique that transforms a designer’sconcise, high-level description of a circuit into a structural description of a circuit.
a
b cc <= a AND b; synthesis
8 CHAPTER 1. VHDL
CAD Tools
CAD Tools allow designers to automate lower-level design processes in implement-ing the desired functionality of a system.
NOTE: EDA = Electronic Design Automation. In digital hardware designEDA = CAD.
1.1.3 Semantics 9
Synthesis vs Simulation
For synthesis, we want the code we write to define the structure of the hardwarethat is generated.
a
b cc <= a AND b; synthesis
10 CHAPTER 1. VHDL
Synthesis vs Simulation
The VHDL semantics define the behaviour of the hardware that is generated, notthe structure of the hardware.
a
b c
a
b c
c <= a AND b;
a
b
c
differentstructure
samebehavioursynthesis
simulation
a
b
c
simulation
synt
hesis
1.1.4 Synthesis of a Simulation-Based Language 11
1.1.4 Synthesis of a Simulation-Based Lan-guage
This section reserved for your reading pleasure
12 CHAPTER 1. VHDL
1.1.5 Solution to Synthesis Sanity• Pick a high-quality synthesis tool and study its documentation thoroughly
• Learn the idioms of the tool
• Different VHDL code with same behaviour can result in very different circuits
• Be careful if you have to port VHDL code from one tool to another
• KISS: Keep It Simple Stupid
– VHDL examples will illustrate reliable coding techniques for the synthesis toolsfrom Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies aswell.
– Follow the coding guidelines and examples from lecture
– As you write VHDL, think about the hardware you expect to get.
Note: If you can’t predict the hardware, then the hardwareprobably won’t be very good (small, fast, correct, etc)
1.1.6 Standard Logic 1164 13
1.1.6 Standard Logic 1164
std logic 1164 : IEEE standard for signal values in VHDL.
’U’ uninitialized’X’ strong unknown’0’ strong 0’1’ strong 1’Z’ high impedance’W’ weak unknown’L’ weak 0’H’ weak 1’--’ don’t care
The most common values are: ’U’ , ’X’ , ’0’ , ’1’ .
If you see ’X’ in a simulation, it usually means that there is a mistake in your code.
14 CHAPTER 1. VHDL
1.2 Comparison of VHDL to Other Hard-ware Description Languages
This section reserved for your reading pleasure
1.3 Overview of Syntax
1.3.1 Syntactic Categories
This section reserved for your reading pleasure
1.3.2 Library Units
This section reserved for your reading pleasure
1.3.3 Entities and Architecture 15
1.3.3 Entities and Architecture
Each hardware module is described with an Entity/Architecture pair
architecture
entityarchitecture
entity
Entity and Architecture
16 CHAPTER 1. VHDL
Entity
library ieee;
use ieee.std_logic_1164.all;
entity and_or is
port (
a, b, c : in std_logic ;
z : out std_logic
);
end and_or;
Example of an entity
1.3.3 Entities and Architecture 17
Architecture
architecture main of and_or is
signal x : std_logic;
begin
x <= a AND b;
z <= x OR (a AND c);
end main;
Example of architecture
18 CHAPTER 1. VHDL
1.3.4 Concurrent Statements• Architecture s contain concurrent statements
• Concurrent statements execute in parallel (Figure1.4)
– Concurrent statements make VHDL fundamentally different from most soft-ware languages.
– Hardware (gates) naturally execute in parallel — VHDL mimics the behaviourof real hardware.
– At each infinitesimally small moment of time, each gate:
1. samples its inputs
2. computes the value of its output
3. drives the output
1.3.4 Concurrent Statements 19
Concurrent Statements
architecture main of bowser isbegin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2;end main;
architecture main of bowser isbegin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b;end main;
a
b z
x1 x2
The order of concurrent statements doesn’t matter
20 CHAPTER 1. VHDL
Types of Concurrent Statements
conditional assignment similar to conventional if-then-elsec <= a+b when sel=’1’ else a+c when sel=’0’ else "0000";
selected assignment similar to conventional case/switchwith color select d <= "00" when red , "01" when . . .;
component instantiation use a hardware module/componentadd1 : adder port map( a => f , b => g, s => h, co => i );
for-generate create multiple pieces of hardwarebgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate;
if-generate conditionally create some hardwareokgen : if optgoal /= fast then generate
result <= ((a and b) or (d and not e)) or g;end generate;fastgen : if optgoal = fast then generate
result <= ’1’;end generate;
process description of complex behaviour (Section 1.3.6)
1.3.5 Component Declaration and Instantiations 21
1.3.5 Component Declaration and Instanti-ations
This section reserved for your reading pleasure
1.3.6 Processes
• Processes are used to describe complex and potentially unsynthesizable be-haviour
• A process is a concurrent statement (Section 1.3.4).
• The body of a process contains sequential statements (Section 1.3.7)
• Processes are the most complex and difficult to understand part of VHDL (Sec-tions 1.5 and 1.6)
22 CHAPTER 1. VHDL
Example Process with Sensitivity List
process (a, b, c)
begin
y <= a AND b;
if (a = ’1’) then
z1 <= b AND c;
z2 <= NOT c;
else
z1 <= b OR c;
z2 <= c;
end if;
end process;
1.3.6 Processes 23
Example Process with Wait Statements
process
begin
y <= a AND b;
z <= ’0’;
wait until rising_edge(clk);
if (a = ’1’) then
z <= ’1’;
y <= ’0’;
wait until rising_edge(clk);
else
y <= a OR b;
end if;
end process;
24 CHAPTER 1. VHDL
Sensitivity Lists and Wait Statements
• Processes must have either a sensitivity list or at least one wait statement oneach execution path through the process.
• Processes cannot have both a sensitivity list and a wait statement.
1.3.6 Processes 25
Sensitivity List
The sensitivity list contains the signals that are read in the process.
A process is executed when a signal in its sensitivity list changes value.
An important coding guideline to ensure consistent synthesis and simulation resultsis to include all signals that are read in the sensitivity list.
There is one exception to this rule: for a process that implements a flip-flop with anif rising edge statement, it is acceptable to include only the clock signal in thesensitivity list — other signals may be included, but are not needed.
26 CHAPTER 1. VHDL
1.3.7 Sequential Statements
Used inside processes and functions .
wait wait until . . . ;signal assignment . . . <= . . . ;if-then-else if . . . then . . . elsif . . . end if;case case . . . is
when . . . | . . . => . . . ;when . . . => . . . ;
end case;loop loop . . . end loop;while loop while . . . loop . . . end loop;for loop for . . . in . . . loop . . . end loop;next next . . . ;
The most commonly used sequential statements
1.3.8 A Few More Miscellaneous VHDL Features 27
1.3.8 A Few More Miscellaneous VHDL Fea-tures
This section reserved for your reading pleasure
1.4 Concurrent vs Sequential Statements
All concurrent assignments can be translated into sequential statements. But, notall sequential statements can be translated into concurrent statements.
28 CHAPTER 1. VHDL
1.4.1 Concurrent Assignment vs Process
The two code fragments below have identical behaviour:
architecture main of tiny is
begin
b <= a;
end main;
architecture main of tiny is
begin
process (a) begin
b <= a;
end process;
end main;
1.4.2 Conditional Assignment vs If Statements 29
1.4.2 Conditional Assignment vs If State-ments
The two code fragments below have identical behaviour:
Concurrent Statements
t <= <val1> when <cond>
else < val2>;
Sequential Statementsif < cond> then
t <= < val1>;
else
t <= < val2>;
end if
30 CHAPTER 1. VHDL
1.4.3 Selected Assignment vs Case State-ment
The two code fragments below have identical behaviour
Concurrent Statementswith < expr> select
t <= < val1> when <choices1>,
<val2> when <choices2>,
<val3> when <choices3>;
Sequential Statementscase < expr> is
when <choices1> =>
t <= < val1>;
when <choices2> =>
t <= < val2>;
when <choices3> =>
t <= < val3>;
end case;
1.4.4 Coding Style 31
1.4.4 Coding Style
Code that’s easy to write with sequential statements, but difficult with concurrent :
case < expr> is
when <choice1> =>
if < cond> then
o <= <expr1>;
else
o <= <expr2>;
end if;
when <choice2> =>
. . .end case;
32 CHAPTER 1. VHDL
1.5 Overview of Processes
Processes are the most difficult VHDL construct to understand. This section givesan overview of processes. Section 1.6 gives the details of the semantics of pro-cesses.• Within a process, statements are executed almost sequentially
• Among processes, execution is done in parallel
• Remember: a process is a concurrent statement!
1.5. OVERVIEW OF PROCESSES 33
Process Semantics• VHDL mimics hardware
• Hardware (gates) execute in parallel
• Processes execute in parallel with each other
• All possible orders of executing processes must produce the same simulationresults (waveforms)
• If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrentstatements must produce the same
waveforms
34 CHAPTER 1. VHDL
Process Semantics
architecture
procA: process
stmtA1;
stmtA2;
stmtA3;
end process;
procB: process
stmtB1;
stmtB2;
end process;
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
single threaded:procA beforeprocB
single threaded:procB beforeprocA
multithreaded:procA and procB
in parallel
1.5. OVERVIEW OF PROCESSES 35
Process Semantics
All execution orders must have same behaviour
36 CHAPTER 1. VHDL
1.5.1 Combinational Process vs ClockedProcess
Each well-written synthesizable process is either combinational or clocked.
Combinational process:• Executing the process takes part of one clock cycle
• Target signals are outputs of combinational circuitry
• A combinational processes must have a sensitivity list
• A combinational process must not have any wait statements
• A combinational process must not have any rising_edge s, orfalling_edge s
• The hardware for a combinational process is just combinational circuitry
1.5.1 Combinational Process vs Clocked Process 37
Clocked process:• Executing the process takes one (or more) clock cycles
• Target signals are outputs of flops
• Process contains one or more wait or if rising edge statements
• Hardware contains combinational circuitry and flip flops
Note: Clocked processes are sometimes called “sequentialprocesses”, but this can be easily confused with “sequential state-ments”, so in E&CE 327 we’ll refer to synthesizable processes aseither “combinational” or “clocked”.
38 CHAPTER 1. VHDL
Combinational or Clocked Process? (1)
process (a,b,c)
p1 <= a;
if (b = c) then
p2 <= b;
else
p2 <= a;
end if;
end process;
1.5.1 Combinational Process vs Clocked Process 39
Combinational or Clocked Process? (2)
process
begin
wait until rising_edge(clk);
b <= a;
end process;
40 CHAPTER 1. VHDL
Combinational or Clocked Process? (3)
process (clk)
begin
if rising_edge(clk) then
b <= a;
end if;
end process;
1.5.1 Combinational Process vs Clocked Process 41
Combinational or Clocked Process? (4)
process (clk)
begin
a <= clk;
end process;
42 CHAPTER 1. VHDL
Combinational or Clocked Process? (5)
process
begin
wait until rising_edge(a);
c <= b;
end process;
1.5.2 Latch Inference 43
1.5.2 Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passesthrough a process and not on other passes, then on a pass through the processwhen the signal is not assigned a value, it must maintain its value from the previouspass.
process (a, b, c)
begin
if (a = ’1’) then
z1 <= b;
z2 <= b;
else
z1 <= c;
end if;
end process;
a
b
c
z1
z2
Example of latch inference
44 CHAPTER 1. VHDL
Latch Inference
When a signal’s value must be stored, VHDL infers a latch or a flip-flop in thehardware to store the value.
If you want a latch or a flip-flop for the signal, then latch inference is good.
If you want combinational circuitry, then latch inference is bad.
1.5.2 Latch Inference 45
Loop, Latch, Flop
b
a
z
Combinational loop
b z
a EN
Latch
b z
a
D Q
Flip-flop
Question: Write VHDL code for each of the above circuits
46 CHAPTER 1. VHDL
1.6 Details of Process Execution
1.6.1 Simple Simulation
a
b
c d
e
a
b
c
d
e
0ns 10ns 12ns 15ns
1.6.2 Temporal Granularities of Simulation 47
Different Programs, Same Behaviour
All three programs below synthesize to the circuit on the previous slide.
The goal of VHDL semantics is that all three programs have the same behaviour.
process (a,b)
begin
c <= a and b;
end process;
process (b,c,d)
begin
d <= not c;
e <= b and d;
end process;
process (a,b,c,d)
begin
c <= a and b;
d <= not c;
e <= b and d;
end process;
process (a,b)
begin
c <= a and b;
end process;
process (c)
begin
d <= not c;
end process;
process (b,d)
begin
e <= b and d;
end process;
48 CHAPTER 1. VHDL
1.6.2 Temporal Granularities of Simulation
This section reserved for your reading pleasure
1.6.3 Intuition Behind Delta-Cycle Simula-tion
In zero-delay simulation, a sequence of dependent events must appear to happeninstantaneously (in zero time). In particular, the effect of an event must propagateinstantaneously through combinational circuitry.
Two fundamental rules for zero-delay simulation:
1. events appear to propagate through combinational circuitry instantaneously.
2. all of the gates appear to operate in parallel
1.6.3 Intuition Behind Delta-Cycle Simulation 49
Intution for Delta Cycles
To make it appear that events propagate instaneously, VHDL introduces an artificialunit of time, the delta cycle, to represent an infinitesimally small amount of time. Ineach delta cycle, every gate in the circuit will sample its inputs, compute its result,and drive its output signal with the result.
Simulators simulate one gate at a time, but the waveforms make it appear that all ofthe gates were run in parallel. In each delta cycle, the simulator executes all gateswhose inputs changed.
To preserve the illusion that the gates ran in parallel, the effect of simulating a gateremains invisible until the end of the delta cycle.
50 CHAPTER 1. VHDL
1.6.4 Definitions and Algorithm
1.6.4.1 Process Modes
suspend
resume
activ
ate
active
suspendedpostponed
1.6.4 Definitions and Algorithm 51
Suspended
suspend
resume
activ
ate
active
suspendedpostponed
• Nothing to currently execute
• A process stays suspended until the event that it is waiting for occurs: either achange in a signal on its sensitivity list or the condition in a wait statement
52 CHAPTER 1. VHDL
Postponed
suspend
resume
activ
ate
active
suspendedpostponed
• Wants to execute, but not currently active
• A process stays postponed until the simulator chooses it from the pool of post-poned processes
1.6.4 Definitions and Algorithm 53
Active
suspend
resume
activ
ate
active
suspendedpostponed
• Currently executing
• A process stays active until it hits a wait statement or sensitivity list, at whichpoint it suspends
54 CHAPTER 1. VHDL
1.6.4.2 Simulation Algorithm
The algorithm presented here is a simplification of the actual algorithm in the VHDLStandard.
This algorithm does not support delayed assignments; for example:(a <= b after 2 ns; ).
A somewhat ironic note, only six of the two hundred pages in the VHDL Standardare devoted to the semantics of executing processes.
1.6.4 Definitions and Algorithm 55
The Algorithm
Simulations start at step 1 with all processes postponed and all signals with adefault value (e.g., ’U’ for std logic ).
1. While there are postponed processes:
(a) Pick one or more postponed processes to execute (become active).(b) Provisionally execute assignments (new values become visible at step 3)(c) A process executes until it hits its sensitivity list or a wait statement, at which point it
suspends.(d) Processes that become suspended, stay suspended until there are no more postponed
or active processes.
2. Each process checks its sensitivity list or wait condition to see if it should resume
3. Update signals with their provisional values4. If no postponed processes, then increment simulation time to next event.
56 CHAPTER 1. VHDL
Notes on Simulation Algorithm• At a wait statement, the process will suspend even if the condition is true in the
current simulation cycle. The process will resume when the condition changesto true.
• In n-threaded execution, at most n processes are active at a time
1.6.4 Definitions and Algorithm 57
1.6.4.3 Delta-Cycle Definitions
Definition simulation step: Executing one sequential assignment or processmode change.
Definition simulation cycle: The operations that occur in one iteration of thesimulation algorithm.
Definition delta cycle: A simulation cycle that does not advance simulationtime.
Definition simulation round: A sequence of simulation cycles that all have thesame simulation time.
58 CHAPTER 1. VHDL
1.6.5 Example 1: Process Execution (Bam-boozle)
This section reserved for your reading pleasure
1.6.6 Example 2: Process Execution (Flum-mox)
This example is a variation of the Bamboozle example from section 1.6.5.
1.6.6 Example 2: Process Execution (Flummox) 59
a
b
c d
e
U
U
U UU
a
b
c
d
e
P
P
Legend
0ns
simulation step
visible-assignment valuesimulation-step pointer(one per process)
process mode (S=suspended, P=postponend A=active)
P
initial values
provisional-assignment value
proc1: process (a, b, c) begin
c <= a AND b;
end process;
proc2: process (b, d) begin
d <= NOT c;
end process;
e <= b AND d;
proc3: process begin
a <= ’1’;
b <= ’0’;
b <= ’1’;
wait for 3 ns;
wait for 99 ns;
end process;
proc1
proc2
proc3
delta cyclesim cycle
sim round
60 CHAPTER 1. VHDL
a
b
c d
e
proc1: ...(a, b, c)...
c <= a AND b;
end process;
proc2: ...(b, d)...
d <= NOT c;
end process;
e <= b AND d;
proc3: process begin
a <= ’1’;
b <= ’0’;
b <= ’1’;
wait for 3 ns;
wait for 99 ns;
end process;
2. Check sens lists, wait conditions for changes3. Update signals with provisional values4. If no postponed procs, increment time
1. While there are postponed processes:(a) Pick process(es) to activate(b) Execute active processes, record prov asns(c) Suspend at sens list or wait statement(d) Once suspended, stay suspended
a
b
c
d
e
proc1
proc2
proc3
delta cyclesim cycle
sim round
1.6.6 Example 2: Process Execution (Flummox) 61
From Delta-Time to Real Time
a
b
c
d
e
U
U
U
U
U
+1δ +2δ +3δ3ns
+1δ +2δ +3δ0ns 102ns
U
U
U
U
U
a
b
c
d
e
3ns0ns 102ns
U
U
U
U
U
2ns1ns 4ns 100ns 101ns
62 CHAPTER 1. VHDL
Note and Questions
Note: If a signal is updated with the same value it had in theprevious simulation cycle, then it does not change, and thereforedoes not trigger processes to resume.
Question: What are the different granularities of time that occur when doingdelta-cycle simulation?
Question: What is the order of granularity, from finest to coarsest, amongstthe different granularities related to delta-cycle simulation?
1.6.7 Ex: Need for Provisonal Asn 63
1.6.7 Ex: Need for Provisonal Asnarchitecture main of swindle is
begin
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
end main;
Question: draw the circuit
Circuit to illustrate need for provisional assignments
1. Start with all signals at ’0’ .
2. Simultaneously change to a = ’1’ and b = ’1’ .
64 CHAPTER 1. VHDL
With Provisional Assignments,
c Before d
If assignments are not visible within same simulation cycle(correct: i.e. provisional assignments are used)
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will have a ’1’ pulse.
1.6.7 Ex: Need for Provisonal Asn 65
With Provisional Assignments,
d Before c
If assignments are not visible within same simulation cycle(correct: i.e. provisional assignments are used)
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p d is scheduled before p c , then d will have a ’1’ pulse.
66 CHAPTER 1. VHDL
Without Prov. Assignments,
c Before d
If assignments are visible within same simulation cycle (incorrect)
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will stay constant ’0’ .
1.6.7 Ex: Need for Provisonal Asn 67
Without Prov. Assignments,
d Before c
If assignments are visible within same simulation cycle (incorrect)
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p d is scheduled before p c , then d will have a ’1’ pulse.
68 CHAPTER 1. VHDL
Need for Provisional Assignment
With provisional assignments, both orders of scheduling processes result in thesame behaviour on all signals. Without provisional assignments, different schedul-ing orders result in different behaviour.
1.6.8 Delta-Cycle Simulations of Flip-Flops 69
1.6.8 Delta-Cycle Simulations of Flip-Flops
p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;
end process;
p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;
end process;flop : process ( clk ) begin
if rising_edge( clk ) thenq <= a;
end if;end process;
a
clk
q
flop
p_a
p_clk
sim roundsim cycle
delta cycle
0ns
PP
U
U
U
P
U
BBB
EE
A SA S
U
A S
0
0
70 CHAPTER 1. VHDL
Redraw with Normal Time Scale
a
clk
q
0ns 10ns 20ns5ns 15ns 30ns 35ns25ns
1.6.8 Delta-Cycle Simulations of Flip-Flops 71
Back-to-Back Flops
p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;
end process;
p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;
end process;flops : process ( clk ) begin
if rising_edge( clk ) thenq1 <= a;q2 <= q1;
end if;end process;
a
clk
q1
flops
p_a
p_clk
sim roundsim cycle
delta cycle
10ns
P A S
0
0
B/E
A SP
U
15ns
P A S
20ns
P A S
30ns
P A SA S
1
0
0
A SP
1
1
B/E
B
BB
EE E
EE
EE
E B E B E B EB E B/E
B/E
B/E
B/E
B/E
B/EBB B E
BB B E
35ns
1
P
U
q2 U
B
72 CHAPTER 1. VHDL
Redraw with Normal Time Scale
a
clk
q
0ns 10ns 20ns5ns 15ns 30ns 35ns25ns
1.6.8 Delta-Cycle Simulations of Flip-Flops 73
External Inputs and Flops
Question: Do the signals b1 and b2 have the same behaviour from20–30 ns?
74 CHAPTER 1. VHDL
architecture mathilde of sauv e is
signal clk, a, b : std_logic;
begin
process begin
clk <= ’1’;
wait for 10 ns;
clk <= ’0’;
wait for 10 ns;
end process;
process begin
wait for 20 ns;
a1 <= ’1’;
end process;
process begin
wait until rising_edge(clk);
a1 <= ’1’;
end process;
process begin
wait until rising_edge( clk );
b1 <= a1;
b2 <= a2;
1.6.8 Delta-Cycle Simulations of Flip-Flops 75
Testbenches and Clock Phases
env : process begina <= ’1’;clk <= ’0’;wait for 10 ns;a <= ’0’;clk <= ’1’;wait for 10 ns;
end process;
flop : process ( clk ) beginif rising_edge( clk ) then
q1 <= aend if;
end process;
a
clk
q1
flop2
flop1
env
sim roundsim cycle
delta cycle
0ns
76 CHAPTER 1. VHDL
Redraw with Normal Time Scale
a
clk
q1
0ns 10ns 20ns
1.6.8 Delta-Cycle Simulations of Flip-Flops 77
WarningNote: Testbench signals For consistent results across differ-ent simulators, simulation scripts vs test benches, and timing-simulation vs zero-delay simulation do not change signals in yourtestbench or script at the same time as the clock changes.
a is output of clocked or com-binational process
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
a is output of timed process(testbench or environment)POOR DESIGN
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
a is output of timed process(testbench or environment)GOOD DESIGN
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
78 CHAPTER 1. VHDL
1.7 Register-Transfer-Level Simulation
a
b
c
d
e
proc1
proc2
proc3
delta cyclesim cycle
sim round BBB
PPP
U
U
U
U
U
A
U
SA
1
0
S
A S
U
U
EE
PP
A
0
U
SA S
BB E
E
P A S
0
1
BB E
E
P A S
0
B EE
P A S
1
PP A S
1
A S
1
1
BB
BEE
P A S
1
0
P A S
0
102ns
0
BBE
E EE
EBB
0ns 3ns
BEE
U
0ns+1δ 0ns+2δ 0ns+2δ 3ns+1δ 3ns+2δ 3ns+3δ
a
b
c
d
e
U
U
U
U
U
1
0
0
1
0
1
1
0
0ns 1ns 2ns 3ns 102ns
Delta cycle simulation RTL simulation
1.7.1 Overview 79
1.7.1 Overview• Much simpler than delta cycle
• Columns are real time: clock cycles, nanoseconds, etc.
• Can simulate both synthesizable and unsynthesizable code
• Cannot simulate combinational loops
• Same values as delta-cycle at end of simulation round
process begin
a <= ’0’;
wait for 10 ns;
a <= ’1’;
...
end process;
process begin
b <= ’0’;
wait for 10 ns;
b <= a;
...
end process;
Question: In this code, whatvalue should b have 10 ns?
80 CHAPTER 1. VHDL
1.7.2 Technique for Register-Transfer LevelSimulation
1. Pre-processing
(a) Separate processes into combinational and non-combinational (clocked andtimed)
(b) Decompose each combinational process into separate processes with onetarget signal per process
(c) Sort processes into topological order based on dependencies
2. For each clock cycle or unit of time:
(a) Run non-combinational processes in any order. Non-combinational assign-ments read from earlier clock cycle / time step, except that clocked processesread the current value of the clock signal.
(b) Run combinational processes in topological order. Combinational assign-ments read from current clock cycle / time step.
1.7.3 Examples of RTL Simulation 81
1.7.3 Examples of RTL Simulation
1.7.3.1 RTL Simulation Example 1
We revisit an earlier example from delta-cycle simulation, but change the codeslightly and do register-transfer-level simulation.
proc1: process (a, b, c) begin
d <= NOT c;
c <= a AND b;
end process;
proc2: process (b, d) begin
e <= b AND d;
end process;
proc3: process begin
a <= ’1’;
b <= ’0’;
wait for 3 ns;
b <= ’1’;
wait for 99 ns;
end process;
82 CHAPTER 1. VHDL
Decompose and sort comb procs
proc1d: process (c) begind <= NOT c;
end process;
proc1c: process (a, b) beginc <= a AND b;
end process;
proc2: process (b, d) begine <= b AND d;
end process;
proc1c: process (a, b) beginc <= a AND b;
end process;
proc1d: process (c) begind <= NOT c;
end process;
proc2: process (b, d) begine <= b AND d;
end process;
Decomposed Sorted
1.7.3 Examples of RTL Simulation 83
Waveforms
a
b
c
d
e
U
U
U
U
U
0ns 1ns 2ns 3ns 102ns
Example: Communicating State Machines
84 CHAPTER 1. VHDL
huey: process
begin
clk <= ’1’;
wait for 10 ns;
clk <= ’0’;
wait for 10 ns;
end process;
dewey: process
begina <= to_unsigned(0,4);
wait until re(clk);
while (a < 4) loop
a <= a + 1;
wait until re(clk);
end loop;
end process;
louie: process
begin
wait until re(clk);
d <= ’1’;
if (a >= 2) then
d <= ’0’;
wait until re(clk);
end if;
end process;
clk
a
d
1.8. VHDL AND HARDWARE BUILDING BLOCKS 85
1.8 VHDL and Hardware Building Blocks
1.8.1 Basic Building Blocks
Different classes of building blocks:
• Conditional
• Arithmetic
• Storage
86 CHAPTER 1. VHDL
Basic Building Blocks: Boolean
Schematic VHDL Description
and AND gate
or OR gatenot inverter
nand NAND gate
nor and gate
xor exclusive-or gate
1.8.1 Basic Building Blocks 87
Basic Building Blocks: Conditional
if-then-else ,when-else ,with-select ,case
Multiplexer
88 CHAPTER 1. VHDL
Basic Building Blocks: Arithmetic
+ adder
- subtracter
asl , lsl left shifter
asr , lsr right shifter
1.8.1 Basic Building Blocks 89
Basic Building Blocks: Storage
CE
S
R D Q
clocked process flip flop WE
A
DI
DO
memory component single-port memory WE
A0
DI0
DO0
A1 DO1
memory component dual-port memory
90 CHAPTER 1. VHDL
1.8.2 Deprecated Building Blocks for RTL
Some of the common gates you have encountered in previous courses should beavoided when synthesizing register-transfer-level hardware, particularly if FPGAsare the implementation technology.
Latches : Use flops, not latches
T, JK, SR, etc flip-flops : Limit yourself to D-type flip-flops
Tri-State Buffers : Use multiplexers, not tri-state buffers
Note: Unfortunately and surprisingly, PalmChip has beenawarded a US patent for using uni-directional busses (i.e. multi-plexers) for system-on-chip designs. The patent was filed in 2000,so all fourth-year design projects completed after that date willneed to pay royalties to PalmChip
1.8.2 Deprecated Building Blocks for RTL 91
What is This?
process (a)
begin
if rising_edge(a) then
c <= b;
end if;
end process;
92 CHAPTER 1. VHDL
1.8.3 Hardware and Code for Flops
1.8.3.1 Flops with Waits and Ifs
process (clk)
begin
if rising_edge(clk) then
q <= d;
end if;
end process;
1.8.3 Hardware and Code for Flops 93
VHDL Code for Flip-Flop: Wait-Style
process
begin
wait until rising_edge(clk);
q <= d;
end process;
94 CHAPTER 1. VHDL
1.8.3.2 Flops with Synchronous Reset
process (clk)
begin
if rising_edge(clk) then
if (reset = ’1’) then
q <= ’0’;
else
q <= d;
end if;
end if;
end process;
1.8.3 Hardware and Code for Flops 95
Flop with Synchronous Reset: Wait-Style
process
begin
wait until rising_edge(clk);
if (reset = ’1’) then
q <= ’0’;
else
q <= d0;
end if;
end process;
96 CHAPTER 1. VHDL
Variation on a Floppy Theme
Question: Synchronous or asynchronous reset?
process (clk, reset)
begin
if (reset = ’1’) then
q <= ’0’;
else
if rising_edge(clk) then
q <= d;
end if;
end if;
end process;
1.8.3 Hardware and Code for Flops 97
Variated Flop of a Theme
Question: Synchronous or asynchronous reset?
process
begin
if (reset = ’1’) then
q <= ’0’;
else
q <= d0;
end if;
wait until rising_edge(clk);
end process;
98 CHAPTER 1. VHDL
Flop with Chip-Enable
process (clk)
begin
if rising_edge(clk) then
if (ce = ’1’) then
q <= d;
end if;
end if;
end process;
Wait-style flop with chip-enable included in course notes
1.8.3 Hardware and Code for Flops 99
Q: Flop with a Mux on the Input?
D Q
d0
d1
sel
q
clk
100 CHAPTER 1. VHDL
Q: Flops with a Mux on the Output?
D Q q0
q1
sel
clk
D Q
clk
d1
d0
q
Question: For the circuits with mux-on-input and mux-on-output, does qhave the same behaviour in both circuits?
1.8.3 Hardware and Code for Flops 101
1.8.3.3 Flop with Chip-Enable and Mux onInput
Hint: Chip Enableprocess (clk)
begin
if rising_edge(clk) then
if (ce = ’1’) then
q <= d;
end if;
end if;
end process;
102 CHAPTER 1. VHDL
1.8.3.4 Flops with Chip-Enable, Muxes, andReset
This section reserved for your reading pleasure
1.8.4 An Example Sequential Circuit
This section reserved for your reading pleasure
1.9 Arrays and Vectors
This section reserved for your reading pleasure
1.10. ARITHMETIC 103
1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators.
Use the VHDL arithmetic operators and let the synthesis tool choose the best im-plementation for you.
1.10.1 Arithmetic Packages
To do arithmetic with signals, use the numeric_std package. This package de-fines types signed and unsigned , which are std_logic vectors on which youcan do signed or unsigned arithmetic.
numeric std supersedes earlier arithmetic packages, such asstd logic arith .
Use only one arithmetic package, otherwise the different definitions will clash andyou can get strange error messages.
104 CHAPTER 1. VHDL
1.10.2 Shift and Rotate Operations
This section reserved for your reading pleasure
1.10.3 Overloading of Arithmetic
This section reserved for your reading pleasure
1.10.4 Different Widths and Arithmetic
This section reserved for your reading pleasure
1.10.5 Overloading of Comparisons
This section reserved for your reading pleasure
1.10.6 Different Widths and Comparisons 105
Overloading of Comparison Operations (=, /= , >=, >, <)
src1/2 src2/1unsigned integer OK
signed integer OKunsigned signed fails in analysis
1.10.6 Different Widths and Comparisons
This section reserved for your reading pleasure
106 CHAPTER 1. VHDL
1.10.7 Type Conversion
The functions unsigned , signed , to integer , to unsigned and to signed
are used to convert between integers, std-logic vectors, signed vectors and un-signed vectors.
If you convert between two types of the same width, then no additional hardwarewill be generated.
The listing below summarizes the types of these functions.
1.10.7 Type Conversion 107
Type Conversion
unsigned( val : std_logic_vector ) return unsigned;
signed( val : std_logic_vector ) return signed;
to_integer( val : signed ) return integer;
to_integer( val : unsigned ) return integer;
to_unsigned( val : integer; width : natural) return unsigned;
to_signed( val : integer; width : natural) return signed;
Note: More details in course notes
108 CHAPTER 1. VHDL
1.11 Synthesizable vs Non-SynthesizableCode
Synthesis is done by matching VHDL code against templates or patterns.
It’s important to use idioms that your synthesis tools recognize.
Think like hardware: when you write VHDL, you should know what hardware youexpect to be produced by the synthesizer.
1.11.1 Unsynthesizable Code 109
1.11.1 Unsynthesizable Code
1.11.1.1 Initial Values
Initial values on signals (UNSYNTHESIZABLE)
signal bad_signal : std_logic := ’0’;
Reason : At powerup, the values on signals are random (except for some FPGAs).
110 CHAPTER 1. VHDL
1.11.1.2 Wait For
Wait for length of time (UNSYNTHESIZABLE)
wait for 10 ns;
Reason : Delays through circuits are dependent upon both the circuit and its op-erating environment, particularly supply voltage and temperature. For example,imagine trying to build an AND gate that will have exactly a 2ns delay in all envi-ronments.
1.11.1 Unsynthesizable Code 111
1.11.1.3 Different Wait Conditions
wait statements with different conditions in a process (UNSYNTHESIZABLE)
-- different clock signals
process
begin
wait until rising_edge(clk1);
x <= a;
wait until rising_edge(clk2);
x <= a;
end process;
Reason : Would require the flip flops to use different clock signals at different times.
112 CHAPTER 1. VHDL
Different Wait Conditions
-- different clock edges
process
begin
wait until rising_edge(clk);
x <= a;
wait until falling_edge(clk);
x <= a;
end process;
Reason : Would require flip-flop to be sensitive to different clock edges at differenttimes.
1.11.1 Unsynthesizable Code 113
1.11.1.4 Multiple “if rising edge” in Pro-cessMultiple if rising edge statements in a process (UNSYNTHESIZABLE)
process (clk)
begin
if rising_edge(clk) then
q0 <= d0;
end if;
if rising_edge(clk) then
q1 <= d1;
end if;
end process;
Reason : The idioms for synthesis tools generally expect just a single ifrising edge statement in each process.
The simpler the VHDL code is, the easier it is to synthesize hardware. Program-mers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobssimpler.
114 CHAPTER 1. VHDL
1.11.1.5 “if rising edge” and “wait” in SameProcess
An if rising edge statement and a wait statement in the same process (UN-SYNTHESIZABLE)
process (clk)
begin
if rising_edge(clk) then
q0 <= d0;
end if;
wait until rising_edge(clk);
q0 <= d1;
end process;
Reason : The idioms for synthesis tools generally expect just a single type of flop-generating statement in each process.
1.11.1 Unsynthesizable Code 115
1.11.1.6 “if rising edge” with “else” Clause
The if statement has a rising edge condition and an else clause (UNSYN-THESIZABLE).
process (clk)
begin
if rising_edge(clk) then
q0 <= d0;
else
q0 <= d1;
end if;
end process;
Reason : Generally, an if-then-else statement synthesizes to a multiplexer.
116 CHAPTER 1. VHDL
1.11.1.7 “if rising edge” Inside a “for” Loop
An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys)
process (clk) begin
for i in 0 to 7 loop
if rising_edge(clk) then
q(i) <= d;
end if;
end loop;
end process;
Reason : just an idiom of the synthesis tool.
Some loop statements are synthesizable (Rushton Section 8.7).For-loops in general are described in Ashenden.
1.11.1 Unsynthesizable Code 117
Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to putthe if-rising-edge outside of the for loop.
process (clk) begin
if rising_edge(clk) then
for i in 0 to 7 loop
q(i) <= d;
end loop;
end if;
end process;
118 CHAPTER 1. VHDL
1.11.1.8 “wait” Inside of a “for loop”wait statements in a for loop (UNSYNTHESIZABLE)
process
begin
for i in 0 to 7 loop
wait until rising_edge(clk);
x <= to_unsigned(i,4);
end loop;
end process;
Reason : Unknown. while-loop s with the same behaviour are synthesizable.
Note: Combinational for-loops Combinational for-loops areusually synthesizable. They are often used to build a combina-tional circuit for each element of an array.
Note: Clocked for-loops Clocked for-loops are not synthe-sizable, but are very useful in simulation, particular to generatetest vectors for test benches.
1.11.1 Unsynthesizable Code 119
Synthesizable Alternative to Wait-Inside-For
while loop (synthesizable)
This is the synthesizable alternative to the the wait statement in a for loop above.
process
begin
-- output values from 0 to 4 on i
-- sending one value out each clock cycle
i <= to_unsigned(0,4);
wait until rising_edge(clk);
while (4 > i) loop
i <= i + 1;
wait until rising_edge(clk);
end loop;
end process;
120 CHAPTER 1. VHDL
1.12 Synthesizable VHDL Coding Guide-lines
This section reserved for your reading pleasure
Chapter 2
RTL Design with VHDL: FromRequirements to Optimized Code
121
122 CHAPTER 2. RTL DESIGN WITH VHDL
2.1 Prelude to Chapter
This section reserved for your reading pleasure
2.2 FPGA Background and Coding Guide-lines
2.2.1 Generic FPGA Hardware
2.2.1 Generic FPGA Hardware 123
2.2.1.1 Generic FPGA Cell“Cell” = “Logic Element” (LE) in Altera
= “Configurable Logic Block” (CLB) in Xilinx
CE
S
R D Q data_in
ctrl_in
carry_in
carry_out
data_outcomb
124 CHAPTER 2. RTL DESIGN WITH VHDL
Configurable Comb/Flop Connection
CE
S
R D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
2.2.1 Generic FPGA Hardware 125
Separate Comb and Flop
CE
S
R D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
126 CHAPTER 2. RTL DESIGN WITH VHDL
Connect Comb and Flop
CE
S
R D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
2.2.1 Generic FPGA Hardware 127
Flopped and Unflopped Outputs
CE
S
R D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
128 CHAPTER 2. RTL DESIGN WITH VHDL
2.2.2 Area Estimation
To estimate the number of FPGA cells that will be required to implement a circuit,recall that an FPGA lookup-table can implement any function with up to four inputsand one output.
We will describe two methods to estimate the area (number of FPGA cells) requiredto implement a gate-level circuit:
1. Rough estimate based simply upon the number of flip-flops and primary inputsthat are in the fanin of each flip-flop.
2. A more accurate estimate, based upon greedily including as many gates aspossible into each FPGA cell.
2.2.2 Area Estimation 129
Lower Bound on Area for Circuit with oneTarget
Source flops/inputs Minimum cells1 12 13 14 15 26 27 28 39 3
10 311 4
For a single target signal, this technique gives a lower bound on the number of cellsneeded.
For multiple target signals, this technique might be an overestimate, because asingle cell can drive several other cells.
130 CHAPTER 2. RTL DESIGN WITH VHDL
Question: How many cells are needed to implement a 4:1 mux?
2.2.2 Area Estimation 131
3 Cells for 10:1 Function
132 CHAPTER 2. RTL DESIGN WITH VHDL
Estimate Area for Circuit
For each flip-flop and output: traverse backward through the fanin gathering asmuch combinational circuitry as possible into the FPGA cell.
Stopping conditions:• flip-flop
• more than four inputs — However, have more than four signals as input, thenfurther back in the fanin, the circuit will collapse back to four or fewer signals.
2.2.2 Area Estimation 133
Question: Map the combinational circuits below onto generic FPGA cells.
a
b
c
d
z
CE
S
R D Q comb
CE
S
R D Q comb
CE
S
R D Q comb
CE
S
R D Q comb
CE
S
R D Q comb
CE
S
R D Q comb
134 CHAPTER 2. RTL DESIGN WITH VHDL
2.2.2.1 Interconnect for Generic FPGA
This section reserved for your reading pleasure
2.2.2.2 Clocks for Generic FPGAs
Characteristics of clock signals:• High fanout (drive many gates)
• Long wires (destination gates scattered all over chip)
Characteristics of FPGAs:• Very few gates that are large (strong) enough to support a high fanout.
• Very few wires that traverse entire chip and can be connected to every flip-flop.
2.2.2 Area Estimation 135
2.2.2.3 Special Circuitry in FPGAs
Memory
For more than five years, FPGAs have had special circuits for RAM and ROM. InAltera FPGAs, these circuits are called ESBs (Embedded System Blocks). Thesespecial circuits are possible because many FPGAs are fabricated on the sameprocesses as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.
136 CHAPTER 2. RTL DESIGN WITH VHDL
Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessorson the same chip as programmable hardware.
Hard SoftAltera Arm 922T with 200 MIPs Nios with ?? MIPsXilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs
The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to imple-ment the first-generation Intel Pentium microprocessor.
2.2.2 Area Estimation 137
Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multi-pliers and adders.
Altera: Mercury 16×16 at 130MHzXilinx: Virtex-II Pro 18×18 at ???MHz
Using these resources can improve significantly both the area and performance ofa design.
138 CHAPTER 2. RTL DESIGN WITH VHDL
Input / Output
Recently, high-end FPGAs have started to include special circuits to increase thebandwidth of communication with the outside world.
ProductAltera True-LVDS (1 Gbps)Xilinx Rocket I/O (3 Gbps)
2.2.3 Generic-FPGA Coding Guidelines 139
2.2.3 Generic-FPGA Coding Guidelines
Flip Flops Are Free• Flip-flops are almost free in FPGAs
reason In FPGAs, the area consumed by a design is usually determined by theamount of combinational circuitry, not by the number of flip-flops.
140 CHAPTER 2. RTL DESIGN WITH VHDL
Use It or Lose• Aim for using 80–90% of the cells on a chip.
reason If you use more than 90% of the cells on a chip, then the place-and-route program might not be able to route the wires to connect the cells.
reason If you use less than 80% of the cells, then probably:
there are optimizations that will increase performance and still allow thedesign to fit on the chip;
or you spent too much human effort on optimizing for low area;or you could use a smaller (cheaper!) chip.
exception In E&CE 327 (unlike in real life), the mark is based on the actualnumber of cells used.
2.2.3 Generic-FPGA Coding Guidelines 141
Just One Clock• Use just one clock signal
reason If all flip-flops use the same clock, then the clock does not impose anyconstraints on where the place-and-route tool puts flip-flops and gates. Ifdifferent flip-flops used different clocks, then flip-flops that are near each otherwould probably be required to use the same clock.
142 CHAPTER 2. RTL DESIGN WITH VHDL
Just One Clock Edge• Use only one edge of the clock signal
reason There are two ways to use both rising and falling edges of a clock signal:have rising-edge and falling-edge flip flops, or have two different clock signalsthat are inverses of each other. Most FPGAs have only rising-edge flip flops.Thus, using both edges of a clock signal is equivalent to having two differentclock signals, which is deprecated by the preceding guideline.
2.3. DESIGN FLOW 143
2.3 Design Flow
This section reserved for your reading pleasure
2.4 Algorithms and High-Level Models
This section reserved for your reading pleasure
144 CHAPTER 2. RTL DESIGN WITH VHDL
2.5 Finite State Machines in VHDL
2.5.1 Introduction to State-Machine Design
2.5.1.1 Mealy vs Moore State Machines
2.5.1 Introduction to State-Machine Design 145
Moore Machines• Outputs are dependent upon only the state
• No combinational paths from inputs to outputs
s0/0
s1/1 s2/0
s3/0
a !a
146 CHAPTER 2. RTL DESIGN WITH VHDL
Mealy Machines• Outputs are dependent upon both the state and the inputs
• Combinational paths from inputs to outputs
s0
s1 s2
s3
a/1 !a/0
/0/0
2.5.1 Introduction to State-Machine Design 147
2.5.1.2 Introduction to State Machines andVHDL
A state machine is generally written as a single clocked process, or as a pair ofprocesses, where one is clocked and one is combinational.
Design Decisions• Moore vs Mealy (Sections 2.5.2 and 2.5.3)
• Implicit vs Explicit (Section 2.5.1.3)
• State values in explicit state machines: Enumerated type vs constants (Sec-tion 2.5.5)
• State values for constants: encoding scheme (binary, gray, one-hot, ...) (Sec-tion 2.5.5)
148 CHAPTER 2. RTL DESIGN WITH VHDL
VHDL Constructs for State Machines
The following VHDL control constructs are useful to steer the transition from stateto state:• if ... then ... else
• case
• for ... loop
• while ... loop
• loop
• next
• exit
2.5.1 Introduction to State-Machine Design 149
2.5.1.3 Explicit vs Implicit State Machines
There are two styles of writing state machines in VHDL: explicit and implicit.
Explicit
• State signal appears explicitly in VHDL code
• At most one wait statement per process
• Two sub-categories of explicit state machines
Explicit-Current
– State signal represents current state
– Next-state computation done in a clocked process
Explicit-Current+Next
– Two state signals: current state and next state
– Next-state computation done in a combinational process
– Current-state <= next-state is registered assignment
Implicit Use multiple wait statements in a process to describe state machineimplicilty
150 CHAPTER 2. RTL DESIGN WITH VHDL
Implicit State Machines
For the implicit style of writing state machines, the synthesis program adds an im-plicit register to hold the state signal and combinational circuitry to update the statesignal. In Synopsys synthesis tools, the state signal defined by the synthesizer isnamed multiple wait state reg .
In Mentor Graphics, the state signal is named STATE VAR
We can think of the VHDL code for implicit state machines as having zero statesignals, explicit-current state machines as having one state signal (state ), andexplicit-current+next state machines as having two state signals (state andstate next ).
2.5.1 Introduction to State-Machine Design 151
State Machine TradeoffsExplicit-Current+Next
• Most detailed, closest to hardware
• Greatest opportunity for manual optimization
• Most labour-intensive
• Susceptible to small, subtle, hard-to-find bugs
Explicit-Current
• Almost as manual optimization as Explicit-Current+Next
• Easier to write than Explicit-Current+Next
• Less susceptible to subtle bugs
Implicit
• Taught infrequently
• Least detailed, furthest from actual hardware
• Rely on synthesis for optimization
• Usually least labour to write, shortest code
• Easiest to write correctly (But must understand VHDL synthesis! )
152 CHAPTER 2. RTL DESIGN WITH VHDL
Limitation of Implicit State Machines
Because implicit state machines are written with loops, if-then-elses, cases, etc. itis difficult to write some state machines with complicated control flows in an implicitstyle. The following example illustrates the point.
s0/0
s1/1
s2/0
s3/0
a
!a
!a
a
2.5.1 Introduction to State-Machine Design 153
Terminology
Note: The terminology of “explicit” and “implicit” is somewhatstandard, in that some descriptions of processes with multiple waitstatements describe the processes as having “implicit state ma-chines”.There is no standard terminology to distinguish between the twoexplicit styles: explicit-current+next and explicit-current.
154 CHAPTER 2. RTL DESIGN WITH VHDL
2.5.2 Implementing a Simple Moore Ma-chine
s0/0
s1/1 s2/0
s3/0
a !a
entity simple is
port (
a, clk : in std_logic;
z : out std_logic
);
end simple;
2.5.2 Implementing a Simple Moore Machine 155
2.5.2.1 Implicit Moore State Machine
architecture moore_implicit_v1a of simple is
begin
process
begin
z <= ’0’;
wait until rising_edge(clk);
if (a = ’1’) then
z <= ’1’;
else
z <= ’0’;
end if;
wait until rising_edge(clk);
z <= ’0’;
wait until rising_edge(clk);
end process;
end moore_implicit;
FlopsGatesDelay
156 CHAPTER 2. RTL DESIGN WITH VHDL
Implicit Moore State Machine
s2/0
!a
2.5.2 Implementing a Simple Moore Machine 157
2.5.2.2 Explicit Moore with Flopped Output
architecture moore_explicit_v1 of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;
beginprocess (clk)begin
if rising_edge(clk) thencase state is
when s0 =>if (a = ’1’) then
state <= s1;z <= ’1’;
elsestate <= s2;z <= ’0’;
end if;when s1 | s2 =>
state <= s3;z <= ’0’;
when s3 =>state <= s0;z <= ’1’;
end case;end if;
end process;end moore_explicit_v1;
FlopsGatesDelay
158 CHAPTER 2. RTL DESIGN WITH VHDL
Explicit Moore with Flopped Outputs
2.5.2 Implementing a Simple Moore Machine 159
2.5.2.3 Explicit Moore with CombinationalOutputs
architecture moore_explicit_v2 of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;
beginprocess (clk)begin
if rising_edge(clk) thencase state is
when s0 =>if (a = ’1’) then
state <= s1;else
state <= s2;end if;
when s1 | s2 =>state <= s3;
when s3 =>state <= s0;
end case;end if;
end process;z <= ’1’ when (state = s1)
else ’0’;end moore_explicit_v2;
FlopsGatesDelay
160 CHAPTER 2. RTL DESIGN WITH VHDL
Explicit Moore with Combinational Outputs
2.5.2 Implementing a Simple Moore Machine 161
2.5.2.4 Explicit-Current+Next Moore withConcurrent Assignment
architecture moore_explicit_v3 of simple istype state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;
beginprocess (clk)begin
if rising_edge(clk) thenstate <= state_nxt;
end if;end process;state_nxt <= s1 when (state = s0) and (a = ’1’)
else s2 when (state = s0) and (a = ’0’)else s3 when (state = s1) or (state = s2)else s0;
z <= ’1’ when (state = s1)else ’0’;
end moore_explicit_v3;
FlopsGatesDelay
162 CHAPTER 2. RTL DESIGN WITH VHDL
Explicit-Current+Next Moore with
Concurrent Assignment
The hardware synthesized from this architecture is the same as that synthesizedfrom moore explicit v2 , which is written in the current-explicit style.
2.5.2 Implementing a Simple Moore Machine 163
2.5.2.5 E-C+N Moore with Comb Procarchitecture moore_explicit_v4 of simple is
type state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;
beginprocess (clk)begin
if rising_edge(clk) thenstate <= state_nxt;
end if;end process;process (state, a)begin
case state iswhen s0 =>
if (a = ’1’) thenstate_nxt <= s1;
elsestate_nxt <= s2;
end if;when s1 | s2 =>
state_nxt <= s3;when s3 =>
state_nxt <= s0;end case;
end process;z <= ’1’ when (state = s1)
else ’0’;end moore_explicit_v4;
Change the selected as-signment to state intoa combinational processusing a case statement.
FlopsGatesDelay
Same hardware asmoore explicit v2
and v3 .
164 CHAPTER 2. RTL DESIGN WITH VHDL
Explicit-Current+Next Moore with
Combinational Process
2.5.3 Implementing a Simple Mealy Machine 165
2.5.3 Implementing a Simple Mealy Ma-chine
Mealy machines have a combinational path from inputs to outputs, which oftenviolates good coding guidelines for hardware. Thus, Moore machines are muchmore common. You should know how to write a Mealy machine if needed, but mostof the state machines that you design will be Moore machines.
This section reserved for your reading pleasure
166 CHAPTER 2. RTL DESIGN WITH VHDL
2.5.4 Reset
All circuits should have a reset signal that puts the circuit back into a good initialstate. However, not all flip flops within the circuit need to be reset. In a circuit thathas a datapath and a state machine, the state machine will probably need to bereset, but datapath may not need to be reset.
There are standard ways to add a reset signal to both explicit and implicit statemachines.
It is important that reset is tested on every clock cycle, otherwise a reset might notbe noticed, or your circuit will be slow to react to reset and could generate illegaloutputs after reset is asserted.
2.5.4 Reset 167
Reset with Implicit State Machine• Insert a loop
• Test for reset after each wait
Example from section 2.5.2.1:
architecture moore_implicit of simple isbegin
processbegin
init : loop -- outermost loopz <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetif (a = ’1’) then
z <= ’1’;else
z <= ’0’;end if;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetz <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for reset
end process;end moore_implicit;
168 CHAPTER 2. RTL DESIGN WITH VHDL
Reset with Explicit State Machine
Reset is often easier to include in an explicit state machine, because we need onlyput a test for reset = ’1’ in the clocked process for the state.
The pattern for an explicit-current style of machine is:
process (clk) begin
if rising_edge(clk) then
if reset = ’1’ then
state <= S0;
else
if ... then
state <= ...;
elif ... then
... -- more tests and assignments to state
end if;
end if;
end if;
end process;
2.5.4 Reset 169
Reset with Explicit State Machine
Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:
architecture moore_explicit_v2 of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;
beginprocess (clk)begin
if rising_edge(clk) thenif (reset = ’1’) thenstate <= s0;
elsecase state is
...end case;
end if;end if;
end process;z <= ’1’ when (state = s1)
else ’0’;end moore_explicit_v2;
170 CHAPTER 2. RTL DESIGN WITH VHDL
Reset with Explicit-Current+Next
The pattern for an explicit-current+next style is:
process (clk) begin
if rising_edge(clk) then
if reset = ’1’ then
state_cur <= reset state;
else
state_cur <= state_nxt;
end if;
end if;
end process;
2.5.5 State Encoding
This section reserved for your reading pleasure
2.6. DATAFLOW DIAGRAMS 171
2.6 Dataflow Diagrams
2.6.1 Dataflow Diagrams Overview• Dataflow diagrams are data-dependency graphs where the computation is di-
vided into clock cycles.
• Purpose:
– Provide a disciplined approach for designing datapath-centric circuits
– Guide the design from algorithm, through high-level models, and finally to reg-ister transfer level code for the datapath and control circuitry.
– Estimate area and performance
– Make tradeoffs between different design options
• Background
– Based on techniques from high-level synthesis tools
– Some similarity between high-level synthesis and software compilation
– Each dataflow diagram corresponds to a basic block in software compiler ter-minology.
172 CHAPTER 2. RTL DESIGN WITH VHDL
Data-Dependency Graph
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Data-dependency graph for z = a + b + c + d + e + f
2.6.1 Dataflow Diagrams Overview 173
Dataflow Diagrams
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Dataflow diagram for z = a + b + c + d + e + f
174 CHAPTER 2. RTL DESIGN WITH VHDL
Clock Cycle Boundaries
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
2.6.1 Dataflow Diagrams Overview 175
Latency
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Latency = 6 clock cycles
1
2
3
4
5
6
176 CHAPTER 2. RTL DESIGN WITH VHDL
Latency
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Latency = 4 clock cycles
1
2
3
4
Question: Why would a good hardware engineer find this designdisatisfying?
2.6.1 Dataflow Diagrams Overview 177
Flip Flops
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Signals crossing clockboundaries are flip-flops
178 CHAPTER 2. RTL DESIGN WITH VHDL
Registered Inputs and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Signals crossing clockboundaries are flip-flops
Flops on both inputs and outputs
2.6.1 Dataflow Diagrams Overview 179
Registered Inputs, Combinational Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Signals crossing clockboundaries are flip-flops
Flops on inputs, but not outputs(Latency = 5)
180 CHAPTER 2. RTL DESIGN WITH VHDL
Datapath Components
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Signals crossing clockboundaries are flip-flops
Blocks in clock cyclesare datapath components
2.6.1 Dataflow Diagrams Overview 181
Inputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Unconnected signal tails are inputs
Signals crossing clockboundaries are flip-flops
Blocks in clock cyclesare datapath components
182 CHAPTER 2. RTL DESIGN WITH VHDL
Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Unconnected signal tails are inputs
Signals crossing clockboundaries are flip-flops
Blocks in clock cyclesare datapath components
Unconnected signal headsare outputs
2.6.1 Dataflow Diagrams Overview 183
Summary
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark clock cycle boundaries
Unconnected signal tails are inputs
Signals crossing clockboundaries are flip-flops
Blocks in clock cyclesare datapath components
Unconnected signal headsare outputs
184 CHAPTER 2. RTL DESIGN WITH VHDL
2.6.2 Dataflow Diagrams, Hardware, andBehaviour
Primary Input
Dataflow Diagrami
x
Hardwarei
x
Behaviourclk
i
x
2.6.2 Dataflow Diagrams, Hardware, and Behaviour 185
Register Input
Dataflow Diagrami
x
Hardwarei
x
Behaviourclk
i
x
186 CHAPTER 2. RTL DESIGN WITH VHDL
Register Signal
Dataflow Diagrami1
x
+
i2
Hardware
+
i2
x
i1
Behaviourclk
i1
i2
x
2.6.2 Dataflow Diagrams, Hardware, and Behaviour 187
Combinational-Component Output
Dataflow Diagrami1
x+
i2
Hardware
+
i2
i1x
Behaviourclk
i1
i2
x
188 CHAPTER 2. RTL DESIGN WITH VHDL
2.6.3 Dataflow Diagram Execution
2.6.3 Dataflow Diagram Execution 189
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0 0 1 2 3 4 5 6
x5
190 CHAPTER 2. RTL DESIGN WITH VHDL
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
0 1 2 3 4 5 6
x5
2.6.3 Dataflow Diagram Execution 191
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
0 1 2 3 4 5 6
x5
192 CHAPTER 2. RTL DESIGN WITH VHDL
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
0 1 2 3 4 5 6
x5
2.6.3 Dataflow Diagram Execution 193
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
0 1 2 3 4 5 6
x5
194 CHAPTER 2. RTL DESIGN WITH VHDL
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
0 1 2 3 4 5 6
x5
2.6.3 Dataflow Diagram Execution 195
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
6
0 1 2 3 4 5 6
x5
196 CHAPTER 2. RTL DESIGN WITH VHDL
Execution with Registers on Both Inputs
and Outputs
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
6
0 1 2 3 4 5 6
x5
2.6.3 Dataflow Diagram Execution 197
Execution Without Output Registers
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
0 1 2 3 4 5 6
x5
198 CHAPTER 2. RTL DESIGN WITH VHDL
2.6.4 Performance Estimation
Performance Equations
Performance ∝1
TimeExec
TimeExec = Latency×ClockPeriod
Definition Latency: Number of clock cycles from inputs to outputs. Acombinational circuit has latency of zero. A single register has a latency ofone. A chain of n registers has a latency of n.
Performance of Dataflow Diagrams• Latency: count horizontal lines in diagram
• Min clock period (Max clock speed) limited by longest path in a clock cycle
2.6.5 Area Estimation 199
2.6.5 Area Estimation• Maximum number of blocks in a clock cycle is total number of that component
that are needed
• Maximum number of signals that cross a cycle boundary is total number ofregisters that are needed
• Maximum number of unconnected signal tails in a clock cycle is total numberof inputs that are needed
• Maximum number of unconnected signal heads in a clock cycle is total num-ber of outputs that are needed
These estimates give lower bounds.
Other constraints might force you to use more components.
200 CHAPTER 2. RTL DESIGN WITH VHDL
Area Estimation
Implementation-technology factors, such as the relative size of registers, multiplex-ers, and datapath components, might force you to make tradeoffs that increase thenumber of datapath components to decrease the overall area of the circuit.• With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
• With some FPGA chips, a 2:1 multiplexer can be combined with an adder intoone FPGA cell per bit.
• In FPGAs, registers are usually “free”, in that the area consumed by a circuit islimited by the amount of combinational logic, not the number of flip-flops.
2.6.6 Design Analysis 201
2.6.6 Design Analysis
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
num inputs
num outputs
num registers
num adders
min clock period
latency
202 CHAPTER 2. RTL DESIGN WITH VHDL
Design Analysis (Cont’d)
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
x5
num inputs
num outputs
num registers
num adders
min clock period
latency
2.6.7 Area / Performance Tradeoffs 203
2.6.7 Area / Performance Tradeoffsone add per clock cycle two adds per clock cycle
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
Note: In the “Two-add” design, half of the last clock cycle iswasted.
204 CHAPTER 2. RTL DESIGN WITH VHDL
Two Adds per Clock Cycle
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
clk
a
x1
x2
x3
x4
x5
z
0 1 2 3 4 5 6
4
x5
2.6.7 Area / Performance Tradeoffs 205
Design Comparison
One add per clock cycle Two adds per clock cyclea b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
inputs 6 6outputs 1 1registers 6 6adders 1 2clock period flop + 1 add flop + 2 addlatency 6 4
Question: Under what circumstances would each design option be fastest?
206 CHAPTER 2. RTL DESIGN WITH VHDL
2.7 Design Example: Massey
This section reserved for your reading pleasure
2.8 Design Example: Vanier
We’ll go through the following artifacts:
1. requirements
2. algorithm
3. dataflow diagram
4. high-level models
5. hardware block diagram
6. RTL code for datapath
7. state machine
8. RTL code for control
2.8. DESIGN EXAMPLE: VANIER 207
Design Process1. Scheduling (allocate operations to clock cycles)
2. I/O allocation
3. First high-level model
4. Register allocation
5. Datapath allocation
6. Connect datapath components, insert muxes where needed
7. Design implicit state machine
8. Optimize
9. Design explicit-current state machine
10. Optimize
208 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.1 Requirements• Functional requirements: compute the following formula:
output = (a × d) + c + (d × b) + b
• Performance requirement:
– Max clock period: flop plus (2 adds or 1 multiply)
– Max latency: 4
• Cost requirements
– Maximum of two adders
– Maximum of two multipliers
– Unlimited registers
– Maximum of three inputs and one output
– Maximum of 5000 student-minutes of design effort
• Registered inputs and outputs
2.8.2 Algorithm 209
2.8.2 Algorithm
output = (a × d) + c + (d × b) + b
Create a data-dependency graph for the algorithm.
z
a d
+
+
+
b c
210 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.3 Initial Dataflow Diagram
Schedule operations into clock cycles.
z
a d
+
+
+
b c
2.8.4 Reschedule to Meet Requirements 211
2.8.4 Reschedule to Meet Requirements
z
a d
+
+
+
b c
z
d b ca
212 CHAPTER 2. RTL DESIGN WITH VHDL
Fix Clock Period Violation
z
d
+
+
+
b c
a
z
d
+
+
+
b c
a
2.8.5 Optimize Resources 213
2.8.5 Optimize Resources
z
a
d
+
+
+
b c
z
d b ca
214 CHAPTER 2. RTL DESIGN WITH VHDL
Analysis
z
a
d
+
+
+
b
c
Question: Should we move the second addition from third clock cycle tosecond?
2.8.5 Optimize Resources 215
Define Entity
Having finalized our input/output scheduling, we can write our entity. Note: we willadd a reset signal later, when we design the state machine to control the datapath.
entity vanier is
port (
clk : in std_logic;
i_1, i_2 : in std_logic_vector(15 downto 0);
o_1 : out std_logic_vector(15 downto 0)
);
end vanier;
216 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.6 Assign Names to Registered Values
z
a
d
+
+
+
b
c
Question: Why do we not need to assign names to combinational signals?
Question: Why do we not need to assign a new name to x1, x2, and x4 thesecond time they cross a clock cycle boundary?
2.8.7 Input/Output Allocation 217
2.8.7 Input/Output Allocation
z
a
d
+
+
+
b
c
x1 x2
x3 x4 x5
x6 x7
x8
218 CHAPTER 2. RTL DESIGN WITH VHDL
VHDL Code!
architecture hlm_v1 of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6, x_7,
x_8 : unsigned(15 downto 0);begin
process beginwait until rising_edge(clk);x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);wait until rising_edge(clk);x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);wait until rising_edge(clk);x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;wait until rising_edge(clk);x_8 <= x_6 + (x_4 + x_7);
end process;o_1 <= std_logic_vector(x_8);
end hlm_v1;
2.8.7 Input/Output Allocation 219
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
x1
0
1
2
3
4
x2
x3
x4
x5
x6
x7
x8
0 1 2 3 4 5
r1
r2
r3
r4
r5
0 1 2 3 4 5
i1
i2
i1
i2
220 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.8 Tangent: Combinational Outputs
architecture hlm_v1c of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6, x_7
: unsigned(15 downto 0);begin
process beginwait until rising_edge(clk);x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);wait until rising_edge(clk);x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);wait until rising_edge(clk);x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;
end process;o_1 <= std_logic_vector(x_6 + (x_4 + x_7));
end hlm_v1c;
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
2.8.9 Register Allocation 221
2.8.9 Register Allocation
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
222 CHAPTER 2. RTL DESIGN WITH VHDL
New VHDL Code!
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
architecture hlm_v2 of vanier issignal r_1, r_2, r_3, r_4, r_5
: unsigned(15 downto 0);begin
process beginwait until rising_edge(clk);r_1 <= unsigned(i_1);r_2 <= unsigned(i_2);wait until rising_edge(clk);r_3 <= unsigned(i_1);r_4 <= r_1(7 downto 0) * r_2(7 downto 0);r_5 <= unsigned(i_2);wait until rising_edge(clk);r_2 <= r_3(7 downto 0) * r_1(7 downto 0);r_5 <= r_2 + r_5;wait until rising_edge(clk);r_5 <= r_2 + (r_4 + r_5);
end process;o_1 <= std_logic_vector(r_5);
end hlm_v2;
2.8.10 Datapath Allocation 223
2.8.10 Datapath Allocation
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
224 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.11 Hardware Block Diagram and StateMachine1. Calculate number of states that are needed
2. Control signals for registers
• Chip enable
• Mux select on input
3. Control signals for datapath components
• Instruction (e.g. add/sub for ALU)
• Mux select on inputs
For our example: Use four states: S0..S3, one for each clock cycle.
2.8.11 Hardware Block Diagram and State Machine 225
2.8.11.1 Control for RegistersBuild a table with one row per state, one colum per register.
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
m1
m1a1
a2
a1
S0
S1
S2
S3
S0
r1 r2 r3 r4 r5ce d ce d ce d ce d ce d
S0S1S2S3
226 CHAPTER 2. RTL DESIGN WITH VHDL
Optimize chip enables and muxes
r1 r2 r3 r4 r5ce d ce d ce d ce d ce d
S0 1 i1 1 i2 – – – – – –S1 0 – 0 – 1 i1 1 m1 1 i2S2 – – 1 m1 – – 0 – 1 a1S3 – – – – – – – – 1 a1
• Chip enable: a register holds a value for multiple clock cycles.
• Mux: a register loads values from multiple sources.
2.8.11 Hardware Block Diagram and State Machine 227
Optimized Chip Enables and Muxes
r1=i1 r2 r3=i1 r4=m1 r5ce ce d ce d
S0 1 1 i2 – –S1 0 0 – 1 i2S2 – 1 m1 0 a1S3 – – – – a1
228 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.11.2 Control for Datapath Components• Table for datapath components.
• One row per state.
• One column per datapath component.
• Sub-columns for sources and instructions (e.g. add/sub for ALU).
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
m1
m1a1
a2
a1
S0
S1
S2
S3
S0
a1 a2 m1src1 src2 src1 src2 src1 src2
S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r3 r1S3 r2 a2 r4 r5 – –
2.8.11 Hardware Block Diagram and State Machine 229
Optimize Datapath Control Table
a1 a2 m1src1 src2 src1 src2 src1 src2
S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r1 r3S3 r2 a2 r4 r5 – –
230 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.11.3 Control for State
We need to control the transition from one state to the next. For this example, thetransition is very simple, each state transitions to its successor: S0→ S1→ S2→
S3→ S0....
2.8.11 Hardware Block Diagram and State Machine 231
2.8.11.4 Complete State Machine Table
r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 – – – – S1S1 0 0 – 1 i2 – r2 S2S2 – 1 m1 0 a1 r5 r3 S3S3 – – – – a1 a2 – S0
Question: What values should we use for don’t cares?
232 CHAPTER 2. RTL DESIGN WITH VHDL
“Don’t Cares” Instantiations
r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 0 a1 a2 r3 S1S1 0 0 m1 1 i2 a2 r2 S2S2 1 1 m1 0 a1 r5 r3 S3S3 1 1 m1 0 a1 a2 r3 S0
2.8.12 VHDL Code with Explicit State Machine 233
2.8.12 VHDL Code with Explicit State Ma-chine
We chose a one-hot encoding of the state, which usually results in small and fasthardware for state machines with sixteen or fewer states.
architecture explicit_v1 of vanier is
signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);
type state_ty is std_logic_vector(3 downto 0);
constant s0 : state_ty := "0001";
constant s1 : state_ty := "0010";
constant s2 : state_ty := "0100";
constant s3 : state_ty := "1000";
signal state : state_ty;
234 CHAPTER 2. RTL DESIGN WITH VHDL
begin------------------------ r_1process (clk) begin
if rising_edge(clk) thenif state != S1 then
r_1 <= i_1;end if;
end if;end process;------------------------ r_2process (clk) begin
if rising_edge(clk) thenif state != S1 then
if state = S0 thenr_2 <= i_2;
elser_2 <= m_1;
end if;end if;
end if;end process;
------------------------ r_3process (clk) begin
if rising_edge(clk) thenr_3 <= i_1;
end if;end process;------------------------ r_4process (clk) begin
if rising_edge(clk) thenif state = S1 then
r_4 <= m_1;end if;
end if;end process;
2.8.12 VHDL Code with Explicit State Machine 235
------------------------ r_5process (clk) begin
if rising_edge(clk) thenif state = S1 then
r_5 <= i_2;else
r_5 <= a_1;end if;
end if;end process;------------------------ combinational datapathwith state select
a1_src2 <= r_5 when S2,a_2 when others;
with state selectm1_src2 <= r_2 when S1
r_3 when others;a_1 <= a_2 + a1_src2;a_2 <= r_4 + r_5;m_1 <= r_1 * m1_src2;o_1 <= r_5;
------------------------ state machineprocess (clk) begin
if rising_edge(clk) thenif reset = ’1’ then
state <= S0;else
case state iswhen S0 => state <= S1;when S1 => state <= S2;when S2 => state <= S3;when S3 => state <= S0;
end case;end if;
end if;end process;----------------------
end explicit_v1;
236 CHAPTER 2. RTL DESIGN WITH VHDL
Hardware Block Diagram
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
m1
m1a1
a2
a1
+
+
m1
a1
a2
r1 r2 r3
r4
r5
i1 i2
S0
S1
S2
S3
S0
2.8.13 Peephole Optimizations 237
2.8.13 Peephole Optimizations
-- r_1
process (clk) begin
if rising_edge(clk) then
if state != S1 then
r_1 <= i_1;
end if;
end if;
end process;
-- r_1 (optimized)
process (clk) begin
if rising_edge(clk) then
if then
r_1 <= i_1;
end if;
end if;
end process;
238 CHAPTER 2. RTL DESIGN WITH VHDL
Peephole Optimizations
-- r_2process (clk) begin
if rising_edge(clk) thenif state != S1
if state = S0 thenr_2 <= i_2;
elser_2 <= m_1;
end if;end if;
end if;end process;
-- r_2 (optimized)process (clk) begin
if rising_edge(clk) thenif state(1) = ’0’ then
if state(0) = ’1’ thenr_2 <= i_2;
elser_2 <= m_1;
end if;end if;
end if;end process;
2.8.13 Peephole Optimizations 239
Peephole Optimizations
-- state machineprocess (clk) begin
if rising_edge(clk) thenif reset = ’1’ then
state <= S0;else
case state iswhen S0 => state <= S1;when S1 => state <= S2;when S2 => state <= S3;when S3 => state <= S0;
end case;end if;
end if;end process;
-- state machine (optimized)-- NOTE: "st" = "state"process (clk) begin
if rising_edge(clk) thenif reset = ’1’ then
st <= S0;else
for i in 0 to 3 loopst( (i+1) mod 4 ) <= st( i );
end loop;end if;
end if;end process;
240 CHAPTER 2. RTL DESIGN WITH VHDL
2.8.14 Notes and Observations
Our functional requirements were written as:
output = (a × d) + (d × b) + b + c
Alternatively, we could have achieved exactly the same functionality with the func-tional requirements written as (the two statements are mathematically equivalent):
output = (a × d) + b + (d × b) + c
2.8.14 Notes and Observations 241
Data Dependency Graphs: Clean vs Ugly
The naive data dependency graph for the alternative formulation is much messierthan the data dependency graph for the original formulation:
Original(a × d) + (d × b) + b + c
z
a d
+
+
+
b c
Alternative(a × d) + c + (d × b) + b
z
a b
+
+ +
cd
242 CHAPTER 2. RTL DESIGN WITH VHDL
2.9 Pipelining
Pipelining is optimization that increases performance by overlapping the executionof multiple parcels (instructions). The cost is an increase in area, because wecannot reuse datapath components, registers, inputs, or outputs.
2.9.1 Introduction to Pipelining
2.9.1 Introduction to Pipelining 243
Review of unpipelined dataflow diagram
a b
c
d
e
f
+
+
+
+
+
r1
z
0
1
2
3
4
5
add1
add1
add1
add1
add1
r1 r2
r2
r1 r2
r1 r2
r1 r2
clk
a
r1
z
0 1 2 3 4 5 6
αα
α
7 8 9 10 11 12 13
α α α α
Question: How soon can westart to execute β?
244 CHAPTER 2. RTL DESIGN WITH VHDL
Pipelined dataflow diagram• Each stage is treated as separate dataflow diagram.
• Double line denotes boundary between stages.
a b
c
d
e
f
+
+
+
+
+
r3
z
0
1
2
3
4
5
add1
add2
add3
add4
add5
r1 r2
r4
r5 r5
r7 r8
r9 r10
stag
e 1
stag
e 2
stag
e 3
stag
e 4
stag
e 5
clk
a
z
0 1 2 3 4 5 6
αα
αα
ααα
7 8 9 10 11 12 13
(stage1) r1
(stage2) r3
(stage3) r5
(stage4) r7
(stage5) r9
Question: How soon can westart to execute β?
2.9.1 Introduction to Pipelining 245
Sequential (Unpipelined) Hardware
+
i2
o1
State(1) State(2) State(3)reset
State(0) State(4)
add1
i1
r1 r2
246 CHAPTER 2. RTL DESIGN WITH VHDL
Pipelined Hardware
+
i2
add1
i1
r1 r2
+add2
r3 r4
i3
+add3
r5 r6
i4
+add4
r7 r8
i5
+add5
r9 r10
i6
o1
stag
e 1
stag
e 2
stag
e 3
stag
e 4
stag
e 5
2.9.1 Introduction to Pipelining 247
Pipelined VHDL Code
-- stage 1process begin
wait until rising_edge(clk);r1 <= i1; r2 <= i2;
end process;-- stage 2process begin
wait until rising_edge(clk);r3 <= r1 + r2; r4 <= i3;
end process;-- stage 3process begin
wait until rising_edge(clk);r5 <= r3 + r4; r6 <= i4;
end process;
-- stage 4process begin
wait until rising_edge(clk);r7 <= r5 + r6; r8 <= i5;
end process;-- stage 5process begin
wait until rising_edge(clk);r9 <= r7 + r8; r10 <= i6;
end process;-- outputo1 <= r9 + r10;
248 CHAPTER 2. RTL DESIGN WITH VHDL
2.9.2 Partially Pipelined• Fully pipelined: throughput is one parcel per clock cycle
• Partially pipelined: throughput is less than one parcel per clock cycle.
• Superscalar: throughput is more than one parcel per clock cycle.
a b
c
d
e
f
+
+
+
+
+
r1
z
0
1
2
3
4
5
add1
add1
add2
add2
add3
r1 r2
r2
r3 r4
r3 r4
r5 r6
stag
e 1
stag
e 2
stag
e 3
clk
a
z
0 1 2 3 4 5 6 7 8 9 10 11 12 13
(stage1) r1
(stage2) r3
(stage3) r5
Question: How do we execute αfollowed by β?
2.9.2 Partially Pipelined 249
Hardware for Partially Pipelined
State(1)reset
State(0)
+
i2
add1
i1
r1 r2
+
i2
add2
r3 r4
+
i2
o1
add3
r5 r6
stage 1stage 2
stage 3
250 CHAPTER 2. RTL DESIGN WITH VHDL
2.9.3 Terminology
Definition Depth: The depth of a pipeline is the number of stages on thelongest path through the pipeline.
Definition Latency: The latency of a pipeline is measured the same as for anunpipelined circuit: the number of clock cycles from inputs to outputs.
Definition Throughput: The number of parcels consumed or produced perclock cycle.
Definition Upstream/downstream: Because parcels flow through the pipelineanalogously to water in a stream, the terms upstream and downstream areused respectively to refer to earlier and later stages in the pipeline. Forexample, stage1 is upstream from stage2.
2.9.3 Terminology 251
Definition Bubble: When a pipe stage is empty (contains invalid data), it issaid to contain a “bubble”.
Question: How do we know whether the output of the pipeline is a bubbleor is valid data?
252 CHAPTER 2. RTL DESIGN WITH VHDL
2.10 Design Example: Pipelined Massey
RequirementsFunctional requirements:
• Compute the sum of six 8-bit numbers:output = a + b + c + d + e + f
• Registered inputs, combinational outputs
Performance requirements:
• Maximum clock period: unlimited
• Maximum latency: four
Cost requirements:
• Maximum of five adders
• Small miscellaneous hardware (e.g. muxes) is unlimited
• Maximum of six inputs and one output
• Design effort is unlimited
2.10. DESIGN EXAMPLE: PIPELINED MASSEY 253
Initial Dataflow Diagrams
Original dataflow
z
a b c d
e f+
+
+
+
+
Final unpipelined dataflowa b c
d e
f
+
+
+
+
+
z
254 CHAPTER 2. RTL DESIGN WITH VHDL
Dataflow Diagram Exploration
Variation on original dataflow
z
a b c d e f
+
+
+
+
+
Pipelined dataflow diagram
z
a b c d
e f+
+
+
+
+
i_valid
o_valid
2.10. DESIGN EXAMPLE: PIPELINED MASSEY 255
VHDL Code
-- stage 1process begin
wait until rising_edge(clk);r1 <= i1; r2 <= i2; r3 <= i3; r4 <= i4; v1 <= i_valid;
end process;a1 <= r1 + r2; a2 <= r3 + r4;-- stage 2process begin
wait until rising_edge(clk);r5 <= a1; r6 <= a2; r7 <= i5; r8 <= i6; v2 <= v1;
end process;a3 <= r5 + r6; a4 <= r7 + r8;-- stage 3process begin
wait until rising_edge(clk);r9 <= a3; r10 <= a4; v3 <= v2;
end process;a5 <= r9 + r10;-- outputsz <= a5;o_valid <= v3;
256 CHAPTER 2. RTL DESIGN WITH VHDL
2.11 Memory Arrays and RTL Design2.11.1 Memory Operations
Read of Memory with Registered InputsHardware
WE
A
DI
DO a doM
clk
we
Behaviourclk
αaa
M(αa)
we
do
-
αd
2.11.1 Memory Operations 257
Write to Memory with Registered Inputs
Hardware WE
A
DI
DO aM
clk
di
we
do
Behaviourclk
αaa
M(αa)
αd
we
di
-
-
-
do
258 CHAPTER 2. RTL DESIGN WITH VHDL
Dual-Port Memory with Registered Inputs
a0M
clk
di0
we WE
A0
DI0
DO0
A1 DO1 a1 do1
do0
clk
αaa0
M(αa)
αd
we
di0
-
-
-
βaa1
do0
-
M(βa) βd
do1
2.11.1 Memory Operations 259
Sequence of Memory Operations
a0M
clk
di0
we WE
A0
DI0
DO0
A1 DO1 a1 do1
do0
clk
αaa0
M(γa)
αd
we
di0
βaa1
do0
M(θa)
do1
γa
γd2
θa
-
-
-
-
M(αa)
M(βa) βd
γd1
θd
260 CHAPTER 2. RTL DESIGN WITH VHDL
2.11.2 Memory Arrays in VHDL
This section reserved for your reading pleasure
2.11.3 Data Dependencies
Definition of Three Types of Dependencies
M[i] :=
:= M[i]
:=
M[i]
:=
:=
M[i]:=
M[i]
:=
:=
M[i]:=
Read after Write Write after Write Write after Read(True dependency) (Load dependency) (Anti dependency)
Instructions in a program can be reordered, so long as the data dependencies arepreserved.
2.11.3 Data Dependencies 261
Purpose of Dependencies
R3 := ......
... := ... R3 ...
producer
consumer
W1
R1
R3 := ......W0
W2
WAW ordering prevents W0
from happening after W1
WAR ordering prevents W2
from happening before R1
RAW ordering prevents R1
from happening before W1
R3 := ......
Each of the three types of memory dependencies (RAW, WAW, and WAR) serves aspecific purpose in ensuring that producer-consumer relationships are preserved.
262 CHAPTER 2. RTL DESIGN WITH VHDL
Ordering of Memory Operations
Data Dependencies
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3] M[2] M[1] M[0]30 20 10 0
M[3]C :=
21
Initial Program
2.11.3 Data Dependencies 263
Data Dependencies (Cont’d)
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3]C :=
Initial Program
M[2] := 21
M[3] 31:=
A := M[2]
B := M[0]
M[3] 32:=
M[0] 01:=
C := M[3]
Valid Modification
264 CHAPTER 2. RTL DESIGN WITH VHDL
Data Dependencies (Cont’d)
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3]C :=
Initial Program
M[2] := 21
M[3] 31:=
A := M[2]
B := M[0]
M[3] 32:=
M[0] 01:=
C := M[3]
Valid (or Bad?) Modification
2.11.4 Memory and Dataflow Diagrams 265
2.11.4 Memory and Dataflow Diagrams
Legend for Dataflow Diagrams
name
name name name (rd) name(wr)
Input port Output port State signal Array read Array write
Basic Memory Operations
mem(rd)
addr
data
mem
mem (anti-dependency)
mem(wr)
data addrmem
mem
data := mem[addr]; mem[addr] := data;Memory Read Memory Write
266 CHAPTER 2. RTL DESIGN WITH VHDL
Dataflow Diagrams and Data Dependencies
Read after Write Dependencies
Algo: mem[wr addr] := data in;data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addr
rd_addr
mem
mem(rd)
mem
Read after Write
2.11.4 Memory and Dataflow Diagrams 267
Read after Write Optimization
Algo: mem[wr addr] := data in;data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addrrd_addr
mem
mem(rd)
mem
Optimization when rd addr 6= wr addr
268 CHAPTER 2. RTL DESIGN WITH VHDL
Write after Write Dependencies
Algo: mem[wr1 addr] := data1;mem[wr2 addr] := data2;
mem(wr)
mem
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2
Write after Write
2.11.4 Memory and Dataflow Diagrams 269
Write after Write Scheduling Option
Algo: mem[wr1 addr] := data1;mem[wr2 addr] := data2;
mem(wr)
mem
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2
Write after Write
Algo: mem[wr1 addr] := data1;mem[wr2 addr] := data2;
mem(wr)
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2mem
Scheduling option whenwr1 addr 6= wr2 addr
270 CHAPTER 2. RTL DESIGN WITH VHDL
Write after Read Dependencies
Algo: rd data := mem[rd addr];mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr
wr_addr
mem
wr_data
rd_data
Write after Read
2.11.4 Memory and Dataflow Diagrams 271
Write after Read Optimization
Algo: rd data := mem[rd addr];mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr wr_addr
mem
wr_data
rd_data
Optimization when rd addr 6= wr addr
272 CHAPTER 2. RTL DESIGN WITH VHDL
2.11.5 Ex: Mem Array and Dataflow Dia-gram
M(wr)
data_in wr_addr
2
M(rd)
mem
M 21 2
M(wr)
31 3
A
0
M(rd)
B M(wr)
32 3
M(wr) 3
01 0
M(rd)
CM
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3]C :=
1
2
3
4
5
6
7
1
2
3 4
5
6
7
2.11.5 Ex: Mem Array and Dataflow Diagram 273
Dependencies for Known Addresses
M(wr)
data_in wr_addr
2
M(rd)
mem
M 21 2
M(wr)
31 3
A
0
M(rd)
B M(wr)
32 3
M(wr) 3
01 0
M(rd)
CM
274 CHAPTER 2. RTL DESIGN WITH VHDL
Anti-Dependencies for Known Addresses
M(wr)
data_in wr_addr
2
M(rd)
mem
M 21 2
M(wr)
31 3
A
0
M(rd)
B M(wr)
32 3
M(wr) 3
01 0
M(rd)
CM
2.11.5 Ex: Mem Array and Dataflow Diagram 275
Minimal Dependencies
M(wr)
2
M(rd)
M 21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
CM
Memory array with minimal dependencies
276 CHAPTER 2. RTL DESIGN WITH VHDL
Memory Array with Orderings
M(wr)
2
M(rd)
M 21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
CM
3
2
1 1 2
34
Memory array with orderings
2.11.5 Ex: Mem Array and Dataflow Diagram 277
Place Operations in Clock Cycles
M(wr)
2
M(rd)
M
21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0 3
M(rd)
CM
3
2
1 1
2
3
4
278 CHAPTER 2. RTL DESIGN WITH VHDL
Final Dataflow Diagram
M(wr)
2
M(rd)
M
21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 03
M(rd)
C M
3
2
1 1
2
3
4
Final version of DFD
2.12. INPUT / OUTPUT PROTOCOLS 279
2.12 Input / Output Protocols
This section reserved for your reading pleasure
280 CHAPTER 2. RTL DESIGN WITH VHDL
2.13 Example: Moving Average
In this section we will design a circuit that performs a moving average as it receivesa stream of data. When each new data item is received, the output is the averageof the four most recently received data.
2 3 5 6 6 0 2 2 5 3 1i_data
o_avg 4 5 4 3
Time 0 1 2 3 4 5 6 7 8 9 10
2.13.1 Requirements and Environmental Assumptions 281
2.13.1 Requirements and EnvironmentalAssumptions1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid
data) between valid data.
2. When the input data is valid, the signal i valid is asserted for exactly oneclock cycle.
3. Input data will be 8-bit signed numbers.
4. When output data is ready, o valid shall be asserted.
5. The output data (o avg ) shall be the average of the four most recently receivedinput data. Output numbers shall be truncated to integer values.
282 CHAPTER 2. RTL DESIGN WITH VHDL
2.13.2 Algorithm
Generic equation with input data xi:
avgi = (xi−3 + xi−2 + xi−1 + xi)/4
Decompose into sum and avg:
sumi = xi−3 + xi−2 + xi−1 + xiavgi = sumi/4
Look for patterns and potential optimizations:
sum5 = x2 +(x3 + x4 + x5)sum6 = (x3 + x4 + x5)+ x6
= sum5− x2 + x6
Generalized recurrence equation:
sumi = sumi−1− xi−4 + xiavgi = sumi/4
2.13.2 Algorithm 283
Summary of Behaviour1. Define a signal new for the value of i data each time that i valid is ’1’ .
2. Define a memory array Mto store a sliding window of the four most recent valuesof i data .
3. Define a signal old for the oldest data value from the sliding window.
4. Update sumi with sumi−1 – old i + newi
284 CHAPTER 2. RTL DESIGN WITH VHDL
Sliding Window
Two design patterns to choose from: shift register vs circular buffer
α β δγold newM[3] M[2] M[1] M[0]
α ε
β δγ
η
ι
ζε
δγ ζε
ηδ ζε
β
γ
δ
κ
λ
ιηζε
κιηζ
ε
ζ
Shift register
α β δγε
M[0..3]old new
α
β
γ
δ
β δγ
δγ
δ
η
ι
ζε
ε
ε
ζ
ζ η
ει
κε ζ η
ζικ ζ η
λ
Circular Buffer
For FIFO behaviour, circular buffer is usually prefered: smaller and lower power.
2.13.2 Algorithm 285
Sliding Window with Registers
CE
D Q
CE
D Q
CE
D Q
CE
D Q
d
ce[0]
ce[1]
ce[2]
ce[3]
M[0]
M[1]
M[2]
M[3]
8
q
8
8
8
8
8
we addr
idx[0]
idx[1]
idx[2]
idx[3]
Register array with chip-enables and decoded multiplexer
286 CHAPTER 2. RTL DESIGN WITH VHDL
2.13.3 Pseudocode and Dataflow Diagrams
First Pseudocode
Real 3-address pseudocode
new = i_data
old = M[idx]
tmp = sum - old
sum = tmp + new
M[idx] = new
idx = idx rol 1
o_avg = sum/4
sum i_data
sum o_avg
(wired shift)
M idx
Rd
Wr
M idx
1tmp
new
old
2.13.3 Pseudocode and Dataflow Diagrams 287
Remove intermediate signal old
new = i_data
tmp = sum - M[idx]
sum = tmp + new
M[idx] = new
idx = idx rol 1
o_avg = sum/4reading new from memorytmp = sum - M[idx]
M[idx] = i_data
new = M[idx]
sum = tmp + new
idx = idx rol 1
o_avg = sum/4Remove intermediate signal new
tmp = sum - M[idx]
M[idx] = i_data
sum = tmp + M[idx]
idx = idx rol 1
o_avg = sum/4
Data-dependency graph after removingnew
i_data
o_avg
(wired shift)
Rd
Wr
M
1Rd
tmp
old
new
sum idx
sum M idx
288 CHAPTER 2. RTL DESIGN WITH VHDL
Dataflow Diagram
Latency of three clock cycles
sumi_data
o_avg
(wired shift)
M idx
RdWr
1Rd
S1
S2
S0
S0M sum idx
Latency of two clock cycles
sumi_data
sum o_avg
(wired shift)
M idx
RdWr
M idx
1Rd
S1
S0
S0
Two clock cycles potentially preferable for performance, but requires an additionalmultiplexer.
2.13.3 Pseudocode and Dataflow Diagrams 289
Latency of two clock cycles with registered addresssumi_data
(wired shift)
idx
RdWr1
Rd
S1
S0
S0
M
sum o_avgM idx
Removes need for multiplexer on address input to circular buffer
290 CHAPTER 2. RTL DESIGN WITH VHDL
Register and Datapath Allocation
sumidx
sumi_data
(wired shift)
idx
RdWr1
Rd
as1
as1
S1
S0
S0
M
sum o_avgM idx
idxsum
rol
2.13.4 Control Tables and State Machine 291
2.13.4 Control Tables and State Machine
sumidx
sumi_data
(wired shift)
idx
RdWr1
Rd
as1
as1
S1
S0
S0
M
sum o_avgM idx
idxsum
rol
Register controltable
M idx sumwe addr d ce d ce d
S0 1 idx x 0 – 1 as1S1 0 idx – 1 rol 1 as1
Datapath controltable
as1 rolsub src1 src2 src1 src2
S0 0 M sum – –S1 1 sum M idx 1
292 CHAPTER 2. RTL DESIGN WITH VHDL
Optimized control table
M idx as1we ce sub
S0 1 1 0S1 0 0 1
Static assignments in control tableM.addr = idx
M.d = x
idx.d = rol
sum.d = as1
as1.src1 = sum
as1.src2 = M
2.13.4 Control Tables and State Machine 293
Control Table and Bubbles
Almost final control table
M idx sum as1we ce ce sub
S0 1 0 1 0S1 0 1 1 1
idle 0 0 0 –
Final control table
M idx sum as1we ce ce sub
S0 1 0 1 0S1 0 1 1 1
idle 0 0 0 0
Static assignmentsM.addr = idx
M.d = x
idx.d = rol
sum.d = as1
as1.src1 = sum
as1.src2 = M
294 CHAPTER 2. RTL DESIGN WITH VHDL
State Machine
i valid valid1S0 1 0S1 0 1
idle 0 0
Final control table with state encoding
state M idx sum as1i valid valid1 we ce ce sub
S0 1 0 1 0 1 0S1 0 1 0 1 1 1
idle 0 0 0 0 0 0
M.we = i_valid
idx.ce = valid1
sum.ce = i_valid OR valid1
as1.sub = valid1
2.13.5 VHDL Code 295
2.13.5 VHDL Code
-- valid bitsprocess begin
wait until rising_edge(clk);valid1 <= i_valid;o_valid <= valid1;
end process;-- idxprocess begin
wait until rising_edge(clk);if reset = ’1’ then
idx <= "0001";else
if valid1 = ’1’ thenidx <= idx rol 1;
end if;end if;
end process;
-- sliding windowprocess begin
wait until rising_edge(clk);for i in 3 downto 0 loop
if (i_valid = ’1’) and (idx(i) = ’1’) thenM(i) <= i_data;
end if;end loop;
end process;mem_out <= M(0) when idx(0) = ’1’
else M(1) when idx(1) = ’1’else M(2) when idx(2) = ’1’else M(3);
-- add subadd_sub <= sum - mem_out when valid1 = ’1’
else sum + mem_out;-- sumprocess begin
wait until rising_edge(clk);if i_valid = ’1’ or valid1 = ’1’ then
sum <= add_sub;end if;
end process;
296 CHAPTER 2. RTL DESIGN WITH VHDL
Hardware
i_datai_valid
valid1
add/sub
sum
o_avg(wired shift)
M
(wired shift) idx
CE
CE
CEA
o_valid
Chapter 3
Performance Analysis andOptimization
297
298 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.1 Introduction
Hennessey and Patterson’s Quantitative Computer Achitecture (textbook for E&CE429) has good information on performance. We will use some of the same def-initions and formulas as Hennessey and Patterson, but we will move away fromgeneric definitions of performance for computer systems and focus on performancefor digital circuits.
3.2. DEFINING PERFORMANCE 299
3.2 Defining Performance
Performance =WorkTime
You can double your performance by:
doing twice the work in the same amount of time
OR doing the same amount of work in half the time
300 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Benchmarking
Performance =WorkTime
Measuring time is easy, but how do we accurately measure work?
The game of benchmarketing is finding a definition of work that makes your systemappear to get the most work done in the least amount of time.
Measure of Work Measure of Performanceclock cycle MHzinstruction MIPssynthetic program Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs)real program SPECtravel 1/4 mile drag race
3.2. DEFINING PERFORMANCE 301
SPEC Benchmarks
The Spec Benchmarks are among the most respected and accurate predictions ofreal-world performance.
Definition SPEC: Standard Performance Evaluation Corporation MISSION:“To establish, maintain, and endorse a standardized set of relevantbenchmarks and metrics for performance evaluation of modern computersystems http://www.spec.org .”
The Spec organization has different benchmarks for integer software, floating-pointsoftware, web-serving software, etc.
302 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.3 Comparing Performance
3.3.1 General Equations
Equation for “Big is n% greater than Small”:
n% =Big−Small
Small
Using “n% greater” formula, the phrase “The performance of A is n% greater thanthe performance of B” is:
n% =PerformanceA−PerformanceB
PerformanceB
Performance is inversely proportional to time:
Performance =1
Time
3.3.1 General Equations 303
Substituting the above equation into the equation for “the performance of A is n%greater than the performance of B” gives:
n% =TimeB−TimeA
TimeA
In general, the equation for a fast system to be “n%” faster than a slow system is:
n% =TSlow −TFast
TFast
Another useful formula is the average time to do one of k different tasks, each ofwhich happens %i of the time and takes an amount of time Ti to do each time it isdone .
TAvg =k
∑i=1
(%i)(Ti)
We can measure the performance of practically anything (cars, computers, vacuumcleaners, printers....)
304 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.3.2 Example: Performance of Printers
This section reserved for your reading pleasure
3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 305
3.4 Clock Speed, CPI, Program Length, andPerformance
3.4.1 Mathematics
CPI Cycles per instructionNumInsts Number of instructionsClockSpeed Clock speedClockPeriod Clock period
Time = NumInsts×CPI×ClockPeriod
Time = NumInsts×CPIClockSpeed
306 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.4.2 Example: CISC vs RISC and CPI
Clock Speed SPECintAMD Athlon 1.1GHz 409Fujitsu SPARC64 675MHz 443
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). TheFujitsu SPARC64 is a RISC microprocessor (it uses Sun’s Sparc instruction set).Assume that it requires 20% more instructions to write a program in the Sparcinstruction set than the same program requires in IA-32.
3.4.2 Example: CISC vs RISC and CPI 307
SPECint and Performance
Clock Speed SPECintAMD Athlon 1.1GHz 409Fujitsu SPARC64 675MHz 443
Question: Which of the two processors has higher performance?
308 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?
3.4.2 Example: CISC vs RISC and CPI 309
Absolute CPI
Question: Can you determine the absolute (actual) CPI of eithermicroprocessor?
310 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.4.3 Effect of Instruction Set on Perfor-mance
Your group designs a microprocessor and you are considering adding a fusedmultiply-accumulate to the instruction set. (A fused multiply accumulate is a sin-gle instruction that does both a multiply and an addition. It is often used in digitalsignal processing.)
Your studies have shown that, on average, half of the multiply operations are fol-lowed by an add instruction that could be done with a fused multiply-add.
Additionally, you know:
cpi %ADD 0.8 CPIavg 15%MUL 1.2 CPIavg 5%Other 1.0 CPIavg 80%
3.4.3 Effect of Instruction Set on Performance 311
Options
You have three options:
option 1 : no change
option 2 : add the MAC instruction, increase the clock period by 20%, and MAChas the same CPI as MUL.
option 3 : add the MAC instruction, keep the clock period the same, and the CPIof a MAC is 50% greater than that of a multiply.
Question: Which option will result in the highest overall performance?
312 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.4.4 Effect of Time to Market on RelativePerformance
Assume that performance of the average product in your market segment doublesevery 18 months.
You are considering an optimization that will improve the performance of your prod-uct by 7%.
Question: If you add the optimization, how much can you allow yourschedule to slip before the delay hurts your relative performance comparedto not doing the optimization and launching the product according to yourcurrent schedule?
3.4.5 Summary of Equations
3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 313
3.5 Performance Analysis and Dataflow Di-agrams
3.5.1 Dataflow Diagrams, CPI, and ClockSpeed• One of the challenges in designing a circuit is to choose the clock speed.
• Choosing a clock period affects many aspects of the design, not just the overallperformance.
• Some goals will push you toward a short clock period
• Some goals will push you toward a long clock period
314 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Goal Action Affect
Minimize area
Increase schedulingflexibility
Decrease percentage ofclock cycle spent in flops(overhead — time inflops is not doing usefulwork)Decrease time to exe-cute an instruction
3.5.1 Dataflow Diagrams, CPI, and Clock Speed 315
Outline to Choose Clock Period
Outline of plan to find optimal latency and clock period for maximum performance:
1. Start with smallest possible clock period.
2. Allocate operations to clock cycles
3. Calculate average time to execute an instruction.
4. If latency > 1, then: increase clock period until reduce latency; return to Step 2.Else (latency = 1): choose clock period and dataflow diagram that resulted inhighest performance.
5. Optimize dataflow diagram to reduce area.
316 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.5.2 Examples of Dataflow Diagrams forTwo Instructions
• Circuit supports two instructions, Aand B
• Each operation occurs 50% of thetime.
• The delay through a register is 5ns.
• Find clock period and dataflow di-agram to maximize overall perfor-mance.
Instruction A
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
Instruction B
i (40ns)
g (50 ns)
3.5.2 Examples of Dataflow Diagrams for Two Instructions 317
3.5.2.1 Scheduling of Operations for Differ-ent Clock Periods
Scheduling (1)
55ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
55ns
55ns
55ns
55ns
Instr A Instr B 25 ns 15 ns
318 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Scheduling (2)
25 ns 15 ns
25 ns 15 ns
3.5.2 Examples of Dataflow Diagrams for Two Instructions 319
Scheduling (3)
25 ns 15 ns
320 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.5.2.2 Performance Computation for Dif-ferent Clock Periods
Question: Which clock speed will result in the highest overall performance?
Clock Period CPIA CPIB Tavg55ns75ns85ns95ns155ns
3.5.2 Examples of Dataflow Diagrams for Two Instructions 321
3.5.2.3 Example: Two Instructions TakingSimilar Time
Question: For the flow below, which clock speed will result in the highestoverall performance?
A B30ns 40ns50ns 50ns20ns 40ns50ns
Clock Period CPIA CPIB Tavgnsnsnsnsnsns
322 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.5.2.4 Example: Same Total Time, Differ-ent Order for A
Question: For the flow below, which clock speed will result in the highestoverall performance?
A B30ns 40ns20ns 50ns50ns 40ns50ns
Clock Period CPIA CPIB Tavgnsnsnsns
3.5.3 Example: From Algorithm to Optimized Dataflow 323
3.5.3 Example: From Algorithm to Opti-mized Dataflow
This question involves doing some of the design work for a circuit that implementsInstP and InstQ using the components described below.
Instruction Algorithm Frequence of OccurrenceInstP a×b× ((a×b)+(b×d)+ e) 75%InstQ (i+ j + k + l)×m 25%
Component Delays2-input Mult 40ns2-input Add 25nsRegister 5ns
324 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
NOTES• There is a resource limitation of a maximum of 3 input ports. (There are no other
resource limitations.)
• You must put registers on your inputs, you do not need to register your outputs.
• The environment will directly connect your outputs (its inputs) to registers.
• Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once — if you needto use a value in multiple clock cycles, you must store it in a register.
3.5.3 Example: From Algorithm to Optimized Dataflow 325
Questions
Question: What clock period will result in the best overall performance?
Question: Find a minimal set of resources that will achieve theperformance you calculated.
326 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.6 General Optimizations
3.6.1 Strength Reduction
Strength reduction replaces one operation with another that is simpler.
3.6.1.1 Arithmetic Strength Reduction
Multiply by a constant power of two wired shift logical leftMultiply by a power of two shift logical leftDivide by a constant power of two wired shift logical rightDivide by a power of two shift logical rightMultiply by 3 wired shift and addition
3.6.1 Strength Reduction 327
3.6.1.2 Boolean Strength ReductionBoolean tests that can be implemented as wires• is odd, is even
• is neg, is pos
By choosing your encodings carefully, you can sometimes reduce a vector compar-ison to a wire.
For example if your state uses a one-hot encoding, then the comparison state =S3 reduces to state(3) = ’1’ . You might expect a reasonable logic-synthesistool to do this reduction automatically, but most tools do not do this reduction.
When using encodings other than one-hot, Karnaugh maps can be useful tools foroptimizing vector comparisons. By carefully choosing our state assignments, whenwe use a full binary encoding for 8 states, the comparison:
(state = S0 or state = S3 or state = S4) = ’1’
can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a conditionthat is true for four states, then we can find an encoding that looks at just 1 bit.
328 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.6.2 Replication and Sharing
3.6.2.1 Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area.
Beforez <= a + b when (w = ’1’)
else a + c;
Aftertmp <= b when (w = ’1’)
else c;
z <= a + tmp;
The first circuit will have two adders, while the second will have one adder. Somesynthesis tools will perform this optimization automatically, particularly if all of thesignals are combinational.
3.6.2 Replication and Sharing 329
3.6.2.2 Common Subexpression Elimina-tion
Introduce new signals to capture subexpressions that occur multiple places in thecode.
Beforey <= a + b + c when (w = ’1’)
else d;
z <= a + c + d when (w = ’1’)
else e;
Aftertmp <= a + c;
y <= b + tmp when (w = ’1’)
else d;
z <= d + tmp when (w = ’1’)
else e;
330 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Subexpression Elimination
Note: Clocked subexpressions Care must be taken when doingcommon subexpression elimination in a clocked process. Puttingthe “temporary” signal in the clocked process will add a clock cycleto the latency of the computation, because the tmp signal will beflip-flop. The tmp signal must be combinational to preserve thebehaviour of the circuit.
3.6.2 Replication and Sharing 331
3.6.2.3 Computation Replication• To improve performance
– If same result is needed at two very distant locations and wire delays are sig-nificant, it might improve performance (increase clock speed) to replicate thehardware
• To reduce area
– If same result is needed at two different times that are widely separated, itmight be cheaper to reuse the hardware component to repeat the computationthan to store the result in a register
Note: Muxes are not free Each time a component is reused,multiplexors are added to inputs and/or outputs. Too much sharingof a component can cost more area in additional multiplexors thanwould be spent in replicating the component
332 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.6.3 Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a
+ b) + c) + d) . You can use parentheses to suggest parallelism.
Perform arithmetic on the minimum number of bits needed. If you only need thelower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to12 bits. This results in a smaller and faster design than computing all 16 bits of theresult and trimming the result to 12 bits.
3.7. RETIMING 333
3.7 Retiming
state
a
b
c
sel
x y z
critical path
state S0 S1 S2 S3 S0 S1 S2 S3a b c
sel x y z
αβγ1α
α+γα+γ
process begin
wait until rising_edge(clk);
if state = S1 then
z <= a + c;
else
z <= b + c;
end if;
end process;
334 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Retimed Circuit and Waveform
state
a
b
c
sel
x y z
state S0 S1 S2 S3 S0 S1 S2 S3a b c
sel x y z
αβγ
process (state) beginif state = S1 then
sel = ’1’else
sel = ’1’end if;
end process;process begin
wait until rising_edge(clk);if sel = ’1’ then
... -- code for zend if;
end process;
process beginwait until rising_edge(clk);if state = then
sel = ’1’else
sel = ’1’end if;
end process;process begin
wait until rising_edge(clk);if sel = ’1’ then
... -- code for zend if;
end process;
Chapter 4
Functional Verification
335
336 CHAPTER 4. FUNCTIONAL VERIFICATION
4.1 Overview
4.1.1 Terminology: Validation / Verification/ Testing
4.1.2 The Difficulty of Designing CorrectChips
4.1.2 The Difficulty of Designing Correct Chips 337
4.1.2.1 Notes from Kenn Heinrich (UWE&CE grad)
“Everyone should get a lecture on why their first industrial design won’t work in thefield.”
Note: There are six reasons in your notes.
4.1.2.2 Notes from Aart de Geus (Chairmanand CEO of Synopsys)
More than 60% of the ASIC designs that are fabricated have at least one error,issue, or a problem that whose severity forced the design to be reworked.
Note: There is a pretty picture in your notes.
338 CHAPTER 4. FUNCTIONAL VERIFICATION
4.2 Test Cases and Coverage
4.2.1 Coverage
To be absolutely certain that an implementation is correct, we must check everycombination of values. This includes both input values and internal state (flip flops).
If we have ni bits of inputs and ns bits in flip-flops, we have to test 2ni+ns differentcases when doing functional verification.
Question: If we have nc combinational signals, why don’t we have to test2ni+ns+nc different cases?
4.2.2 Floating Point Divider Example 339
4.2.2 Floating Point Divider Example
This example illustrates the difficulty of achieving significant coverage on realisticcircuits.
Consider doing the functional simulation for a double precision (64-bit) floating-pointdivider.
Given InformationData width 64 bitsNumber of gates in circuit 10 000Number of assembly-language instructions tosimulate one gate for one test case
100
Number of clock cycles required to execute oneassembly language instruction on the computerthat is running the simulation
0.5
Clock speed of computer that is running the sim-ulation
1 Gigahertz
340 CHAPTER 4. FUNCTIONAL VERIFICATION
Number of Cases
Question: How many cases must be considered?
width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
4.2.2 Floating Point Divider Example 341
Simulation Run Time
Question: How long will it take to simulate all of the different possible casesusing a single computer?
width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
342 CHAPTER 4. FUNCTIONAL VERIFICATION
Coverage
Question: If you can run simulations non-stop for one year on tencomputers, what coverage will you achieve?
width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
4.2.2 Floating Point Divider Example 343
Simulation vs the Real World
From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, DesignAutomation Conference 2001. (Link on E&CE 327 web page.)• Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15
MHz.
• By tapeout, over 200 billion simulation cycles had been run on a network ofcomputers.
• All of these simulations represent less than two minutes of running a real proces-sor.
344 CHAPTER 4. FUNCTIONAL VERIFICATION
4.3 Testbenches
4.3.1 Overview of Test Benches
stimulus
implementation
specification
check
testbench
Implementation Circuit that you’re checking for bugsalso known as: “design under test” or “unit under test”
Stimulus Generates test vectors
Specification Describes desired behaviour of implementation
Check Checks whether implementation obeys specification
4.3.2 Reference Model Style Testbench 345
4.3.2 Reference Model Style Testbench
stimulus
implementation
specification
reference model testbench
4.3.3 Relational Style Testbench
stimulus
implementation
relational testbench
check
346 CHAPTER 4. FUNCTIONAL VERIFICATION
4.3.4 Coding Structure of a Testbench
stimulus
implementation
specification
check
testbench
architecture main of athabasca_tb iscomponent declaration for implementation;other declarations
beginimplementation instantiation;stimulus process;specification process (or component instantiation);check process;
end main;
4.3.5 Datapath vs Control 347
4.3.5 Datapath vs Control
Datapath and control circuits tend to use different styles of testbenches.
stimulus
implementation
specification
reference model testbench
stimulus
implementation
relational testbench
check
348 CHAPTER 4. FUNCTIONAL VERIFICATION
4.3.6 Verification Tips
Suggested order of simulation for functional verification.
1. Write high-level model.
2. Simulate high-level model until have correct functionality and latency.
3. Write synthesizable model.
4. Use zero-delay simulation (uw-sim ) to check behaviour of synthesizable modelagainst high-level model.
5. Optimize the synthesizable model.
6. Use zero-delay simulation (uw-sim ) to check behaviour of optimized modelagainst high-level model.
7. Use timing-simulation (uw-timsim ) to check behaviour of optimized modelagainst high-level model.
section 4.4 describes a series of testbenches that are particularly useful for debug-ging datapath circuits in the early phases of the design cycle.
4.4. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS 349
4.4 Functional Verification for Datapath Cir-cuits
In this section we will incrementally develop a testbench for a very simple circuit:an AND gate.
350 CHAPTER 4. FUNCTIONAL VERIFICATION
Implementation
entity and2 is
port (
a, b : in std_logic;
c : out std_logic
);
end and2;
architecture main of and2 is
begin
c <= ’1’ when (a = ’1’ AND b = ’1’)
else ’0’;
end and2;
4.4.1 A Spec-Less Testbench 351
4.4.1 A Spec-Less Testbench
First, use waveform viewer to check that implementation generates reasonable out-puts for a small set of inputs.
entity and2_tb isend and2_tb;
architecture main_tb of and2_tb iscomponent and2 ... end component;signal ta, tb, tc_impl : std_logic;signal ok : boolean;
begin---------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);---------------------------------------------stimulus : processbegin
ta <= ’0’; tb <= ’0’;wait for 10ns;ta <= ’1’; tb <= ’1’;wait for 10ns;
end process;---------------------------------------------
end main_tb;
352 CHAPTER 4. FUNCTIONAL VERIFICATION
4.4.2 Use an Array for Test Vectorsarchitecture main_tb of and2_tb is
...begin
...stimulus : process
type test_datum_ty is recordra, rb : std_logic;
end record;type test_vectors_ty is
array(natural range <>) of test_datum_ty;constant test_vectors : test_vectors_ty :=
-- a b( ( ’0’, ’0’),
( ’1’, ’1’));
beginfor i in test_vectors’low to test_vectors’high loop
ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;
end loop;end process;
end main_tb;
4.4.3 Build Spec into Stimulus 353
4.4.3 Build Spec into Stimulus
stimulus : processtype test_datum_ty is record
ra, rb, rc : std_logic;end record;type test_vectors_ty is
array(natural range <>) of test_datum_ty;constant test_vectors : test_vectors_ty :=
-- a, b: inputs-- c : expected output-- a b c( ( ’0’, ’0’, ’0’),
( ’0’, ’1’, ’0’),( ’1’, ’1’, ’1’)
);begin
for i in test_vectors’low to test_vectors’high loopta <= test_vectors(i).ra;tb <= test_vectors(i).rb;tc_spec <= test_vectors(i).rc;wait for 10 ns;
end loop;end process;
354 CHAPTER 4. FUNCTIONAL VERIFICATION
Build Spec into Stimulus (Cont’d)
stimulus : process...
beginfor i in test_vectors’low to test_vectors’high loopta <= test_vectors(i).ra;tb <= test_vectors(i).rb;tc_spec <= test_vectors(i).rc;wait for 10 ns;
end loop;end process;------------------------------------------check : process (tc_impl, tc_spec)begin
ok <= (tc_impl = tc_spec);end process;------------------------------------------
end main_tb;
4.4.4 Have Separate Specification Entity 355
4.4.4 Have Separate Specification Entityentity and2_spec is...(same as and2 entity)...
end and2_spec;
architecture spec of and2_spec isbegin
c <= a AND b;end spec;
356 CHAPTER 4. FUNCTIONAL VERIFICATION
Testbench for Separate Specification
architecture main_tb of and2_tb iscomponent and2 ...;component and2_spec ...;signal ta, tb, tc_impl, tc_spec : std_logic;signal ok : boolean;
begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);spec : and2_spec port map (a => ta, b => tb, c => tc_spec);------------------------------------------
stimulus process...check process...
end
4.4.4 Have Separate Specification Entity 357
Testbench for Separate Spec (Cont’d)
stimulus : process...constant test_vectors : test_vectors_ty :=
-- a b( ( ’0’, ’0’),
( ’1’, ’1’));
beginfor i in test_vectors’low to test_vectors’high loop
ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;
end loop;end process;------------------------------------------check : process (tc_impl, tc_spec)begin
ok <= (tc_impl = tc_spec);end process;------------------------------------------
end main_tb;
358 CHAPTER 4. FUNCTIONAL VERIFICATION
4.4.5 Generate Test Vectors Automaticallyarchitecture main_tb of and2_tb is
...begin
...stimulus : process
subtype std_test_ty of std_logic is (’0’, ’1’);begin
for va in std_test_ty’low to std_test_ty’high loopfor vb in std_test_ty’low to std_test_ty’high loop
ta <= va;tb <= vb;wait for 10 ns;
end loop;end loop;
end process;...
end main_tb;
4.4.6 Relational Specification 359
4.4.6 Relational Specification
Sometimes we want to check a relationship between the output and the input, ratherthan check that the output has a specific value.
To do this, we drop the spec process, and put the brains into the check process.
architecture main_tb of and2_tb is...
begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);------------------------------------------stimulus : process
...end process;------------------------------------------check : process (tc_impl, tc_spec)begin
ok <= NOT (tc_impl = ’1’ AND (ta =’0’ OR tb = ’0’));end process;------------------------------------------
end main_tb;
360 CHAPTER 4. FUNCTIONAL VERIFICATION
4.5 Functional Verification of Control Cir-cuits
Control circuits are often more challenging to verify than datapath circuits.
In this section, we will explore the functional verification of state machines via aFirst-In First-Out queue.
4.5.1 Overview of Queues in Hardware 361
4.5.1 Overview of Queues in Hardwarewrite read
qu
eu
e
Structure of queue
362 CHAPTER 4. FUNCTIONAL VERIFICATION
Empty Write 1
A
Write 2
A
Write Sequence
4.5.1 Overview of Queues in Hardware 363
Write 1
BA
Write 2
BA
A Second Example Write
364 CHAPTER 4. FUNCTIONAL VERIFICATION
Read 1
BA
Read 2
BA
Example Read Sequence
4.5.1 Overview of Queues in Hardware 365
Write 1
BCDEFGHI
J
Write 2
BCDEFGHIJ
Write Illustrating Index Wrap
366 CHAPTER 4. FUNCTIONAL VERIFICATION
Write 1
BCDEFGHIJ
K
Write 2
BCDEFGHIJ
K
Write Illustrating Full Queue
4.5.1 Overview of Queues in Hardware 367
empty
mem
wr_idx
rd_idx
data_wrdata_rd
do_wr
do_rd
Queue Signals
empty
mem
wr_idx
rd_idx
data_wr
data_rd
do_wr
do_rd
WE
A0
DI0
DO0
A1 DO1
Incomplete Queue Blocks
Control circuitry not shown.
368 CHAPTER 4. FUNCTIONAL VERIFICATION
4.5.2 VHDL Coding
4.5.2.1 Package
package queue_pkg is
subtype data is std_logic_vector(3 downto 0);
function to_data(i : integer) return data;
end queue_pkg;
package body queue_pkg is
function to_data(i : integer) return data is
begin
return std_logic_vector(to_unsigned(i, 4));
end to_data;
end queue_pkg;
4.5.2.2 Other VHDL Coding
4.5.3 Code Structure for Verification 369
This section reserved for your reading pleasure
4.5.3 Code Structure for Verification
Verification things to notice in queue implementation:
1. instrumentation code
2. coverage monitors
3. assertions
370 CHAPTER 4. FUNCTIONAL VERIFICATION
Code Structure for Verification
architecture ... is
...
begin
... normal implementation ...
process (clk)
begin
if rising_edge(clk) then
... instrumentation code ...
prev_ signame <= signame;
end if;
end process;
... assertions ...
... coverage monitors ...
end;
4.5.4 Instrumentation Code 371
4.5.4 Instrumentation Code• Added to implementation to support verification
• Usually keeps track of previous values of signals
• Does not create hardware (Optimized away during synthesis)
• Does not feed any output signals
• Must use synthesizable subset of VHDL
process (clk) begin
if rising_edge(clk) then
prev_rd_idx <= rd_idx;
prev_wr_idx <= wr_idx;
prev_do_rd <= do_rd;
prev_do_wr <= do_wr;
end if;
end process;
372 CHAPTER 4. FUNCTIONAL VERIFICATION
Coverage Events for Queue
Question: What events should we monitor to estimate the coverage of ourfunctional tests?
4.5.4 Instrumentation Code 373
Coverage Monitor Template
process ( signals read)
begin
if ( condition) then
report "coverage: message";
elsif ( condition) ) then
report "coverage: message";
else
report "error: case fall through on message"
severity warning;
end if;
end process;
374 CHAPTER 4. FUNCTIONAL VERIFICATION
Coverage Monitor Code
Events related to rd idx equals wr idx .
process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx)
begin
if (rd_idx = wr_idx) then
if ( prev_rd_idx = prev_wr_idx ) then
report "coverage: read = write both moved";
elsif ( rd_idx /= prev_rd_idx ) then
report "coverage: Read caught write";
elsif ( wr_idx /= prev_wr_idx ) then
report "coverage: Write caught read";
else
report "error: case fall through on rd/wr catching"
severity warning;
end if;
end if;
end process;
4.5.4 Instrumentation Code 375
Coverage Monitor Code
Events related to rd idx wrapping.
process (rd_idx)
begin
if (rd_idx = low_idx) then
report "coverage: rd mv to low";
elsif (rd_idx = high_idx) then
report "coverage: rd mv to high";
else
report "coverage: rd mv normal";
end if;
end process;
376 CHAPTER 4. FUNCTIONAL VERIFICATION
4.5.5 Assertions
Assertions for Queue1. If rd idx changes, then it increments or wraps.
2. If rd idx changes, then do rd was ’1’ , or reset is ’1’ .
3. If wr idx changes, then it increments or wraps.
4. If wr idx changes, then do wr was ’1’ , or reset is ’1’ .
5. And many others....
4.5.5 Assertions 377
Assertion Template
process ( signals read) begin
assert ( required condition)
report "error: message" severity warning;
end process;
378 CHAPTER 4. FUNCTIONAL VERIFICATION
Assertions: Read Index
process (rd_idx) begin
assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx))
report "error: rd inc" severity warning;
assert ((prev_do_rd = ’1’) or (reset = ’1’))
report "error: rd imp do_rd" severity warning;
end process;
4.5.5 Assertions 379
Assertions: Write Index
process (wr_idx) begin
assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx))
report "error: wr inc" severity warning;
assert ((prev_do_wr = ’1’) or (reset = ’1’))
report "error: wr imp do_wr" severity warning;
end process;
380 CHAPTER 4. FUNCTIONAL VERIFICATION
4.5.6 VHDL Coding Tips
Vector Type Declaration
type data_array_ty is array(natural range <>) of data;
signal data_array : data_array_ty(7 downto 0);
4.5.6 VHDL Coding Tips 381
Functions
function to_idx
(i : natural range data_array’low to data_array’high)
return idx_ty
is
begin
return to_unsigned(i, idx_ty’length);
end to_idx;
Conversion to IndexWithout Function With Function
rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5);
The function code is verbose, but is very maintainable, because neither the functionitself nor uses of the function need to know the width of the index vector.
382 CHAPTER 4. FUNCTIONAL VERIFICATION
Attributes
function inc_idx (idx : idx_ty) return idx_ty is
begin
if idx < data_array’high then
return (idx + 1);
else
return (to_idx(data_array’low));
end if;
end inc_idx;
4.5.6 VHDL Coding Tips 383
Feedback Loops, and Functions
Coding guideline: use functions. Don’t use procedures.
inc as fun inc as procwr_idx <= inc_idx(wr_idx); inc_idx(wr_idx);
Functions clearly distinguish between reading from a signal and writing to a signal.By examining the use of a procedure, you cannot tell which signals are read fromand which are written to. You must examine the declaration or implementation ofthe procedure to determine modes of signals.
Modifying a signal within a procedure results in a tri-state signal. This is bad.
384 CHAPTER 4. FUNCTIONAL VERIFICATION
File I/O (textio package)
TEXTIO defines read , write , readline , writeline functions.
Described in:• http://www.eng.auburn.edu/department/ee/mgc/vhdl.ht ml#textio
These functions can be used to read test vectors from a file and write results to afile.
4.5.7 Queue Specification 385
4.5.7 Queue Specification
Most bugs in queues are related to the queue becoming full, becoming empty,and/or wrap of indices.
Specification should be “obviously correct”. Avoid bugs in specification by makingspecification queue larger than the max number of writes that we will do in testsuite. Thus, the specification queue will never become full or wrap. However, theimplementation queue will become full and wrap.
386 CHAPTER 4. FUNCTIONAL VERIFICATION
Write Index Update in Specification
We increment write-index on every write, we never wrap.
process (clk) begin
if rising_edge(clk) then
if (reset = ’1’) then
wr_idx <= 0;
elsif (do_wr = ’1’) then
wr_idx <= wr_idx + 1;
end if;
end if;
end process;
4.5.7 Queue Specification 387
Things to Notice
Things to notice in queue specification:
1. don’t care conditions (’-’ )
2. uninitialized data (hint: what is the value of rd_data when do more reads thanwrites?
388 CHAPTER 4. FUNCTIONAL VERIFICATION
Don’t Care
rd_data <= data_array(rd_idx) when (do_rd =’1’)
else (others => ’-’);
4.5.8 Queue Testbench 389
4.5.8 Queue Testbench
Things to notice in queue testbench:
1. running multipe test sequences
2. uninitialized data ’U’
3. std_match to compare spec and impl data
0 ∼ 00 ∼ L1 ∼ 11 ∼ H- ∼ everything
everything else 6∼ everything
With equality, ’-’ 6= ’1’ , but we want to use ’-’ to mean “don’t care” in specifi-cation. The solution is to use std match , rather than = to check implementationsignals against the specification.
390 CHAPTER 4. FUNCTIONAL VERIFICATION
Stimulus Process StructureThe stimulus process runs multiple test vectors in a single simulation run.
stimulus : processtype test_datum_ty is
recordr_reset, ... normal fields ...
end record;type test_vectors_ty is
array(natural range <>) of test_datum_ty;constant test_vectors : test_vectors_ty :=
( -- reset ... other signal ...( ’1’, normal fields), -- test case 1( ’0’, normal fields),
...( ’1’, normal fields), -- test case 2( ’0’, normal fields),
...);
beginfor i in test_vectors’range loop
if (test_vectors(i).r_reset = ’1’) then... reset code ...
end if;reset <= ’0’;... normal sequence ...wait until rising_edge(clk);
end loop;end process;
4.6. EXAMPLE: MICROWAVE OVEN 391
4.6 Example: Microwave Oven
This question concerns the VHDL code microwave , which controls a simple mi-crowave oven; the properties prop1 ...prop3 ; and two proposed changes to theVHDL code.
INSTRUCTIONS:
1. Assume that the code as currently written is correct — any change to the codethat causes a change to the behaviour of the signals heat or count is a bug.
2. For each of the two proposed code changes, answer whether the code changewill cause a bug.
3. If the code change will cause a bug, provide a test case that will exercise thebug and identify all of the given properties (prop1 , prop2 , and prop3 ) that willdetect the bug with the test case you provide.
4. If none of the three properties can detect the bug, provide a property of yourown that will detect the bug with the testcase you provide.
392 CHAPTER 4. FUNCTIONAL VERIFICATION
Question: For each of the three properties prop1...prop2, answer whetherthe property is best checked as part of a testbench or assertion. For eachproperty, justify why a testbench or an assertion is the best method tovalidate that property.
prop1 If start is pushed and the door is closed, then heat remains on for exactlythe time specified by the timer when start was pushed, assuming reset remainsfalse and the door remains closed.
prop2 If the door is open, then heat is off.
prop3 If start is not pushed, reset is false, and count is greater than zero, thencount is decremented.
4.6. EXAMPLE: MICROWAVE OVEN 393
Implementationentity microwave is
port (
timer -- time input from user
: in unsigned(7 downto 0);
reset, -- resets microwave
clk, -- clock signal input
is_open, -- detects when door is open
start -- start button input from user
: in std_logic;
heat : out std_logic -- 1=on, 0=off
);
end microwave;
architecture main of microwave is
signal count : unsigned(7 downto 0); -- internal time count
signal x_heat : std_logic;
begin
394 CHAPTER 4. FUNCTIONAL VERIFICATION
-- heat process ------------------------------process (clk)begin
if rising_edge(clk) thenif reset = ’1’ then
x_heat <= ’0’;elsif (is_open = ’0’) and (start = ’1’) and -- region of
(time > 0) -- change #1then --
x_heat <= ’1’; --elsif (is_open = ’0’) and (count > 0) then --
x_heat <= x_heat; --else
x_heat <= ’0’;end if;
end if;end process;
4.6. EXAMPLE: MICROWAVE OVEN 395
-- count process ------------------------------process (clk)begin
if rising_edge(clk) thenif (reset = ’1’) then
count <= to_unsigned(0, 8);elsif (start = ’1’) then -- region of
count <= timer; -- change #2elsif (count > 0) then --
count <= count - 1; --end if;
end if;end process;heat <= x_heat;
end main;
396 CHAPTER 4. FUNCTIONAL VERIFICATION
Propertiesprop1 If start is pushed and the door is closed, then heat remains on for exactly
the time specified by the timer when start was pushed, assuming reset remainsfalse and the door remains closed.
prop2 If the door is open, then heat is off.
prop3 If start is not pushed, reset is false, and count is greater than zero, thencount is decremented.
4.6. EXAMPLE: MICROWAVE OVEN 397
Change #1
From:
elsif (start = ’1’) then
count <= time;
elsif (count > 0) then
count <= count - 1;
To:
elsif (count > 0) then
count <= count - 1;
elsif (start = ’1’) then
count <= time;
398 CHAPTER 4. FUNCTIONAL VERIFICATION
Change #2
From:
elsif (is_open = ’0’) and (start = ’1’) and (time > 0)
then x_heat <= ’1’;
elsif (is_open = ’0’) and (count > 0)
then x_heat <= x_heat;
To:
elsif (is_open = ’0’)
and ((start = ’1’) or (count > 0))
then x_heat <= ’1’;
else x_heat <= ’0’;
4.6. EXAMPLE: MICROWAVE OVEN 399
Coverage
Question: If msb of src1 is ’1’ and lsb of src2 is ’0’ or sum(3) is ’1’, thenresult is wrong. What is the minimum coverage needed to detect bug?What is the minimim coverage needed to guarantee that the bug will bedetected?
400 CHAPTER 4. FUNCTIONAL VERIFICATION
Chapter 5
Timing Analysis
401
402 CHAPTER 5. TIMING ANALYSIS
5.1 Delays and Definitions
In this section we will look at the different timing parameters of circuits. Our focuswill be on those parameters that limit the maximum clock speed at which a circuitwill work correctly.
5.1.1 Background Definitions
This section reserved for your reading pleasure
5.1.2 Clock-Related Timing Definitions 403
5.1.2 Clock-Related Timing Definitions
5.1.2.1 Clock Skewskew
clk1
clk2
clk3
clk4
clk1
clk2
clk3
clk4
Definition Clock Skew: The difference in arrival times for the same clockedge at different flip-flops.
Clock skew is caused by the difference in interconnect delays to different points onthe chip.
404 CHAPTER 5. TIMING ANALYSIS
Clock Tree Design
Clock tree design is critical in high-performance designs to minimize clock skew.Sophisticated synthesis tools put lots of effort into clock tree design, and the tech-niques for clock tree design still generate PhD theses.
5.1.2 Clock-Related Timing Definitions 405
5.1.2.2 Clock Latency
latency
master clock
intermediate clock
final clock
master clock
inte
rmed
iate
clo
ck final clock
Definition Clock Latency: The difference in arrival times for the same clockedge at different levels of interconnect along the clock tree. (Intuitively“different points in the clock generation circuitry.”)
Note: Clock latency Clock latency does not affect the limit onthe minimim clock period.
406 CHAPTER 5. TIMING ANALYSIS
5.1.2.3 Clock Jitter
jitter
ideal clock
clock with jitter
Definition Clock Jitter: Difference between actual clock period and idealclock period.
5.1.2 Clock-Related Timing Definitions 407
Causes of Clock Jitter
Clock jitter is caused by:• temperature and voltage variations over time
• temperature and voltage variations across different locations on a chip
• manufacturing variations between different parts
408 CHAPTER 5. TIMING ANALYSIS
5.1.3 Storage-Related Timing Definitions
5.1.3.1 Flops and Latches
d
clk
q
Flop Behaviour
d
clk
q
Latch Behaviour
Storage devices have two modes: load mode and store mode.
Flops are edge sensitive; they are in load mode just before the clock edge.
Latches are level senstive; they are in load mode while their enable signal is as-serted high (low for active low latches).
5.1.3 Storage-Related Timing Definitions 409
Timing Parameters
β
d
clk
q
Clock-to-Q
HoldSetup
α β
Flip-flop
d
clk
q
Clock-to-Q
HoldSetup
α β
α β
Active-high latch
d
clk
q
Clock-to-Q
HoldSetup
α β
α β
Active-low latch
Setup and hold define the window in which input data are required to be constantin order to guarantee that storage device will store data correctly.
Clock-to-Q defines the delay from the clock edge to when the output is guaranteedto be stable.
410 CHAPTER 5. TIMING ANALYSIS
5.1.4 Propagation Delays
Propagation delay time it takes a signal to travel from the source (driving) flop tothe destination flop
propagation delay = load delay + interconnect delay
Load delay combinational gates between the flops
Interconnect delay wires between gates and flops
5.1.5 Timing Constraints 411
5.1.5 Timing Constraints
5.1.5.1 Minimum Clock Periodsignal may change
signal is stablea b
clk1 clk2
signal may rise
signal may fall
clk1
clk2
a
b
clock period
ClockPeriod >
412 CHAPTER 5. TIMING ANALYSIS
5.1.5.2 Hold Constraint5.1.5.3 Example Timing Violations
Good Timinga
b
clk
a
clk
b
dc
c
Clock-to-Q
Setup
Prop
d
β γ
β
βα γ
α
α
αα
β
Hold
5.1.5 Timing Constraints 413
Setup Violation
α
a
clk
b
c α β
?α?β?
a
clk
b
c
Clock-to-Q
Setup
Prop
d
β γ
β
βα γ
α
α
αα
?α?β?
Setup Violation
414 CHAPTER 5. TIMING ANALYSIS
Hold Violation
a b
clk
a
clk
b
dc
c
Hold
d
β γ
β
β γ
?β?γ?
γ
Clock-to-Q
Prop
Hold Violation
5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS 415
5.2 Timing Analysis of Latches and FlipFlops
In this section, we show how to find the clock-to-Q, setup, and hold times for latches,flip-flops, and other storage elements.
5.2.1 Simple Multiplexer Latch
416 CHAPTER 5. TIMING ANALYSIS
5.2.1.1 Structure and Behaviour of Multi-plexer Latch
i o
clk
Loading / pass-through mode
i o
’1’
Storage mode
5.2.1 Simple Multiplexer Latch 417
Unfold Multiplexer to Simple Gates
i o
’0’
ab
s
o
Multiplexer: symbol and implementation
i o
clka
sel
b
o
Latch implementation
418 CHAPTER 5. TIMING ANALYSIS
Latch Glitching
d clk
o
Note: inverters on clk Both of the inverters on the clk signalare needed. Together, they prevent a glitch on the OR gate whenclk is deasserted. If there was only one inverter, a glitch wouldoccur. For more on this, see section 5.2.1.6
5.2.1 Simple Multiplexer Latch 419
Loading and Storing Values
d clk
o
Loading ’0’
0
11
10
0
d=’0’ clk=’1’
o1
Loading ’1’
1
00
00
0
d=’1’ clk=’1’
o1
Storing ’0’
010
11
d clk=’0’
o=’0’0
1
Storing ’1’
420 CHAPTER 5. TIMING ANALYSIS
5.2.1.2 Strategy for Timing Analysis ofStorage Devices
The key to calculating setup and hold times of a latch, flop, etc is to identify:
1. how the data is stored when not connected to the input (often a pair of invertersin a loop)
2. the gate(s) that the clock uses to cause the stored data to drive the output (oftena transmission gate or multiplexor)
3. the gate(s) that the clock uses to cause the input to drive the output (often atransmission gate or multiplexor)
5.2.1 Simple Multiplexer Latch 421
5.2.1.3 Clock-to-Q Time of a MultiplexerLatch
clk d
l1l2
qn q
s2
s1
cn
c2 clk
d l1
l2
qn q
s2
s1
cn
c2
clk d
l1l2
qn q
s2
s1
cn
c2 clk
d l1
l2
qn q
s2
s1
cn
c2
clk d
l1l2
qn q
s2
s1
cn
c2 clk
d l1
l2
qn q
s2
s1
cn
c2
422 CHAPTER 5. TIMING ANALYSIS
5.2.1.4 Setup Timing of a Multiplexer Latchclk
d α1 0 1
αα
α α
ααα0
0
Circuit is stable in load mode
clk d α
0 1 0α
0
α α
ααα1
t=3: l2 is set to 0, because c2 turns off AND gate
α
clk d α
0 0 1α
α
α α
ααα0
0
t=0: Clk transitions from load to store
clk d α
0 1 0α
0
α α
ααα1
t=4: α from store path propagates to q
α
clk d α
0 1 1α
α
α α
ααα1
0
t=1: Clk transitions from load to store
clk d α
0 1 0α
0
α α
ααα1
t=5: α from store path completes cycle
α
clk d α
0 1 0α
α
α α
ααα1
t=2: s1 propagates to s2, because cn turns on AND gate
α
5.2.1 Simple Multiplexer Latch 423
Setup Violation
clk d
1 0 1ω
ω
ω ω
ωωω0
0
Circuit is stable in load mode with ω
ωclk
d α αα
ω ω
ωωω
0
t=1: α propagates through ANDClk propagates through inverter
0 1 1
1
clk d α
1 0 1ω
ω
ω ω
ωωω0
0
t=-1: D transitions from ω to α
Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.
clk d α α
α
α ω
ωωω
0
ω
t=2: old ω propagates through AND
1 0
1
clk d α
0 1α
ω
ω ω
ωωω0
0
t=0: α propagates through inverterClk transitions from load to store
α0
clk d α α
0
α
αωω
t=3: l2 is set to 0, because c2 turns off AND gate
ω
0 1 0
1ω/α
424 CHAPTER 5. TIMING ANALYSIS
clk d α α
ω ω/α
ω/ααα
ω
0 1 0
1
t=4: ω/α from store path propagates to q
clk d α=1
0 1 00
0
0 1
1111
t=5: Illustrate instability with ω=0, α=1
0
clk d α
0 1 0α
0
ω
ωω/αω/α
1α
t=5: ω/α from store path completes cycle
ω
d ω
l1
l2
qn
q
s1
s2
clk
cn
ω
ω
ω
ω
ω
α
α
α
ω
α ω
ω
ω
ω
setup with negative margin
c2
ω
ω
ω
ω
ω
ω
α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α
-3 -2 -1 0 1 2 3 4 5 6
5.2.1 Simple Multiplexer Latch 425
We now repeat the analysis of setup violation, but illustrate the minimum violation(input transitions from ω to α 3 time-units before the clock edge).
clk d
1 0 1ω
ω
ω ω
ωωω0
0
Circuit is stable in load mode with ω
ω
clk d α
1 0 1α
α
ω ω
ωωω0
0
t=-1: α propagates through AND
clk d α
1 0 1ω
ω
ω ω
ωωω0
0
t=-3: D transitions from ω to α
clk d α
0 0 1α
α
α ω
ωωω0
0
t=0: Clk transitions from load to store
clk d α
1 0 1α
ω
ω ω
ωωω0
0
t=-2: α propagates through inverter
α
clk d α
0 1 1α
α
α α
αωω1
0
t=1: Clk propagates through inverter
426 CHAPTER 5. TIMING ANALYSIS
clk d α
0 1 0α
α
α α
ααα1
t=2: old ω propagates through AND
ω
Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.
clk d α
0 1 0α
0
α α
αω/αω/α
1
t=5: ω/α from store path completes cycle
α
clk d α
0 1 0α
0
ω/α α
ααα1
t=3: l2 is set to 0, because c2 turns off AND gate
α
clk d α=1
0 1 00
0
0 1
1111
t=5: Illustrate instability with ω=0, α=1
0
clk d α
0 1 0α
0
α ω/α
ω/ααα1
t=4: ω/α from store path propagates to q
α
d ω
l1
l2
qn
q
s1
s2
clk
cn
ω
ω
ω
ω
ω
α
α
α
ω
α α
α
α
α
setup with negative margin
c2
α
α
α
α
ω
ω
α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α α/ω
α
-3 -2 -1 0 1 2 3 4 5 6
5.2.1 Simple Multiplexer Latch 427
Minimum Setup Time
clk d
l1l2
qn q
s2
s1
cn
d ω
l1
l2
qn
q
s1
s2
clk
cn
ω
ω
ω
ω
ω
α
α
α
α
α
α
α
setup
c2
α
α
α
α
α
α
α
α
α
α
α
α
α
α
α
α
428 CHAPTER 5. TIMING ANALYSIS
5.2.1.5 Hold Time of a Multiplexer Latchclk
d l1
l2
qn q
s2
s1
cn
c2
5.2.1 Simple Multiplexer Latch 429
Hold Time Behaviour
clk d
l1l2
qn q
s2
s1
cn
c2clk
d l1
l2
qn q
s2
s1
cn
c2
clk d
l1l2
qn q
s2
s1
cn
c2clk
d l1
l2
qn q
s2
s1
cn
c2
clk d
l1l2
qn q
s2
s1
cn
c2clk
d l1
l2
qn q
s2
s1
cn
c2
430 CHAPTER 5. TIMING ANALYSIS
5.2.1.6 Example of a Bad Latch
clk d
l1l2
qn q
s2
s1
cn
c2
d α β
l1
l2
qn
q
s1
α β
s2
clk
c2
α
α
α
α
cn
α
α
α
α
α
α
α
α
α
5.3. CRITICAL PATHS AND FALSE PATHS 431
5.3 Critical Paths and False Paths
5.3.1 Introduction to Critical and FalsePaths
Definition critical path: The slowest path on the chip between flops or flopsand pins. The critical path limits the maximum clock speed.
Definition false path: : a path along which an edge cannot travel frombeginning to end.
432 CHAPTER 5. TIMING ANALYSIS
Outline
The algorithm that we present comes from McGeer and Brayton in a DAC 198?paper. The algorithm to find the critical path through a circuit is presented in severalparts.
1. Section 5.3.2: Find the longest path ignoring the possibility of false paths.
2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical pathis a false path.
3. Section 5.3.4: If a candidate path is a false path, then find the next candidatepath, and repeat the false-path detection algorithm.
4. Section 5.3.5: Correct, complete, and complex algorithm to find the critical pathin a circuit.
5.3.1 Introduction to Critical and False Paths 433
Notes
Note: The analysis of critical paths and false paths assumesthat all inputs change values at exactly the same time. Timingdifferences between inputs are modelled by the skew parameterin timing analysis.
Throughout our discussion of critical paths, we will use the delay values for gatesshown in the table below.
gate delayNOT 2AND 4OR 4XOR 6
434 CHAPTER 5. TIMING ANALYSIS
5.3.1.1 Example of Critical Path in FullAdder
Question: Find the critical path through the full-adder circuit shown below.
ci a b
co
si
jk
5.3.1 Introduction to Critical and False Paths 435
Alternative Excitation
Question: Do the input values of ci=0, a=↓, b=1 exercise the critical path?
ci a b
co
si
jk
436 CHAPTER 5. TIMING ANALYSIS
5.3.1.2 Preliminaries for Critical Paths
5.3.1.3 Longest Path and Critical Path
The longest path through the circuit might not be the critical path, because thebehaviour of the gates might prevent an edge (0→ 1 or 1→ 0) from travelling alongthe path.
5.3.1 Introduction to Critical and False Paths 437
Example False Path
Question: Determine whether the longest path in the circuit below is a falsepath
ya
b
a = 0, b = 0→ 1 a = 0, b = 1→ 0
ya
b
ya
b
a = 1, b = 0→ 1 a = 1, b = 1→ 0
ya
b
ya
b
Question: How can we determine analytically that this is a false path?
438 CHAPTER 5. TIMING ANALYSIS
ya
b
5.3.1 Introduction to Critical and False Paths 439
Preview of Complete Example
Question: Find the critical path through the circuit below.
a b
c
d ef
g
a b
c
d ef
g
440 CHAPTER 5. TIMING ANALYSIS
5.3.2 Longest Path
Outline of Algorithm to Find Longest Path
The basic idea is to annotate each signal with the maximum delay from it to anoutput.• Start at destination signals and traverse through fanin to source signals.
– Destination signals have a delay of 0
– At each gate, annotate the inputs by the delay through the gate plus the delayof the output.
– When a signal fans out to multiple gates, annotate the output of the source(driving) gate with maximum delay of the destination signals.
• The primary input signal with the maximum delay is the start of the longest path.The delay annotation of this signal is the delay of the longest path.
• The longest path is found by working from the source signal to the destinationsignals, picking the fanout signal with the maximum delay at each step.
5.3.3 Detecting a False Path 441
5.3.3 Detecting a False Path
5.3.3.1 Preliminaries
The controlling value of a gate is the value such that if one of the inputs has thisvalue, the output can be determined independently of the other inputs.
The controlled output value is the value produced by the controlling input value.
Gate Controlling Value Controlled Output
AND
OR
NAND
NOR
XOR
442 CHAPTER 5. TIMING ANALYSIS
Path Input, Side Input
Definition path input: For a gate on a path (either a candidate critical path, ora real critical path), the path input is the input signal that is on the path.
Definition side input: For a gate on a path (either a candidate critical path, ora real critical path), the side inputs are the input signals that are not on thepath.
5.3.3 Detecting a False Path 443
Reconvergent Fanout
Definition reconvergent fanout: There are paths from signals in the fanout ofa gate that reconverge at another gate.
ya
b
c
z d e
f
h
g
If a candidate path has reconvergent fanout, then the rising or falling edge on theinput to the path might cause a side input along the path to have a rising or fallingedge, rather than a stable ’0’ or ’1’ .
444 CHAPTER 5. TIMING ANALYSIS
Rules for Propagating an Edge Along a Path
1 1
0 0
1 1
0 0
NOT
AND
OR
XOR
5.3.3 Detecting a False Path 445
Missing Rules?
Question: Why do the rules not have falling edges for AND gates or risingedges for OR gates on the side input?
ab c
a
b
c
446 CHAPTER 5. TIMING ANALYSIS
Viability Condition of a Path
Definition Viability condition: For a path (p) though a circuit, the viabilitycondition is a Boolean expression in terms of the input signals that definesthe cases where an edge will propagate along the path.
Based upon the rules for propagating an edge that we have seen so far, the viabilitycondition for a path is: every side input has a non-controlling value.
As always, section 5.3.5 has the complete viability condition.
5.3.3 Detecting a False Path 447
5.3.3.2 Almost-Correct Algorithm to Detecta False Path1. Annotate each side input along the path with its non-controlling value. These
annotations are the constraints that must be satisfied for the candidate path tobe exercised.
2. Propagate the constraints backward from the side inputs of the path to the inputsof the circuit under consideration.
3. If there is a contradiction amongst the constraints, then the candidate path is afalse path.
4. If there is no contradiction, then the constraints on the inputs give the conditionsunder which an edge will traverse along the candidate path from input to output.
5.3.3.3 Examples of Detecting False Paths
448 CHAPTER 5. TIMING ANALYSIS
False-Path Example 1
Question: Determine if the longest path in the circuit below is a false path.
a
b
c
0
14 12 1212
6 44
8 88
44
8 2 016
12
10
d
e
f g
h
i
j
k
side input non-controlling value constraint
5.3.4 Finding the Next Candidate Path 449
5.3.4 Finding the Next Candidate Path
If the longest path is a false path, we need to find the next longest path in the circuit,which will be our next candidate critical path. If this candidate fails, we continue tofind the next longest of the remaining paths, ad infinitum.
450 CHAPTER 5. TIMING ANALYSIS
5.3.4.1 Algorithm to Find Next CandidatePath1. Initialize path table with primary inputs, their potential delay, and fanout.
2. Sort path table by potential delay
3. If the partial path with the max delay has just one unused fanout signal,then extend the partial path with this signal.Otherwise:
(a) Extend path through unused fanout with max delay.
(b) Delete this fanout signal from the list of unused fanout signals .
4. Compute constraint that side input has non-controlling value
5. If the new constraint does not cause a contradiction,then return to step 3.Otherwise:
(a) Mark this partial path as false.
(b) For each partial path that is a prefix of the false path:
• recalculate potential delay of path
(c) Return to step 2
5.3.4 Finding the Next Candidate Path 451
5.3.4.2 Examples of Finding Next Candi-date Path
Next-Path Example 1
Question: Starting from the initial delay calculation and longest path, findthe next candidate path and test if it is a false path.
a
b
c
0
14 12 1212
6 44
8 88
44
8 2 016
12
10
d
e
f g
h
i
j
k
452 CHAPTER 5. TIMING ANALYSIS
potential unuseddelay fanout path10 e c12 h, g b16 d a
5.3.4 Finding the Next Candidate Path 453
side input non-controlling value constraint
454 CHAPTER 5. TIMING ANALYSIS
5.3.5 Correct Algorithm to Find CriticalPath
We now remove the assumption that side inputs always arrive earlier than pathinputs.
5.3.5.1 Rules for Late Side Inputs
Early Side
monotone speedup side input causes glitchpath input propogates
Late Side
path=CTRLside=non-ctrl
path=non-ctrl path=CTRL path=non-ctrlside=non-ctrl side=CTRL side=CTRL
path input causes glitch path input propogates neither input propogatesside input propogates
monotone speedup
The complete and correct rule: a path input excites the gate if the side-input isnon-controlling or the side-input arrives late and the path input is controlling.
5.3.5 Correct Algorithm to Find Critical Path 455
5.3.5.2 Monotone Speedup
Definition monotonic: A function ( f ) is monotonic if increasing its inputcauses the output to increase or remain the same. Mathematically:x < y =⇒ f (x)≤ f (y).
Definition monotononous: A lecture is monotonous if increasing the length ofthe lecture increases the number of people who are asleep.
Definition monotone speedup: The maximum clockspeed of a circuit shouldbe monotonic with respect to the speed of any gate or sub-circuit. That is, ifwe increase the speed of part of the circuit, we should either increase theclockspeed of the circuit, or leave it unchanged.
456 CHAPTER 5. TIMING ANALYSIS
5.3.5.3 Analysis ofSide-Input-Causes-Glitch Situation
5.3.5.4 Complete Algorithm• If find a contradiction on the path, check for side inputs that are on previously
discovered false paths.
• If a gate and its side input are on a previously discovered false path, then theside input defines a prefix of a false path that is a late-arriving side input.
• For each late-arriving prefix, compute its viability (the conditions under which anedge will propagate along the prefix to the late side input).
• To the row of the late arriving side input in the constraint table, add as adisjunction the constraint that: the path input has a controlling value and at leastone of the prefixes is viable.
5.3.5 Correct Algorithm to Find Critical Path 457
5.3.5.5 Complete Examples
Complete Example 1
Question: Find the critical path in the circuit below.
a b
c
d ef
g
potential unuseddelay fanout pathfalse a,b,d,e,f,g10 g, c a10 a,c,f,g
side input non-controlling value constraintf[e] 1 ag[a] 1 a
458 CHAPTER 5. TIMING ANALYSIS
Complete Example 2
Question: Find the critical path in the circuit below.
a
c
h
i jj
i
gb
f
04
44
48
88
8
8
8
12
1212
8
814 1010ed 12
potential unuseddelay fanout pathfalse b,d,e,g,h,i,j8 f a12 h c14 f, g b,d,e14 b,d,e,g,i,j
side input non-ctrl value constrainth[c] 0 ci[h] 0 cbj[f] 0 ab
5.3.5 Correct Algorithm to Find Critical Path 459
Complete Example 3Monotone speedup
• Critical path 〈a,c,e,f〉
• Late side input e[d]
• Total delay 10
• Excitation: a = rising edge
a b
ef
c
d0 0 2 4
0 2
0
Rising edge excitation
a b
ef
c
d0 0 2 4
0 2
04
6
Falling edge excitation
a b
ef
c
d0 0 0.5 1
0 2
0
610
Fast timing
460 CHAPTER 5. TIMING ANALYSIS
Complete Example 4Late side inputs sometimes must have an edge.
Find the second-longest path with contradiction using early sides:
a b
c de
f g h
i jk
0
0 2 4 6
6
1 0 11 1
1
1 00a
b
c de
f g h
i jk
2 44
08
4 8
0 2 4 6
6810
10 12
14 16a b
c de
f g h
i jk
0
0
5.3.5 Correct Algorithm to Find Critical Path 461
Complete Example 5
Late side paths must be viable.
Question: Find the critical path in the circuit below.
a b
c
d
e
f
g
h
i
j
k
a b
c
d
e
f
g
h
i
j
k
462 CHAPTER 5. TIMING ANALYSIS
5.3.6 Further Extensions to Critical PathAnalysis
McGeer and Brayton’s paper includes two extensions to the critical path algorithmpresented here that we will not cover.• gates with more than two inputs
• finding all input values that will exercise the critical path
• multiple paths with the same delay to the same gate
5.3.7 Increasing the Accuracy of CriticalPath Analysis
When doing critical path calculations, it is often useful to strike a balance betweenaccuracy and effort. In the examples so far, we assumed that all signals had thesame wire and load delays. This assumption simplifies calculations, but reducesaccuracy. Section 5.4 discusses how the analog world affects timing analysis.
5.4. ELMORE TIMING MODEL 463
5.4 Elmore Timing Model
5.4.1 RC-Networks for Timing Analysis
Transistor Level(P-Tran)
gate
source
drain
Mask Level(P-Tran)
gate
sourcepoly
p-diff
contact
drain
Cross-Section ofFabricatedTransistor
poly
p-diff
contact
substrate
Switch Level(P-Tran)
gate
source
drain
464 CHAPTER 5. TIMING ANALYSIS
Transistor Level(N-Tran)
gate
source
drain
Mask Level(N-Tran)
gate
sourcepoly
n-diff
drain
contact
Cross-Section ofFabricatedTransistor
poly
p-diff
contact
substrate
Switch Level(N-Tran)
gate
source
drain
5.4.1 RC-Networks for Timing Analysis 465
Different Levels of Abstraction for Inverter
Gate Levela b
Transistor Level
a b
VDD
GND
Mask Level
VDD
GND
a b
poly
n-diff
p-diff
metal
metal
contact
RC-Network models of P- andN-transistors
gate
Rpu
RpdCp
source
drain
Cp
source
gate
drain
466 CHAPTER 5. TIMING ANALYSIS
RC-Network for Timing Analysis
a b
Rpu
Rpd
Cp
VDD
GND
CL
5.4.1 RC-Networks for Timing Analysis 467
A Pair of Inverters
Gate Level
ab
c
Transistor Level
ab
VDD
GND
c
Mask Level
ab
c
468 CHAPTER 5. TIMING ANALYSIS
A Pair of Inverters (Cont’d)
Mask LevelVDD
GND
ab c
RC-Network for Timing Analysis
ab
Rpu
Rpd
Cp
VDD
GND
c
Rpu
Rpd
CpCL CLCW
RW RV
RC-Network for Timing Analysis (trimmed)
5.4.1 RC-Networks for Timing Analysis 469
Rpu
Rpd
Cp
VDD
GND
CL
RVb
CW
RW
470 CHAPTER 5. TIMING ANALYSIS
A Circuit with Fanout
Gate Level
ab
c
d
Gate Level (physical layout)
ab c
dc
Transistor Level
ab
VDD
GND
c b d
c
5.4.1 RC-Networks for Timing Analysis 471
A Circuit with Fanout (Cont’d)
Transistor Level
ab
VDD
GND
c b d
c
Mask LevelVDD
GND
a db
b
c
c
472 CHAPTER 5. TIMING ANALYSIS
A Circuit with Fanout (Cont’d)
Mask LevelVDD
GND
a db
b
c
c
RC-Network for Timing Analysis
a
Rpu
Rpd
Cp
GND
c
Rpu
Rpd
Cpd
Rpu
Rpd
Cp
c
CL CL CL
VDD
b
CW1
RW1 RV
b
CW2
RW2 RV
CW3
RW3
5.4.1 RC-Networks for Timing Analysis 473
A Circuit with Fanout
RC-Network for Timing Analysis
a
Rpu
Rpd
Cp
GND
c
Rpu
Rpd
Cpd
Rpu
Rpd
Cp
c
CL CL CL
VDD
b
CW1
RW1 RV
b
CW2
RW2 RV
CW3
RW3
RC-Network for Timing Analysis (trimmed)
Rpu
Rpd
Cp
GND
CL CL
VDD
RV
bRVb
CW1
RW1
CW2
RW2
474 CHAPTER 5. TIMING ANALYSIS
RC-Network for Timing Analysis (cleaned up)
Rpu
Rpd
Cp
GND
CL
CL
VDD
RV
b RV
b
CW1
RW1
CW2
RW2
5.4.2 Derivation of Analog Timing Model 475
5.4.2 Derivation of Analog Timing Model
Real Waveforms
Slow input
time
inputvoltage
time
outputvoltage
Fast input
time
inputvoltage
time
inputvoltage
476 CHAPTER 5. TIMING ANALYSIS
Steps Toward Approximation
We begin with two simplifications as steps toward calculating a single delay valuefor a circuit.
1. Look at the circuit’s response to a step-function input.
2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% ofVDD.
Definition Trip Points: A high or ’1’ trip point is the voltage level where anupwards transition means the signal represents a ’1’ .
A low or ’0’ trip point is the voltage level where a downwards transitionmeans the signal represents a ’0’ .
a
b
5.4.2 Derivation of Analog Timing Model 477
Node Numbering, Initial Conditions• The source (VDD in our case) and each capacitor is a node. We number the
nodes, capacitors, and resistors. Resistors are numbered according to thecapacitor to their right. Multiple resistors in series without an interveningcapacitor are lumped into a single resistor.
• All nodes except the source start at GND.
• We calculate the voltage at a node when we turn on the P-transistor (connect toVDD).
The process for analyzing a transition from VDD to GND on a node is the dual ofthe process just described. The source node is GND, all other nodes start at VDD,we calculate the voltage when we turn on the N-transistor (connect it to GND).
Rpu
Rpd
Cp
GND
CL
CL
VDD
RV
b RV
b
CW1
RW1
CW2
RW2
1 2 5
3 40
R1
R2 R5
R3 R4
478 CHAPTER 5. TIMING ANALYSIS
Define: Path and Downstream
Definition path: The path from the source node to a node i is the set of allresistors between the source and i. Example: path(3) = {R1,R2,R3}
Definition down: The set of capactitors downstream from a node is the set ofall capacitors where current would flow through the node to charge thecapacitor. You can think of this as the set of capacitors that are between thenode and ground. Example: down(2) = {C2,C3,C4,C5}. Example: down(3) ={C3,C4}
5.4.2 Derivation of Analog Timing Model 479
5.4.2.1 Example Derivation: Equation forVoltage at Node 3
V3(t) = V0(t)−voltage drop fromNode0toNode3
The voltage drop is the sum of the voltage dropsacross the resistors on the path from Node0 toNode3
= V0(t)− ∑r∈path(3)
Rr×Ir(t)
= V0(t)− (R1I1(t)+R2I2(t)+R3I3(t))
The current through a resistor is the sum of thecurrents through all of the downstream capacitors
Ir(t) = ∑c∈down(r)
Ic
I1(t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5I2(t) = Ic2 + Ic3 + Ic4 + Ic5I3(t) = Ic3 + Ic4
480 CHAPTER 5. TIMING ANALYSIS
Substitute Ir into the equation for V3
V3(t) = V0(t)−
R1(Ic1 + Ic2 + Ic3 + Ic4 + Ic5)+ R2(Ic2 + Ic3 + Ic4 + Ic5)+ R3(Ic3 + Ic4)
Use associativity to group terms by currents.
V3(t) = V0(t)−
Ic1(R1)+ Ic2(R1 +R2)+ Ic3(R1 +R2 +R3)+ Ic4(R1 +R2 +R3)+ Ic5(R1 +R2)
5.4.2 Derivation of Analog Timing Model 481
Current through a capacitor
Ic(t) = Cc∂Vc(t)
∂t
Substitute Ic into equation for V3
V3(t) = V0(t)−
(R1)Cc1∂Vc1(t)
∂t
+ (R1 +R2)Cc2∂Vc2(t)
∂t
+ (R1 +R2 +R3)Cc3∂Vc3(t)
∂t
+ (R1 +R2 +R3)Cc4∂Vc4(t)
∂t
+ (R1 +R2)Cc5∂Vc5(t)
∂t
482 CHAPTER 5. TIMING ANALYSIS
Ri,k = ∑r∈(path(k)∩path(k))
Rr
R3,1 = R1R3,2 = R1 +R2R3,3 = R1 +R2 +R3R3,4 = R1 +R2 +R3R3,5 = R1 +R2
Substitute Ri,k into V3
V3(t) = V0(t)−
R3,1Cc1∂Vc1(t)
∂t+ R3,2Cc2
∂Vc2(t)∂t
+ R3,3Cc3∂Vc3(t)
∂t
+ R3,4Cc4∂Vc4(t)
∂t+ R3,5Cc5
∂Vc5(t)∂t
5.4.2 Derivation of Analog Timing Model 483
5.4.2.2 General Derivation
Vi(t) = V0(t)−voltage drop fromNode0toNodei
The voltage drop is the sum of the voltage dropsacross the resistors on the path from Node0 toNodei
= V0(t)− ∑r∈path(i)
Rr×Ir(t)
484 CHAPTER 5. TIMING ANALYSIS
The current through a resistor is the sum of thecurrents through all of the downstream capacitors
Ir(t) = ∑c∈down(r)
Ic
Substitute Ir into the equation for Vi
Vi(t) = V0(t)− ∑r∈path(i)
Rr× ∑c∈down(r)
Ic
Use associativity to push Rr into the summationover c
Vi(t) = V0(t)− ∑r∈path(i)
∑c∈down(r)
Rr×Ic
5.4.2 Derivation of Analog Timing Model 485
Current through a capacitor
Ic(t) = Cc∂Vc(t)
∂t
Substitute Ic into equation for Vi
Vi(t) = V0(t)− ∑r∈path(i)
∑c∈down(r)
Rr×Cc∂Vc(t)
∂t
A little bit of handwaving to prepare for Elmore re-sistance
Vi(t) = V0(t)− ∑k∈Nodes
∑r∈path(i)∩path(k)
Rr
×Ck∂Vc(t)
∂t
486 CHAPTER 5. TIMING ANALYSIS
Define Elmore resistance Ri,k
Ri,k = ∑r∈(path(k)∩path(k))
Rr
Substitute Ri,k into Vi
Vi(t) = V0(t)− ∑k∈Nodes
Ri,k×Ck∂Vc(t)
∂t
5.4.3 Elmore Timing Model 487
5.4.3 Elmore Timing Model• Assume that V0(t) is a step function from 0 to 1 at time 0.
• Derive upper and lower bounds for Vi(t).
• Find RC time constants for upper and lower bounds.
• Elmore delay is guaranteed to be between upper and lower bounds.
Upper and lower bounds Elmore model RC-network model
TD-TRi
TP-TRi
TRi
TD
TP
488 CHAPTER 5. TIMING ANALYSIS
Equations for Curves
Time : 0 TDi−TRi TP−TRi ∞
Upper 1+t−TDi
TP1−
TRi
TPe
TDi−TP− t
TRi
Elmore 1− e−t/TDi
Lower 0 1−TDi
t +TRi
1−TDi
TPe
TP−TRi− t
TP
Fact: 0≤ TRi ≤ TDi ≤ TP
5.4.3 Elmore Timing Model 489
Definitions of Time Constants
TRi = ∑k∈Nodes
R2k,iCk
Ri,iMathematical artifact, no intuitive meaning
TDi = ∑k∈Nodes
Rk,iCk Elmore delay
TP = ∑k∈Nodes
Rk,kCk RC-time constant for lumped network
490 CHAPTER 5. TIMING ANALYSIS
Picking the Trip Point
Vi(t) = VDD(1− e−t/TDi)Pick trip point of Vi(t) = 0.65VDD, then solve for t
0.65VDD = VDD(1− e−t/TDi)
0.35 = e−t/TDi
Take ln of both sidesln0.35 = ln(e−t/TDi)
ln0.35 =−1.05≈−1.0−1.0 = −t/TDi
t = TDi
By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmoredelay.
5.4.4 Examples of Using Elmore Delay 491
5.4.4 Examples of Using Elmore Delay
5.4.4.1 Interconnect with Single Fanout
492 CHAPTER 5. TIMING ANALYSIS
G1 G2
G1Ra1
C1 Ra2
Ra3
C2C3Ra4
G2Rw1
Rw2Rw3
C1
G1
Vi
Rpu
Rpd
Cp C2
Rw1
C3
Rw2 Rw3
CG2
G2
Ra1 Ra2 Ra3 Ra4
G* gateC* capacitance on wireRa* resistance through antifuseRw* resistance through wire
5.4.4 Examples of Using Elmore Delay 493
Question: Calculate delay from gate 1 to gate 2
C1
G1
Vi
Rpu
Rpd
Cp C2
Rw1
C3
Rw2 Rw3
CG2
G2
Ra1 Ra2 Ra3 Ra4
494 CHAPTER 5. TIMING ANALYSIS
Doubling Antifuses
Question: If you double the number of antifuses and wires needed toconnect two gates, what will be the approximate effect on the wire delaybetween the gates?
5.4.4 Examples of Using Elmore Delay 495
5.4.4.2 Interconnect with Multiple Gates inFanout
G1 G2
G3 G1
G2
G3
Question: Assuming that wire resistance is much less than antifuseresistance and that all antifuses have equal resistance, calculate the delayfrom the source inverter (G1) to G2
496 CHAPTER 5. TIMING ANALYSIS
5.4.4 Examples of Using Elmore Delay 497
Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly thesame capacitance, which is greater, the delay to G2 or the delay to G3?
G1R1
C1
R2
R3
C2
C4R4
G2
C6
R6
R5
G3
C3
C5
C7
C1
G1
Vi
Rpu
Rpd
Cp C2
R1
C4
R2 R3 R4
C5
G2
C6
R5 R6
C7
G3
C3
n1 n2 n3 n4 n5
n6 n7
498 CHAPTER 5. TIMING ANALYSIS
5.5 Practical Usage of Timing Analysis
Speed Grading
• Fabs sort chips according to their speed (sorting is known as speed gradingor speed binning)
• Faster chips are more expensive
• In FPGAs, sorting is based usualy on propagation delay through an FPGAcell. As wires become a larger portiono of delay, some analysis of wiredelays is also being done.
• Propagation delay is the average of the rising and falling propagation delays.
• Typical speed grades for FPGAs:
Std standard speed grade1 15% faster than Std2 25% faster than Std3 35% faster than Std
Worst-Case Timing
• Maximum Delay in CMOS. When?
5.5. PRACTICAL USAGE OF TIMING ANALYSIS 499
– Minimum voltage
– Maximum temperature
– Slow-slow conditions (process variation/corner which result in slowp-channel and slow n-channel). We could also have fast-fast, slow-fast,and fast-slow process corners
• Increasing temperature increases delay
– ⇑ Temp =⇒ ⇑ resistivity
– ⇑ resistivity =⇒ ⇑ electron vibration
– ⇑ electron vibration =⇒ ⇑ colliding with current electrons
– ⇑ colliding with current electrons =⇒ ⇑ delay
• Increasing supply voltage decreases delay
– ⇑ supply voltage =⇒ ⇑ current
– ⇑ current =⇒ ⇓ load capacitor charge time
– ⇓ load capacitor charge time =⇒ ⇓ total delay
• Derating factor is a number used to adjust timing number to account forvoltage and temp conditions
500 CHAPTER 5. TIMING ANALYSIS
• ASIC manufacturers classes, based on variety of environments:VDD TA (ambient temp) TC (case temp)
Commercial 5V ± 5% 0 to +70CIndustrial 5V ± 10% –40 to +85CMilitary 5V ± 10% –55 to +125C
• What is important is the transistor temperature inside the chip, TJ (junctiontemperature)
5.5.1 Speed BinningSpeed binning is the process of testing each manufactured part to determine themaximum clock speed at which it will run reliably.
Manufacturers sell chips off of the same manufacturing line at different pricesbased on how fast they will run.
A “speed bin” is the clock speed that chips will be labeled with when sold.
Overclocking: running a chip at a clock speed faster than what it is rated for (andhoping that your software crashes more frequently than your over-stressedhardware will).
5.5.1 Speed Binning 501
5.5.1.1 FPGAs, Interconnect, andSynthesis
On FPGAs 40-60% of clock cycle is consumed by interconnect.
When synthesizing, increasing effort (number of iterations) of place and route cansignificantly reduce the clock period on large designs.
502 CHAPTER 5. TIMING ANALYSIS
5.5.2 Worst Case Timing
5.5.2.1 Fanout delay
In Smith’s book, Table 5.2 (Fanout delay) combines two separate parameters:
• capacitive load delay
• interconnect delay
into a single parameter (fanout). This is common, and fine.
But, when reading a table such as this, you need to know whether fanout delay iscombining both capacitive load delay and interconnect delay, or is just capacitiveload.
5.5.2 Worst Case Timing 503
5.5.2.2 Derating Factors
Delays are dependent upon supply voltage and temperature.
⇑ Temp =⇒ ⇑ Delay⇑ Supply voltage =⇒ ⇓ Delay
504 CHAPTER 5. TIMING ANALYSIS
Temperature
• ⇑ Temp =⇒ ⇑ Delay
– ⇑ Temp =⇒ ⇑ Resistivity of wires
– As temp goes up, atoms vibrate more, and so have greater probability ofcolliding with electrons flowing with current.
5.5.2 Worst Case Timing 505
Supply Voltage
• ⇑ Supply voltage =⇒ ⇓ Delay
– ⇑ Supply voltage =⇒ ⇑ current (V = IR)
– ⇑ current =⇒ ⇓ time to charge load capacitors to threshold voltage
506 CHAPTER 5. TIMING ANALYSIS
Derating Factor Definition
A “derating factor” is a number to adjust timing numbers to account for differenttemperature and voltage conditions.
Excerpt from table 5.3 in Smith’s book (Actel Act 3 derating factors):
Derating factor Temp Vdd1.17 125C 4.5V1.00 70C 5.0V0.63 -55C 5.5V
Chapter 6
Power Analysis and Power-AwareDesign
507
508 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.1 Overview
6.1.1 Importance of Power and Energy• Laptops, PDA, cell-phones, etc — obvious!
• For microprocessors in personal computers, every watt above 40W adds $1 tomanufacturing cost
• Approx 25% of operating expense of server farm goes to energy bills
• (Dis)Comfort of Unix labs in E2
• Sandia Labs had to build a special sub-station when they took delivery ofTeraflops massively parallel supercomputer (over 9000 Pentium Pros)
• High-speed microprocessors today can run so hot that they will damagethemselves — Athlon reliability problems, Pentium 4 processor thermal throttling
• In 2000, information technology consumed 8% of total power in US.
• Future power viruses: cell phone viruses cause cell phone to run in full powermode and consume battery very quickly; PC viruses that cause CPU tomeltdown batteries
6.1.2 Industrial Names and Products 509
6.1.2 Industrial Names and ProductsNote: Lots of links from E&CE 327 web pages under “Docu-mentation”
6.1.3 Power vs Energy
Most people talk about “power” reduction, but sometimes they mean “power” andsometimes “energy.”• Power minimization is usually about heat removal
• Energy minimization is usually about battery life or energy costs
Type Units Equivalent Types EquationsEnergy Joules Work = Volts×Coulombs
= 12×C×Volts2
Power Watts Energy / Time = Volts× I= Joules/sec
510 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.1.4 Batteries, Power and Energy
6.1.4.1 Do Batteries Store Energy orPower?
Energy = Volts×Coulombs
Power =EnergyTime
Batteries rated in Amp-hours at a voltage.
battery = Amps×Seconds×Volts
= CoulombsSeconds ×Seconds×Volts
= Coulombs×Volts
= Energy
Batteries store energy.
6.1.4 Batteries, Power and Energy 511
6.1.4.2 Battery Life and Efficiency
To extend battery life, we want to increase the amount of work done and/ordecrease energy consumed.
Work and energy are same units, therefore to extend battery life, we truly want toimprove efficiency.
“Power efficiency” of microprocessors normally measured in MIPS/Watt. Is this areal measure of efficiency?
MIPsWatts = millions of instructions
Seconds ×SecondsEnergy
= millions of instructionsEnergy
Both instructions executed and energy are measures of work, so MIPs/Watt is ameasure of efficiency.
Question: What is the weakness of this analysis?
512 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.1.4.3 Battery Life and Power
Question: Running a VHDL simulation requires executing an average of 1million instructions per simulation step. My computer runs at 700MHz, has aCPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH.Assuming all of my computer’s clock cycles go towards running VHDLsimulations, how many simulation steps can I run on one battery charge?
6.1.4 Batteries, Power and Energy 513
Battery Life and Power
Question: If I use the SpeedStep feature of my computer, my computerruns at 600MHz with 60W of power. With SpeedStep activated, muchlonger can I keep the computer running on one battery?
514 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Battery Life and Power
Question: With SpeedStep activated, how many more simulation steps canI run on one battery?
6.2. POWER EQUATIONS 515
6.2 Power Equations
Power = SwitchPower+ShortPower︸ ︷︷ ︸
+ LeakagePower︸ ︷︷ ︸
DynamicPower StaticPower
Dynamic Power dependent upon clock speed
Switching Power useful — charges up transistors
Short Circuit Power not useful — both N and P transistors are on
Static Power independent of clock speed
Leakage Power not useful — leaks around transistor
516 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Dynamic Power
Dynamic power is proportional to how often signals change their value (switch).• Roughly 20% of signals switch during a clock cycle.
• Need to take glitches into account when calculating activity factor. Glitchesincrease the activity factor.
• Equations for dynamic power contain clock speed and activity factor.
6.2.1 Switching Power 517
6.2.1 Switching Power
1->00->1CapLoad
Charging a capacitor
0->11->0CapLoad
Disharging a capacitor
energy to (dis)charge capacitor =12×CapLoad×VoltSup2
518 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Switching Power
When a capacitor C is charged to a voltage V , the energy stored in capacitor is12CV 2.
The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy(12CV 2 is dissipated as heat through the pullup resistance. Half of energy is
transfered to the capacitor.
When the capacitor discharges from V to 0, the energy stored in the capacitor(12CV 2) is dissipated as heat through the pulldown resistance.
6.2.1 Switching Power 519
Switching Power
f ′: frequency at which invertor goes through complete charge-discharge cycle .(eqn 15.4 in Smith)
average switching power = f ′×CapLoad×VoltSup2
ClockSpeed clock speedActFact average number of times that signal switches from 0→ 1
or from 1→ 0 during a clock cycle
average switching power =12×ActFact×ClockSpeed×CapLoad×VoltSup2
520 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.2.2 Short-Circuited Power
Vi Vo
IShort
VoltSup
GND
VoltThresh
VoltSup - VoltThresh
P-trans on
N-trans on
TimeShort
Gate Voltage
PwrShort = ActFact×ClockSpeed×TimeShort× IShort×VoltSup
6.2.3 Leakage Power 521
6.2.3 Leakage Power
N-substrate
P
Vi
Vo
N N P
P
Cross section of invertor showingparasitic diode
I
V
ILeak
Leakage current through parasitic diode
PwrLk = ILeak×VoltSup
ILeak ∝ e
(−q×VoltThresh
k×T
)
522 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.2.4 Glossary
This section reserved for your reading pleasure
6.2.5 Note on Power Equations
This section reserved for your reading pleasure
6.3 Overview of Power ReductionTechniques
We can divide power reduction techniques into two classes: analog and digital.
6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 523
Analog Parameters
Power reduction parameters at the analog level.
capacitance for example, Silicon on Insulator (SOI)
resistance for example, copper wires
voltage low-voltage circuits
524 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Analog Techniques
Power reduction techniques at the analog level.
dual-VDD Two different supply voltages: high voltage for performance-criticalportions of design, low voltage for remainder of circuit. Alternatively, can varyvoltage over time: high voltage when running performance-critical software andlow voltage when running software that is less sensitive to performance.
dual-Vt Two different threshold voltages: transistors with low threshold voltagefor performance-critical portions of design (can switch more quickly, but moreleakage power), transistors with high threshold voltage for remainder of circuit(switches more slowly, but reduces leakage power).
exotic circuits Special flops, latches, and combinational circuitry that run at ahigh frequency while minimizing power
adiabatic circuits Special circuitry that consumes power on 0→ 1 transitions,but not 1→ 0 transitions. These sacrifice performance for reduced power.
clock trees Up to 30% of total power can be consumed in clock generation andclock tree
6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 525
Digital Parameters
Power-reduction parameters at the digital level.
capacitance (number of gates)
activity factor
clock frequency
526 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Digital Techniques
Power-reduction techniques at the digital level.
multiple clocks Put a high speed clock in performance-critical parts of designand a low speed clock for remainder of circuit
clock gating Turn off clock to portions of a chip when it’s not being used
data encoding Gray coding vs one-hot vs fully encoded vs ...
glitch reduction Adjust circuit delays or add redundant circuitry to reduce oreliminate glitches.
asynchronous circuits Get rid of clocks altogether....
Additional low-power design techniques for RTL from a Qualis engineer:http://home.europa.com/ ˜ celiac/lowpower.html
6.4. VOLTAGE REDUCTION FOR POWER REDUCTION 527
6.4 Voltage Reduction for PowerReduction
If our goal is to reduce power, the most promising approach is to reduce thesupply voltage, because, from:
Power = (ActFact×ClockSpeed× 12CapLoad×VoltSup2)
+ (ActFact×ClockSpeed×TimeShort× IShort×VoltSup)+ (ILeak×VoltSup)
we observe:
Power ∝ VoltSup2
528 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Reducing Difference Between Supply and
Threshold Voltage
As the supply voltage decreases, it takes longer to charge up the capacitive load,which increases the load delay of a circuit.
In the chapter on timing analysis, we saw that increasing the supply voltage willdecrease the delay through a circuit. (From V = IR, increasing V causes anincrease in I, which causes the capacitive load to charge more quickly.) However,it is more accurate to take into account both the value of the supply voltage, andthe difference between the supply voltage and the threshold voltage.
MaxClockSpeed ∝(VoltSup−VoltThresh)2
VoltSup
6.4. VOLTAGE REDUCTION FOR POWER REDUCTION 529
Effect of Decreasing Supply Voltage on
Delay
Question: If the delay along the critical path of a circuit is 20 ns, the supplyvoltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical pathdelay if the supply voltage is dropped to 2.2 V.
530 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Reducing Threshold Voltage IncreasesLeakage Current
If we reduce the supply voltage, we want to also reduce the threshold voltage, sothat we do not increase the delay through the circuit. However, as thresholdvoltage drops, leakage current increases:
ILeak ∝ e
(−q×VoltThresh
k×T
)
And increasing the leakage current increases the power:
Power ∝ ILeak
So, need to strike a balance between reducing VoltSup (which has a quadraticaffect on reducing power), and increasing ILeak, which has a linear affect onincreasing power.
6.5. DATA ENCODING FOR POWER REDUCTION 531
6.5 Data Encoding for Power Reduction
6.5.1 How Data Encoding Can ReducePower
Data encoding is a technique that chooses data values so that normal executionwill have a low activity factor.
The most common example is “Gray coding” where exactly one bit changes valueeach clock cycle when counting.
532 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Decimal Gray Binary0 0000 00001 0001 00012 0011 00103 0010 00114 0110 01005 0111 01016 0101 01107 0100 01118 1100 10009 1101 1001
10 1111 101011 1110 101112 1010 110013 1011 110114 1001 111015 1000 1111
6.5.1 How Data Encoding Can Reduce Power 533
8-bit Counter
Question: For an eight-bit counter, how much more power will a binarycounter consume than a Gray-code counter?
534 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Random Data
Question: For completely random eight-bit data, how much more power willa binary circuit consume than a Gray-code circuit?
6.5.2 Example Problem: Sixteen Pulser 535
6.5.2 Example Problem: Sixteen Pulser
6.5.2.1 Problem StatementYour task is to do the power analysis for a circuit that should send out aone-clock-cycle pulse on the done signal once every 16 clock cycles. (That is,done is ’0’ for 15 clock cycles, then ’1’ for one cycle, then repeat with 15 cycles of’0’ followed by a ’1’, etc.)
done
1 2 3 1615 17 3231 33
clk
Required behaviour
You have been asked to consider three different types of counters: a binarycounter, a Gray-code counter, and a one-hot counter. (The table below shows thevalues from 0 to 15 for the different encodings.)
Question: What is the relative amount of power consumption for thedifferent options?
536 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.5.2.2 Additional Information
Your implementation technology is an FPGA where each cell has a programablecombinational circuit and a flip-flop. The combinational circuit has 4 inputs and 1output. The capacitive load of the combinational circuit is twice that of the flip-flop.
PLA
cell
1. You may neglect power associated with clocks.
2. You may assume that all counters:
(a) are implemented on the same fabrication process
(b) run at the same clock speed
(c) have negligible leakage and short-circuit currents
6.5.2 Example Problem: Sixteen Pulser 537
Data Encoding
Decimal Gray One-Hot Binary0 0000 0000000000000001 00001 0001 0000000000000010 00012 0011 0000000000000100 00103 0010 0000000000001000 00114 0110 0000000000010000 01005 0111 0000000000100000 01016 0101 0000000001000000 01107 0100 0000000010000000 01118 1100 0000000100000000 10009 1101 0000001000000000 1001
10 1111 0000010000000000 101011 1110 0000100000000000 101112 1010 0001000000000000 110013 1011 0010000000000000 110114 1001 0100000000000000 111015 1000 1000000000000000 1111
538 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.5.2.3 Answer
Sketch the Circuitry
Name the output “done” and the count digits “d()”.
6.5.2 Example Problem: Sixteen Pulser 539
Capacitance
cap number subtotal capGray d() PLAs
Flops
done PLAs
Flops
1-Hot d() PLAs
Flops
done PLAs
Flops
Binary d() PLAs
Flops
done PLAs
Flops
540 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Activity Factors
Gray Coding Activity Factor
d(0)
d(1)
d(2)
d(3)
done
clk
4/16
2/16
2/16
2/16
8/16
Gray coding
6.5.2 Example Problem: Sixteen Pulser 541
One-Hot Activity Factor
d(0)
d(1)
d(2)
done
clk
2/16
2/16
2/16
2/16
2/16
One-hot coding
542 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Binary Coding Activity Factor
d(0)
d(1)
d(2)
d(3)
done
clk
8/16
4/16
2/16
2/16
16/16
Binary coding
6.5.2 Example Problem: Sixteen Pulser 543
Putting it all Together
subtotal cap act fact power
Gray d() PLAs
Flops
done PLAs
Flops
Total
1-Hot d() PLAs
Flops
done PLAs
Flops
Total
Binary d() PLAs
Flops
done PLAs
Flops
Total
544 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.6 Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when acircuit isn’t needed. This reduces the activity factor.
6.6.1 Introduction to Clock Gating
Examples of Clock Gating
Condition Circuitry turned offO/S in standby mode Everything except “core” state (PC, registers,
caches, etc)No floating point instruc-tions for k clock cycles
floating point circuitry
Instruction cache miss Instruction decode circuitryNo instruction in pipestage i
Pipe stage i
6.6.2 Implementing Clock Gating 545
6.6.2 Implementing Clock Gating
Clock gating is implemented by adding a component that disables the clock whenthe circuit isn’t needed.
i_data
clk
o_data
i_valid
o_valid
Without clock gating
Clock EnableState Machine
clk
i_wakeup
clk_en
cool_clk
i_data o_data
i_valid
o_valid
With clock gating
546 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.6.3 Design Process
6.6.4 Effectiveness of Clock Gating
Parameters to characterize effectiveness of clock gating:
Eff = effectiveness of clock gatingPctValid = percentage of clock cycles with valid data in the circuit —
the clock must be togglingPctClk = percentage of clock cycles that clock toggles
Effectiveness measures the percentage of clock cycles with invalid data in whichthe clock is turned off. Equation for effectiveness of clock gating:
Eff =PctClkOffPctInvalid
=1−PctClk
1−PctValid
6.6.4 Effectiveness of Clock Gating 547
Clock Gating Effectiveness Questions
Question: What is the effectiveness if the clock toggles only when there isvalid data?
Question: What is the effectiveness of a clock that always toggles?
548 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Clock Gating Effectiveness Questions
Question: What does it mean for a clock gating scheme to be 75%effective?
Question: What happens if PctClk < PctValid?
6.6.4 Effectiveness of Clock Gating 549
Effect of Effectiveness
We can see the effect of the effectiveness of a clock-gating scheme on the activityfactor:
A’
Eff
A
0 10
PctValid * A
The new activity factor with a clock gating scheme is:
A′ = A− (1−PctValid)×Eff ×A
550 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.6.5 Example: Reduced Activity Factorwith Clock Gating
Question: How much power will be saved in the following clock-gatingscheme?
• 70% of the time the main circuit has valid data
• clock gating circuit is 90% effective (90% of the time that the circuit has invaliddata, the clock is off)
• clock gating circuit has 10% of the area of the main circuit
• clock gating circuit has same activity factor as main circuit
• neglect short-circuiting and leakage power
6.6.5 Example: Reduced Activity Factor with Clock Gating 551
552 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.6.6 Clock Gating with Valid-Bit Protocol
6.6.6.1 Valid-Bit Protocol
Need a mechanism to tell circuit when to pay attention to data inputs
clk
i_valid
i_data o_data
o_valid
clk
i_valid
i_data α β γ
6.6.6 Clock Gating with Valid-Bit Protocol 553
Valid-Bit Protocol
clk
i_valid
i_data o_data
o_valid
clk
i_valid
i_data
o_data
o_valid
α β γ
α β γ
i valid : high when i data has valid data — signifies whether circuit should payattention to or ignore data.
o valid : high when o data has valid data — signifies whether whetherenvironment should pay attention to output of circuit.
For more on circuit protocols, see section 2.12.
554 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Microscopic Analysis
Which clock edges are needed?
i_valid
clk
o_valid
clk
i_valid
o_valid
6.6.6 Clock Gating with Valid-Bit Protocol 555
6.6.6.2 How Many Clock Cycles forModule?
Given a module with latency Lat , if the module receives a stream of NumPclsconsecutive valid parcels, how many clock cycles must the clock-enable signal beasserted?
Latency NumPcls NumClkEn
i_valido_validclk_en
Latency NumPcls NumClkEn
i_valido_validclk_en
i_valido_validclk_en
i_valido_validclk_en
i_valido_validclk_en
i_valido_validclk_en
i_valido_validclk_en
556 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.6.6.3 Adding Clock-Gating Circuitry
Before Clock Gating
data_in
clk
data_out
valid_in valid_out
clk
α β δγ
α β γ
data_in
valid_in
data_out
valid_out don’t care
uninitialized
6.6.6 Clock Gating with Valid-Bit Protocol 557
After Clock Gating: Circuitry
Clock EnableState Machine
data_in
hot_clk
wakeup_in
data_out
clk_en
cool_clk
valid_in valid_out
wakeup_out
• hot clk : clock that always toggles
• cool clk : gated clock — sometimes toggles, sometimes stays low
• wakeup : alerts circuit that valid data will be arriving soon
• clk en : turns on cool clk
558 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
After Clock Gating: New Signals
data_in
valid_in
hot_clk
data_out
valid_out
wakeup_in
cool_clk
clk_en
wakeup_out
α β δγ
α β γ
6.6.7 Example: Pipelined Circuit with Clock-Gating 559
6.6.7 Example: Pipelined Circuit withClock-Gating
Design a “clock enable state machine” for the pipelined component describedbelow.• capacitance of pipelined component = 200
• latency varies from 5 to 10 clock cycles, even distribution of latencies
• contains a maximum of 6 instructions (parcels of data).
• 60% of incoming parcels are valid
• average length of continuous sequence of valid parcels is 80
• use input and output valid bits for wakeup
• leakage current is negligible
• short-circuit current is negligible
• LUTs have a capacitance of 1, flops have a capacitance of 2
560 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Waveforms for Parcel Count
i_valid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
o_valid
parcel_count
parcel_clk_en
18 19 20 21 22 23 24
Waveforms for Cycle Count
i_valid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
o_valid
cycle_count
1 2 0 0 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1000
cycle_clk_en
18 19 20 21 22 23 24
5
6.6.7 Example: Pipelined Circuit with Clock-Gating 561
Summary of Design Process
Outline:
1. sketch out circuitry for parcel count and cycle count state machine
2. estimate capacitance of each state machine
3. estimate activity factor of main circuit, based on behaviour
562 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
Parcel Count Design
Need to count (0..6) parcels, therefore need 3 bits for counter.
Counter must be able to increment and decrement.
Equations for counter action (increment/decrement/no-change):
i valid o valid action0 0 no change0 1 decrement1 0 increment1 1 no change
Chapter 7
Fault Testing and Testability
563
564 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1 Faults and Testing
7.1.1 Overview of Faults and Testing
7.1.1.1 Faults
During manufacturing, faults can occur that make the physical product behaveincorrectly.
Definition : A fault is a manufacturing defect that causes a wire, poly, diffusion, orvia to either break or connect to something it shouldn’t .
Good wires Shorted wires Open wire
7.1.1 Overview of Faults and Testing 565
7.1.1.2 Causes of Faults• Fabrication process (initial construction is bad)
chemical mix, impurities, dust
• Manufacturing process (damage during construction)
– handling: probing, cutting, mounting
– materials: corrosion, adhesion failure, cracking, peeling
7.1.1.3 Testing
Definition Testing is the process of checking that the manufacturedwafer/chip/board/system has the same functionality as the simulations.
566 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.1.4 Burn In
Definition Burn-in: The process of subjecting chips to extreme conditions (highand low temps, high and low voltages, high and low clock speeds) before andduring testing.
Soon to break wire
7.1.1.5 Bin Sorting
Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped andlabeled (binned) by the maximum clock frequency at which they will work reliably.
For example, chips coming off of the same production line might be labelled as800MHz, 900MHz, and 1000MHz.
7.1.2 Example Problem: Economics of Testing 567
7.1.1.6 Testing Techniques
7.1.1.7 Design for Testability (DFT)
7.1.2 Example Problem: Economics ofTesting
Note: There is a tradeoff between the amount of money spenton testing chips vs dealing with (e.g. replacing) faulty chips. Usu-ally the best tradeoff is to ship chips with a small, but non-zeroprobability that the chip has a fault.
7.1.3 Physical Faults
568 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.3.1 Types of Physical Faults
Good Circuit Bad Circuitsab
cd open
ab
cd
wired-AND bridging shortab
cd
wired-OR bridging shortab
cd
stronger wins bridging shortab
cd
(b is stronger)
short to VDDab
cd
short to GND
ab
cd
7.1.3 Physical Faults 569
7.1.3.2 Locations of Faults
Each segment of wire, poly, diffusion, via, etc is a potential fault location.
Different segments affect different gates in the fanout.
A potential fault location is a segment or segments where a fault at any positionaffects the same set of gates in the same way.
b b
570 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.3.3 Layout Affects Locations
a
d
ef
g
h
ibc
e
g
h
bL1
L2
L3
L4
e
g
h
bL1
L2
L3
L4
L5
7.1.3.4 Naming Fault Locations
Two ways to name a fault location:
pin-fault model Faults are modelled as occuring on input and output pins ofgates.
net-fault model Faults are modelled as occuring on segments of wires.
In E&CE 327, we’ll use the net-fault model, because it is simpler to work with andis closer to what actually happens in hardware.
7.1.4 Detecting a Fault 571
7.1.4 Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expectedvalue.
7.1.4.1 Which Test Vectors will Detect aFault?
Question: For the good circuit and faulty circuit shown below, which testvectors will detect the fault?
a b
c
d
e
Good circuit
a b
c
d
e
Faulty circuit
572 CHAPTER 7. FAULT TESTING AND TESTABILITY
Answer:
a b c good faulty0 0 0 0 00 0 1 1 10 1 0 0 00 1 1 1 11 0 0 0 01 0 1 1 11 1 0 1 01 1 1 1 1
Sometimes multiple test vectors will catch the same fault.
Sometimes a single test vector can catch multiple faults.
7.1.4 Detecting a Fault 573
a b
c
d
e
a b
c
d
e
Another fault
a b c good faulty1 1 0 1 0 ←−
The test vector 110 can catch both this fault and the previous one.
Note: Detect vs. diagnose Testing detects faults. Testing doesnot diagnose which fault occurred.
574 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.5 Mathematical Models of Faults
Goal: develop reliable and predictable technique for detecting faults in circuits.
Observations:
• The possible faults in a circuit are dependent upon the physical layout of thecircuit.
• A very wide variety of possible faults
• A single test vector can catch many different faults
Need: a mathematical model for faults that is abstracted from complexities ofcircuit layout and plethora of possible faults, yet still detects most or all possiblefaults.
7.1.5 Mathematical Models of Faults 575
7.1.5.1 Single Stuck-At Fault Model
Two simplifying assumptions:
1. A maximum of one fault per tested circuit (hence “single”)
2. All faults are either:
(a) stuck-at 1: short to VDD
(b) stuck-at 0: short to GND
hence, “stuck at”
576 CHAPTER 7. FAULT TESTING AND TESTABILITY
Example of Stuck-At Faults
a
d
ibc
Question: If we consider all possible stuck-at faults, how many faultycircuits would we need to test for?
Question: If we consider only single-stuck-at faults, how many faultycircuits would we need to test for?
7.1.6 Generate Test Vector to Find a Mathematical Fault 577
7.1.6 Generate Test Vector to Find aMathematical Fault
Faults are detected by stimulating circuits (real, manufactured circuit, not asimulation!) with test-vectors and checking that the real circuit gives the correctoutput.
7.1.6.1 Algorithm1. compute Karnaugh map for correct circuit
2. compute Karnaugh map for faulty circuit
3. find region of disagreement
4. any assignment in region of disagreement is a test vector that will detect fault
5. any assignment outside of region of disagreement will result in same output onboth correct and faulty circuit
578 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.6.2 Example of Finding a Test Vector
a b
c
d
e
a b
c
d
e
c
ba
1
0
10 11 01 00ba ba ba
c
a b
c
ab
c
Good circuit Faulty circuit
Question: Find a test test vector will detect the faulty circuit
a bc
7.1.7 Undetectable Faults 579
7.1.7 Undetectable Faults
Not all faults are detectable.
1. If a circuit is irredundant then all single stuck-at faults can be detected.
A redundant circuit is one where one or more gates can be removedwithout affecting the functional behaviour.
2. If not trying to find all of the faults in a circuit, then a fault that you aren’t lookingfor can mask a fault that you are looking for.
7.1.7.1 Redundant Circuitry
Some faults are undetectable. Undetectable stuck-at faults are located inredundant parts of a circuit.
580 CHAPTER 7. FAULT TESTING AND TESTABILITY
Timing Hazards
Static hazardDynamic hazard
Timing hazards are often removed byadding redundant circuitry.
Redundant Circuitry
ab
c
1,0
1,1
1,1
0,10,1
1,0
1,0,1
d
e
fg
Irredundant circuit
a
b
c
d
e
f
g
Illustration of timing hazard
Glitch on g is caused because the AND gate for e turns off before f turns on.
7.1.7 Undetectable Faults 581
Redundant Circuitry
Question: Add one or more gates to the circuit so that the static hazard isguaranteed to be prevented, independent of the delay values through thegates
a b
c
ab
c
1,0
1,1
1,1
0,10,1
1,0
1,0,1
d
e
fg
Redundant Circuitry
Question: Has the redundant circuitry introduced any undetectable faults?If so, identify an undetectable fault.
582 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1.7.2 Curious Circuitry and FaultDetection
Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 aredetectable.
a
b
c
zL1
L2
L3
a
c
z
ab
c
fault eqn K-map diff w/ ckt
L2@0 a⊕ (b⊕ c)
ab
c
ab
c
L2@1 a⊕ (b⊕ c)
ab
c
ab
c
7.2. TEST GENERATION 583
7.2 Test Generation
7.2.1 A Small Example
a
b
c
zL2
L4
L5
ab+bca
bc
fault eqn K-map diff w/ ckt test vectors
1) L2@1
a bc
a bc
2) L4@1
a bc
a bc
3) L5@1
a bc
a bc
584 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.2 Choosing Test Vectors
The goal of test vector generation is to find the smallest set of test vectors that willdetect the faults of interest.
Test vector generation requires analyzing the faults.
We can simplify the task of fault analysis by reducing the number of faults that wehave to analyze.
Smith has examples of this in Figures 14.13 and 14.14.
7.2.2 Choosing Test Vectors 585
7.2.2.1 Fault Dominationfault eqn K-map Diff w/ ckt test vectors
1) L5@1 ab+c
ab
c
ab
c
101, 001
2) L6@1 1
ab
c
ab
c
101, 001, 100, 010, 000
Definition dominates: f1 dominates f2: any test vector that detects f1 willalso detect f2.
When choosing test vectors, we can ignore the dominated fault, but must keep thedominant fault.
Question: To detect both L5@1 and L6@1, can we ignore one of the faults?
Question: What would happen if we ignored the “wrong” fault?
586 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.2.2 Fault Equivalence
fault eqn K-map Diff w/ ckt
1) L1@1 b
ab
c
ab
c
2) L3@1 b
ab
c
ab
c
Definition fault equivalence: f1 is equivalent to f2: f1 and f2 are detected byexactly the same set of test vectors. That is, all of the test vectors thatdetect f1 will also detect f2, and vice versa.
When choosing test vectors we can ignore one of the faults and just include theother.
7.2.2 Choosing Test Vectors 587
7.2.2.3 Gate Collapsing
A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault onthe output of the OR gate.
Definition Gate collapsing: : The technique of looking at the functionality of agate and finding equivalent faults between inputs and outputs.
Sets of collapsable faults for common gates
AND
@0
@0@0
OR
@1
@1@1
Question What is the set of collapsible faults for a NAND gate?
NAND
588 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.2.4 Node Collapsing
Note: Node collapsing is relevant only for the pin-fault model
7.2.2.5 Fault Collapsing Summary
When calculating the test-vectors to detect a set of faults, apply the faultcollapsing techniques of:• gate collapsing
• node collapsing (if using pin-fault model)
• general fault equivalence (intelligent collapsing)
• fault domination
to reduce the number of faults that you must examine.
7.2.3 Fault Coverage 589
7.2.3 Fault Coverage
Definition Fault coverage: percentage of detectable faults that are detected by aset of test vectors.
FaultCoverage =DetectedFaults
DetectableFaults
Some people’s definition of fault coverage has a denominator of AllPossibleFaults,not just those that are detectable.
590 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.4 Test Vector Generation and FaultDetection
There are two ways to generate vectors and check results: built-in tests and scantesting.
Both require:• generate test vectors
• overide normal datapath to send test-vectors, rather than normal inputs, asinputs to flops
• compare outputs of flops to expected result
7.2.5 Generate Test Vectors for 100% Coverage 591
7.2.5 Generate Test Vectors for 100%Coverage
In this section we will find the test vectors to achieve 100% coverage of singlestuck at faults for the circuit of the day.
We will use a simple algorithm, there are much more sophisticated algorithms thatare more efficient.
The problem of test vector generation is often called Automatic Test PatternGeneration (ATPG) and continues to be an active area of research.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
ab+bca
bc
Example Circuit with Fault Locations and Karnaugh Map
592 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.5.1 Collapse the Faults
Initial circuit with potential faults:
a
b
c
z
L7@0,1
L6@0,1
L8@0,1
L1@0,1
L2@0,1
L3@0,1
L4@0,1
L5@0,1
7.2.5 Generate Test Vectors for 100% Coverage 593
Gate Collapsing
gate faults kept fault
For each set of equivalent faults, we will keep the fault shown in bold and eliminatethe other faults. A good heuristic for choosing which fault to keep: keep the faultcloses to the output. The closer a fault is to the output, the easier it is to analyzeits behaviour, because the equation for the output will be simpler.
594 CHAPTER 7. FAULT TESTING AND TESTABILITY
Intelligent Collapsing1. delete faults that previously decided could be ignored
2. by intelligent analysis of circuit, find equivalent faults
a
b
c
z
L7@0,1
L6@0,1
L8@0,1
L1@0,1
L2@0,1
L3@0,1
L4@0,1
L5@0,1
7.2.5 Generate Test Vectors for 100% Coverage 595
7.2.5.2 Check for Fault Dominationfault eqn K-map Diff w/ ckt
1) L2@1 a+c
ab
ca b
c
2) L3@1 b
ab
ca b
c
3) L4@1 a+bc
ab
ca b
c
4) L5@1 ab+c
ab
ca b
c
5) L6@0 bc
ab
ca b
c
6) L7@0 ab
ab
ca b
c
7) L8@0 0
ab
ca b
c
8) L8@1 1
ab
ca b
c
596 CHAPTER 7. FAULT TESTING AND TESTABILITY
Remove dominated faults
Current faults:
a
b
c
z
L7@0,1
L6@0,1
L8@0,1
L1@0,1
L2@0,1
L3@0,1
L4@0,1
L5@0,1
Dominated faults:
7.2.5 Generate Test Vectors for 100% Coverage 597
7.2.5.3 Required Test Vectors
Definition required test vector: A test vector tv is required if there is a fault forwhich tv is the only test vector that will detect the fault.
fault eqn K-map Diff w/ ckt
1) L3@1 b
ab
c
ab
c
2) L4@1 a+bc
ab
c
ab
c
3) L5@1 ab+c
ab
c
ab
c
4) L6@0 bc
ab
c
ab
c
5) L7@0 ab
ab
c
ab
c
598 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.5.4 Faults Not Covered by RequiredTest Vectors
fault eqn K-map Diff w/ ckt
1) L4@1 a+bc
ab
c
ab
c
2) L5@1 ab+c
ab
c
ab
c
Test vector(s) required to catch these faults:
7.2.5 Generate Test Vectors for 100% Coverage 599
7.2.5.5 Order to Run Test Vectors
The order in which the test vectors are run is important because it can affect howlong a faulty chip stays in the tester before the chip’s fault is detected.
The first vector to run should be the one that detects the most faults.
Build a table for which faults each test vector will detect.
600 CHAPTER 7. FAULT TESTING AND TESTABILITY
Test Vector
faulta
bc
ab
c
ab
c
ab
c
110 010 011 101
1) L1@0a
bc
1
2) L1@1a
bc
1
3) L2@0a
bc
1 1
4) L2@1a
bc
1
5) L3@0a
bc
1
6) L3@1a
bc
1
7) L4@0a
bc
1
8) L4@1a
bc
1
9) L5@0a
bc
1
10) L5@1a
bc
1
11) L6@0a
bc
1
12) L6@1a
bc
1 1
13) L7@0a
bc
1
14) L7@1a
bc
1 1
15) L8@0a
bc
1 1
16) L8@1a
bc
1 1Faults detected 5 5 5 6
7.2.5 Generate Test Vectors for 100% Coverage 601
7.2.5.6 Summary of Technique to Find andOrder Test Vectors
1. identify all possible faults
2. gate collapsing
3. node collapsing
4. intelligent collapsing
5. fault domination
6. determine required test vectors
7. choose minimal set of test vectors to detect remaining faults
8. order test vectors based on number of faults detected (NOTE: when iteratingthrough this step, need to take into account faults detected by earlier testvectors)
602 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.2.6 One Fault Hiding Anothera
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
Assume that we are not trying to detect all faults — L1 is viewed as not being atrisk for faults, but L3 is at risk for faults.
a
b
c
z
L1
L3
a
b
c
z
L1
L3
7.2.6 One Fault Hiding Another 603
Fault Hiding
a
b
c
z
L1
L3
a
b
c
z
L1
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will notdetect L3@0.
In the presence of other faults, the set of test vectors to detect a fault will change.
fault(s) eqn K-map Diff w/ ckt
L3@0 aba
bc
ab
c
L1@1,L3@0 ba
bc
ab
c
604 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3 Scan Testing in General
7.3.1 Structure and Behaviour of ScanTesting
circuitundertest
data_in(3)
data_in(1)
data_in(2)
data_in(0)
zeta_in(3)
zeta_in(1)
zeta_in(2)
zeta_in(0)
anot
her
circ
uit #
0
anot
her
circ
uit #
1
Normal Circuit
7.3.1 Structure and Behaviour of Scan Testing 605
circuitundertest
anot
her
circ
uit
yet a
noth
er c
ircui
t
mode0 scan_in0
scan_out0
mode1 scan_in1
scan_out1
scan
cha
in 0
scan
cha
in 1
Circuit with Scan Chains Added
606 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3.2 Scan Chains
circuitundertest
data_in(3)
data_in(1)
data_in(2)
data_in(0)
zeta_in(3)
zeta_in(1)
zeta_in(2)
zeta_in(0)
anot
her
circ
uit #
0
anot
her
circ
uit #
1Normal Circuit
mode0 scan_in0
circuitundertest
scan_out0
mode1 scan_in1
scan_out1
data_in(3)
data_in(1)
data_in(2)
data_in(0)
zeta_in(3)
zeta_in(1)
zeta_in(2)
zeta_in(0)
Circuit with Scan Chains Added
7.3.2 Scan Chains 607
7.3.2.1 Circuitry in Normal and Scan Modemode0 scan_in0
circuitundertest
scan_out0
mode1 scan_in1
scan_out1
Normal Mode
mode0 scan_in0
circuitundertest
scan_out0
mode1 scan_in1
scan_out1
Scan Mode
608 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3.2.2 Scan in Operation
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1sc
an c
hain
0Circuit under test with scan chains
clk
scan_in0
mode0
scan_out1
scan_out0
scan_in1
currentvector0
currentresults1
Sequence of load; test; unload
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
currentvector0
Load Test Vector(1 cycle per bit)
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
Run Test VectorThrough Circuit
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
currentresults1
Unload Result(1 cycle per bit)
7.3.2 Scan Chains 609
Unload and Load and Same Time
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
currentvector0
previousresults0
previousresults1
currentvector1
Unload Prev ResultLoad Cur Test Vector
(1 cycle per bit)
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
Run Cur Test VectorThrough Circuit
circuitundertest
anot
her
circ
uit
yet
anot
her
circ
uit
mode0 scan_in0
scan_out0
mode1 scan_in1
scan
cha
in 0
scan_out1
scan
cha
in 0
next testvector0
currentresults0
currentresults1
next testvector1
Unload Cur ResultLoad New Test Vector
(1 cycle per bit)
clk
scan_in0
mode0
scan_out1
next testvector0
previousresults1
scan_out0
scan_in1 currentvector1
currentresults0
previousresults0currentvector0
next testvector1
currentresults1
Sequence of load; run; unload
610 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3.2.3 Scan in Operation with ExampleCircuit
a b
c z
d
y
Circuit under test
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
Circuit under test with scan test circuitry
7.3.2 Scan Chains 611
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
δδ
Start Loading Test Vector (Load δ)
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
γ γ δ
δδ
Load γmode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
δ
β
γ
δδ
γ
γβ
Load β
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
α α β
β β γ
γ γ
δδ
δ
Load α
612 CHAPTER 7. FAULT TESTING AND TESTABILITY
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
clk
mode0
β β
α βα
α
γ
γ γ δ
Run Test Vector
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
clk
mode0
α
α α β
β β γ
γ γ δ
αβ
α__
δ
β__
γ
βδ
αβ+β__
γ
α__
δ+βδ
Test Values Propagate
(α__
δ+βδ)
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
−-
clk
mode0
δ’ δ’
αβ+β__
γ
α__
δ+βδ
Flop-In Result, Start (Un)loading Test Vector
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
−
−
αβ+β__
γ
(α__
δ+βδ, αβ+β__
γ)
−
−−
clk
mode0
δ’
δ’ δ’
γ’ γ’
Continue (Un)loading Test Vector
7.3.2 Scan Chains 613
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
−
−
−
clk
mode0
−
ζζ
γ’
γ’ γ’ δ’
δ’ δ’
β’ β’
(α__
δ+βδ, αβ+β__
γ)
Continue (Un)loading Test Vector
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
(α__
δ+βδ, αβ+β__
γ)clk
mode0
−ζ
ζ
ζ
ψψ
β’
β’ β’ γ’
γ’ γ’ δ’
δ’ δ’ δ’
α’ α’
Finish (Un)loading Test Vector
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
ψ
(α__
δ+βδ, αβ+β__
γ)
ψ
clk
mode0
α’
β’
γ’
δ’
ψ
ζ
Run Next Test Vector
614 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3.3 Summary of Scan Testing
• Adding scan circuitry
1. Registers around circuit to be tested are grouped into scan chains
2. Replace each flop with mux + flop
3. Flops and muxes wired together into scan chains
4. Each scan chain is connected to dedicated I/O pins for loading andunloading test vectors
• Running test vectors
1. Put scan chain in “scan” mode
2. Load in test vector (one element of vector per clock cycle)
3. Put scan chain in “normal” mode
4. Run circuit for one clock cycle — load result of test into flops
5. Unload results of current test vector while simultaneously loading in nexttest vector (one element of vector per clock cycle)
7.3.4 Time to Test a Chip 615
7.3.4 Time to Test a Chip
If the length (number of flops) of a scan chain is n, then it takes 2n+1 clock cyclesto run a single test: n clock cycles to scan in the test vector, 1 clock cycle toexecute the test vector, and n cycles to scan out the results. Once the results arescanned out, they can be compared to the expected results for a correctly workingcircuit.
If we run 2 or more tests (and chips generally are subjected to hundreds ofthousands of tests), then we speed things up by scanning in the next test vectorwhile we scan out the previous result.
ScanLength = number of flip flops in a scan chainNumVectors = number of test vectors in test suiteTimeScan = number of clock cycles to run test suite
= NumVectors× (ScanLength+1)+ScanLength
616 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.3.4.1 Example: Time to Test a Chip
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits,22,000 bits, and two of 15,000 bits.
500,000 test vectors are used for each scan chain.
The tests are run at 80% of full speed.
Question: Calculate the total test time.
7.4. BOUNDARY SCAN AND JTAG 617
7.4 Boundary Scan and JTAG
Boundary scan originated as technique to test wires on printed circuit boards(PCBs).
Goal was to replace “bed-of-nails” style testing with technique that would work forhigh-density PCBs (lots of small wires close together)
Now used to test both boards and chip internals.
Used both on boundaries (I/O pins) and internal flops.
618 CHAPTER 7. FAULT TESTING AND TESTABILITY
Boundary Scan with JTAG
Standardized by IEEE (1149) and previously by JTAG:• 4 required signals (Scan Pins: TDI , TDO, TCK, TMS)
• 1 optional signal (Scan Pin: TRST)
• protocol to connect circuit under test to tester and other circuits
• state machine to drive test circuitry on chip
• Boundary Scan Description Language (BSDL): structural language used todescribe which features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library.Rarely is a JTAG circuit custom-built as part of a larger part. So, you’ll probably bechoosing and using JTAG circuits, not constructing new ones.
Using JTAG circuitry is usually done by giving a description of your printed circuitboard (PCB) and the JTAG components on each chip (in BSDL) to test generationsoftware. The software then generates a sequence of JTAG commands and datathat can be used to test the wires on the circuit board for opens and shorts.
7.4.1 Scan Instructions 619
JTAG Structure
scan registers
TDI TDOTCK TMS
circuitundertest
chip
control
normalinputpins
normaloutputpins
High-level view
BSC
BSC
BSC
BR
IR
IDCODE
TAP Controller
BSR
TDI TDO
TCK
TMS
IRC IRC
circuitundertest
chip
Instruction Decoder
BSC
BSC
BSC
control
Detailed view
620 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.4.1 Scan Instructions
This the set of required instructions, other instructions are optional.
EXTEST Test board-level interconnect. Drive output pins of chipwith hard-coded test vector. Sample results on inputs.
SAMPLE Sample result dataPRELOADLoad test vectorBYPASS Directly connect TDI to TDO. This is used when several
chips are daisy chained together to skip loading data intosome chips.
IDCODE Output manufacturer and part number
7.5. BUILT IN SELF TEST 621
7.5 Built In Self Test
7.5.1 Block Diagrammode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
resultchecker
all_ok
o_data(0)d(0)
d(1)
d(2)
d(3)
o_data(1)
o_data(2)
circuitundertest
Circuit in Normal Mode
mode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
resultchecker
all_ok
o_data(0)d(0)
d(1)
d(2)
d(3)
o_data(1)
o_data(2)
circuitundertest
Circuit in Test Mode
622 CHAPTER 7. FAULT TESTING AND TESTABILITY
Circuit w/ BIST in Normal Mode
circuitundertest
mode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
7.5.1 Block Diagram 623
Circuit w/ BIST in Test Mode
circuitundertest
mode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
624 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.1.1 Components
Test Generatormode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
• generates a psuedo-random set of test vectors
• for n output bits, generates all vectors from 1 to 2n−1 in a pseudo random order
• built with a linear-feedback shift register (shift-register portion is the input flops)
7.5.1 Block Diagram 625
Test Generator
q2q1q0
Question: Why not just use a counter to generate 1..2n−1?
626 CHAPTER 7. FAULT TESTING AND TESTABILITY
Signature Analyzer
mode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
• checks that the output it is examining has the correct results for the completeset of tests that are run
• only has a meaningful result at the end of the entire test sequence.
• built with a linear-feedback shift register
• similar to a hash function or a lossy compression function
• if there are no faults, the signature analyzer will definitely say “ok” (no falsenegatives)
• if there is a fault, the signature analyzer might say “ok” or might say “bad” (falsepositives are possible)
• design tradeoff: more accurate signature analyzers require more hardware
7.5.1 Block Diagram 627
Result Checkermode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
• signature analyzers output “ok”/”bad” on every clock cycle, but the result is onlymeaningful at the end of running the complete set of test vectors
• the result checker looks at test vector inputs to detect the end of the test suiteand outputs “all ok” if all signature analyzers report “ok” at that moment
• implemented as an AND gate
628 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.1.2 Linear Feedback Shift Register(LFSR)
Basically, a shift register (sequence of flip-flops) with the output of the last flip-flopfed back into some of the earlier flip-flops with XOR gates.
Design parameters:
• number of flip-flops
• external or internal XOR
• feedback taps (coefficients)
• external-input orself-contained
• reset or set
S
R
S
R
S
R
reset
d0 q0 d1 q1 d2 q2i
LFSR Example
7.5.1 Block Diagram 629
Example LFSRs
S
R
S
R
S
R
reset
d0 q0 d1 q1 d2 q2i
External-XOR, input, reset
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
External-XOR, no input, set
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2i
Internal-XOR, input, set
S
R
S
R
S
R
reset
d0 q0 d1 q1 d2 q2i
Internal-XOR, input, reset
In E&CE 327, we use internal- XOR LFSR’s, because the circuitry matches themathematics of Galois fields.
External-XOR LFSR’s work just fine, but they are more difficult to analyze, becausetheir behaviour can’t be treated as Galois fields.
630 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.1.3 Maximal-Length LFSR
Definition maximal-length linear feedback shift register: An LFSR thatoutputs a pseudo-random sequence of all representable bit-vectors except0...00 .
Definition pseudo random: The same elements in the same order every time,but the relationship between consecutive elements is apparantly random.
Maximal-length linear feedback shift registers are used to generate test vectors forbuilt-in self test.
7.5.1 Block Diagram 631
Maximal-Length LFSR Circuits
The figures below illustrate the two maximal-length internal-XOR linear feedbackshift registers that can be constructed with 3 flops.
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
Maximal-length internal-XOR LFSR
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
Maximal-length internal-XOR LFSR
Question: Why do maximal-length LFSRs not generate the test vector0...00?
632 CHAPTER 7. FAULT TESTING AND TESTABILITY
Maximal Length LFSR Characteristics
Maximal-length LFSRs:
• set to all 1s initially
• self contained (no external i input)
clk
d0
q0
reset
d1
q1
val 6 4 1 2 5 3 77
q2
1 2 3 4 5 6 7 8
6
Timing diagram for a 3-flop maximal-length LFSR
7.5.2 Test Generator 633
7.5.2 Test Generatormode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
The test generator component is a maximal-length LFSR ...
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
634 CHAPTER 7. FAULT TESTING AND TESTABILITY
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on theinputs to each flip-flop. In test mode, the data input on each flip flop is connectedto the output of the previous flip flop. In normal mode, the input of each flip flop isconnected to the environment.
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
i_d(0) i_d(1) i_d(2)
mode
q2q1q0
7.5.2 Test Generator 635
Test Generator
mode
i_d(0)
i_d(2)
i_d(1)
q0
q1
q2
d0
d1
d2
A test generator, reset not shown
636 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.3 Signature Analyzer
There are four things that change between different signature analyzers:
• number of flops (⇑ flops =⇒ ⇑ area, ⇑ accuracy)
• choice of feedback taps: a good choice can improve accuracy (more isn’tnecessarily better)
• bubbles on input to AND gate for “ok”: determined by expected result fromsimulating test sequence through circuit under test and LFSR of analyzer.
mode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
7.5.3 Signature Analyzer 637
Signature Analyzer
This circuit:
• Two flops, most analyzers use more — the HP boards in the 1970s used 37flops!
• Feedback taps on both flops. Different signature analyzers have differentconfigurations of feedback taps.
• Also contains “ok” tester (AND gate). Expected output of LFSR at end of testsequence is: q0=1 and q1=1 , or 01 . (We know this because of bubble on AND
gate. To see why this is the expected output of the signature analyzer, we wouldneed to know the correct sequence of outputs of the circuit under test.)
S
R
S
R
reset
d0 q0 d1 q1i
ok
638 CHAPTER 7. FAULT TESTING AND TESTABILITY
Signature Analyzer
clk
q0
q1
reset
0
0
i i6 i5 i4 i3 i2 i1 i0 -
d0 -
d1
7.5.3 Signature Analyzer 639
Signature Analyzer Timing
clk
q0
q1
reset
0
0
i6
i60
i i6 i5 i4 i3 i2 i1 i0
356 = i3⊕i5⊕i62356 = i2⊕i3⊕i5⊕i6etc...
-
d0 i6 i5 -
d1 0 i6 i5⊕i6
i5
i5⊕i6
i4⊕i6
i4⊕i6
356
356
i4⊕i5
i4⊕i5
346
346
245
245
2356
2356
1346
1346
02356
02356
1245
1245
-
640 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.4 Result Checkermode
i_data(0)
i_data(2)
i_data(1)
i_data(3)
testgenerator
signatureanalyzer0
signatureanalyzer1
signatureanalyzer2
resultchecker
all_ok
test gen LFSR
o_data(0)d(0)
d(1)
d(2)
d(3)
ok(0)
ok(1)
ok(2)
o_data(1)
o_data(2)
circuitundertest
The purpose of the result checker is to check the “ok” circuit at the end of the testsequence.
q0 q1 all_ok
reset
q2 ok
7.5.5 Arithmetic over Binary Fields 641
7.5.5 Arithmetic over Binary Fields• Galois Fields!
• Two operations: “+” and “×”
• Two values: 0 and 1
• Bit vectors and shift-registers are written as polynomials in terms of x.
Addition+ represents XOR
expression result0+0 00+1 11+0 11+1 0x+ x 0
Multiplication× represents concatenating shift
registers
expression resultx4×1 x4
x2× x3 x5
642 CHAPTER 7. FAULT TESTING AND TESTABILITY
Example
Calculate (x3 + x2 +1)× (x2 + x)
x2 × (x3 + x2 +1) = x5 + x4 + x2
x × (x3 + x2 +1) = x4 + x3 + xx5 + x3 + x2 + x
7.5.6 Shift Registers and Characteristic Polynomials 643
7.5.6 Shift Registers and CharacteristicPolynomials
Each linear feedback shift register has a corresponding characteristic polynomial.
From polynomials to hardware:
• The maximum exponent denotes the number of flops
• The other exponents denote the flops that tap off of feedback line from last flop
• From the characteristic polynomial, we cannot determine whether the shiftregister has an external input. Stated another way, two shift registers that areidentical except that one has an external input and the other does not will havethe same characteristic polynomial.
644 CHAPTER 7. FAULT TESTING AND TESTABILITY
Shift Regs and Polynomials
S
R
S
R
reset
d0 q0 q1i
S
R q2
p(x) = x3
S
R
S
R
reset
d0 q0 q1
S
R q2d1i
x0 x1 x2 x3
p(x) = x3 + x
S
R
S
R
reset
d0 q0 q1i
S
R q2
x0 x1 x2 x3
p(x) = x3 +1
7.5.6 Shift Registers and Characteristic Polynomials 645
Shift Regs and Polynomials
S
R
S
R
reset
d0 q0 d1 q1i
S
R q2
x0 x1 x2 x3
p(x) = x3 + x+1
S
R
S
R
reset
d0 q0 d1 q1i
S
R q2d2
x0 x1 x2 x3
p(x) = x3 + x2 + x+1
S
R
S
R
reset
d0 q0 d1 q1i
S
R q2
S
R q3d3
x0 x1 x2 x3 x4
p(x) = x4 + x3 + x+1
646 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.6.1 Circuit Multiplication
Redoing the multiplication example (x2 + x)× (x3 + x2 +1) as circuits:
x2 + x
x3 + x2 +1
(x2 + x)× (x3 + x2 +1)
= x× (x3 + x2 +1)
+ x2× (x3 + x2 +1)
= x5 + x3 + x2 + x
7.5.7 Bit Streams and Characteristic Polynomials 647
7.5.7 Bit Streams and CharacteristicPolynomials
A bit stream, or bit sequence, can be represented as a polynomial.
The oldest (first) bit in a sequence of n bits is represented by xn−1 and theyoungest (last) bit is x0.
The bit sequence 1010011 can be represented as x6 + x4 + x+1:
1 0 1 0 0 1 1= 1x6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0
= x6 + x4 + x+1
648 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.5.8 Division
With rules for multiplication and addition, we can define division.
A fundamental theorem of division defines q and r to be the quotient andremainder, respectively, of m÷ p iff:
m(x) = q(x)× p(x)+ r(x)
7.5.8 Division 649
Long Division
In Galois fields, we do division just as with long division in elementary school.
Given:
m(x) = x6 + x4 + x3
p(x) = x4 + x
Calculate the quotient, q(x) and remainder r(x) for m(x)÷ p(x):
x2 + 1x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0
x6 + 1x3
1x4
1x4 + xx
Quotient q(x) = x2 +1Remainder r(x) = x
650 CHAPTER 7. FAULT TESTING AND TESTABILITY
Long Division (Check)
Check result:
m(x) = q(x) × p(x) + r(x)= (x2 +1) × (x4 + x) + x= x6 + x3 + x4 + x + x= x6 + x4 + x3
7.5.9 Signature Analysis: Math and Circuits 651
7.5.9 Signature Analysis: Math andCircuits
The input to the signature analyzer is a “message”, m(x), which is a sequence of n
bits represented as a polynomial.
After n shifts through an LFSR with l flops:
• The sequence of output bits forms a quotient, q(x), of length n− l
• The flops in the analyzer form a remainder, r(x), of length l
m(x) = q(x)× p(x)+ r(x)
The remainder is the signature.
652 CHAPTER 7. FAULT TESTING AND TESTABILITY
Test Generation: Math and Circuits
The mathematics for an LFSR without an input i :
• same polynomial as if the circuit had an input
• input sequence is all 0s
7.5.9 Signature Analysis: Math and Circuits 653
Input Streams and Error Polynomials
An input stream with an error can be represented as m(x)+ e(x)
• e(x) is the error polynomial
• bits in the message that are flipped have a coefficient of 1 in e(x)
m(x)+ e(x) = q′(x)× p(x)+ r′(x)
654 CHAPTER 7. FAULT TESTING AND TESTABILITY
Input Streams and Error Polynomials
The error e(x) will be detected if it results in a different signature (remainder).
m(x) and m(x)+ e(x) will have the same remainder iff
e(x) mod p(x) = 0
That is e(x) must be a multiple of p(x).
The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).
7.5.9 Signature Analysis: Math and Circuits 655
BIST for a Simple Circuit
Outline of steps to see if a fault will be detected by BIST:
1. Output sequence from test generator
2. Output sequence from correct circuit
3. Remainder for signature analyzer with correct output sequence
4. Output sequence from faulty circuit
5. Remainder for signature analyzer with faulty output sequence
6. Compare correct and faulty remainder, if different then fault detected
656 CHAPTER 7. FAULT TESTING AND TESTABILITY
Components
a
b z
a
L1
L2
L3
L4
L5
L6
L7L8
t0 t1 t2D QD QD Q
r0 r1 r2D QD QD Qz
7.5.9 Signature Analysis: Math and Circuits 657
t0 t1 t2t0 t1 t2a b c
corr
ect
faul
ty
z z
z r0 r1 r2 z r0 r1 r2
658 CHAPTER 7. FAULT TESTING AND TESTABILITY
Question: Determine if L2@1 will be detected
Test Generation Sequencet0 t1 t2
1 1 11
1
11
11
1
1
initial values = 101
111
00
0
00
0
00
01
111
00
final values are repeatof initial values
Technique is to shift; then computeresult of XORs
Equation for correct circuit: ab+bc
Equation for faulty circuit: a+ c
Output sequences for correct and faultycircuits
t0 t1 t2a b c
corr
ect
faul
ty
z z1 1 11
1
1
1
1
0
00
0
00
01
11
00
1
vectors from test generationsequence
1110000
output sequencesfrom circuits
1111
11
0
7.5.9 Signature Analysis: Math and Circuits 659
Signature analyzer sequence for correctCircuit
z r0 r1 r21110000
0 0 0
output sequencefrom correct circuit
initialvalues = 0
1111001
111100
remainder
011
1
1
0
0
0011
1
1
0
0
01
11
00
001
11
00
1
Signature analyzer sequence for faultycircuit
z r0 r1 r2
output sequencefrom correct circuit
initialvalues = 0
remainder
11
1111
11
0
1 0 0 00 011
11
00
111
1
00
110001
011000
010000
0010000
660 CHAPTER 7. FAULT TESTING AND TESTABILITY
7.6 Scan vs Self Test
Scan
⇑ less hardware
⇓ slower
⇑ well defined coverage
⇑ test vectors are easy to modify
Self Test
⇓ more hardware
⇑ faster
⇓ ill defined coverage
⇓ test vectors are hard to modify
Chapter 8
Review
This chapter lists the major topics of the term. The “Topics List” section for eachmajor area is meant to be relatively complete.
661
662 CHAPTER 8. REVIEW
8.1 Overview of the Term
• The purely digital world
– VHDL
– design and optimization methods
– functional verification
– performance analysis
• Analog effects in the digitalworld
– timing analysis
– power
– faults and testing
8.2. VHDL 663
8.2 VHDL
8.2.1 VHDL Topics• simple syntax and semantics — things that you should know simply by having
done the labs and project
• behavioural semantics of VHDL
• synthesis semantics of VHDL
• synthesizable and unsynthesizable code
664 CHAPTER 8. REVIEW
8.2.2 VHDL Example Problems• identify whether a particular signal will be the output of combinational circuitry or
a flop
• identify whether a particular process is combinational or clocked
• legal, synthesizable, and good code
• perform delta-cycle simulation of VHDL
• perform RTL simulation of VHDL
• identify whether two VHDL fragments have same behaviour
• match VHDL code with waveforms
• match VHDL code with hardware
• choose the VHDL fragment that generates smaller or faster hardware
8.3. RTL DESIGN TECHNIQUES 665
8.3 RTL Design Techniques
8.3.1 Design Topics• coding guidelines
• generic FPGA hardware
• area estimation
• finite state machines
– implicit
– explicit-current
– explicit-current+next
• from algorithm to hardware
– dependency graph
– dataflow diagram
– scheduling
– input/output allocation
– register allocation
– datapath allocation
– hardware block diagram
– state machine
• memory dependencies
• memory arrays and dataflow diagrams
666 CHAPTER 8. REVIEW
8.3.2 Design Example Problems• choose design guidelines to follow in different situations
• estimate area to implement a circuit in an FPGA
• calculate resource usage for a dataflow diagram
• calculate performance data for a dataflow diagram
• given an algorithm, design a dataflow diagram
• given a dataflow diagram, design the datapath and finite state machine
• optimize a dataflow diagram to improve performance or reduce resource usage
• given a dataflow diagram, calculate the clock period that will result in themaximum performance
8.4. FUNCTIONAL VERIFICATION 667
8.4 Functional Verification
8.4.1 Verification Topics• test cases
• measuring coverage
• time for verification
• test benches
• assertions
• coverage monitors
• relational specification
• functional specification
• boundary conditions / corner cases
668 CHAPTER 8. REVIEW
8.4.2 Verification Example Problems• choose first cases to test
• identify corner cases
• choose technique to detect bug (test case, assertion/test bench)
• determine whether a code change will cause a bug
• identify a test case and either assertion or test bench to catch a bug
8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION 669
8.5 Performance Analysis andOptimization
8.5.1 Performance Topics• time to execute a program
• definition of performance
• speedup
• n% faster
• calculating performance of different different tasks and of average task
• choosing which task to optimize to best improve overall performance
• cpi calculations
• performance increase over time
• design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market)
• CPI calculations
• MIPs calculations
• Clock speed vs. performance
• Optimality — performance / area tradeoffs
670 CHAPTER 8. REVIEW
8.5.2 Performance Example Problems• calculate performance / area tradeoffs
• calculate performance / time tradeoffs
• compare performance data between products
• evaluate performance criteria
8.6. TIMING ANALYSIS 671
8.6 Timing Analysis
8.6.1 Timing Topics• circuit parameters that affect delay
– clock period
– clock skew
– clock jitter
– propagation delay
– load delay
– setup time
– hold time
– clock-to-Q time
• timing analysis of latch
• timing analysis of master-slaveflip-flop
• timing analysis of hierachical storagedevice
• critical path and false path
– algorithm to find critical path
– algorithm to determine if path isfalse or critical
– signal assignment to exercisecritical path
• elmore timing model
• derating factors
672 CHAPTER 8. REVIEW
8.6.2 Timing Example Problems• timing parameters for minimum clock period
• timing parameters for hold constraint
• find the critical path and assignment to exercise it
• compute elmore delay constant
• compare accuracy of different timing models
• determine if a storage device will work correctly
• compute timing parameters of storage device
• identify timing violation, suggest remedy
• suggest design change to increase clock speed
8.7. POWER 673
8.7 Power
8.7.1 Power Topics• power vs energy
• equations for power
– dynamic power
– static power
– switching power
– short circuit power
– leakage power
– activity factor
– leakage current
– threshold voltage
– supply voltage
• analog power reduction techniques
• rtl power reduction techniques
– data encoding
– clock gating
674 CHAPTER 8. REVIEW
8.7.2 Power Example Problems• predict effect of new fabrication process on power
• predict effect of environment change (temp, supply voltage, etc) on powerconsumption
• predict effect of design change on power consumption (capacitance, activityfactor)
• design data-encoding scheme for a circuit, predict effect on power consumption
• design clock gating scheme for a circuit, predict effect on power consumption
• asses validity of various power- or energy-consumption metrics
8.8. TESTING 675
8.8 Testing
8.8.1 Testing Topics• causes of faults
• locations of faults
• physical faults
• single stuck-at fault model
• testable / untestable fault
• economics of testing
• fault coverage
• test vector generation
• order test vectors to reduce test time
• behaviour of a scan chain
• time to run a scan test
• JTAG
• built-in self-test
• linear feedback shift register
• signature analyzer
• Galois fields
• process and time to run a BIST test
676 CHAPTER 8. REVIEW
8.8.2 Testing Example Problems• compute optimal amount of testing to maximize profits
• compute coverage for a given set of test vectors
• find test vectors to catch a set of faults, choose order to run test vectors
• determine if a fault is detectable
• choose an LFSR to use for BIST test generation
• choose an LFSR to use for BIST signature analysis
• determine if a given BIST will catch a given fault
• determine probability that a given BIST technique will report that a faulty circuitis correct
• determine if a given fault-testing scheme will detect a physical fault
• match LFSR to characteristic polynomial
• match BIST hardware to Galois mathematics
• perform Galois field mathematics, compare to waveforms
8.9. FORMULAS TO BE GIVEN ON FINAL EXAM 677
8.9 Formulas to be Given on Final Exam
T =Ins×C
F
Pf =W
T
S =T1
T2
M =F/106
(n
∑i=0
PIi×Ci)
678 CHAPTER 8. REVIEW
Formulas II
P =12(A×CL×V
2×F)+(τ×A×V× ISh×F)+(V× IL)
q = 1.60218×10−19C
k = 1.38066×10−23J/K
F ∝(V−VTh)2
V
IL ∝ e
−q×VTh
k×T