ECE 327 Slides VHDL Verilog Digital Hardware Design

E&CE 327: Digital Systems Engineering

Lecture Slides

Mark Aagaard2011t1–Winter

University of WaterlooDept of Electrical and Computer Engineering

Contents

I Lecture Notes 1

1 VHDL 31.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . . . 51.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . . 111.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . . . 121.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 Comparison of VHDL to Other Hardware Description Languages . . 14

i

ii CONTENTS

1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . . . 141.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . . . 151.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . . 181.3.5 Component Declaration and Instantiations . . . . . . . . . . . 211.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . 261.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . . . 27

1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . . . 271.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . 281.4.2 Conditional Assignment vs If Statements . . . . . . . . . . . 291.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . 301.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . . 321.5.1 Combinational Process vs Clocked Process . . . . . . . . . . 361.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . . . . . 461.6.1 Simple Simulation . . . . . . . . . . . . . . . . . . . . . . . . 461.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . 48

CONTENTS iii

1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . . . . . . . . 481.6.4 Definitions and Algorithm . . . . . . . . . . . . . . . . . . . . 50

1.6.4.1 Process Modes . . . . . . . . . . . . . . . . . . . . 501.6.4.2 Simulation Algorithm . . . . . . . . . . . . . . . . . 541.6.4.3 Delta-Cycle Definitions . . . . . . . . . . . . . . . . 57

1.6.5 Example 1: Process Execution (Bamboozle) . . . . . . . . . 581.6.6 Example 2: Process Execution (Flummox) . . . . . . . . . . . 581.6.7 Ex: Need for Provisonal Asn . . . . . . . . . . . . . . . . . . 631.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . . . . . 69

1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . . . . . . . . 781.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791.7.2 Technique for Register-Transfer Level Simulation . . . . . . . 801.7.3 Examples of RTL Simulation . . . . . . . . . . . . . . . . . . 81

1.7.3.1 RTL Simulation Example 1 . . . . . . . . . . . . . . 811.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . . . . . 85

1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . 851.8.2 Deprecated Building Blocks for RTL . . . . . . . . . . . . . . 901.8.3 Hardware and Code for Flops . . . . . . . . . . . . . . . . . . 92

1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . . . . . . 921.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . 94

iv CONTENTS

1.8.3.3 Flop with Chip-Enable and Mux on Input . . . . . . 1011.8.3.4 Flops with Chip-Enable, Muxes, and Reset . . . . . 102

1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . . . . . 1021.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . . . . 1031.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . . . . . 1041.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . . . . . 1041.10.4 Different Widths and Arithmetic . . . . . . . . . . . . . . . . 1041.10.5 Overloading of Comparisons . . . . . . . . . . . . . . . . . 1041.10.6 Different Widths and Comparisons . . . . . . . . . . . . . . 1051.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . 106

1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . . . . . 1081.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . . . . 109

1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . . . . . 1091.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . . . . 1101.11.1.3 Different Wait Conditions . . . . . . . . . . . . . . 1111.11.1.4 Multiple “if rising edge” in Process . . . . . . . . . 1131.11.1.5 “if rising edge” and “wait” in Same Process . . . . 1141.11.1.6 “if rising edge” with “else” Clause . . . . . . . . . . 115

CONTENTS v

1.11.1.7 “if rising edge” Inside a “for” Loop . . . . . . . . . . 1161.11.1.8 “wait” Inside of a “for loop” . . . . . . . . . . . . . . 118

1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . 120

2 RTL Design with VHDL 1212.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . 122

2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . 1222.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . 123

2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1282.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . 1342.2.2.2 Clocks for Generic FPGAs . . . . . . . . . . . . . . 1342.2.2.3 Special Circuitry in FPGAs . . . . . . . . . . . . . . 135

2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . 1392.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . 1432.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . 144

2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . 1442.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . 1442.5.1.2 Introduction to State Machines and VHDL . . . . . . 147

vi CONTENTS

2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . 1492.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . 154

2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . 1552.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . 1572.5.2.3 Explicit Moore with Combinational Outputs . . . . . 1592.5.2.4 Explicit-Current+Next Moore with Concurrent As-

signment . . . . . . . . . . . . . . . . . . . . . . . . 1612.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . . . . 163

2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . 1652.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1662.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 170

2.6 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1712.6.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . 1712.6.2 Dataflow Diagrams, Hardware, and Behaviour . . . . . . . . 1842.6.3 Dataflow Diagram Execution . . . . . . . . . . . . . . . . . . 1882.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . . . . 1982.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1992.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2012.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . . . . 203

2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . 206

CONTENTS vii

2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . 2062.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 2082.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2092.8.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . 2102.8.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . 2112.8.5 Optimize Resources . . . . . . . . . . . . . . . . . . . . . . . 2132.8.6 Assign Names to Registered Values . . . . . . . . . . . . . . 2162.8.7 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . 2172.8.8 Tangent: Combinational Outputs . . . . . . . . . . . . . . . . 2202.8.9 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . 2212.8.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . 2232.8.11 Hardware Block Diagram and State Machine . . . . . . . . 224

2.8.11.1 Control for Registers . . . . . . . . . . . . . . . . . 2252.8.11.2 Control for Datapath Components . . . . . . . . . 2282.8.11.3 Control for State . . . . . . . . . . . . . . . . . . . 2302.8.11.4 Complete State Machine Table . . . . . . . . . . . 231

2.8.12 VHDL Code with Explicit State Machine . . . . . . . . . . . 2332.8.13 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . 2372.8.14 Notes and Observations . . . . . . . . . . . . . . . . . . . . 240

2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

viii CONTENTS

2.9.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . 2422.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . 2482.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

2.10 Design Example: Pipelined Massey . . . . . . . . . . . . . . . . . . 2522.11 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . 256

2.11.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . 2562.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . 2602.11.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 2602.11.4 Memory and Dataflow Diagrams . . . . . . . . . . . . . . . 2652.11.5 Ex: Mem Array and Dataflow Diagram . . . . . . . . . . . . 272

2.12 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . . . 2792.13 Example: Moving Average . . . . . . . . . . . . . . . . . . . . . . . 280

2.13.1 Requirements and Environmental Assumptions . . . . . . . 2812.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2822.13.3 Pseudocode and Dataflow Diagrams . . . . . . . . . . . . . 2862.13.4 Control Tables and State Machine . . . . . . . . . . . . . . . 2912.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

CONTENTS ix

3 Performance Analysis and Optimization 2973.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2983.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 2993.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 302

3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 3023.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . 304

3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . 3053.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3053.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 3063.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . 3103.4.4 Effect of Time to Market on Relative Performance . . . . . . 3123.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . 312

3.5 Performance Analysis and Dataflow Diagrams . . . . . . . . . . . . 3133.5.1 Dataflow Diagrams, CPI, and Clock Speed . . . . . . . . . . 3133.5.2 Examples of Dataflow Diagrams for Two Instructions . . . . . 316

3.5.2.1 Scheduling of Operations for Different Clock Periods 3173.5.2.2 Performance Computation for Different Clock Periods 3203.5.2.3 Example: Two Instructions Taking Similar Time . . . 3213.5.2.4 Example: Same Total Time, Different Order for A . . 322

3.5.3 Example: From Algorithm to Optimized Dataflow . . . . . . . 323

x CONTENTS

3.6 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 3263.6.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . 326

3.6.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . 3263.6.1.2 Boolean Strength Reduction . . . . . . . . . . . . . 327

3.6.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . 3283.6.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . 3283.6.2.2 Common Subexpression Elimination . . . . . . . . . 3293.6.2.3 Computation Replication . . . . . . . . . . . . . . . 331

3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3323.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

CONTENTS xi

4 Functional Verification 3354.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

4.1.1 Terminology: Validation / Verification / Testing . . . . . . . . . 3364.1.2 The Difficulty of Designing Correct Chips . . . . . . . . . . . 336

4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . 3374.1.2.2 Notes from Aart de Geus (Chairman and CEO of

Synopsys) . . . . . . . . . . . . . . . . . . . . . . . 3374.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . 338

4.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3384.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . . . 339

4.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3444.3.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . 3444.3.2 Reference Model Style Testbench . . . . . . . . . . . . . . . 3454.3.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . 3454.3.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . 3464.3.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . 3474.3.6 Verification Tips . . . . . . . . . . . . . . . . . . . . . . . . . 348

4.4 Functional Verification for Datapath Circuits . . . . . . . . . . . . . . 3494.4.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . 3514.4.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . 352

xii CONTENTS

4.4.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . 3534.4.4 Have Separate Specification Entity . . . . . . . . . . . . . . . 3554.4.5 Generate Test Vectors Automatically . . . . . . . . . . . . . . 3584.4.6 Relational Specification . . . . . . . . . . . . . . . . . . . . . 359

4.5 Functional Verification of Control Circuits . . . . . . . . . . . . . . . 3604.5.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . 3614.5.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

4.5.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . 3684.5.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . 368

4.5.3 Code Structure for Verification . . . . . . . . . . . . . . . . . 3694.5.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . 3714.5.5 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3764.5.6 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . 3804.5.7 Queue Specification . . . . . . . . . . . . . . . . . . . . . . . 3854.5.8 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . 389

4.6 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . 391

CONTENTS xiii

5 Timing Analysis 4015.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 402

5.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . . 4025.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . . 403

5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . 4035.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . 4055.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . 406

5.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . . 4085.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . . 408

5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . 4105.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . 411

5.1.5.1 Minimum Clock Period . . . . . . . . . . . . . . . . . 4115.1.5.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . 4125.1.5.3 Example Timing Violations . . . . . . . . . . . . . . 412

5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . 4155.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . 415

5.2.1.1 Structure and Behaviour of Multiplexer Latch . . . . 4165.2.1.2 Strategy for Timing Analysis of Storage Devices . . 4205.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . 4215.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . 422

xiv CONTENTS

5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . 4285.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . . . 430

5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . 4315.3.1 Introduction to Critical and False Paths . . . . . . . . . . . . 431

5.3.1.1 Example of Critical Path in Full Adder . . . . . . . . 4345.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . . . 4365.3.1.3 Longest Path and Critical Path . . . . . . . . . . . . 436

5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 4405.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . . . 441

5.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . 4415.3.3.2 Almost-Correct Algorithm to Detect a False Path . . 4475.3.3.3 Examples of Detecting False Paths . . . . . . . . . 447

5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . . 4495.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . 4505.3.4.2 Examples of Finding Next Candidate Path . . . . . . 451

5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . . . 4545.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . . . 4545.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . . . 4555.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . . . 4565.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . 456

CONTENTS xv

5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . 4575.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . 4625.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . 462

5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4635.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . 4635.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . 475

5.4.2.1 Example Derivation: Equation for Voltage at Node 3 4795.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . 483

5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . 4875.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . 491

5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . 4915.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . 495

5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . 4985.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 500

5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . 5015.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . 502

5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . 5025.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . . 503

xvi CONTENTS

6 Power Analysis and Power-Aware Design 5076.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508

6.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . 5086.1.2 Industrial Names and Products . . . . . . . . . . . . . . . . . 5096.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . . . . . . . . 5096.1.4 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . 510

6.1.4.1 Do Batteries Store Energy or Power? . . . . . . . . 5106.1.4.2 Battery Life and Efficiency . . . . . . . . . . . . . . 5116.1.4.3 Battery Life and Power . . . . . . . . . . . . . . . . 512

6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5156.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . 5176.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . 5206.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 5216.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5226.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . 522

6.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . . . 5226.4 Voltage Reduction for Power Reduction . . . . . . . . . . . . . . . . 5276.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . 531

6.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . 5316.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . 535

CONTENTS xvii

6.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . 5356.5.2.2 Additional Information . . . . . . . . . . . . . . . . . 5366.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . 538

6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5446.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . 5446.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . 5456.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . 5466.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . 5466.6.5 Example: Reduced Activity Factor with Clock Gating . . . . . 5506.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . . . . . . 552

6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . 5526.6.6.2 How Many Clock Cycles for Module? . . . . . . . . 5556.6.6.3 Adding Clock-Gating Circuitry . . . . . . . . . . . . 556

6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . 559

xviii CONTENTS

7 Fault Testing and Testability 5637.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . 5647.1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 5647.1.1.2 Causes of Faults . . . . . . . . . . . . . . . . . . . . 5657.1.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . 5657.1.1.4 Burn In . . . . . . . . . . . . . . . . . . . . . . . . . 5667.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . . . . . . . . 5667.1.1.6 Testing Techniques . . . . . . . . . . . . . . . . . . 5677.1.1.7 Design for Testability (DFT) . . . . . . . . . . . . . . 567

7.1.2 Example Problem: Economics of Testing . . . . . . . . . . . 5677.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 567

7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . 5687.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . 5697.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . 5707.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . 570

7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . 5717.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . 571

7.1.5 Mathematical Models of Faults . . . . . . . . . . . . . . . . . 5747.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . 575

CONTENTS xix

7.1.6 Generate Test Vector to Find a Mathematical Fault . . . . . . 5777.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5777.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . 578

7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 5797.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . 5797.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . 582

7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5837.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . 5837.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . 584

7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . 5857.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . 5867.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . 5877.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . 5887.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . 588

7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 5897.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . 5907.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . 591

7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . 5927.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . 5957.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . 597

xx CONTENTS

7.2.5.4 Faults Not Covered by Required Test Vectors . . . . 5987.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 5997.2.5.6 Summary of Technique to Find and Order Test Vectors601

7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 6027.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 604

7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 6047.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 606

7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 6077.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 6087.3.2.3 Scan in Operation with Example Circuit . . . . . . . 610

7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 6147.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 615

7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 6167.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 617

7.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 6207.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 6217.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . 6247.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . 6287.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . 630

CONTENTS xxi

7.5.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 6337.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . 6367.5.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . 6407.5.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . . . . . 6417.5.6 Shift Registers and Characteristic Polynomials . . . . . . . . 643

7.5.6.1 Circuit Multiplication . . . . . . . . . . . . . . . . . . 6467.5.7 Bit Streams and Characteristic Polynomials . . . . . . . . . . 6477.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6487.5.9 Signature Analysis: Math and Circuits . . . . . . . . . . . . . 651

7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660

xxii CONTENTS

8 Review 6618.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . 6628.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663

8.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6638.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . 664

8.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 6658.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 6658.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . 666

8.4 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 6678.4.1 Verification Topics . . . . . . . . . . . . . . . . . . . . . . . . 6678.4.2 Verification Example Problems . . . . . . . . . . . . . . . . . 668

8.5 Performance Analysis and Optimization . . . . . . . . . . . . . . . . 6698.5.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . 6698.5.2 Performance Example Problems . . . . . . . . . . . . . . . . 670

8.6 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6718.6.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6718.6.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . 672

8.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6738.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6738.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . 674

CONTENTS xxiii

8.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6758.8.1 Testing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 6758.8.2 Testing Example Problems . . . . . . . . . . . . . . . . . . . 676

8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . 677

Part I

Lecture Notes

1

Chapter 1

VHDL: The Language

3

4 CHAPTER 1. VHDL

1.1 Introduction to VHDL

1.1.1 Levels of AbstractionTransistor Signal values and time are continous (analog). Each transistor is mod-

eled by a resistor-capacitor network.

Switch Time is continuous, but voltage may be either continuous or discrete. Lin-ear equations are used.

Gate Transistors are grouped together into gates. Voltages are discrete valuessuch as 0 and 1.

Register transfer level Hardware is modeled as assignments to registers andcombinational signals. Basic unit of time is one clock cycle.

Transaction level A transaction is an operation such as transfering data acrossa bus. Building blocks are processors, controllers, etc. VHDL, SystemC, orSystemVerilog.

Electronic-system level Looks at an entire electronic system, with both hard-ware and software.

1.1.2 VHDL Origins and History 5

1.1.2 VHDL Origins and History

VHDL = VHSIC Hardware Description LanguageVHSIC = Very High Speed Integrated Circuit

The VHSIC Hardware Description Language (VHDL) is a formal notationintended for use in all phases of the creation of electronic systems.Because it is both machine readable and human readable, it supports thedevelopment, verification, synthesis and testing of hardware designs, thecommunication of hardware design data, and the maintenance,modification, and procurement of hardware.

Language Reference Manual (IEEE Design Automation StandardsCommittee, 1993a)

VHDL is a lot more than synthesis of digitalhardware

6 CHAPTER 1. VHDL

1.1.3 Semantics

The original goal of VHDL was to simulate circuits. The semantics of the languagedefine circuit behaviour .

a

b

c

simulationc <= a AND b;

But now, VHDL is used in simulation and synthesis. Synthesis is concerned withthe structure of the circuit.

Synthesis: converts one type of description (behavioural) into another, lower level,description (usually a netlist).

a

b cc <= a AND b; synthesis

1.1.3 Semantics 7

Synthesis

Synthesis is a computer-aided design (CAD) technique that transforms a designer’sconcise, high-level description of a circuit into a structural description of a circuit.

a


8 CHAPTER 1. VHDL

CAD Tools

CAD Tools allow designers to automate lower-level design processes in implement-ing the desired functionality of a system.

NOTE: EDA = Electronic Design Automation. In digital hardware designEDA = CAD.

1.1.3 Semantics 9

Synthesis vs Simulation

For synthesis, we want the code we write to define the structure of the hardwarethat is generated.

a


10 CHAPTER 1. VHDL

Synthesis vs Simulation

The VHDL semantics define the behaviour of the hardware that is generated, notthe structure of the hardware.

a

b c

a

b c

c <= a AND b;

a

b

c

differentstructure

samebehavioursynthesis

simulation

a

b

c

simulation

synt

hesis

1.1.4 Synthesis of a Simulation-Based Language 11

1.1.4 Synthesis of a Simulation-Based Lan-guage

This section reserved for your reading pleasure

12 CHAPTER 1. VHDL

1.1.5 Solution to Synthesis Sanity• Pick a high-quality synthesis tool and study its documentation thoroughly

• Learn the idioms of the tool

• Different VHDL code with same behaviour can result in very different circuits

• Be careful if you have to port VHDL code from one tool to another

• KISS: Keep It Simple Stupid

– VHDL examples will illustrate reliable coding techniques for the synthesis toolsfrom Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies aswell.

– Follow the coding guidelines and examples from lecture

– As you write VHDL, think about the hardware you expect to get.

Note: If you can’t predict the hardware, then the hardwareprobably won’t be very good (small, fast, correct, etc)

1.1.6 Standard Logic 1164 13

1.1.6 Standard Logic 1164

std logic 1164 : IEEE standard for signal values in VHDL.

’U’ uninitialized’X’ strong unknown’0’ strong 0’1’ strong 1’Z’ high impedance’W’ weak unknown’L’ weak 0’H’ weak 1’--’ don’t care

The most common values are: ’U’ , ’X’ , ’0’ , ’1’ .

If you see ’X’ in a simulation, it usually means that there is a mistake in your code.

14 CHAPTER 1. VHDL

1.2 Comparison of VHDL to Other Hard-ware Description Languages


1.3 Overview of Syntax

1.3.1 Syntactic Categories


1.3.2 Library Units


1.3.3 Entities and Architecture 15

1.3.3 Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

architecture

entityarchitecture

entity

Entity and Architecture

16 CHAPTER 1. VHDL

Entity

library ieee;

use ieee.std_logic_1164.all;

entity and_or is

port (

a, b, c : in std_logic ;

z : out std_logic

);

end and_or;

Example of an entity

1.3.3 Entities and Architecture 17

Architecture

architecture main of and_or is

signal x : std_logic;

begin

x <= a AND b;

z <= x OR (a AND c);

end main;

Example of architecture

18 CHAPTER 1. VHDL

1.3.4 Concurrent Statements• Architecture s contain concurrent statements

• Concurrent statements execute in parallel (Figure1.4)

– Concurrent statements make VHDL fundamentally different from most soft-ware languages.

– Hardware (gates) naturally execute in parallel — VHDL mimics the behaviourof real hardware.

– At each infinitesimally small moment of time, each gate:

1. samples its inputs

2. computes the value of its output

3. drives the output

1.3.4 Concurrent Statements 19

Concurrent Statements

architecture main of bowser isbegin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2;end main;

architecture main of bowser isbegin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b;end main;

a

b z

x1 x2

The order of concurrent statements doesn’t matter

20 CHAPTER 1. VHDL

Types of Concurrent Statements

conditional assignment similar to conventional if-then-elsec <= a+b when sel=’1’ else a+c when sel=’0’ else "0000";

selected assignment similar to conventional case/switchwith color select d <= "00" when red , "01" when . . .;

component instantiation use a hardware module/componentadd1 : adder port map( a => f , b => g, s => h, co => i );

for-generate create multiple pieces of hardwarebgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate;

if-generate conditionally create some hardwareokgen : if optgoal /= fast then generate

result <= ((a and b) or (d and not e)) or g;end generate;fastgen : if optgoal = fast then generate

result <= ’1’;end generate;

process description of complex behaviour (Section 1.3.6)

1.3.5 Component Declaration and Instantiations 21

1.3.5 Component Declaration and Instanti-ations


1.3.6 Processes

• Processes are used to describe complex and potentially unsynthesizable be-haviour

• A process is a concurrent statement (Section 1.3.4).

• The body of a process contains sequential statements (Section 1.3.7)

• Processes are the most complex and difficult to understand part of VHDL (Sec-tions 1.5 and 1.6)

22 CHAPTER 1. VHDL

Example Process with Sensitivity List

process (a, b, c)

begin

y <= a AND b;

if (a = ’1’) then

z1 <= b AND c;

z2 <= NOT c;

else

z1 <= b OR c;

z2 <= c;

end if;

end process;

1.3.6 Processes 23

Example Process with Wait Statements

process

begin

y <= a AND b;

z <= ’0’;

wait until rising_edge(clk);


z <= ’1’;

y <= ’0’;


else

y <= a OR b;

end if;

end process;

24 CHAPTER 1. VHDL

Sensitivity Lists and Wait Statements

• Processes must have either a sensitivity list or at least one wait statement oneach execution path through the process.

• Processes cannot have both a sensitivity list and a wait statement.

1.3.6 Processes 25

Sensitivity List

The sensitivity list contains the signals that are read in the process.

A process is executed when a signal in its sensitivity list changes value.

An important coding guideline to ensure consistent synthesis and simulation resultsis to include all signals that are read in the sensitivity list.

There is one exception to this rule: for a process that implements a flip-flop with anif rising edge statement, it is acceptable to include only the clock signal in thesensitivity list — other signals may be included, but are not needed.

26 CHAPTER 1. VHDL

1.3.7 Sequential Statements

Used inside processes and functions .

wait wait until . . . ;signal assignment . . . <= . . . ;if-then-else if . . . then . . . elsif . . . end if;case case . . . is

when . . . | . . . => . . . ;when . . . => . . . ;

end case;loop loop . . . end loop;while loop while . . . loop . . . end loop;for loop for . . . in . . . loop . . . end loop;next next . . . ;

The most commonly used sequential statements

1.3.8 A Few More Miscellaneous VHDL Features 27

1.3.8 A Few More Miscellaneous VHDL Fea-tures


1.4 Concurrent vs Sequential Statements

All concurrent assignments can be translated into sequential statements. But, notall sequential statements can be translated into concurrent statements.

28 CHAPTER 1. VHDL

1.4.1 Concurrent Assignment vs Process

The two code fragments below have identical behaviour:

architecture main of tiny is

begin

b <= a;

end main;

architecture main of tiny is

begin

process (a) begin

b <= a;

end process;

end main;

1.4.2 Conditional Assignment vs If Statements 29

1.4.2 Conditional Assignment vs If State-ments

The two code fragments below have identical behaviour:

Concurrent Statements

t <= <val1> when <cond>

else < val2>;

Sequential Statementsif < cond> then

t <= < val1>;

else

t <= < val2>;

end if

30 CHAPTER 1. VHDL

1.4.3 Selected Assignment vs Case State-ment

The two code fragments below have identical behaviour

Concurrent Statementswith < expr> select

t <= < val1> when <choices1>,

<val2> when <choices2>,

<val3> when <choices3>;

Sequential Statementscase < expr> is

when <choices1> =>

t <= < val1>;

when <choices2> =>

t <= < val2>;

when <choices3> =>

t <= < val3>;

end case;

1.4.4 Coding Style 31

1.4.4 Coding Style

Code that’s easy to write with sequential statements, but difficult with concurrent :

case < expr> is

when <choice1> =>

if < cond> then

o <= <expr1>;

else

o <= <expr2>;

end if;

when <choice2> =>

. . .end case;

32 CHAPTER 1. VHDL

1.5 Overview of Processes

Processes are the most difficult VHDL construct to understand. This section givesan overview of processes. Section 1.6 gives the details of the semantics of pro-cesses.• Within a process, statements are executed almost sequentially

• Among processes, execution is done in parallel

• Remember: a process is a concurrent statement!

1.5. OVERVIEW OF PROCESSES 33

Process Semantics• VHDL mimics hardware

• Hardware (gates) execute in parallel

• Processes execute in parallel with each other

• All possible orders of executing processes must produce the same simulationresults (waveforms)

• If a signal is not assigned a value, then it holds its previous value

All orders of executing concurrentstatements must produce the same

waveforms

34 CHAPTER 1. VHDL

Process Semantics

architecture

procA: process

stmtA1;

stmtA2;

stmtA3;

end process;

procB: process

stmtB1;

stmtB2;

end process;

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

single threaded:procA beforeprocB

single threaded:procB beforeprocA

multithreaded:procA and procB

in parallel

1.5. OVERVIEW OF PROCESSES 35

Process Semantics

All execution orders must have same behaviour

36 CHAPTER 1. VHDL

1.5.1 Combinational Process vs ClockedProcess

Each well-written synthesizable process is either combinational or clocked.

Combinational process:• Executing the process takes part of one clock cycle

• Target signals are outputs of combinational circuitry

• A combinational processes must have a sensitivity list

• A combinational process must not have any wait statements

• A combinational process must not have any rising_edge s, orfalling_edge s

• The hardware for a combinational process is just combinational circuitry

1.5.1 Combinational Process vs Clocked Process 37

Clocked process:• Executing the process takes one (or more) clock cycles

• Target signals are outputs of flops

• Process contains one or more wait or if rising edge statements

• Hardware contains combinational circuitry and flip flops

Note: Clocked processes are sometimes called “sequentialprocesses”, but this can be easily confused with “sequential state-ments”, so in E&CE 327 we’ll refer to synthesizable processes aseither “combinational” or “clocked”.

38 CHAPTER 1. VHDL

Combinational or Clocked Process? (1)

process (a,b,c)

p1 <= a;

if (b = c) then

p2 <= b;

else

p2 <= a;

end if;

end process;



process

begin


b <= a;

end process;

40 CHAPTER 1. VHDL


process (clk)

begin

if rising_edge(clk) then

b <= a;

end if;

end process;



process (clk)

begin

a <= clk;

end process;

42 CHAPTER 1. VHDL


process

begin

wait until rising_edge(a);

c <= b;

end process;

1.5.2 Latch Inference 43

1.5.2 Latch Inference

The semantics of VHDL require that if a signal is assigned a value on some passesthrough a process and not on other passes, then on a pass through the processwhen the signal is not assigned a value, it must maintain its value from the previouspass.

process (a, b, c)

begin


z1 <= b;

z2 <= b;

else

z1 <= c;

end if;

end process;

a

b

c

z1

z2

Example of latch inference

44 CHAPTER 1. VHDL

Latch Inference

When a signal’s value must be stored, VHDL infers a latch or a flip-flop in thehardware to store the value.

If you want a latch or a flip-flop for the signal, then latch inference is good.

If you want combinational circuitry, then latch inference is bad.

1.5.2 Latch Inference 45

Loop, Latch, Flop

b

a

z

Combinational loop

b z

a EN

Latch

b z

a

D Q

Flip-flop

Question: Write VHDL code for each of the above circuits

46 CHAPTER 1. VHDL

1.6 Details of Process Execution

1.6.1 Simple Simulation

a

b

c d

e

a

b

c

d

e

0ns 10ns 12ns 15ns

1.6.2 Temporal Granularities of Simulation 47

Different Programs, Same Behaviour

All three programs below synthesize to the circuit on the previous slide.

The goal of VHDL semantics is that all three programs have the same behaviour.

process (a,b)

begin

c <= a and b;

end process;

process (b,c,d)

begin

d <= not c;

e <= b and d;

end process;

process (a,b,c,d)

begin

c <= a and b;

d <= not c;

e <= b and d;

end process;

process (a,b)

begin

c <= a and b;

end process;

process (c)

begin

d <= not c;

end process;

process (b,d)

begin

e <= b and d;

end process;

48 CHAPTER 1. VHDL

1.6.2 Temporal Granularities of Simulation


1.6.3 Intuition Behind Delta-Cycle Simula-tion

In zero-delay simulation, a sequence of dependent events must appear to happeninstantaneously (in zero time). In particular, the effect of an event must propagateinstantaneously through combinational circuitry.

Two fundamental rules for zero-delay simulation:

1. events appear to propagate through combinational circuitry instantaneously.

2. all of the gates appear to operate in parallel

1.6.3 Intuition Behind Delta-Cycle Simulation 49

Intution for Delta Cycles

To make it appear that events propagate instaneously, VHDL introduces an artificialunit of time, the delta cycle, to represent an infinitesimally small amount of time. Ineach delta cycle, every gate in the circuit will sample its inputs, compute its result,and drive its output signal with the result.

Simulators simulate one gate at a time, but the waveforms make it appear that all ofthe gates were run in parallel. In each delta cycle, the simulator executes all gateswhose inputs changed.

To preserve the illusion that the gates ran in parallel, the effect of simulating a gateremains invisible until the end of the delta cycle.

50 CHAPTER 1. VHDL

1.6.4 Definitions and Algorithm

1.6.4.1 Process Modes

suspend

resume

activ

ate

active

suspendedpostponed

1.6.4 Definitions and Algorithm 51

Suspended

suspend

resume

activ

ate

active

suspendedpostponed

• Nothing to currently execute

• A process stays suspended until the event that it is waiting for occurs: either achange in a signal on its sensitivity list or the condition in a wait statement

52 CHAPTER 1. VHDL

Postponed

suspend

resume

activ

ate

active

suspendedpostponed

• Wants to execute, but not currently active

• A process stays postponed until the simulator chooses it from the pool of post-poned processes


Active

suspend

resume

activ

ate

active

suspendedpostponed

• Currently executing

• A process stays active until it hits a wait statement or sensitivity list, at whichpoint it suspends

54 CHAPTER 1. VHDL

1.6.4.2 Simulation Algorithm

The algorithm presented here is a simplification of the actual algorithm in the VHDLStandard.

This algorithm does not support delayed assignments; for example:(a <= b after 2 ns; ).

A somewhat ironic note, only six of the two hundred pages in the VHDL Standardare devoted to the semantics of executing processes.


The Algorithm

Simulations start at step 1 with all processes postponed and all signals with adefault value (e.g., ’U’ for std logic ).

1. While there are postponed processes:

(a) Pick one or more postponed processes to execute (become active).(b) Provisionally execute assignments (new values become visible at step 3)(c) A process executes until it hits its sensitivity list or a wait statement, at which point it

suspends.(d) Processes that become suspended, stay suspended until there are no more postponed

or active processes.

2. Each process checks its sensitivity list or wait condition to see if it should resume

3. Update signals with their provisional values4. If no postponed processes, then increment simulation time to next event.

56 CHAPTER 1. VHDL

Notes on Simulation Algorithm• At a wait statement, the process will suspend even if the condition is true in the

current simulation cycle. The process will resume when the condition changesto true.

• In n-threaded execution, at most n processes are active at a time


1.6.4.3 Delta-Cycle Definitions

Definition simulation step: Executing one sequential assignment or processmode change.

Definition simulation cycle: The operations that occur in one iteration of thesimulation algorithm.

Definition delta cycle: A simulation cycle that does not advance simulationtime.

Definition simulation round: A sequence of simulation cycles that all have thesame simulation time.

58 CHAPTER 1. VHDL

1.6.5 Example 1: Process Execution (Bam-boozle)


1.6.6 Example 2: Process Execution (Flum-mox)

This example is a variation of the Bamboozle example from section 1.6.5.

1.6.6 Example 2: Process Execution (Flummox) 59

a

b

c d

e

U

U

U UU

a

b

c

d

e

P

P

Legend

0ns

simulation step

visible-assignment valuesimulation-step pointer(one per process)

process mode (S=suspended, P=postponend A=active)

P

initial values

provisional-assignment value

proc1: process (a, b, c) begin

c <= a AND b;

end process;

proc2: process (b, d) begin

d <= NOT c;

end process;

e <= b AND d;

proc3: process begin

a <= ’1’;

b <= ’0’;

b <= ’1’;

wait for 3 ns;

wait for 99 ns;

end process;

proc1

proc2

proc3

delta cyclesim cycle

sim round

60 CHAPTER 1. VHDL

a

b

c d

e

proc1: ...(a, b, c)...

c <= a AND b;

end process;

proc2: ...(b, d)...

d <= NOT c;

end process;

e <= b AND d;


a <= ’1’;

b <= ’0’;

b <= ’1’;

wait for 3 ns;

wait for 99 ns;

end process;

2. Check sens lists, wait conditions for changes3. Update signals with provisional values4. If no postponed procs, increment time

1. While there are postponed processes:(a) Pick process(es) to activate(b) Execute active processes, record prov asns(c) Suspend at sens list or wait statement(d) Once suspended, stay suspended

a

b

c

d

e

proc1

proc2

proc3


sim round

1.6.6 Example 2: Process Execution (Flummox) 61

From Delta-Time to Real Time

a

b

c

d

e

U

U

U

U

U

+1δ +2δ +3δ3ns

+1δ +2δ +3δ0ns 102ns

U

U

U

U

U

a

b

c

d

e

3ns0ns 102ns

U

U

U

U

U

2ns1ns 4ns 100ns 101ns

62 CHAPTER 1. VHDL

Note and Questions

Note: If a signal is updated with the same value it had in theprevious simulation cycle, then it does not change, and thereforedoes not trigger processes to resume.

Question: What are the different granularities of time that occur when doingdelta-cycle simulation?

Question: What is the order of granularity, from finest to coarsest, amongstthe different granularities related to delta-cycle simulation?

1.6.7 Ex: Need for Provisonal Asn 63

1.6.7 Ex: Need for Provisonal Asnarchitecture main of swindle is

begin

p_c: process (a, b) begin

c <= a AND b;

end process;

p_d: process (a, c) begin

d <= a XOR c;

end process;

end main;

Question: draw the circuit

Circuit to illustrate need for provisional assignments

1. Start with all signals at ’0’ .

2. Simultaneously change to a = ’1’ and b = ’1’ .

64 CHAPTER 1. VHDL

With Provisional Assignments,

c Before d

If assignments are not visible within same simulation cycle(correct: i.e. provisional assignments are used)


c <= a AND b;

end process;


d <= a XOR c;

end process;

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled before p d, then d will have a ’1’ pulse.


With Provisional Assignments,

d Before c

If assignments are not visible within same simulation cycle(correct: i.e. provisional assignments are used)


c <= a AND b;

end process;


d <= a XOR c;

end process;

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p d is scheduled before p c , then d will have a ’1’ pulse.

66 CHAPTER 1. VHDL

Without Prov. Assignments,

c Before d

If assignments are visible within same simulation cycle (incorrect)


c <= a AND b;

end process;


d <= a XOR c;

end process;

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled before p d, then d will stay constant ’0’ .


Without Prov. Assignments,

d Before c

If assignments are visible within same simulation cycle (incorrect)


c <= a AND b;

end process;


d <= a XOR c;

end process;

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p d is scheduled before p c , then d will have a ’1’ pulse.

68 CHAPTER 1. VHDL

Need for Provisional Assignment

With provisional assignments, both orders of scheduling processes result in thesame behaviour on all signals. Without provisional assignments, different schedul-ing orders result in different behaviour.

1.6.8 Delta-Cycle Simulations of Flip-Flops 69

1.6.8 Delta-Cycle Simulations of Flip-Flops

p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;

end process;

p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;flop : process ( clk ) begin

if rising_edge( clk ) thenq <= a;

end if;end process;

a

clk

q

flop

p_a

p_clk

sim roundsim cycle

delta cycle

0ns

PP

U

U

U

P

U

BBB

EE

A SA S

U

A S

0

0

70 CHAPTER 1. VHDL

Redraw with Normal Time Scale

a

clk

q

0ns 10ns 20ns5ns 15ns 30ns 35ns25ns


Back-to-Back Flops

p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;

end process;

p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;flops : process ( clk ) begin

if rising_edge( clk ) thenq1 <= a;q2 <= q1;

end if;end process;

a

clk

q1

flops

p_a

p_clk

sim roundsim cycle

delta cycle

10ns

P A S

0

0

B/E

A SP

U

15ns

P A S

20ns

P A S

30ns

P A SA S

1

0

0

A SP

1

1

B/E

B

BB

EE E

EE

EE

E B E B E B EB E B/E

B/E

B/E

B/E

B/E

B/EBB B E

BB B E

35ns

1

P

U

q2 U

B

72 CHAPTER 1. VHDL


a

clk

q

0ns 10ns 20ns5ns 15ns 30ns 35ns25ns


External Inputs and Flops

Question: Do the signals b1 and b2 have the same behaviour from20–30 ns?

74 CHAPTER 1. VHDL

architecture mathilde of sauv e is

signal clk, a, b : std_logic;

begin

process begin

clk <= ’1’;

wait for 10 ns;

clk <= ’0’;

wait for 10 ns;

end process;

process begin

wait for 20 ns;

a1 <= ’1’;

end process;

process begin


a1 <= ’1’;

end process;

process begin

wait until rising_edge( clk );

b1 <= a1;

b2 <= a2;


Testbenches and Clock Phases

env : process begina <= ’1’;clk <= ’0’;wait for 10 ns;a <= ’0’;clk <= ’1’;wait for 10 ns;

end process;

flop : process ( clk ) beginif rising_edge( clk ) then

q1 <= aend if;

end process;

a

clk

q1

flop2

flop1

env

sim roundsim cycle

delta cycle

0ns

76 CHAPTER 1. VHDL


a

clk

q1

0ns 10ns 20ns


WarningNote: Testbench signals For consistent results across differ-ent simulators, simulation scripts vs test benches, and timing-simulation vs zero-delay simulation do not change signals in yourtestbench or script at the same time as the clock changes.

a is output of clocked or com-binational process

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

a is output of timed process(testbench or environment)POOR DESIGN

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

a is output of timed process(testbench or environment)GOOD DESIGN

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

78 CHAPTER 1. VHDL

1.7 Register-Transfer-Level Simulation

a

b

c

d

e

proc1

proc2

proc3


sim round BBB

PPP

U

U

U

U

U

A

U

SA

1

0

S

A S

U

U

EE

PP

A

0

U

SA S

BB E

E

P A S

0

1

BB E

E

P A S

0

B EE

P A S

1

PP A S

1

A S

1

1

BB

BEE

P A S

1

0

P A S

0

102ns

0

BBE

E EE

EBB

0ns 3ns

BEE

U

0ns+1δ 0ns+2δ 0ns+2δ 3ns+1δ 3ns+2δ 3ns+3δ

a

b

c

d

e

U

U

U

U

U

1

0

0

1

0

1

1

0

0ns 1ns 2ns 3ns 102ns

Delta cycle simulation RTL simulation

1.7.1 Overview 79

1.7.1 Overview• Much simpler than delta cycle

• Columns are real time: clock cycles, nanoseconds, etc.

• Can simulate both synthesizable and unsynthesizable code

• Cannot simulate combinational loops

• Same values as delta-cycle at end of simulation round

process begin

a <= ’0’;

wait for 10 ns;

a <= ’1’;

...

end process;

process begin

b <= ’0’;

wait for 10 ns;

b <= a;

...

end process;

Question: In this code, whatvalue should b have 10 ns?

80 CHAPTER 1. VHDL

1.7.2 Technique for Register-Transfer LevelSimulation

1. Pre-processing

(a) Separate processes into combinational and non-combinational (clocked andtimed)

(b) Decompose each combinational process into separate processes with onetarget signal per process

(c) Sort processes into topological order based on dependencies

2. For each clock cycle or unit of time:

(a) Run non-combinational processes in any order. Non-combinational assign-ments read from earlier clock cycle / time step, except that clocked processesread the current value of the clock signal.

(b) Run combinational processes in topological order. Combinational assign-ments read from current clock cycle / time step.

1.7.3 Examples of RTL Simulation 81

1.7.3 Examples of RTL Simulation

1.7.3.1 RTL Simulation Example 1

We revisit an earlier example from delta-cycle simulation, but change the codeslightly and do register-transfer-level simulation.

proc1: process (a, b, c) begin

d <= NOT c;

c <= a AND b;

end process;

proc2: process (b, d) begin

e <= b AND d;

end process;


a <= ’1’;

b <= ’0’;

wait for 3 ns;

b <= ’1’;

wait for 99 ns;

end process;

82 CHAPTER 1. VHDL

Decompose and sort comb procs

proc1d: process (c) begind <= NOT c;

end process;

proc1c: process (a, b) beginc <= a AND b;

end process;

proc2: process (b, d) begine <= b AND d;

end process;

proc1c: process (a, b) beginc <= a AND b;

end process;

proc1d: process (c) begind <= NOT c;

end process;

proc2: process (b, d) begine <= b AND d;

end process;

Decomposed Sorted

1.7.3 Examples of RTL Simulation 83

Waveforms

a

b

c

d

e

U

U

U

U

U

0ns 1ns 2ns 3ns 102ns

Example: Communicating State Machines

84 CHAPTER 1. VHDL

huey: process

begin

clk <= ’1’;

wait for 10 ns;

clk <= ’0’;

wait for 10 ns;

end process;

dewey: process

begina <= to_unsigned(0,4);

wait until re(clk);

while (a < 4) loop

a <= a + 1;

wait until re(clk);

end loop;

end process;

louie: process

begin

wait until re(clk);

d <= ’1’;

if (a >= 2) then

d <= ’0’;

wait until re(clk);

end if;

end process;

clk

a

d

1.8. VHDL AND HARDWARE BUILDING BLOCKS 85

1.8 VHDL and Hardware Building Blocks

1.8.1 Basic Building Blocks

Different classes of building blocks:

• Conditional

• Arithmetic

• Storage

86 CHAPTER 1. VHDL

Basic Building Blocks: Boolean

Schematic VHDL Description

and AND gate

or OR gatenot inverter

nand NAND gate

nor and gate

xor exclusive-or gate

1.8.1 Basic Building Blocks 87

Basic Building Blocks: Conditional

if-then-else ,when-else ,with-select ,case

Multiplexer

88 CHAPTER 1. VHDL

Basic Building Blocks: Arithmetic

+ adder

- subtracter

asl , lsl left shifter

asr , lsr right shifter

1.8.1 Basic Building Blocks 89

Basic Building Blocks: Storage

CE

S

R D Q

clocked process flip flop WE

A

DI

DO

memory component single-port memory WE

A0

DI0

DO0

A1 DO1

memory component dual-port memory

90 CHAPTER 1. VHDL

1.8.2 Deprecated Building Blocks for RTL

Some of the common gates you have encountered in previous courses should beavoided when synthesizing register-transfer-level hardware, particularly if FPGAsare the implementation technology.

Latches : Use flops, not latches

T, JK, SR, etc flip-flops : Limit yourself to D-type flip-flops

Tri-State Buffers : Use multiplexers, not tri-state buffers

Note: Unfortunately and surprisingly, PalmChip has beenawarded a US patent for using uni-directional busses (i.e. multi-plexers) for system-on-chip designs. The patent was filed in 2000,so all fourth-year design projects completed after that date willneed to pay royalties to PalmChip

1.8.2 Deprecated Building Blocks for RTL 91

What is This?

process (a)

begin

if rising_edge(a) then

c <= b;

end if;

end process;

92 CHAPTER 1. VHDL

1.8.3 Hardware and Code for Flops

1.8.3.1 Flops with Waits and Ifs

process (clk)

begin


q <= d;

end if;

end process;

1.8.3 Hardware and Code for Flops 93

VHDL Code for Flip-Flop: Wait-Style

process

begin


q <= d;

end process;

94 CHAPTER 1. VHDL

1.8.3.2 Flops with Synchronous Reset

process (clk)

begin


if (reset = ’1’) then

q <= ’0’;

else

q <= d;

end if;

end if;

end process;


Flop with Synchronous Reset: Wait-Style

process

begin



q <= ’0’;

else

q <= d0;

end if;

end process;

96 CHAPTER 1. VHDL

Variation on a Floppy Theme

Question: Synchronous or asynchronous reset?

process (clk, reset)

begin


q <= ’0’;

else


q <= d;

end if;

end if;

end process;


Variated Flop of a Theme

Question: Synchronous or asynchronous reset?

process

begin


q <= ’0’;

else

q <= d0;

end if;


end process;

98 CHAPTER 1. VHDL

Flop with Chip-Enable

process (clk)

begin


if (ce = ’1’) then

q <= d;

end if;

end if;

end process;

Wait-style flop with chip-enable included in course notes


Q: Flop with a Mux on the Input?

D Q

d0

d1

sel

q

clk

100 CHAPTER 1. VHDL

Q: Flops with a Mux on the Output?

D Q q0

q1

sel

clk

D Q

clk

d1

d0

q

Question: For the circuits with mux-on-input and mux-on-output, does qhave the same behaviour in both circuits?


1.8.3.3 Flop with Chip-Enable and Mux onInput

Hint: Chip Enableprocess (clk)

begin


if (ce = ’1’) then

q <= d;

end if;

end if;

end process;

102 CHAPTER 1. VHDL

1.8.3.4 Flops with Chip-Enable, Muxes, andReset


1.8.4 An Example Sequential Circuit


1.9 Arrays and Vectors


1.10. ARITHMETIC 103

1.10 Arithmetic

VHDL includes all of the common arithmetic and logical operators.

Use the VHDL arithmetic operators and let the synthesis tool choose the best im-plementation for you.

1.10.1 Arithmetic Packages

To do arithmetic with signals, use the numeric_std package. This package de-fines types signed and unsigned , which are std_logic vectors on which youcan do signed or unsigned arithmetic.

numeric std supersedes earlier arithmetic packages, such asstd logic arith .

Use only one arithmetic package, otherwise the different definitions will clash andyou can get strange error messages.

104 CHAPTER 1. VHDL

1.10.2 Shift and Rotate Operations


1.10.3 Overloading of Arithmetic


1.10.4 Different Widths and Arithmetic


1.10.5 Overloading of Comparisons


1.10.6 Different Widths and Comparisons 105

Overloading of Comparison Operations (=, /= , >=, >, <)

src1/2 src2/1unsigned integer OK

signed integer OKunsigned signed fails in analysis

1.10.6 Different Widths and Comparisons


106 CHAPTER 1. VHDL

1.10.7 Type Conversion

The functions unsigned , signed , to integer , to unsigned and to signed

are used to convert between integers, std-logic vectors, signed vectors and un-signed vectors.

If you convert between two types of the same width, then no additional hardwarewill be generated.

The listing below summarizes the types of these functions.

1.10.7 Type Conversion 107

Type Conversion

unsigned( val : std_logic_vector ) return unsigned;

signed( val : std_logic_vector ) return signed;

to_integer( val : signed ) return integer;

to_integer( val : unsigned ) return integer;

to_unsigned( val : integer; width : natural) return unsigned;

to_signed( val : integer; width : natural) return signed;

Note: More details in course notes

108 CHAPTER 1. VHDL

1.11 Synthesizable vs Non-SynthesizableCode

Synthesis is done by matching VHDL code against templates or patterns.

It’s important to use idioms that your synthesis tools recognize.

Think like hardware: when you write VHDL, you should know what hardware youexpect to be produced by the synthesizer.

1.11.1 Unsynthesizable Code 109

1.11.1 Unsynthesizable Code

1.11.1.1 Initial Values

Initial values on signals (UNSYNTHESIZABLE)

signal bad_signal : std_logic := ’0’;

Reason : At powerup, the values on signals are random (except for some FPGAs).

110 CHAPTER 1. VHDL

1.11.1.2 Wait For

Wait for length of time (UNSYNTHESIZABLE)

wait for 10 ns;

Reason : Delays through circuits are dependent upon both the circuit and its op-erating environment, particularly supply voltage and temperature. For example,imagine trying to build an AND gate that will have exactly a 2ns delay in all envi-ronments.


1.11.1.3 Different Wait Conditions

wait statements with different conditions in a process (UNSYNTHESIZABLE)

-- different clock signals

process

begin

wait until rising_edge(clk1);

x <= a;

wait until rising_edge(clk2);

x <= a;

end process;

Reason : Would require the flip flops to use different clock signals at different times.

112 CHAPTER 1. VHDL

Different Wait Conditions

-- different clock edges

process

begin


x <= a;

wait until falling_edge(clk);

x <= a;

end process;

Reason : Would require flip-flop to be sensitive to different clock edges at differenttimes.


1.11.1.4 Multiple “if rising edge” in Pro-cessMultiple if rising edge statements in a process (UNSYNTHESIZABLE)

process (clk)

begin


q0 <= d0;

end if;


q1 <= d1;

end if;

end process;

Reason : The idioms for synthesis tools generally expect just a single ifrising edge statement in each process.

The simpler the VHDL code is, the easier it is to synthesize hardware. Program-mers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobssimpler.

114 CHAPTER 1. VHDL

1.11.1.5 “if rising edge” and “wait” in SameProcess

An if rising edge statement and a wait statement in the same process (UN-SYNTHESIZABLE)

process (clk)

begin


q0 <= d0;

end if;


q0 <= d1;

end process;

Reason : The idioms for synthesis tools generally expect just a single type of flop-generating statement in each process.


1.11.1.6 “if rising edge” with “else” Clause

The if statement has a rising edge condition and an else clause (UNSYN-THESIZABLE).

process (clk)

begin


q0 <= d0;

else

q0 <= d1;

end if;

end process;

Reason : Generally, an if-then-else statement synthesizes to a multiplexer.

116 CHAPTER 1. VHDL

1.11.1.7 “if rising edge” Inside a “for” Loop

An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys)

process (clk) begin

for i in 0 to 7 loop


q(i) <= d;

end if;

end loop;

end process;

Reason : just an idiom of the synthesis tool.

Some loop statements are synthesizable (Rushton Section 8.7).For-loops in general are described in Ashenden.


Synthesizable Alternative

A synthesizable alternative to an if rising edge statement in a for-loop is to putthe if-rising-edge outside of the for loop.

process (clk) begin



q(i) <= d;

end loop;

end if;

end process;

118 CHAPTER 1. VHDL

1.11.1.8 “wait” Inside of a “for loop”wait statements in a for loop (UNSYNTHESIZABLE)

process

begin



x <= to_unsigned(i,4);

end loop;

end process;

Reason : Unknown. while-loop s with the same behaviour are synthesizable.

Note: Combinational for-loops Combinational for-loops areusually synthesizable. They are often used to build a combina-tional circuit for each element of an array.

Note: Clocked for-loops Clocked for-loops are not synthe-sizable, but are very useful in simulation, particular to generatetest vectors for test benches.


Synthesizable Alternative to Wait-Inside-For

while loop (synthesizable)

This is the synthesizable alternative to the the wait statement in a for loop above.

process

begin

-- output values from 0 to 4 on i

-- sending one value out each clock cycle

i <= to_unsigned(0,4);


while (4 > i) loop

i <= i + 1;


end loop;

end process;

120 CHAPTER 1. VHDL

1.12 Synthesizable VHDL Coding Guide-lines


Chapter 2

RTL Design with VHDL: FromRequirements to Optimized Code

121

122 CHAPTER 2. RTL DESIGN WITH VHDL

2.1 Prelude to Chapter


2.2 FPGA Background and Coding Guide-lines

2.2.1 Generic FPGA Hardware

2.2.1 Generic FPGA Hardware 123

2.2.1.1 Generic FPGA Cell“Cell” = “Logic Element” (LE) in Altera

= “Configurable Logic Block” (CLB) in Xilinx

CE

S

R D Q data_in

ctrl_in

carry_in

carry_out

data_outcomb


Configurable Comb/Flop Connection

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


Separate Comb and Flop

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


Connect Comb and Flop

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


Flopped and Unflopped Outputs

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


2.2.2 Area Estimation

To estimate the number of FPGA cells that will be required to implement a circuit,recall that an FPGA lookup-table can implement any function with up to four inputsand one output.

We will describe two methods to estimate the area (number of FPGA cells) requiredto implement a gate-level circuit:

1. Rough estimate based simply upon the number of flip-flops and primary inputsthat are in the fanin of each flip-flop.

2. A more accurate estimate, based upon greedily including as many gates aspossible into each FPGA cell.

2.2.2 Area Estimation 129

Lower Bound on Area for Circuit with oneTarget

Source flops/inputs Minimum cells1 12 13 14 15 26 27 28 39 3

10 311 4

For a single target signal, this technique gives a lower bound on the number of cellsneeded.

For multiple target signals, this technique might be an overestimate, because asingle cell can drive several other cells.


Question: How many cells are needed to implement a 4:1 mux?


3 Cells for 10:1 Function


Estimate Area for Circuit

For each flip-flop and output: traverse backward through the fanin gathering asmuch combinational circuitry as possible into the FPGA cell.

Stopping conditions:• flip-flop

• more than four inputs — However, have more than four signals as input, thenfurther back in the fanin, the circuit will collapse back to four or fewer signals.


Question: Map the combinational circuits below onto generic FPGA cells.

a

b

c

d

z

CE

S

R D Q comb

CE

S

R D Q comb

CE

S

R D Q comb

CE

S

R D Q comb

CE

S

R D Q comb

CE

S

R D Q comb


2.2.2.1 Interconnect for Generic FPGA


2.2.2.2 Clocks for Generic FPGAs

Characteristics of clock signals:• High fanout (drive many gates)

• Long wires (destination gates scattered all over chip)

Characteristics of FPGAs:• Very few gates that are large (strong) enough to support a high fanout.

• Very few wires that traverse entire chip and can be connected to every flip-flop.


2.2.2.3 Special Circuitry in FPGAs

Memory

For more than five years, FPGAs have had special circuits for RAM and ROM. InAltera FPGAs, these circuits are called ESBs (Embedded System Blocks). Thesespecial circuits are possible because many FPGAs are fabricated on the sameprocesses as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.


Microprocessors

A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessorson the same chip as programmable hardware.

Hard SoftAltera Arm 922T with 200 MIPs Nios with ?? MIPsXilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs

The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to imple-ment the first-generation Intel Pentium microprocessor.


Arithmetic Circuitry

A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multi-pliers and adders.

Altera: Mercury 16×16 at 130MHzXilinx: Virtex-II Pro 18×18 at ???MHz

Using these resources can improve significantly both the area and performance ofa design.


Input / Output

Recently, high-end FPGAs have started to include special circuits to increase thebandwidth of communication with the outside world.

ProductAltera True-LVDS (1 Gbps)Xilinx Rocket I/O (3 Gbps)

2.2.3 Generic-FPGA Coding Guidelines 139

2.2.3 Generic-FPGA Coding Guidelines

Flip Flops Are Free• Flip-flops are almost free in FPGAs

reason In FPGAs, the area consumed by a design is usually determined by theamount of combinational circuitry, not by the number of flip-flops.


Use It or Lose• Aim for using 80–90% of the cells on a chip.

reason If you use more than 90% of the cells on a chip, then the place-and-route program might not be able to route the wires to connect the cells.

reason If you use less than 80% of the cells, then probably:

there are optimizations that will increase performance and still allow thedesign to fit on the chip;

or you spent too much human effort on optimizing for low area;or you could use a smaller (cheaper!) chip.

exception In E&CE 327 (unlike in real life), the mark is based on the actualnumber of cells used.

2.2.3 Generic-FPGA Coding Guidelines 141

Just One Clock• Use just one clock signal

reason If all flip-flops use the same clock, then the clock does not impose anyconstraints on where the place-and-route tool puts flip-flops and gates. Ifdifferent flip-flops used different clocks, then flip-flops that are near each otherwould probably be required to use the same clock.


Just One Clock Edge• Use only one edge of the clock signal

reason There are two ways to use both rising and falling edges of a clock signal:have rising-edge and falling-edge flip flops, or have two different clock signalsthat are inverses of each other. Most FPGAs have only rising-edge flip flops.Thus, using both edges of a clock signal is equivalent to having two differentclock signals, which is deprecated by the preceding guideline.

2.3. DESIGN FLOW 143

2.3 Design Flow


2.4 Algorithms and High-Level Models



2.5 Finite State Machines in VHDL

2.5.1 Introduction to State-Machine Design

2.5.1.1 Mealy vs Moore State Machines

2.5.1 Introduction to State-Machine Design 145

Moore Machines• Outputs are dependent upon only the state

• No combinational paths from inputs to outputs

s0/0

s1/1 s2/0

s3/0

a !a


Mealy Machines• Outputs are dependent upon both the state and the inputs

• Combinational paths from inputs to outputs

s0

s1 s2

s3

a/1 !a/0

/0/0


2.5.1.2 Introduction to State Machines andVHDL

A state machine is generally written as a single clocked process, or as a pair ofprocesses, where one is clocked and one is combinational.

Design Decisions• Moore vs Mealy (Sections 2.5.2 and 2.5.3)

• Implicit vs Explicit (Section 2.5.1.3)

• State values in explicit state machines: Enumerated type vs constants (Sec-tion 2.5.5)

• State values for constants: encoding scheme (binary, gray, one-hot, ...) (Sec-tion 2.5.5)


VHDL Constructs for State Machines

The following VHDL control constructs are useful to steer the transition from stateto state:• if ... then ... else

• case

• for ... loop

• while ... loop

• loop

• next

• exit


2.5.1.3 Explicit vs Implicit State Machines

There are two styles of writing state machines in VHDL: explicit and implicit.

Explicit

• State signal appears explicitly in VHDL code

• At most one wait statement per process

• Two sub-categories of explicit state machines

Explicit-Current

– State signal represents current state

– Next-state computation done in a clocked process

Explicit-Current+Next

– Two state signals: current state and next state

– Next-state computation done in a combinational process

– Current-state <= next-state is registered assignment

Implicit Use multiple wait statements in a process to describe state machineimplicilty


Implicit State Machines

For the implicit style of writing state machines, the synthesis program adds an im-plicit register to hold the state signal and combinational circuitry to update the statesignal. In Synopsys synthesis tools, the state signal defined by the synthesizer isnamed multiple wait state reg .

In Mentor Graphics, the state signal is named STATE VAR

We can think of the VHDL code for implicit state machines as having zero statesignals, explicit-current state machines as having one state signal (state ), andexplicit-current+next state machines as having two state signals (state andstate next ).


State Machine TradeoffsExplicit-Current+Next

• Most detailed, closest to hardware

• Greatest opportunity for manual optimization

• Most labour-intensive

• Susceptible to small, subtle, hard-to-find bugs

Explicit-Current

• Almost as manual optimization as Explicit-Current+Next

• Easier to write than Explicit-Current+Next

• Less susceptible to subtle bugs

Implicit

• Taught infrequently

• Least detailed, furthest from actual hardware

• Rely on synthesis for optimization

• Usually least labour to write, shortest code

• Easiest to write correctly (But must understand VHDL synthesis! )


Limitation of Implicit State Machines

Because implicit state machines are written with loops, if-then-elses, cases, etc. itis difficult to write some state machines with complicated control flows in an implicitstyle. The following example illustrates the point.

s0/0

s1/1

s2/0

s3/0

a

!a

!a

a


Terminology

Note: The terminology of “explicit” and “implicit” is somewhatstandard, in that some descriptions of processes with multiple waitstatements describe the processes as having “implicit state ma-chines”.There is no standard terminology to distinguish between the twoexplicit styles: explicit-current+next and explicit-current.


2.5.2 Implementing a Simple Moore Ma-chine

s0/0

s1/1 s2/0

s3/0

a !a

entity simple is

port (

a, clk : in std_logic;

z : out std_logic

);

end simple;

2.5.2 Implementing a Simple Moore Machine 155

2.5.2.1 Implicit Moore State Machine

architecture moore_implicit_v1a of simple is

begin

process

begin

z <= ’0’;



z <= ’1’;

else

z <= ’0’;

end if;


z <= ’0’;


end process;

end moore_implicit;

FlopsGatesDelay


Implicit Moore State Machine

s2/0

!a


2.5.2.2 Explicit Moore with Flopped Output

architecture moore_explicit_v1 of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;

beginprocess (clk)begin

if rising_edge(clk) thencase state is

when s0 =>if (a = ’1’) then

state <= s1;z <= ’1’;

elsestate <= s2;z <= ’0’;

end if;when s1 | s2 =>

state <= s3;z <= ’0’;

when s3 =>state <= s0;z <= ’1’;

end case;end if;

end process;end moore_explicit_v1;

FlopsGatesDelay


Explicit Moore with Flopped Outputs


2.5.2.3 Explicit Moore with CombinationalOutputs



if rising_edge(clk) thencase state is

when s0 =>if (a = ’1’) then

state <= s1;else

state <= s2;end if;

when s1 | s2 =>state <= s3;

when s3 =>state <= s0;

end case;end if;

end process;z <= ’1’ when (state = s1)

else ’0’;end moore_explicit_v2;

FlopsGatesDelay


Explicit Moore with Combinational Outputs


2.5.2.4 Explicit-Current+Next Moore withConcurrent Assignment

architecture moore_explicit_v3 of simple istype state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;


if rising_edge(clk) thenstate <= state_nxt;

end if;end process;state_nxt <= s1 when (state = s0) and (a = ’1’)

else s2 when (state = s0) and (a = ’0’)else s3 when (state = s1) or (state = s2)else s0;

z <= ’1’ when (state = s1)else ’0’;

end moore_explicit_v3;

FlopsGatesDelay


Explicit-Current+Next Moore with

Concurrent Assignment

The hardware synthesized from this architecture is the same as that synthesizedfrom moore explicit v2 , which is written in the current-explicit style.


2.5.2.5 E-C+N Moore with Comb Procarchitecture moore_explicit_v4 of simple is

type state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;


if rising_edge(clk) thenstate <= state_nxt;

end if;end process;process (state, a)begin

case state iswhen s0 =>

if (a = ’1’) thenstate_nxt <= s1;

elsestate_nxt <= s2;

end if;when s1 | s2 =>

state_nxt <= s3;when s3 =>

state_nxt <= s0;end case;



Change the selected as-signment to state intoa combinational processusing a case statement.

FlopsGatesDelay

Same hardware asmoore explicit v2

and v3 .


Explicit-Current+Next Moore with

Combinational Process

2.5.3 Implementing a Simple Mealy Machine 165

2.5.3 Implementing a Simple Mealy Ma-chine

Mealy machines have a combinational path from inputs to outputs, which oftenviolates good coding guidelines for hardware. Thus, Moore machines are muchmore common. You should know how to write a Mealy machine if needed, but mostof the state machines that you design will be Moore machines.



2.5.4 Reset

All circuits should have a reset signal that puts the circuit back into a good initialstate. However, not all flip flops within the circuit need to be reset. In a circuit thathas a datapath and a state machine, the state machine will probably need to bereset, but datapath may not need to be reset.

There are standard ways to add a reset signal to both explicit and implicit statemachines.

It is important that reset is tested on every clock cycle, otherwise a reset might notbe noticed, or your circuit will be slow to react to reset and could generate illegaloutputs after reset is asserted.

2.5.4 Reset 167

Reset with Implicit State Machine• Insert a loop

• Test for reset after each wait

Example from section 2.5.2.1:

architecture moore_implicit of simple isbegin

processbegin

init : loop -- outermost loopz <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetif (a = ’1’) then

z <= ’1’;else

z <= ’0’;end if;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetz <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for reset

end process;end moore_implicit;


Reset with Explicit State Machine

Reset is often easier to include in an explicit state machine, because we need onlyput a test for reset = ’1’ in the clocked process for the state.

The pattern for an explicit-current style of machine is:

process (clk) begin


if reset = ’1’ then

state <= S0;

else

if ... then

state <= ...;

elif ... then

... -- more tests and assignments to state

end if;

end if;

end if;

end process;

2.5.4 Reset 169

Reset with Explicit State Machine

Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:



if rising_edge(clk) thenif (reset = ’1’) thenstate <= s0;

elsecase state is

...end case;

end if;end if;




Reset with Explicit-Current+Next

The pattern for an explicit-current+next style is:

process (clk) begin


if reset = ’1’ then

state_cur <= reset state;

else

state_cur <= state_nxt;

end if;

end if;

end process;

2.5.5 State Encoding


2.6. DATAFLOW DIAGRAMS 171

2.6 Dataflow Diagrams

2.6.1 Dataflow Diagrams Overview• Dataflow diagrams are data-dependency graphs where the computation is di-

vided into clock cycles.

• Purpose:

– Provide a disciplined approach for designing datapath-centric circuits

– Guide the design from algorithm, through high-level models, and finally to reg-ister transfer level code for the datapath and control circuitry.

– Estimate area and performance

– Make tradeoffs between different design options

• Background

– Based on techniques from high-level synthesis tools

– Some similarity between high-level synthesis and software compilation

– Each dataflow diagram corresponds to a basic block in software compiler ter-minology.


Data-Dependency Graph

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Data-dependency graph for z = a + b + c + d + e + f

2.6.1 Dataflow Diagrams Overview 173

Dataflow Diagrams

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Dataflow diagram for z = a + b + c + d + e + f


Clock Cycle Boundaries

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Horizontal lines mark clock cycle boundaries


Latency

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z


Latency = 6 clock cycles

1

2

3

4

5

6


Latency

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z


Latency = 4 clock cycles

1

2

3

4

Question: Why would a good hardware engineer find this designdisatisfying?


Flip Flops

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z


Signals crossing clockboundaries are flip-flops


Registered Inputs and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z



Flops on both inputs and outputs


Registered Inputs, Combinational Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z



Flops on inputs, but not outputs(Latency = 5)


Datapath Components

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z



Blocks in clock cyclesare datapath components


Inputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z


Unconnected signal tails are inputs




Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z





Unconnected signal headsare outputs


Summary

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z





Unconnected signal headsare outputs


2.6.2 Dataflow Diagrams, Hardware, andBehaviour

Primary Input

Dataflow Diagrami

x

Hardwarei

x

Behaviourclk

i

x

2.6.2 Dataflow Diagrams, Hardware, and Behaviour 185

Register Input

Dataflow Diagrami

x

Hardwarei

x

Behaviourclk

i

x


Register Signal

Dataflow Diagrami1

x

+

i2

Hardware

+

i2

x

i1

Behaviourclk

i1

i2

x

2.6.2 Dataflow Diagrams, Hardware, and Behaviour 187

Combinational-Component Output

Dataflow Diagrami1

x+

i2

Hardware

+

i2

i1x

Behaviourclk

i1

i2

x


2.6.3 Dataflow Diagram Execution

2.6.3 Dataflow Diagram Execution 189

Execution with Registers on Both Inputs

and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0 0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

6

0 1 2 3 4 5 6

x5



and Outputs

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

6

0 1 2 3 4 5 6

x5


Execution Without Output Registers

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

0 1 2 3 4 5 6

x5


2.6.4 Performance Estimation

Performance Equations

Performance ∝1

TimeExec

TimeExec = Latency×ClockPeriod

Definition Latency: Number of clock cycles from inputs to outputs. Acombinational circuit has latency of zero. A single register has a latency ofone. A chain of n registers has a latency of n.

Performance of Dataflow Diagrams• Latency: count horizontal lines in diagram

• Min clock period (Max clock speed) limited by longest path in a clock cycle


2.6.5 Area Estimation• Maximum number of blocks in a clock cycle is total number of that component

that are needed

• Maximum number of signals that cross a cycle boundary is total number ofregisters that are needed

• Maximum number of unconnected signal tails in a clock cycle is total numberof inputs that are needed

• Maximum number of unconnected signal heads in a clock cycle is total num-ber of outputs that are needed

These estimates give lower bounds.

Other constraints might force you to use more components.


Area Estimation

Implementation-technology factors, such as the relative size of registers, multiplex-ers, and datapath components, might force you to make tradeoffs that increase thenumber of datapath components to decrease the overall area of the circuit.• With some FPGA chips, a 2:1 multiplexer has the same area as an adder.

• With some FPGA chips, a 2:1 multiplexer can be combined with an adder intoone FPGA cell per bit.

• In FPGAs, registers are usually “free”, in that the area consumed by a circuit islimited by the amount of combinational logic, not the number of flip-flops.

2.6.6 Design Analysis 201

2.6.6 Design Analysis

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

num inputs

num outputs

num registers

num adders

min clock period

latency


Design Analysis (Cont’d)

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

x5

num inputs

num outputs

num registers

num adders

min clock period

latency

2.6.7 Area / Performance Tradeoffs 203

2.6.7 Area / Performance Tradeoffsone add per clock cycle two adds per clock cycle

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

Note: In the “Two-add” design, half of the last clock cycle iswasted.


Two Adds per Clock Cycle

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

clk

a

x1

x2

x3

x4

x5

z

0 1 2 3 4 5 6

4

x5

2.6.7 Area / Performance Tradeoffs 205

Design Comparison

One add per clock cycle Two adds per clock cyclea b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

inputs 6 6outputs 1 1registers 6 6adders 1 2clock period flop + 1 add flop + 2 addlatency 6 4

Question: Under what circumstances would each design option be fastest?


2.7 Design Example: Massey


2.8 Design Example: Vanier

We’ll go through the following artifacts:

1. requirements

2. algorithm

3. dataflow diagram

4. high-level models

5. hardware block diagram

6. RTL code for datapath

7. state machine

8. RTL code for control

2.8. DESIGN EXAMPLE: VANIER 207

Design Process1. Scheduling (allocate operations to clock cycles)

2. I/O allocation

3. First high-level model

4. Register allocation

5. Datapath allocation

6. Connect datapath components, insert muxes where needed

7. Design implicit state machine

8. Optimize

9. Design explicit-current state machine

10. Optimize


2.8.1 Requirements• Functional requirements: compute the following formula:

output = (a × d) + c + (d × b) + b

• Performance requirement:

– Max clock period: flop plus (2 adds or 1 multiply)

– Max latency: 4

• Cost requirements

– Maximum of two adders

– Maximum of two multipliers

– Unlimited registers

– Maximum of three inputs and one output

– Maximum of 5000 student-minutes of design effort

• Registered inputs and outputs

2.8.2 Algorithm 209

2.8.2 Algorithm

output = (a × d) + c + (d × b) + b

Create a data-dependency graph for the algorithm.

z

a d

+

+

+

b c


2.8.3 Initial Dataflow Diagram

Schedule operations into clock cycles.

z

a d

+

+

+

b c

2.8.4 Reschedule to Meet Requirements 211

2.8.4 Reschedule to Meet Requirements

z

a d

+

+

+

b c

z

d b ca


Fix Clock Period Violation

z

d

+

+

+

b c

a

z

d

+

+

+

b c

a

2.8.5 Optimize Resources 213

2.8.5 Optimize Resources

z

a

d

+

+

+

b c

z

d b ca


Analysis

z

a

d

+

+

+

b

c

Question: Should we move the second addition from third clock cycle tosecond?

2.8.5 Optimize Resources 215

Define Entity

Having finalized our input/output scheduling, we can write our entity. Note: we willadd a reset signal later, when we design the state machine to control the datapath.

entity vanier is

port (

clk : in std_logic;

i_1, i_2 : in std_logic_vector(15 downto 0);

o_1 : out std_logic_vector(15 downto 0)

);

end vanier;


2.8.6 Assign Names to Registered Values

z

a

d

+

+

+

b

c

Question: Why do we not need to assign names to combinational signals?

Question: Why do we not need to assign a new name to x1, x2, and x4 thesecond time they cross a clock cycle boundary?

2.8.7 Input/Output Allocation 217

2.8.7 Input/Output Allocation

z

a

d

+

+

+

b

c

x1 x2

x3 x4 x5

x6 x7

x8


VHDL Code!

architecture hlm_v1 of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6, x_7,

x_8 : unsigned(15 downto 0);begin

process beginwait until rising_edge(clk);x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);wait until rising_edge(clk);x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);wait until rising_edge(clk);x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;wait until rising_edge(clk);x_8 <= x_6 + (x_4 + x_7);

end process;o_1 <= std_logic_vector(x_8);

end hlm_v1;

2.8.7 Input/Output Allocation 219

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

x1

0

1

2

3

4

x2

x3

x4

x5

x6

x7

x8

0 1 2 3 4 5

r1

r2

r3

r4

r5

0 1 2 3 4 5

i1

i2

i1

i2


2.8.8 Tangent: Combinational Outputs

architecture hlm_v1c of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6, x_7

: unsigned(15 downto 0);begin

process beginwait until rising_edge(clk);x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);wait until rising_edge(clk);x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);wait until rising_edge(clk);x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;

end process;o_1 <= std_logic_vector(x_6 + (x_4 + x_7));

end hlm_v1c;

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

2.8.9 Register Allocation 221

2.8.9 Register Allocation

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7


New VHDL Code!

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

architecture hlm_v2 of vanier issignal r_1, r_2, r_3, r_4, r_5

: unsigned(15 downto 0);begin

process beginwait until rising_edge(clk);r_1 <= unsigned(i_1);r_2 <= unsigned(i_2);wait until rising_edge(clk);r_3 <= unsigned(i_1);r_4 <= r_1(7 downto 0) * r_2(7 downto 0);r_5 <= unsigned(i_2);wait until rising_edge(clk);r_2 <= r_3(7 downto 0) * r_1(7 downto 0);r_5 <= r_2 + r_5;wait until rising_edge(clk);r_5 <= r_2 + (r_4 + r_5);

end process;o_1 <= std_logic_vector(r_5);

end hlm_v2;

2.8.10 Datapath Allocation 223

2.8.10 Datapath Allocation

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5


2.8.11 Hardware Block Diagram and StateMachine1. Calculate number of states that are needed

2. Control signals for registers

• Chip enable

• Mux select on input

3. Control signals for datapath components

• Instruction (e.g. add/sub for ALU)

• Mux select on inputs

For our example: Use four states: S0..S3, one for each clock cycle.

2.8.11 Hardware Block Diagram and State Machine 225

2.8.11.1 Control for RegistersBuild a table with one row per state, one colum per register.

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

m1

m1a1

a2

a1

S0

S1

S2

S3

S0

r1 r2 r3 r4 r5ce d ce d ce d ce d ce d

S0S1S2S3


Optimize chip enables and muxes

r1 r2 r3 r4 r5ce d ce d ce d ce d ce d

S0 1 i1 1 i2 – – – – – –S1 0 – 0 – 1 i1 1 m1 1 i2S2 – – 1 m1 – – 0 – 1 a1S3 – – – – – – – – 1 a1

• Chip enable: a register holds a value for multiple clock cycles.

• Mux: a register loads values from multiple sources.


Optimized Chip Enables and Muxes

r1=i1 r2 r3=i1 r4=m1 r5ce ce d ce d

S0 1 1 i2 – –S1 0 0 – 1 i2S2 – 1 m1 0 a1S3 – – – – a1


2.8.11.2 Control for Datapath Components• Table for datapath components.

• One row per state.

• One column per datapath component.

• Sub-columns for sources and instructions (e.g. add/sub for ALU).

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

m1

m1a1

a2

a1

S0

S1

S2

S3

S0

a1 a2 m1src1 src2 src1 src2 src1 src2

S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r3 r1S3 r2 a2 r4 r5 – –


Optimize Datapath Control Table

a1 a2 m1src1 src2 src1 src2 src1 src2

S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r1 r3S3 r2 a2 r4 r5 – –


2.8.11.3 Control for State

We need to control the transition from one state to the next. For this example, thetransition is very simple, each state transitions to its successor: S0→ S1→ S2→

S3→ S0....


2.8.11.4 Complete State Machine Table

r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 – – – – S1S1 0 0 – 1 i2 – r2 S2S2 – 1 m1 0 a1 r5 r3 S3S3 – – – – a1 a2 – S0

Question: What values should we use for don’t cares?


“Don’t Cares” Instantiations

r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 0 a1 a2 r3 S1S1 0 0 m1 1 i2 a2 r2 S2S2 1 1 m1 0 a1 r5 r3 S3S3 1 1 m1 0 a1 a2 r3 S0

2.8.12 VHDL Code with Explicit State Machine 233

2.8.12 VHDL Code with Explicit State Ma-chine

We chose a one-hot encoding of the state, which usually results in small and fasthardware for state machines with sixteen or fewer states.

architecture explicit_v1 of vanier is

signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);

type state_ty is std_logic_vector(3 downto 0);

constant s0 : state_ty := "0001";




signal state : state_ty;


begin------------------------ r_1process (clk) begin

if rising_edge(clk) thenif state != S1 then

r_1 <= i_1;end if;

end if;end process;------------------------ r_2process (clk) begin

if rising_edge(clk) thenif state != S1 then

if state = S0 thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;

end if;end process;

------------------------ r_3process (clk) begin

if rising_edge(clk) thenr_3 <= i_1;

end if;end process;------------------------ r_4process (clk) begin

if rising_edge(clk) thenif state = S1 then

r_4 <= m_1;end if;

end if;end process;

2.8.12 VHDL Code with Explicit State Machine 235

------------------------ r_5process (clk) begin

if rising_edge(clk) thenif state = S1 then

r_5 <= i_2;else

r_5 <= a_1;end if;

end if;end process;------------------------ combinational datapathwith state select

a1_src2 <= r_5 when S2,a_2 when others;

with state selectm1_src2 <= r_2 when S1

r_3 when others;a_1 <= a_2 + a1_src2;a_2 <= r_4 + r_5;m_1 <= r_1 * m1_src2;o_1 <= r_5;

------------------------ state machineprocess (clk) begin

if rising_edge(clk) thenif reset = ’1’ then

state <= S0;else

case state iswhen S0 => state <= S1;when S1 => state <= S2;when S2 => state <= S3;when S3 => state <= S0;

end case;end if;

end if;end process;----------------------

end explicit_v1;


Hardware Block Diagram

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

m1

m1a1

a2

a1

+

+

m1

a1

a2

r1 r2 r3

r4

r5

i1 i2

S0

S1

S2

S3

S0

2.8.13 Peephole Optimizations 237

2.8.13 Peephole Optimizations

-- r_1

process (clk) begin


if state != S1 then

r_1 <= i_1;

end if;

end if;

end process;

-- r_1 (optimized)

process (clk) begin


if then

r_1 <= i_1;

end if;

end if;

end process;


Peephole Optimizations

-- r_2process (clk) begin

if rising_edge(clk) thenif state != S1

if state = S0 thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;

end if;end process;

-- r_2 (optimized)process (clk) begin

if rising_edge(clk) thenif state(1) = ’0’ then

if state(0) = ’1’ thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;

end if;end process;

2.8.13 Peephole Optimizations 239

Peephole Optimizations

-- state machineprocess (clk) begin


state <= S0;else

case state iswhen S0 => state <= S1;when S1 => state <= S2;when S2 => state <= S3;when S3 => state <= S0;

end case;end if;

end if;end process;

-- state machine (optimized)-- NOTE: "st" = "state"process (clk) begin


st <= S0;else

for i in 0 to 3 loopst( (i+1) mod 4 ) <= st( i );

end loop;end if;

end if;end process;


2.8.14 Notes and Observations

Our functional requirements were written as:

output = (a × d) + (d × b) + b + c

Alternatively, we could have achieved exactly the same functionality with the func-tional requirements written as (the two statements are mathematically equivalent):

output = (a × d) + b + (d × b) + c

2.8.14 Notes and Observations 241

Data Dependency Graphs: Clean vs Ugly

The naive data dependency graph for the alternative formulation is much messierthan the data dependency graph for the original formulation:

Original(a × d) + (d × b) + b + c

z

a d

+

+

+

b c

Alternative(a × d) + c + (d × b) + b

z

a b

+

+ +

cd


2.9 Pipelining

Pipelining is optimization that increases performance by overlapping the executionof multiple parcels (instructions). The cost is an increase in area, because wecannot reuse datapath components, registers, inputs, or outputs.

2.9.1 Introduction to Pipelining

2.9.1 Introduction to Pipelining 243

Review of unpipelined dataflow diagram

a b

c

d

e

f

+

+

+

+

+

r1

z

0

1

2

3

4

5

add1

add1

add1

add1

add1

r1 r2

r2

r1 r2

r1 r2

r1 r2

clk

a

r1

z

0 1 2 3 4 5 6

αα

α

7 8 9 10 11 12 13

α α α α

Question: How soon can westart to execute β?


Pipelined dataflow diagram• Each stage is treated as separate dataflow diagram.

• Double line denotes boundary between stages.

a b

c

d

e

f

+

+

+

+

+

r3

z

0

1

2

3

4

5

add1

add2

add3

add4

add5

r1 r2

r4

r5 r5

r7 r8

r9 r10

stag

e 1

stag

e 2

stag

e 3

stag

e 4

stag

e 5

clk

a

z

0 1 2 3 4 5 6

αα

αα

ααα

7 8 9 10 11 12 13

(stage1) r1

(stage2) r3

(stage3) r5

(stage4) r7

(stage5) r9

Question: How soon can westart to execute β?


Sequential (Unpipelined) Hardware

+

i2

o1

State(1) State(2) State(3)reset

State(0) State(4)

add1

i1

r1 r2


Pipelined Hardware

+

i2

add1

i1

r1 r2

+add2

r3 r4

i3

+add3

r5 r6

i4

+add4

r7 r8

i5

+add5

r9 r10

i6

o1

stag

e 1

stag

e 2

stag

e 3

stag

e 4

stag

e 5


Pipelined VHDL Code

-- stage 1process begin

wait until rising_edge(clk);r1 <= i1; r2 <= i2;

end process;-- stage 2process begin

wait until rising_edge(clk);r3 <= r1 + r2; r4 <= i3;



end process;





end process;-- outputo1 <= r9 + r10;


2.9.2 Partially Pipelined• Fully pipelined: throughput is one parcel per clock cycle

• Partially pipelined: throughput is less than one parcel per clock cycle.

• Superscalar: throughput is more than one parcel per clock cycle.

a b

c

d

e

f

+

+

+

+

+

r1

z

0

1

2

3

4

5

add1

add1

add2

add2

add3

r1 r2

r2

r3 r4

r3 r4

r5 r6

stag

e 1

stag

e 2

stag

e 3

clk

a

z

0 1 2 3 4 5 6 7 8 9 10 11 12 13

(stage1) r1

(stage2) r3

(stage3) r5

Question: How do we execute αfollowed by β?

2.9.2 Partially Pipelined 249

Hardware for Partially Pipelined

State(1)reset

State(0)

+

i2

add1

i1

r1 r2

+

i2

add2

r3 r4

+

i2

o1

add3

r5 r6

stage 1stage 2

stage 3


2.9.3 Terminology

Definition Depth: The depth of a pipeline is the number of stages on thelongest path through the pipeline.

Definition Latency: The latency of a pipeline is measured the same as for anunpipelined circuit: the number of clock cycles from inputs to outputs.

Definition Throughput: The number of parcels consumed or produced perclock cycle.

Definition Upstream/downstream: Because parcels flow through the pipelineanalogously to water in a stream, the terms upstream and downstream areused respectively to refer to earlier and later stages in the pipeline. Forexample, stage1 is upstream from stage2.

2.9.3 Terminology 251

Definition Bubble: When a pipe stage is empty (contains invalid data), it issaid to contain a “bubble”.

Question: How do we know whether the output of the pipeline is a bubbleor is valid data?


2.10 Design Example: Pipelined Massey

RequirementsFunctional requirements:

• Compute the sum of six 8-bit numbers:output = a + b + c + d + e + f

• Registered inputs, combinational outputs

Performance requirements:

• Maximum clock period: unlimited

• Maximum latency: four

Cost requirements:

• Maximum of five adders

• Small miscellaneous hardware (e.g. muxes) is unlimited

• Maximum of six inputs and one output

• Design effort is unlimited

2.10. DESIGN EXAMPLE: PIPELINED MASSEY 253

Initial Dataflow Diagrams

Original dataflow

z

a b c d

e f+

+

+

+

+

Final unpipelined dataflowa b c

d e

f

+

+

+

+

+

z


Dataflow Diagram Exploration

Variation on original dataflow

z

a b c d e f

+

+

+

+

+

Pipelined dataflow diagram

z

a b c d

e f+

+

+

+

+

i_valid

o_valid

2.10. DESIGN EXAMPLE: PIPELINED MASSEY 255

VHDL Code


wait until rising_edge(clk);r1 <= i1; r2 <= i2; r3 <= i3; r4 <= i4; v1 <= i_valid;

end process;a1 <= r1 + r2; a2 <= r3 + r4;-- stage 2process begin

wait until rising_edge(clk);r5 <= a1; r6 <= a2; r7 <= i5; r8 <= i6; v2 <= v1;

end process;a3 <= r5 + r6; a4 <= r7 + r8;-- stage 3process begin

wait until rising_edge(clk);r9 <= a3; r10 <= a4; v3 <= v2;

end process;a5 <= r9 + r10;-- outputsz <= a5;o_valid <= v3;


2.11 Memory Arrays and RTL Design2.11.1 Memory Operations

Read of Memory with Registered InputsHardware

WE

A

DI

DO a doM

clk

we

Behaviourclk

αaa

M(αa)

we

do

-

αd

2.11.1 Memory Operations 257

Write to Memory with Registered Inputs

Hardware WE

A

DI

DO aM

clk

di

we

do

Behaviourclk

αaa

M(αa)

αd

we

di

-

-

-

do


Dual-Port Memory with Registered Inputs

a0M

clk

di0

we WE

A0

DI0

DO0

A1 DO1 a1 do1

do0

clk

αaa0

M(αa)

αd

we

di0

-

-

-

βaa1

do0

-

M(βa) βd

do1

2.11.1 Memory Operations 259

Sequence of Memory Operations

a0M

clk

di0

we WE

A0

DI0

DO0

A1 DO1 a1 do1

do0

clk

αaa0

M(γa)

αd

we

di0

βaa1

do0

M(θa)

do1

γa

γd2

θa

-

-

-

-

M(αa)

M(βa) βd

γd1

θd


2.11.2 Memory Arrays in VHDL


2.11.3 Data Dependencies

Definition of Three Types of Dependencies

M[i] :=

:= M[i]

:=

M[i]

:=

:=

M[i]:=

M[i]

:=

:=

M[i]:=

Read after Write Write after Write Write after Read(True dependency) (Load dependency) (Anti dependency)

Instructions in a program can be reordered, so long as the data dependencies arepreserved.

2.11.3 Data Dependencies 261

Purpose of Dependencies

R3 := ......

... := ... R3 ...

producer

consumer

W1

R1

R3 := ......W0

W2

WAW ordering prevents W0

from happening after W1

WAR ordering prevents W2

from happening before R1

RAW ordering prevents R1

from happening before W1

R3 := ......

Each of the three types of memory dependencies (RAW, WAW, and WAR) serves aspecific purpose in ensuring that producer-consumer relationships are preserved.


Ordering of Memory Operations

Data Dependencies

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3] M[2] M[1] M[0]30 20 10 0

M[3]C :=

21

Initial Program

2.11.3 Data Dependencies 263

Data Dependencies (Cont’d)

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3]C :=

Initial Program

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid Modification


Data Dependencies (Cont’d)

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3]C :=

Initial Program

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid (or Bad?) Modification

2.11.4 Memory and Dataflow Diagrams 265

2.11.4 Memory and Dataflow Diagrams

Legend for Dataflow Diagrams

name

name name name (rd) name(wr)

Input port Output port State signal Array read Array write

Basic Memory Operations

mem(rd)

addr

data

mem

mem (anti-dependency)

mem(wr)

data addrmem

mem

data := mem[addr]; mem[addr] := data;Memory Read Memory Write


Dataflow Diagrams and Data Dependencies

Read after Write Dependencies

Algo: mem[wr addr] := data in;data out := mem[rd addr];

data_out

mem(wr)

data_in wr_addr

rd_addr

mem

mem(rd)

mem

Read after Write


Read after Write Optimization

Algo: mem[wr addr] := data in;data out := mem[rd addr];

data_out

mem(wr)

data_in wr_addrrd_addr

mem

mem(rd)

mem

Optimization when rd addr 6= wr addr


Write after Write Dependencies

Algo: mem[wr1 addr] := data1;mem[wr2 addr] := data2;

mem(wr)

mem

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2

Write after Write


Write after Write Scheduling Option


mem(wr)

mem

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2

Write after Write


mem(wr)

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2mem

Scheduling option whenwr1 addr 6= wr2 addr


Write after Read Dependencies

Algo: rd data := mem[rd addr];mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr

wr_addr

mem

wr_data

rd_data

Write after Read


Write after Read Optimization

Algo: rd data := mem[rd addr];mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr wr_addr

mem

wr_data

rd_data

Optimization when rd addr 6= wr addr


2.11.5 Ex: Mem Array and Dataflow Dia-gram

M(wr)

data_in wr_addr

2

M(rd)

mem

M 21 2

M(wr)

31 3

A

0

M(rd)

B M(wr)

32 3

M(wr) 3

01 0

M(rd)

CM

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3]C :=

1

2

3

4

5

6

7

1

2

3 4

5

6

7

2.11.5 Ex: Mem Array and Dataflow Diagram 273

Dependencies for Known Addresses

M(wr)

data_in wr_addr

2

M(rd)

mem

M 21 2

M(wr)

31 3

A

0

M(rd)

B M(wr)

32 3

M(wr) 3

01 0

M(rd)

CM


Anti-Dependencies for Known Addresses

M(wr)

data_in wr_addr

2

M(rd)

mem

M 21 2

M(wr)

31 3

A

0

M(rd)

B M(wr)

32 3

M(wr) 3

01 0

M(rd)

CM


Minimal Dependencies

M(wr)

2

M(rd)

M 21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

Memory array with minimal dependencies


Memory Array with Orderings

M(wr)

2

M(rd)

M 21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

3

2

1 1 2

34

Memory array with orderings


Place Operations in Clock Cycles

M(wr)

2

M(rd)

M

21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 0 3

M(rd)

CM

3

2

1 1

2

3

4


Final Dataflow Diagram

M(wr)

2

M(rd)

M

21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 03

M(rd)

C M

3

2

1 1

2

3

4

Final version of DFD

2.12. INPUT / OUTPUT PROTOCOLS 279

2.12 Input / Output Protocols



2.13 Example: Moving Average

In this section we will design a circuit that performs a moving average as it receivesa stream of data. When each new data item is received, the output is the averageof the four most recently received data.

2 3 5 6 6 0 2 2 5 3 1i_data

o_avg 4 5 4 3

Time 0 1 2 3 4 5 6 7 8 9 10

2.13.1 Requirements and Environmental Assumptions 281

2.13.1 Requirements and EnvironmentalAssumptions1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid

data) between valid data.

2. When the input data is valid, the signal i valid is asserted for exactly oneclock cycle.

3. Input data will be 8-bit signed numbers.

4. When output data is ready, o valid shall be asserted.

5. The output data (o avg ) shall be the average of the four most recently receivedinput data. Output numbers shall be truncated to integer values.


2.13.2 Algorithm

Generic equation with input data xi:

avgi = (xi−3 + xi−2 + xi−1 + xi)/4

Decompose into sum and avg:

sumi = xi−3 + xi−2 + xi−1 + xiavgi = sumi/4

Look for patterns and potential optimizations:

sum5 = x2 +(x3 + x4 + x5)sum6 = (x3 + x4 + x5)+ x6

= sum5− x2 + x6

Generalized recurrence equation:

sumi = sumi−1− xi−4 + xiavgi = sumi/4

2.13.2 Algorithm 283

Summary of Behaviour1. Define a signal new for the value of i data each time that i valid is ’1’ .

2. Define a memory array Mto store a sliding window of the four most recent valuesof i data .

3. Define a signal old for the oldest data value from the sliding window.

4. Update sumi with sumi−1 – old i + newi


Sliding Window

Two design patterns to choose from: shift register vs circular buffer

α β δγold newM[3] M[2] M[1] M[0]

α ε

β δγ

η

ι

ζε

δγ ζε

ηδ ζε

β

γ

δ

κ

λ

ιηζε

κιηζ

ε

ζ

Shift register

α β δγε

M[0..3]old new

α

β

γ

δ

β δγ

δγ

δ

η

ι

ζε

ε

ε

ζ

ζ η

ει

κε ζ η

ζικ ζ η

λ

Circular Buffer

For FIFO behaviour, circular buffer is usually prefered: smaller and lower power.

2.13.2 Algorithm 285

Sliding Window with Registers

CE

D Q

CE

D Q

CE

D Q

CE

D Q

d

ce[0]

ce[1]

ce[2]

ce[3]

M[0]

M[1]

M[2]

M[3]

8

q

8

8

8

8

8

we addr

idx[0]

idx[1]

idx[2]

idx[3]

Register array with chip-enables and decoded multiplexer


2.13.3 Pseudocode and Dataflow Diagrams

First Pseudocode

Real 3-address pseudocode

new = i_data

old = M[idx]

tmp = sum - old

sum = tmp + new

M[idx] = new

idx = idx rol 1

o_avg = sum/4

sum i_data

sum o_avg

(wired shift)

M idx

Rd

Wr

M idx

1tmp

new

old

2.13.3 Pseudocode and Dataflow Diagrams 287

Remove intermediate signal old

new = i_data

tmp = sum - M[idx]

sum = tmp + new

M[idx] = new

idx = idx rol 1

o_avg = sum/4reading new from memorytmp = sum - M[idx]

M[idx] = i_data

new = M[idx]

sum = tmp + new

idx = idx rol 1

o_avg = sum/4Remove intermediate signal new

tmp = sum - M[idx]

M[idx] = i_data

sum = tmp + M[idx]

idx = idx rol 1

o_avg = sum/4

Data-dependency graph after removingnew

i_data

o_avg

(wired shift)

Rd

Wr

M

1Rd

tmp

old

new

sum idx

sum M idx


Dataflow Diagram

Latency of three clock cycles

sumi_data

o_avg

(wired shift)

M idx

RdWr

1Rd

S1

S2

S0

S0M sum idx

Latency of two clock cycles

sumi_data

sum o_avg

(wired shift)

M idx

RdWr

M idx

1Rd

S1

S0

S0

Two clock cycles potentially preferable for performance, but requires an additionalmultiplexer.

2.13.3 Pseudocode and Dataflow Diagrams 289

Latency of two clock cycles with registered addresssumi_data

(wired shift)

idx

RdWr1

Rd

S1

S0

S0

M

sum o_avgM idx

Removes need for multiplexer on address input to circular buffer


Register and Datapath Allocation

sumidx

sumi_data

(wired shift)

idx

RdWr1

Rd

as1

as1

S1

S0

S0

M

sum o_avgM idx

idxsum

rol

2.13.4 Control Tables and State Machine 291

2.13.4 Control Tables and State Machine

sumidx

sumi_data

(wired shift)

idx

RdWr1

Rd

as1

as1

S1

S0

S0

M

sum o_avgM idx

idxsum

rol

Register controltable

M idx sumwe addr d ce d ce d

S0 1 idx x 0 – 1 as1S1 0 idx – 1 rol 1 as1

Datapath controltable

as1 rolsub src1 src2 src1 src2

S0 0 M sum – –S1 1 sum M idx 1


Optimized control table

M idx as1we ce sub

S0 1 1 0S1 0 0 1

Static assignments in control tableM.addr = idx

M.d = x

idx.d = rol

sum.d = as1

as1.src1 = sum

as1.src2 = M

2.13.4 Control Tables and State Machine 293

Control Table and Bubbles

Almost final control table

M idx sum as1we ce ce sub

S0 1 0 1 0S1 0 1 1 1

idle 0 0 0 –

Final control table

M idx sum as1we ce ce sub

S0 1 0 1 0S1 0 1 1 1

idle 0 0 0 0

Static assignmentsM.addr = idx

M.d = x

idx.d = rol

sum.d = as1

as1.src1 = sum

as1.src2 = M


State Machine

i valid valid1S0 1 0S1 0 1

idle 0 0

Final control table with state encoding

state M idx sum as1i valid valid1 we ce ce sub

S0 1 0 1 0 1 0S1 0 1 0 1 1 1

idle 0 0 0 0 0 0

M.we = i_valid

idx.ce = valid1

sum.ce = i_valid OR valid1

as1.sub = valid1

2.13.5 VHDL Code 295

2.13.5 VHDL Code

-- valid bitsprocess begin

wait until rising_edge(clk);valid1 <= i_valid;o_valid <= valid1;

end process;-- idxprocess begin

wait until rising_edge(clk);if reset = ’1’ then

idx <= "0001";else

if valid1 = ’1’ thenidx <= idx rol 1;

end if;end if;

end process;

-- sliding windowprocess begin

wait until rising_edge(clk);for i in 3 downto 0 loop

if (i_valid = ’1’) and (idx(i) = ’1’) thenM(i) <= i_data;

end if;end loop;

end process;mem_out <= M(0) when idx(0) = ’1’

else M(1) when idx(1) = ’1’else M(2) when idx(2) = ’1’else M(3);

-- add subadd_sub <= sum - mem_out when valid1 = ’1’

else sum + mem_out;-- sumprocess begin

wait until rising_edge(clk);if i_valid = ’1’ or valid1 = ’1’ then

sum <= add_sub;end if;

end process;


Hardware

i_datai_valid

valid1

add/sub

sum

o_avg(wired shift)

M

(wired shift) idx

CE

CE

CEA

o_valid

Chapter 3

Performance Analysis andOptimization

297

298 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.1 Introduction

Hennessey and Patterson’s Quantitative Computer Achitecture (textbook for E&CE429) has good information on performance. We will use some of the same def-initions and formulas as Hennessey and Patterson, but we will move away fromgeneric definitions of performance for computer systems and focus on performancefor digital circuits.

3.2. DEFINING PERFORMANCE 299

3.2 Defining Performance

Performance =WorkTime

You can double your performance by:

doing twice the work in the same amount of time

OR doing the same amount of work in half the time


Benchmarking

Performance =WorkTime

Measuring time is easy, but how do we accurately measure work?

The game of benchmarketing is finding a definition of work that makes your systemappear to get the most work done in the least amount of time.

Measure of Work Measure of Performanceclock cycle MHzinstruction MIPssynthetic program Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs)real program SPECtravel 1/4 mile drag race

3.2. DEFINING PERFORMANCE 301

SPEC Benchmarks

The Spec Benchmarks are among the most respected and accurate predictions ofreal-world performance.

Definition SPEC: Standard Performance Evaluation Corporation MISSION:“To establish, maintain, and endorse a standardized set of relevantbenchmarks and metrics for performance evaluation of modern computersystems http://www.spec.org .”

The Spec organization has different benchmarks for integer software, floating-pointsoftware, web-serving software, etc.


3.3 Comparing Performance

3.3.1 General Equations

Equation for “Big is n% greater than Small”:

n% =Big−Small

Small

Using “n% greater” formula, the phrase “The performance of A is n% greater thanthe performance of B” is:

n% =PerformanceA−PerformanceB

PerformanceB

Performance is inversely proportional to time:

Performance =1

Time

3.3.1 General Equations 303

Substituting the above equation into the equation for “the performance of A is n%greater than the performance of B” gives:

n% =TimeB−TimeA

TimeA

In general, the equation for a fast system to be “n%” faster than a slow system is:

n% =TSlow −TFast

TFast

Another useful formula is the average time to do one of k different tasks, each ofwhich happens %i of the time and takes an amount of time Ti to do each time it isdone .

TAvg =k

∑i=1

(%i)(Ti)

We can measure the performance of practically anything (cars, computers, vacuumcleaners, printers....)


3.3.2 Example: Performance of Printers


3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 305

3.4 Clock Speed, CPI, Program Length, andPerformance

3.4.1 Mathematics

CPI Cycles per instructionNumInsts Number of instructionsClockSpeed Clock speedClockPeriod Clock period

Time = NumInsts×CPI×ClockPeriod

Time = NumInsts×CPIClockSpeed


3.4.2 Example: CISC vs RISC and CPI

Clock Speed SPECintAMD Athlon 1.1GHz 409Fujitsu SPARC64 675MHz 443

The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). TheFujitsu SPARC64 is a RISC microprocessor (it uses Sun’s Sparc instruction set).Assume that it requires 20% more instructions to write a program in the Sparcinstruction set than the same program requires in IA-32.

3.4.2 Example: CISC vs RISC and CPI 307

SPECint and Performance

Clock Speed SPECintAMD Athlon 1.1GHz 409Fujitsu SPARC64 675MHz 443

Question: Which of the two processors has higher performance?


Relative CPI

Question: What is the ratio between the CPIs of the two microprocessors?

3.4.2 Example: CISC vs RISC and CPI 309

Absolute CPI

Question: Can you determine the absolute (actual) CPI of eithermicroprocessor?


3.4.3 Effect of Instruction Set on Perfor-mance

Your group designs a microprocessor and you are considering adding a fusedmultiply-accumulate to the instruction set. (A fused multiply accumulate is a sin-gle instruction that does both a multiply and an addition. It is often used in digitalsignal processing.)

Your studies have shown that, on average, half of the multiply operations are fol-lowed by an add instruction that could be done with a fused multiply-add.

Additionally, you know:

cpi %ADD 0.8 CPIavg 15%MUL 1.2 CPIavg 5%Other 1.0 CPIavg 80%

3.4.3 Effect of Instruction Set on Performance 311

Options

You have three options:

option 1 : no change

option 2 : add the MAC instruction, increase the clock period by 20%, and MAChas the same CPI as MUL.

option 3 : add the MAC instruction, keep the clock period the same, and the CPIof a MAC is 50% greater than that of a multiply.

Question: Which option will result in the highest overall performance?


3.4.4 Effect of Time to Market on RelativePerformance

Assume that performance of the average product in your market segment doublesevery 18 months.

You are considering an optimization that will improve the performance of your prod-uct by 7%.

Question: If you add the optimization, how much can you allow yourschedule to slip before the delay hurts your relative performance comparedto not doing the optimization and launching the product according to yourcurrent schedule?

3.4.5 Summary of Equations

3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 313

3.5 Performance Analysis and Dataflow Di-agrams

3.5.1 Dataflow Diagrams, CPI, and ClockSpeed• One of the challenges in designing a circuit is to choose the clock speed.

• Choosing a clock period affects many aspects of the design, not just the overallperformance.

• Some goals will push you toward a short clock period

• Some goals will push you toward a long clock period


Goal Action Affect

Minimize area

Increase schedulingflexibility

Decrease percentage ofclock cycle spent in flops(overhead — time inflops is not doing usefulwork)Decrease time to exe-cute an instruction

3.5.1 Dataflow Diagrams, CPI, and Clock Speed 315

Outline to Choose Clock Period

Outline of plan to find optimal latency and clock period for maximum performance:

1. Start with smallest possible clock period.

2. Allocate operations to clock cycles

3. Calculate average time to execute an instruction.

4. If latency > 1, then: increase clock period until reduce latency; return to Step 2.Else (latency = 1): choose clock period and dataflow diagram that resulted inhighest performance.

5. Optimize dataflow diagram to reduce area.


3.5.2 Examples of Dataflow Diagrams forTwo Instructions

• Circuit supports two instructions, Aand B

• Each operation occurs 50% of thetime.

• The delay through a register is 5ns.

• Find clock period and dataflow di-agram to maximize overall perfor-mance.

Instruction A

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

Instruction B

i (40ns)

g (50 ns)

3.5.2 Examples of Dataflow Diagrams for Two Instructions 317

3.5.2.1 Scheduling of Operations for Differ-ent Clock Periods

Scheduling (1)

55ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

55ns

55ns

55ns

55ns

Instr A Instr B 25 ns 15 ns


Scheduling (2)

25 ns 15 ns

25 ns 15 ns


Scheduling (3)

25 ns 15 ns


3.5.2.2 Performance Computation for Dif-ferent Clock Periods

Question: Which clock speed will result in the highest overall performance?

Clock Period CPIA CPIB Tavg55ns75ns85ns95ns155ns


3.5.2.3 Example: Two Instructions TakingSimilar Time

Question: For the flow below, which clock speed will result in the highestoverall performance?

A B30ns 40ns50ns 50ns20ns 40ns50ns

Clock Period CPIA CPIB Tavgnsnsnsnsnsns


3.5.2.4 Example: Same Total Time, Differ-ent Order for A

Question: For the flow below, which clock speed will result in the highestoverall performance?

A B30ns 40ns20ns 50ns50ns 40ns50ns

Clock Period CPIA CPIB Tavgnsnsnsns

3.5.3 Example: From Algorithm to Optimized Dataflow 323

3.5.3 Example: From Algorithm to Opti-mized Dataflow

This question involves doing some of the design work for a circuit that implementsInstP and InstQ using the components described below.

Instruction Algorithm Frequence of OccurrenceInstP a×b× ((a×b)+(b×d)+ e) 75%InstQ (i+ j + k + l)×m 25%

Component Delays2-input Mult 40ns2-input Add 25nsRegister 5ns


NOTES• There is a resource limitation of a maximum of 3 input ports. (There are no other

resource limitations.)

• You must put registers on your inputs, you do not need to register your outputs.

• The environment will directly connect your outputs (its inputs) to registers.

• Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once — if you needto use a value in multiple clock cycles, you must store it in a register.

3.5.3 Example: From Algorithm to Optimized Dataflow 325

Questions

Question: What clock period will result in the best overall performance?

Question: Find a minimal set of resources that will achieve theperformance you calculated.


3.6 General Optimizations

3.6.1 Strength Reduction

Strength reduction replaces one operation with another that is simpler.

3.6.1.1 Arithmetic Strength Reduction

Multiply by a constant power of two wired shift logical leftMultiply by a power of two shift logical leftDivide by a constant power of two wired shift logical rightDivide by a power of two shift logical rightMultiply by 3 wired shift and addition

3.6.1 Strength Reduction 327

3.6.1.2 Boolean Strength ReductionBoolean tests that can be implemented as wires• is odd, is even

• is neg, is pos

By choosing your encodings carefully, you can sometimes reduce a vector compar-ison to a wire.

For example if your state uses a one-hot encoding, then the comparison state =S3 reduces to state(3) = ’1’ . You might expect a reasonable logic-synthesistool to do this reduction automatically, but most tools do not do this reduction.

When using encodings other than one-hot, Karnaugh maps can be useful tools foroptimizing vector comparisons. By carefully choosing our state assignments, whenwe use a full binary encoding for 8 states, the comparison:

(state = S0 or state = S3 or state = S4) = ’1’

can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a conditionthat is true for four states, then we can find an encoding that looks at just 1 bit.


3.6.2 Replication and Sharing

3.6.2.1 Mux-Pushing

Pushing multiplexors into the fanin of a signal can reduce area.

Beforez <= a + b when (w = ’1’)

else a + c;

Aftertmp <= b when (w = ’1’)

else c;

z <= a + tmp;

The first circuit will have two adders, while the second will have one adder. Somesynthesis tools will perform this optimization automatically, particularly if all of thesignals are combinational.

3.6.2 Replication and Sharing 329

3.6.2.2 Common Subexpression Elimina-tion

Introduce new signals to capture subexpressions that occur multiple places in thecode.

Beforey <= a + b + c when (w = ’1’)

else d;

z <= a + c + d when (w = ’1’)

else e;

Aftertmp <= a + c;

y <= b + tmp when (w = ’1’)

else d;

z <= d + tmp when (w = ’1’)

else e;


Subexpression Elimination

Note: Clocked subexpressions Care must be taken when doingcommon subexpression elimination in a clocked process. Puttingthe “temporary” signal in the clocked process will add a clock cycleto the latency of the computation, because the tmp signal will beflip-flop. The tmp signal must be combinational to preserve thebehaviour of the circuit.

3.6.2 Replication and Sharing 331

3.6.2.3 Computation Replication• To improve performance

– If same result is needed at two very distant locations and wire delays are sig-nificant, it might improve performance (increase clock speed) to replicate thehardware

• To reduce area

– If same result is needed at two different times that are widely separated, itmight be cheaper to reuse the hardware component to repeat the computationthan to store the result in a register

Note: Muxes are not free Each time a component is reused,multiplexors are added to inputs and/or outputs. Too much sharingof a component can cost more area in additional multiplexors thanwould be spent in replicating the component


3.6.3 Arithmetic

VHDL is left-associative. The expression a + b + c + d is interpreted as (((a

+ b) + c) + d) . You can use parentheses to suggest parallelism.

Perform arithmetic on the minimum number of bits needed. If you only need thelower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to12 bits. This results in a smaller and faster design than computing all 16 bits of theresult and trimming the result to 12 bits.

3.7. RETIMING 333

3.7 Retiming

state

a

b

c

sel

x y z

critical path

state S0 S1 S2 S3 S0 S1 S2 S3a b c

sel x y z

αβγ1α

α+γα+γ

process begin


if state = S1 then

z <= a + c;

else

z <= b + c;

end if;

end process;


Retimed Circuit and Waveform

state

a

b

c

sel

x y z

state S0 S1 S2 S3 S0 S1 S2 S3a b c

sel x y z

αβγ

process (state) beginif state = S1 then

sel = ’1’else

sel = ’1’end if;

end process;process begin

wait until rising_edge(clk);if sel = ’1’ then

... -- code for zend if;

end process;

process beginwait until rising_edge(clk);if state = then

sel = ’1’else

sel = ’1’end if;

end process;process begin

wait until rising_edge(clk);if sel = ’1’ then

... -- code for zend if;

end process;

Chapter 4

Functional Verification

335

336 CHAPTER 4. FUNCTIONAL VERIFICATION

4.1 Overview

4.1.1 Terminology: Validation / Verification/ Testing

4.1.2 The Difficulty of Designing CorrectChips

4.1.2 The Difficulty of Designing Correct Chips 337

4.1.2.1 Notes from Kenn Heinrich (UWE&CE grad)

“Everyone should get a lecture on why their first industrial design won’t work in thefield.”

Note: There are six reasons in your notes.

4.1.2.2 Notes from Aart de Geus (Chairmanand CEO of Synopsys)

More than 60% of the ASIC designs that are fabricated have at least one error,issue, or a problem that whose severity forced the design to be reworked.

Note: There is a pretty picture in your notes.


4.2 Test Cases and Coverage

4.2.1 Coverage

To be absolutely certain that an implementation is correct, we must check everycombination of values. This includes both input values and internal state (flip flops).

If we have ni bits of inputs and ns bits in flip-flops, we have to test 2ni+ns differentcases when doing functional verification.

Question: If we have nc combinational signals, why don’t we have to test2ni+ns+nc different cases?

4.2.2 Floating Point Divider Example 339

4.2.2 Floating Point Divider Example

This example illustrates the difficulty of achieving significant coverage on realisticcircuits.

Consider doing the functional simulation for a double precision (64-bit) floating-pointdivider.

Given InformationData width 64 bitsNumber of gates in circuit 10 000Number of assembly-language instructions tosimulate one gate for one test case

100

Number of clock cycles required to execute oneassembly language instruction on the computerthat is running the simulation

0.5

Clock speed of computer that is running the sim-ulation

1 Gigahertz


Number of Cases

Question: How many cases must be considered?

width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109


Simulation Run Time

Question: How long will it take to simulate all of the different possible casesusing a single computer?



Coverage

Question: If you can run simulations non-stop for one year on tencomputers, what coverage will you achieve?



Simulation vs the Real World

From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, DesignAutomation Conference 2001. (Link on E&CE 327 web page.)• Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15

MHz.

• By tapeout, over 200 billion simulation cycles had been run on a network ofcomputers.

• All of these simulations represent less than two minutes of running a real proces-sor.


4.3 Testbenches

4.3.1 Overview of Test Benches

stimulus

implementation

specification

check

testbench

Implementation Circuit that you’re checking for bugsalso known as: “design under test” or “unit under test”

Stimulus Generates test vectors

Specification Describes desired behaviour of implementation

Check Checks whether implementation obeys specification

4.3.2 Reference Model Style Testbench 345

4.3.2 Reference Model Style Testbench

stimulus

implementation

specification

reference model testbench

4.3.3 Relational Style Testbench

stimulus

implementation

relational testbench

check


4.3.4 Coding Structure of a Testbench

stimulus

implementation

specification

check

testbench

architecture main of athabasca_tb iscomponent declaration for implementation;other declarations

beginimplementation instantiation;stimulus process;specification process (or component instantiation);check process;

end main;

4.3.5 Datapath vs Control 347

4.3.5 Datapath vs Control

Datapath and control circuits tend to use different styles of testbenches.

stimulus

implementation

specification

reference model testbench

stimulus

implementation

relational testbench

check


4.3.6 Verification Tips

Suggested order of simulation for functional verification.

1. Write high-level model.

2. Simulate high-level model until have correct functionality and latency.

3. Write synthesizable model.

4. Use zero-delay simulation (uw-sim ) to check behaviour of synthesizable modelagainst high-level model.

5. Optimize the synthesizable model.

6. Use zero-delay simulation (uw-sim ) to check behaviour of optimized modelagainst high-level model.

7. Use timing-simulation (uw-timsim ) to check behaviour of optimized modelagainst high-level model.

section 4.4 describes a series of testbenches that are particularly useful for debug-ging datapath circuits in the early phases of the design cycle.

4.4. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS 349

4.4 Functional Verification for Datapath Cir-cuits

In this section we will incrementally develop a testbench for a very simple circuit:an AND gate.


Implementation

entity and2 is

port (

a, b : in std_logic;

c : out std_logic

);

end and2;

architecture main of and2 is

begin

c <= ’1’ when (a = ’1’ AND b = ’1’)

else ’0’;

end and2;

4.4.1 A Spec-Less Testbench 351

4.4.1 A Spec-Less Testbench

First, use waveform viewer to check that implementation generates reasonable out-puts for a small set of inputs.

entity and2_tb isend and2_tb;

architecture main_tb of and2_tb iscomponent and2 ... end component;signal ta, tb, tc_impl : std_logic;signal ok : boolean;

begin---------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);---------------------------------------------stimulus : processbegin

ta <= ’0’; tb <= ’0’;wait for 10ns;ta <= ’1’; tb <= ’1’;wait for 10ns;

end process;---------------------------------------------

end main_tb;


4.4.2 Use an Array for Test Vectorsarchitecture main_tb of and2_tb is

...begin

...stimulus : process

type test_datum_ty is recordra, rb : std_logic;

end record;type test_vectors_ty is

array(natural range <>) of test_datum_ty;constant test_vectors : test_vectors_ty :=

-- a b( ( ’0’, ’0’),

( ’1’, ’1’));

beginfor i in test_vectors’low to test_vectors’high loop

ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;

end loop;end process;

end main_tb;

4.4.3 Build Spec into Stimulus 353

4.4.3 Build Spec into Stimulus

stimulus : processtype test_datum_ty is record

ra, rb, rc : std_logic;end record;type test_vectors_ty is


-- a, b: inputs-- c : expected output-- a b c( ( ’0’, ’0’, ’0’),

( ’0’, ’1’, ’0’),( ’1’, ’1’, ’1’)

);begin

for i in test_vectors’low to test_vectors’high loopta <= test_vectors(i).ra;tb <= test_vectors(i).rb;tc_spec <= test_vectors(i).rc;wait for 10 ns;



Build Spec into Stimulus (Cont’d)

stimulus : process...

beginfor i in test_vectors’low to test_vectors’high loopta <= test_vectors(i).ra;tb <= test_vectors(i).rb;tc_spec <= test_vectors(i).rc;wait for 10 ns;

end loop;end process;------------------------------------------check : process (tc_impl, tc_spec)begin

ok <= (tc_impl = tc_spec);end process;------------------------------------------

end main_tb;

4.4.4 Have Separate Specification Entity 355

4.4.4 Have Separate Specification Entityentity and2_spec is...(same as and2 entity)...

end and2_spec;

architecture spec of and2_spec isbegin

c <= a AND b;end spec;


Testbench for Separate Specification

architecture main_tb of and2_tb iscomponent and2 ...;component and2_spec ...;signal ta, tb, tc_impl, tc_spec : std_logic;signal ok : boolean;

begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);spec : and2_spec port map (a => ta, b => tb, c => tc_spec);------------------------------------------

stimulus process...check process...

end

4.4.4 Have Separate Specification Entity 357

Testbench for Separate Spec (Cont’d)

stimulus : process...constant test_vectors : test_vectors_ty :=

-- a b( ( ’0’, ’0’),

( ’1’, ’1’));

beginfor i in test_vectors’low to test_vectors’high loop

ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;

end loop;end process;------------------------------------------check : process (tc_impl, tc_spec)begin

ok <= (tc_impl = tc_spec);end process;------------------------------------------

end main_tb;


4.4.5 Generate Test Vectors Automaticallyarchitecture main_tb of and2_tb is

...begin

...stimulus : process

subtype std_test_ty of std_logic is (’0’, ’1’);begin

for va in std_test_ty’low to std_test_ty’high loopfor vb in std_test_ty’low to std_test_ty’high loop

ta <= va;tb <= vb;wait for 10 ns;

end loop;end loop;

end process;...

end main_tb;

4.4.6 Relational Specification 359

4.4.6 Relational Specification

Sometimes we want to check a relationship between the output and the input, ratherthan check that the output has a specific value.

To do this, we drop the spec process, and put the brains into the check process.

architecture main_tb of and2_tb is...

begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);------------------------------------------stimulus : process

...end process;------------------------------------------check : process (tc_impl, tc_spec)begin

ok <= NOT (tc_impl = ’1’ AND (ta =’0’ OR tb = ’0’));end process;------------------------------------------

end main_tb;


4.5 Functional Verification of Control Cir-cuits

Control circuits are often more challenging to verify than datapath circuits.

In this section, we will explore the functional verification of state machines via aFirst-In First-Out queue.

4.5.1 Overview of Queues in Hardware 361

4.5.1 Overview of Queues in Hardwarewrite read

qu

eu

e

Structure of queue


Empty Write 1

A

Write 2

A

Write Sequence


Write 1

BA

Write 2

BA

A Second Example Write


Read 1

BA

Read 2

BA

Example Read Sequence


Write 1

BCDEFGHI

J

Write 2

BCDEFGHIJ

Write Illustrating Index Wrap


Write 1

BCDEFGHIJ

K

Write 2

BCDEFGHIJ

K

Write Illustrating Full Queue


empty

mem

wr_idx

rd_idx

data_wrdata_rd

do_wr

do_rd

Queue Signals

empty

mem

wr_idx

rd_idx

data_wr

data_rd

do_wr

do_rd

WE

A0

DI0

DO0

A1 DO1

Incomplete Queue Blocks

Control circuitry not shown.


4.5.2 VHDL Coding

4.5.2.1 Package

package queue_pkg is

subtype data is std_logic_vector(3 downto 0);

function to_data(i : integer) return data;

end queue_pkg;

package body queue_pkg is

function to_data(i : integer) return data is

begin

return std_logic_vector(to_unsigned(i, 4));

end to_data;

end queue_pkg;

4.5.2.2 Other VHDL Coding

4.5.3 Code Structure for Verification 369


4.5.3 Code Structure for Verification

Verification things to notice in queue implementation:

1. instrumentation code

2. coverage monitors

3. assertions


Code Structure for Verification

architecture ... is

...

begin

... normal implementation ...

process (clk)

begin


... instrumentation code ...

prev_ signame <= signame;

end if;

end process;

... assertions ...

... coverage monitors ...

end;

4.5.4 Instrumentation Code 371

4.5.4 Instrumentation Code• Added to implementation to support verification

• Usually keeps track of previous values of signals

• Does not create hardware (Optimized away during synthesis)

• Does not feed any output signals

• Must use synthesizable subset of VHDL

process (clk) begin


prev_rd_idx <= rd_idx;

prev_wr_idx <= wr_idx;

prev_do_rd <= do_rd;

prev_do_wr <= do_wr;

end if;

end process;


Coverage Events for Queue

Question: What events should we monitor to estimate the coverage of ourfunctional tests?


Coverage Monitor Template

process ( signals read)

begin

if ( condition) then

report "coverage: message";

elsif ( condition) ) then

report "coverage: message";

else

report "error: case fall through on message"

severity warning;

end if;

end process;


Coverage Monitor Code

Events related to rd idx equals wr idx .

process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx)

begin

if (rd_idx = wr_idx) then

if ( prev_rd_idx = prev_wr_idx ) then

report "coverage: read = write both moved";

elsif ( rd_idx /= prev_rd_idx ) then

report "coverage: Read caught write";

elsif ( wr_idx /= prev_wr_idx ) then

report "coverage: Write caught read";

else

report "error: case fall through on rd/wr catching"

severity warning;

end if;

end if;

end process;


Coverage Monitor Code

Events related to rd idx wrapping.

process (rd_idx)

begin

if (rd_idx = low_idx) then

report "coverage: rd mv to low";

elsif (rd_idx = high_idx) then

report "coverage: rd mv to high";

else

report "coverage: rd mv normal";

end if;

end process;


4.5.5 Assertions

Assertions for Queue1. If rd idx changes, then it increments or wraps.

2. If rd idx changes, then do rd was ’1’ , or reset is ’1’ .

3. If wr idx changes, then it increments or wraps.

4. If wr idx changes, then do wr was ’1’ , or reset is ’1’ .

5. And many others....

4.5.5 Assertions 377

Assertion Template

process ( signals read) begin

assert ( required condition)

report "error: message" severity warning;

end process;


Assertions: Read Index

process (rd_idx) begin

assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx))

report "error: rd inc" severity warning;

assert ((prev_do_rd = ’1’) or (reset = ’1’))

report "error: rd imp do_rd" severity warning;

end process;

4.5.5 Assertions 379

Assertions: Write Index

process (wr_idx) begin

assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx))

report "error: wr inc" severity warning;

assert ((prev_do_wr = ’1’) or (reset = ’1’))

report "error: wr imp do_wr" severity warning;

end process;


4.5.6 VHDL Coding Tips

Vector Type Declaration

type data_array_ty is array(natural range <>) of data;

signal data_array : data_array_ty(7 downto 0);

4.5.6 VHDL Coding Tips 381

Functions

function to_idx

(i : natural range data_array’low to data_array’high)

return idx_ty

is

begin

return to_unsigned(i, idx_ty’length);

end to_idx;

Conversion to IndexWithout Function With Function

rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5);

The function code is verbose, but is very maintainable, because neither the functionitself nor uses of the function need to know the width of the index vector.


Attributes

function inc_idx (idx : idx_ty) return idx_ty is

begin

if idx < data_array’high then

return (idx + 1);

else

return (to_idx(data_array’low));

end if;

end inc_idx;

4.5.6 VHDL Coding Tips 383

Feedback Loops, and Functions

Coding guideline: use functions. Don’t use procedures.

inc as fun inc as procwr_idx <= inc_idx(wr_idx); inc_idx(wr_idx);

Functions clearly distinguish between reading from a signal and writing to a signal.By examining the use of a procedure, you cannot tell which signals are read fromand which are written to. You must examine the declaration or implementation ofthe procedure to determine modes of signals.

Modifying a signal within a procedure results in a tri-state signal. This is bad.


File I/O (textio package)

TEXTIO defines read , write , readline , writeline functions.

Described in:• http://www.eng.auburn.edu/department/ee/mgc/vhdl.ht ml#textio

These functions can be used to read test vectors from a file and write results to afile.

4.5.7 Queue Specification 385

4.5.7 Queue Specification

Most bugs in queues are related to the queue becoming full, becoming empty,and/or wrap of indices.

Specification should be “obviously correct”. Avoid bugs in specification by makingspecification queue larger than the max number of writes that we will do in testsuite. Thus, the specification queue will never become full or wrap. However, theimplementation queue will become full and wrap.


Write Index Update in Specification

We increment write-index on every write, we never wrap.

process (clk) begin



wr_idx <= 0;

elsif (do_wr = ’1’) then

wr_idx <= wr_idx + 1;

end if;

end if;

end process;

4.5.7 Queue Specification 387

Things to Notice

Things to notice in queue specification:

1. don’t care conditions (’-’ )

2. uninitialized data (hint: what is the value of rd_data when do more reads thanwrites?


Don’t Care

rd_data <= data_array(rd_idx) when (do_rd =’1’)

else (others => ’-’);

4.5.8 Queue Testbench 389

4.5.8 Queue Testbench

Things to notice in queue testbench:

1. running multipe test sequences

2. uninitialized data ’U’

3. std_match to compare spec and impl data

0 ∼ 00 ∼ L1 ∼ 11 ∼ H- ∼ everything

everything else 6∼ everything

With equality, ’-’ 6= ’1’ , but we want to use ’-’ to mean “don’t care” in specifi-cation. The solution is to use std match , rather than = to check implementationsignals against the specification.


Stimulus Process StructureThe stimulus process runs multiple test vectors in a single simulation run.

stimulus : processtype test_datum_ty is

recordr_reset, ... normal fields ...

end record;type test_vectors_ty is


( -- reset ... other signal ...( ’1’, normal fields), -- test case 1( ’0’, normal fields),

...( ’1’, normal fields), -- test case 2( ’0’, normal fields),

...);

beginfor i in test_vectors’range loop

if (test_vectors(i).r_reset = ’1’) then... reset code ...

end if;reset <= ’0’;... normal sequence ...wait until rising_edge(clk);


4.6. EXAMPLE: MICROWAVE OVEN 391

4.6 Example: Microwave Oven

This question concerns the VHDL code microwave , which controls a simple mi-crowave oven; the properties prop1 ...prop3 ; and two proposed changes to theVHDL code.

INSTRUCTIONS:

1. Assume that the code as currently written is correct — any change to the codethat causes a change to the behaviour of the signals heat or count is a bug.

2. For each of the two proposed code changes, answer whether the code changewill cause a bug.

3. If the code change will cause a bug, provide a test case that will exercise thebug and identify all of the given properties (prop1 , prop2 , and prop3 ) that willdetect the bug with the test case you provide.

4. If none of the three properties can detect the bug, provide a property of yourown that will detect the bug with the testcase you provide.


Question: For each of the three properties prop1...prop2, answer whetherthe property is best checked as part of a testbench or assertion. For eachproperty, justify why a testbench or an assertion is the best method tovalidate that property.

prop1 If start is pushed and the door is closed, then heat remains on for exactlythe time specified by the timer when start was pushed, assuming reset remainsfalse and the door remains closed.

prop2 If the door is open, then heat is off.

prop3 If start is not pushed, reset is false, and count is greater than zero, thencount is decremented.


Implementationentity microwave is

port (

timer -- time input from user

: in unsigned(7 downto 0);

reset, -- resets microwave

clk, -- clock signal input

is_open, -- detects when door is open

start -- start button input from user

: in std_logic;

heat : out std_logic -- 1=on, 0=off

);

end microwave;

architecture main of microwave is

signal count : unsigned(7 downto 0); -- internal time count

signal x_heat : std_logic;

begin


-- heat process ------------------------------process (clk)begin


x_heat <= ’0’;elsif (is_open = ’0’) and (start = ’1’) and -- region of

(time > 0) -- change #1then --

x_heat <= ’1’; --elsif (is_open = ’0’) and (count > 0) then --

x_heat <= x_heat; --else

x_heat <= ’0’;end if;

end if;end process;


-- count process ------------------------------process (clk)begin

if rising_edge(clk) thenif (reset = ’1’) then

count <= to_unsigned(0, 8);elsif (start = ’1’) then -- region of

count <= timer; -- change #2elsif (count > 0) then --

count <= count - 1; --end if;

end if;end process;heat <= x_heat;

end main;


Propertiesprop1 If start is pushed and the door is closed, then heat remains on for exactly

the time specified by the timer when start was pushed, assuming reset remainsfalse and the door remains closed.

prop2 If the door is open, then heat is off.

prop3 If start is not pushed, reset is false, and count is greater than zero, thencount is decremented.


Change #1

From:

elsif (start = ’1’) then

count <= time;

elsif (count > 0) then

count <= count - 1;

To:

elsif (count > 0) then

count <= count - 1;

elsif (start = ’1’) then

count <= time;


Change #2

From:

elsif (is_open = ’0’) and (start = ’1’) and (time > 0)

then x_heat <= ’1’;

elsif (is_open = ’0’) and (count > 0)

then x_heat <= x_heat;

To:

elsif (is_open = ’0’)

and ((start = ’1’) or (count > 0))

then x_heat <= ’1’;

else x_heat <= ’0’;


Coverage

Question: If msb of src1 is ’1’ and lsb of src2 is ’0’ or sum(3) is ’1’, thenresult is wrong. What is the minimum coverage needed to detect bug?What is the minimim coverage needed to guarantee that the bug will bedetected?


Chapter 5

Timing Analysis

401

402 CHAPTER 5. TIMING ANALYSIS

5.1 Delays and Definitions

In this section we will look at the different timing parameters of circuits. Our focuswill be on those parameters that limit the maximum clock speed at which a circuitwill work correctly.

5.1.1 Background Definitions


5.1.2 Clock-Related Timing Definitions 403

5.1.2 Clock-Related Timing Definitions

5.1.2.1 Clock Skewskew

clk1

clk2

clk3

clk4

clk1

clk2

clk3

clk4

Definition Clock Skew: The difference in arrival times for the same clockedge at different flip-flops.

Clock skew is caused by the difference in interconnect delays to different points onthe chip.


Clock Tree Design

Clock tree design is critical in high-performance designs to minimize clock skew.Sophisticated synthesis tools put lots of effort into clock tree design, and the tech-niques for clock tree design still generate PhD theses.


5.1.2.2 Clock Latency

latency

master clock

intermediate clock

final clock

master clock

inte

rmed

iate

clo

ck final clock

Definition Clock Latency: The difference in arrival times for the same clockedge at different levels of interconnect along the clock tree. (Intuitively“different points in the clock generation circuitry.”)

Note: Clock latency Clock latency does not affect the limit onthe minimim clock period.


5.1.2.3 Clock Jitter

jitter

ideal clock

clock with jitter

Definition Clock Jitter: Difference between actual clock period and idealclock period.


Causes of Clock Jitter

Clock jitter is caused by:• temperature and voltage variations over time

• temperature and voltage variations across different locations on a chip

• manufacturing variations between different parts


5.1.3 Storage-Related Timing Definitions

5.1.3.1 Flops and Latches

d

clk

q

Flop Behaviour

d

clk

q

Latch Behaviour

Storage devices have two modes: load mode and store mode.

Flops are edge sensitive; they are in load mode just before the clock edge.

Latches are level senstive; they are in load mode while their enable signal is as-serted high (low for active low latches).

5.1.3 Storage-Related Timing Definitions 409

Timing Parameters

β

d

clk

q

Clock-to-Q

HoldSetup

α β

Flip-flop

d

clk

q

Clock-to-Q

HoldSetup

α β

α β

Active-high latch

d

clk

q

Clock-to-Q

HoldSetup

α β

α β

Active-low latch

Setup and hold define the window in which input data are required to be constantin order to guarantee that storage device will store data correctly.

Clock-to-Q defines the delay from the clock edge to when the output is guaranteedto be stable.


5.1.4 Propagation Delays

Propagation delay time it takes a signal to travel from the source (driving) flop tothe destination flop

propagation delay = load delay + interconnect delay

Load delay combinational gates between the flops

Interconnect delay wires between gates and flops

5.1.5 Timing Constraints 411

5.1.5 Timing Constraints

5.1.5.1 Minimum Clock Periodsignal may change

signal is stablea b

clk1 clk2

signal may rise

signal may fall

clk1

clk2

a

b

clock period

ClockPeriod >


5.1.5.2 Hold Constraint5.1.5.3 Example Timing Violations

Good Timinga

b

clk

a

clk

b

dc

c

Clock-to-Q

Setup

Prop

d

β γ

β

βα γ

α

α

αα

β

Hold

5.1.5 Timing Constraints 413

Setup Violation

α

a

clk

b

c α β

?α?β?

a

clk

b

c

Clock-to-Q

Setup

Prop

d

β γ

β

βα γ

α

α

αα

?α?β?

Setup Violation


Hold Violation

a b

clk

a

clk

b

dc

c

Hold

d

β γ

β

β γ

?β?γ?

γ

Clock-to-Q

Prop

Hold Violation

5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS 415

5.2 Timing Analysis of Latches and FlipFlops

In this section, we show how to find the clock-to-Q, setup, and hold times for latches,flip-flops, and other storage elements.

5.2.1 Simple Multiplexer Latch


5.2.1.1 Structure and Behaviour of Multi-plexer Latch

i o

clk

Loading / pass-through mode

i o

’1’

Storage mode

5.2.1 Simple Multiplexer Latch 417

Unfold Multiplexer to Simple Gates

i o

’0’

ab

s

o

Multiplexer: symbol and implementation

i o

clka

sel

b

o

Latch implementation


Latch Glitching

d clk

o

Note: inverters on clk Both of the inverters on the clk signalare needed. Together, they prevent a glitch on the OR gate whenclk is deasserted. If there was only one inverter, a glitch wouldoccur. For more on this, see section 5.2.1.6


Loading and Storing Values

d clk

o

Loading ’0’

0

11

10

0

d=’0’ clk=’1’

o1

Loading ’1’

1

00

00

0

d=’1’ clk=’1’

o1

Storing ’0’

010

11

d clk=’0’

o=’0’0

1

Storing ’1’


5.2.1.2 Strategy for Timing Analysis ofStorage Devices

The key to calculating setup and hold times of a latch, flop, etc is to identify:

1. how the data is stored when not connected to the input (often a pair of invertersin a loop)

2. the gate(s) that the clock uses to cause the stored data to drive the output (oftena transmission gate or multiplexor)

3. the gate(s) that the clock uses to cause the input to drive the output (often atransmission gate or multiplexor)


5.2.1.3 Clock-to-Q Time of a MultiplexerLatch

clk d

l1l2

qn q

s2

s1

cn

c2 clk

d l1

l2

qn q

s2

s1

cn

c2

clk d

l1l2

qn q

s2

s1

cn

c2 clk

d l1

l2

qn q

s2

s1

cn

c2

clk d

l1l2

qn q

s2

s1

cn

c2 clk

d l1

l2

qn q

s2

s1

cn

c2


5.2.1.4 Setup Timing of a Multiplexer Latchclk

d α1 0 1

αα

α α

ααα0

0

Circuit is stable in load mode

clk d α

0 1 0α

0

α α

ααα1

t=3: l2 is set to 0, because c2 turns off AND gate

α

clk d α

0 0 1α

α

α α

ααα0

0

t=0: Clk transitions from load to store

clk d α

0 1 0α

0

α α

ααα1

t=4: α from store path propagates to q

α

clk d α

0 1 1α

α

α α

ααα1

0


clk d α

0 1 0α

0

α α

ααα1

t=5: α from store path completes cycle

α

clk d α

0 1 0α

α

α α

ααα1

t=2: s1 propagates to s2, because cn turns on AND gate

α


Setup Violation

clk d

1 0 1ω

ω

ω ω

ωωω0

0

Circuit is stable in load mode with ω

ωclk

d α αα

ω ω

ωωω

0

t=1: α propagates through ANDClk propagates through inverter

0 1 1

1

clk d α

1 0 1ω

ω

ω ω

ωωω0

0

t=-1: D transitions from ω to α

Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.

clk d α α

α

α ω

ωωω

0

ω

t=2: old ω propagates through AND

1 0

1

clk d α

0 1α

ω

ω ω

ωωω0

0

t=0: α propagates through inverterClk transitions from load to store

α0

clk d α α

0

α

αωω


ω

0 1 0

1ω/α


clk d α α

ω ω/α

ω/ααα

ω

0 1 0

1

t=4: ω/α from store path propagates to q

clk d α=1

0 1 00

0

0 1

1111

t=5: Illustrate instability with ω=0, α=1

0

clk d α

0 1 0α

0

ω

ωω/αω/α

1α

t=5: ω/α from store path completes cycle

ω

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α ω

ω

ω

ω

setup with negative margin

c2

ω

ω

ω

ω

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α

-3 -2 -1 0 1 2 3 4 5 6


We now repeat the analysis of setup violation, but illustrate the minimum violation(input transitions from ω to α 3 time-units before the clock edge).

clk d

1 0 1ω

ω

ω ω

ωωω0

0

Circuit is stable in load mode with ω

ω

clk d α

1 0 1α

α

ω ω

ωωω0

0

t=-1: α propagates through AND

clk d α

1 0 1ω

ω

ω ω

ωωω0

0

t=-3: D transitions from ω to α

clk d α

0 0 1α

α

α ω

ωωω0

0


clk d α

1 0 1α

ω

ω ω

ωωω0

0

t=-2: α propagates through inverter

α

clk d α

0 1 1α

α

α α

αωω1

0

t=1: Clk propagates through inverter


clk d α

0 1 0α

α

α α

ααα1

t=2: old ω propagates through AND

ω

Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.

clk d α

0 1 0α

0

α α

αω/αω/α

1

t=5: ω/α from store path completes cycle

α

clk d α

0 1 0α

0

ω/α α

ααα1


α

clk d α=1

0 1 00

0

0 1

1111

t=5: Illustrate instability with ω=0, α=1

0

clk d α

0 1 0α

0

α ω/α

ω/ααα1

t=4: ω/α from store path propagates to q

α

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α α

α

α

α

setup with negative margin

c2

α

α

α

α

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α

-3 -2 -1 0 1 2 3 4 5 6


Minimum Setup Time

clk d

l1l2

qn q

s2

s1

cn

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

α

α

α

α

setup

c2

α

α

α

α

α

α

α

α

α

α

α

α

α

α

α

α


5.2.1.5 Hold Time of a Multiplexer Latchclk

d l1

l2

qn q

s2

s1

cn

c2


Hold Time Behaviour

clk d

l1l2

qn q

s2

s1

cn

c2clk

d l1

l2

qn q

s2

s1

cn

c2

clk d

l1l2

qn q

s2

s1

cn

c2clk

d l1

l2

qn q

s2

s1

cn

c2

clk d

l1l2

qn q

s2

s1

cn

c2clk

d l1

l2

qn q

s2

s1

cn

c2


5.2.1.6 Example of a Bad Latch

clk d

l1l2

qn q

s2

s1

cn

c2

d α β

l1

l2

qn

q

s1

α β

s2

clk

c2

α

α

α

α

cn

α

α

α

α

α

α

α

α

α

5.3. CRITICAL PATHS AND FALSE PATHS 431

5.3 Critical Paths and False Paths

5.3.1 Introduction to Critical and FalsePaths

Definition critical path: The slowest path on the chip between flops or flopsand pins. The critical path limits the maximum clock speed.

Definition false path: : a path along which an edge cannot travel frombeginning to end.


Outline

The algorithm that we present comes from McGeer and Brayton in a DAC 198?paper. The algorithm to find the critical path through a circuit is presented in severalparts.

1. Section 5.3.2: Find the longest path ignoring the possibility of false paths.

2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical pathis a false path.

3. Section 5.3.4: If a candidate path is a false path, then find the next candidatepath, and repeat the false-path detection algorithm.

4. Section 5.3.5: Correct, complete, and complex algorithm to find the critical pathin a circuit.

5.3.1 Introduction to Critical and False Paths 433

Notes

Note: The analysis of critical paths and false paths assumesthat all inputs change values at exactly the same time. Timingdifferences between inputs are modelled by the skew parameterin timing analysis.

Throughout our discussion of critical paths, we will use the delay values for gatesshown in the table below.

gate delayNOT 2AND 4OR 4XOR 6


5.3.1.1 Example of Critical Path in FullAdder

Question: Find the critical path through the full-adder circuit shown below.

ci a b

co

si

jk


Alternative Excitation

Question: Do the input values of ci=0, a=↓, b=1 exercise the critical path?

ci a b

co

si

jk


5.3.1.2 Preliminaries for Critical Paths

5.3.1.3 Longest Path and Critical Path

The longest path through the circuit might not be the critical path, because thebehaviour of the gates might prevent an edge (0→ 1 or 1→ 0) from travelling alongthe path.


Example False Path

Question: Determine whether the longest path in the circuit below is a falsepath

ya

b

a = 0, b = 0→ 1 a = 0, b = 1→ 0

ya

b

ya

b

a = 1, b = 0→ 1 a = 1, b = 1→ 0

ya

b

ya

b

Question: How can we determine analytically that this is a false path?


ya

b


Preview of Complete Example

Question: Find the critical path through the circuit below.

a b

c

d ef

g

a b

c

d ef

g


5.3.2 Longest Path

Outline of Algorithm to Find Longest Path

The basic idea is to annotate each signal with the maximum delay from it to anoutput.• Start at destination signals and traverse through fanin to source signals.

– Destination signals have a delay of 0

– At each gate, annotate the inputs by the delay through the gate plus the delayof the output.

– When a signal fans out to multiple gates, annotate the output of the source(driving) gate with maximum delay of the destination signals.

• The primary input signal with the maximum delay is the start of the longest path.The delay annotation of this signal is the delay of the longest path.

• The longest path is found by working from the source signal to the destinationsignals, picking the fanout signal with the maximum delay at each step.

5.3.3 Detecting a False Path 441

5.3.3 Detecting a False Path

5.3.3.1 Preliminaries

The controlling value of a gate is the value such that if one of the inputs has thisvalue, the output can be determined independently of the other inputs.

The controlled output value is the value produced by the controlling input value.

Gate Controlling Value Controlled Output

AND

OR

NAND

NOR

XOR


Path Input, Side Input

Definition path input: For a gate on a path (either a candidate critical path, ora real critical path), the path input is the input signal that is on the path.

Definition side input: For a gate on a path (either a candidate critical path, ora real critical path), the side inputs are the input signals that are not on thepath.


Reconvergent Fanout

Definition reconvergent fanout: There are paths from signals in the fanout ofa gate that reconverge at another gate.

ya

b

c

z d e

f

h

g

If a candidate path has reconvergent fanout, then the rising or falling edge on theinput to the path might cause a side input along the path to have a rising or fallingedge, rather than a stable ’0’ or ’1’ .


Rules for Propagating an Edge Along a Path

1 1

0 0

1 1

0 0

NOT

AND

OR

XOR


Missing Rules?

Question: Why do the rules not have falling edges for AND gates or risingedges for OR gates on the side input?

ab c

a

b

c


Viability Condition of a Path

Definition Viability condition: For a path (p) though a circuit, the viabilitycondition is a Boolean expression in terms of the input signals that definesthe cases where an edge will propagate along the path.

Based upon the rules for propagating an edge that we have seen so far, the viabilitycondition for a path is: every side input has a non-controlling value.

As always, section 5.3.5 has the complete viability condition.


5.3.3.2 Almost-Correct Algorithm to Detecta False Path1. Annotate each side input along the path with its non-controlling value. These

annotations are the constraints that must be satisfied for the candidate path tobe exercised.

2. Propagate the constraints backward from the side inputs of the path to the inputsof the circuit under consideration.

3. If there is a contradiction amongst the constraints, then the candidate path is afalse path.

4. If there is no contradiction, then the constraints on the inputs give the conditionsunder which an edge will traverse along the candidate path from input to output.

5.3.3.3 Examples of Detecting False Paths


False-Path Example 1

Question: Determine if the longest path in the circuit below is a false path.

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k

side input non-controlling value constraint

5.3.4 Finding the Next Candidate Path 449

5.3.4 Finding the Next Candidate Path

If the longest path is a false path, we need to find the next longest path in the circuit,which will be our next candidate critical path. If this candidate fails, we continue tofind the next longest of the remaining paths, ad infinitum.


5.3.4.1 Algorithm to Find Next CandidatePath1. Initialize path table with primary inputs, their potential delay, and fanout.

2. Sort path table by potential delay

3. If the partial path with the max delay has just one unused fanout signal,then extend the partial path with this signal.Otherwise:

(a) Extend path through unused fanout with max delay.

(b) Delete this fanout signal from the list of unused fanout signals .

4. Compute constraint that side input has non-controlling value

5. If the new constraint does not cause a contradiction,then return to step 3.Otherwise:

(a) Mark this partial path as false.

(b) For each partial path that is a prefix of the false path:

• recalculate potential delay of path

(c) Return to step 2


5.3.4.2 Examples of Finding Next Candi-date Path

Next-Path Example 1

Question: Starting from the initial delay calculation and longest path, findthe next candidate path and test if it is a false path.

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k


potential unuseddelay fanout path10 e c12 h, g b16 d a


side input non-controlling value constraint


5.3.5 Correct Algorithm to Find CriticalPath

We now remove the assumption that side inputs always arrive earlier than pathinputs.

5.3.5.1 Rules for Late Side Inputs

Early Side

monotone speedup side input causes glitchpath input propogates

Late Side

path=CTRLside=non-ctrl

path=non-ctrl path=CTRL path=non-ctrlside=non-ctrl side=CTRL side=CTRL

path input causes glitch path input propogates neither input propogatesside input propogates

monotone speedup

The complete and correct rule: a path input excites the gate if the side-input isnon-controlling or the side-input arrives late and the path input is controlling.

5.3.5 Correct Algorithm to Find Critical Path 455

5.3.5.2 Monotone Speedup

Definition monotonic: A function ( f ) is monotonic if increasing its inputcauses the output to increase or remain the same. Mathematically:x < y =⇒ f (x)≤ f (y).

Definition monotononous: A lecture is monotonous if increasing the length ofthe lecture increases the number of people who are asleep.

Definition monotone speedup: The maximum clockspeed of a circuit shouldbe monotonic with respect to the speed of any gate or sub-circuit. That is, ifwe increase the speed of part of the circuit, we should either increase theclockspeed of the circuit, or leave it unchanged.


5.3.5.3 Analysis ofSide-Input-Causes-Glitch Situation

5.3.5.4 Complete Algorithm• If find a contradiction on the path, check for side inputs that are on previously

discovered false paths.

• If a gate and its side input are on a previously discovered false path, then theside input defines a prefix of a false path that is a late-arriving side input.

• For each late-arriving prefix, compute its viability (the conditions under which anedge will propagate along the prefix to the late side input).

• To the row of the late arriving side input in the constraint table, add as adisjunction the constraint that: the path input has a controlling value and at leastone of the prefixes is viable.


5.3.5.5 Complete Examples

Complete Example 1

Question: Find the critical path in the circuit below.

a b

c

d ef

g

potential unuseddelay fanout pathfalse a,b,d,e,f,g10 g, c a10 a,c,f,g

side input non-controlling value constraintf[e] 1 ag[a] 1 a


Complete Example 2


a

c

h

i jj

i

gb

f

04

44

48

88

8

8

8

12

1212

8

814 1010ed 12

potential unuseddelay fanout pathfalse b,d,e,g,h,i,j8 f a12 h c14 f, g b,d,e14 b,d,e,g,i,j

side input non-ctrl value constrainth[c] 0 ci[h] 0 cbj[f] 0 ab


Complete Example 3Monotone speedup

• Critical path 〈a,c,e,f〉

• Late side input e[d]

• Total delay 10

• Excitation: a = rising edge

a b

ef

c

d0 0 2 4

0 2

0

Rising edge excitation

a b

ef

c

d0 0 2 4

0 2

04

6

Falling edge excitation

a b

ef

c

d0 0 0.5 1

0 2

0

610

Fast timing


Complete Example 4Late side inputs sometimes must have an edge.

Find the second-longest path with contradiction using early sides:

a b

c de

f g h

i jk

0

0 2 4 6

6

1 0 11 1

1

1 00a

b

c de

f g h

i jk

2 44

08

4 8

0 2 4 6

6810

10 12

14 16a b

c de

f g h

i jk

0

0


Complete Example 5

Late side paths must be viable.


a b

c

d

e

f

g

h

i

j

k

a b

c

d

e

f

g

h

i

j

k


5.3.6 Further Extensions to Critical PathAnalysis

McGeer and Brayton’s paper includes two extensions to the critical path algorithmpresented here that we will not cover.• gates with more than two inputs

• finding all input values that will exercise the critical path

• multiple paths with the same delay to the same gate

5.3.7 Increasing the Accuracy of CriticalPath Analysis

When doing critical path calculations, it is often useful to strike a balance betweenaccuracy and effort. In the examples so far, we assumed that all signals had thesame wire and load delays. This assumption simplifies calculations, but reducesaccuracy. Section 5.4 discusses how the analog world affects timing analysis.

5.4. ELMORE TIMING MODEL 463

5.4 Elmore Timing Model

5.4.1 RC-Networks for Timing Analysis

Transistor Level(P-Tran)

gate

source

drain

Mask Level(P-Tran)

gate

sourcepoly

p-diff

contact

drain

Cross-Section ofFabricatedTransistor

poly

p-diff

contact

substrate

Switch Level(P-Tran)

gate

source

drain


Transistor Level(N-Tran)

gate

source

drain

Mask Level(N-Tran)

gate

sourcepoly

n-diff

drain

contact

Cross-Section ofFabricatedTransistor

poly

p-diff

contact

substrate

Switch Level(N-Tran)

gate

source

drain

5.4.1 RC-Networks for Timing Analysis 465

Different Levels of Abstraction for Inverter

Gate Levela b

Transistor Level

a b

VDD

GND

Mask Level

VDD

GND

a b

poly

n-diff

p-diff

metal

metal

contact

RC-Network models of P- andN-transistors

gate

Rpu

RpdCp

source

drain

Cp

source

gate

drain


RC-Network for Timing Analysis

a b

Rpu

Rpd

Cp

VDD

GND

CL


A Pair of Inverters

Gate Level

ab

c

Transistor Level

ab

VDD

GND

c

Mask Level

ab

c


A Pair of Inverters (Cont’d)

Mask LevelVDD

GND

ab c


ab

Rpu

Rpd

Cp

VDD

GND

c

Rpu

Rpd

CpCL CLCW

RW RV

RC-Network for Timing Analysis (trimmed)


Rpu

Rpd

Cp

VDD

GND

CL

RVb

CW

RW


A Circuit with Fanout

Gate Level

ab

c

d

Gate Level (physical layout)

ab c

dc

Transistor Level

ab

VDD

GND

c b d

c


A Circuit with Fanout (Cont’d)

Transistor Level

ab

VDD

GND

c b d

c

Mask LevelVDD

GND

a db

b

c

c


A Circuit with Fanout (Cont’d)

Mask LevelVDD

GND

a db

b

c

c


a

Rpu

Rpd

Cp

GND

c

Rpu

Rpd

Cpd

Rpu

Rpd

Cp

c

CL CL CL

VDD

b

CW1

RW1 RV

b

CW2

RW2 RV

CW3

RW3


A Circuit with Fanout


a

Rpu

Rpd

Cp

GND

c

Rpu

Rpd

Cpd

Rpu

Rpd

Cp

c

CL CL CL

VDD

b

CW1

RW1 RV

b

CW2

RW2 RV

CW3

RW3

RC-Network for Timing Analysis (trimmed)

Rpu

Rpd

Cp

GND

CL CL

VDD

RV

bRVb

CW1

RW1

CW2

RW2


RC-Network for Timing Analysis (cleaned up)

Rpu

Rpd

Cp

GND

CL

CL

VDD

RV

b RV

b

CW1

RW1

CW2

RW2

5.4.2 Derivation of Analog Timing Model 475

5.4.2 Derivation of Analog Timing Model

Real Waveforms

Slow input

time

inputvoltage

time

outputvoltage

Fast input

time

inputvoltage

time

inputvoltage


Steps Toward Approximation

We begin with two simplifications as steps toward calculating a single delay valuefor a circuit.

1. Look at the circuit’s response to a step-function input.

2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% ofVDD.

Definition Trip Points: A high or ’1’ trip point is the voltage level where anupwards transition means the signal represents a ’1’ .

A low or ’0’ trip point is the voltage level where a downwards transitionmeans the signal represents a ’0’ .

a

b


Node Numbering, Initial Conditions• The source (VDD in our case) and each capacitor is a node. We number the

nodes, capacitors, and resistors. Resistors are numbered according to thecapacitor to their right. Multiple resistors in series without an interveningcapacitor are lumped into a single resistor.

• All nodes except the source start at GND.

• We calculate the voltage at a node when we turn on the P-transistor (connect toVDD).

The process for analyzing a transition from VDD to GND on a node is the dual ofthe process just described. The source node is GND, all other nodes start at VDD,we calculate the voltage when we turn on the N-transistor (connect it to GND).

Rpu

Rpd

Cp

GND

CL

CL

VDD

RV

b RV

b

CW1

RW1

CW2

RW2

1 2 5

3 40

R1

R2 R5

R3 R4


Define: Path and Downstream

Definition path: The path from the source node to a node i is the set of allresistors between the source and i. Example: path(3) = {R1,R2,R3}

Definition down: The set of capactitors downstream from a node is the set ofall capacitors where current would flow through the node to charge thecapacitor. You can think of this as the set of capacitors that are between thenode and ground. Example: down(2) = {C2,C3,C4,C5}. Example: down(3) ={C3,C4}


5.4.2.1 Example Derivation: Equation forVoltage at Node 3

V3(t) = V0(t)−voltage drop fromNode0toNode3

The voltage drop is the sum of the voltage dropsacross the resistors on the path from Node0 toNode3

= V0(t)− ∑r∈path(3)

Rr×Ir(t)

= V0(t)− (R1I1(t)+R2I2(t)+R3I3(t))

The current through a resistor is the sum of thecurrents through all of the downstream capacitors

Ir(t) = ∑c∈down(r)

Ic

I1(t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5I2(t) = Ic2 + Ic3 + Ic4 + Ic5I3(t) = Ic3 + Ic4


Substitute Ir into the equation for V3

V3(t) = V0(t)−

R1(Ic1 + Ic2 + Ic3 + Ic4 + Ic5)+ R2(Ic2 + Ic3 + Ic4 + Ic5)+ R3(Ic3 + Ic4)

Use associativity to group terms by currents.

V3(t) = V0(t)−

Ic1(R1)+ Ic2(R1 +R2)+ Ic3(R1 +R2 +R3)+ Ic4(R1 +R2 +R3)+ Ic5(R1 +R2)


Current through a capacitor

Ic(t) = Cc∂Vc(t)

∂t

Substitute Ic into equation for V3

V3(t) = V0(t)−

(R1)Cc1∂Vc1(t)

∂t

+ (R1 +R2)Cc2∂Vc2(t)

∂t

+ (R1 +R2 +R3)Cc3∂Vc3(t)

∂t

+ (R1 +R2 +R3)Cc4∂Vc4(t)

∂t

+ (R1 +R2)Cc5∂Vc5(t)

∂t


Ri,k = ∑r∈(path(k)∩path(k))

Rr

R3,1 = R1R3,2 = R1 +R2R3,3 = R1 +R2 +R3R3,4 = R1 +R2 +R3R3,5 = R1 +R2

Substitute Ri,k into V3

V3(t) = V0(t)−

R3,1Cc1∂Vc1(t)

∂t+ R3,2Cc2

∂Vc2(t)∂t

+ R3,3Cc3∂Vc3(t)

∂t

+ R3,4Cc4∂Vc4(t)

∂t+ R3,5Cc5

∂Vc5(t)∂t


5.4.2.2 General Derivation

Vi(t) = V0(t)−voltage drop fromNode0toNodei

The voltage drop is the sum of the voltage dropsacross the resistors on the path from Node0 toNodei

= V0(t)− ∑r∈path(i)

Rr×Ir(t)


The current through a resistor is the sum of thecurrents through all of the downstream capacitors

Ir(t) = ∑c∈down(r)

Ic

Substitute Ir into the equation for Vi

Vi(t) = V0(t)− ∑r∈path(i)

Rr× ∑c∈down(r)

Ic

Use associativity to push Rr into the summationover c


∑c∈down(r)

Rr×Ic


Current through a capacitor

Ic(t) = Cc∂Vc(t)

∂t

Substitute Ic into equation for Vi


∑c∈down(r)

Rr×Cc∂Vc(t)

∂t

A little bit of handwaving to prepare for Elmore re-sistance

Vi(t) = V0(t)− ∑k∈Nodes

∑r∈path(i)∩path(k)

Rr

×Ck∂Vc(t)

∂t


Define Elmore resistance Ri,k

Ri,k = ∑r∈(path(k)∩path(k))

Rr

Substitute Ri,k into Vi

Vi(t) = V0(t)− ∑k∈Nodes

Ri,k×Ck∂Vc(t)

∂t

5.4.3 Elmore Timing Model 487

5.4.3 Elmore Timing Model• Assume that V0(t) is a step function from 0 to 1 at time 0.

• Derive upper and lower bounds for Vi(t).

• Find RC time constants for upper and lower bounds.

• Elmore delay is guaranteed to be between upper and lower bounds.

Upper and lower bounds Elmore model RC-network model

TD-TRi

TP-TRi

TRi

TD

TP


Equations for Curves

Time : 0 TDi−TRi TP−TRi ∞

Upper 1+t−TDi

TP1−

TRi

TPe

TDi−TP− t

TRi

Elmore 1− e−t/TDi

Lower 0 1−TDi

t +TRi

1−TDi

TPe

TP−TRi− t

TP

Fact: 0≤ TRi ≤ TDi ≤ TP

5.4.3 Elmore Timing Model 489

Definitions of Time Constants

TRi = ∑k∈Nodes

R2k,iCk

Ri,iMathematical artifact, no intuitive meaning

TDi = ∑k∈Nodes

Rk,iCk Elmore delay

TP = ∑k∈Nodes

Rk,kCk RC-time constant for lumped network


Picking the Trip Point

Vi(t) = VDD(1− e−t/TDi)Pick trip point of Vi(t) = 0.65VDD, then solve for t

0.65VDD = VDD(1− e−t/TDi)

0.35 = e−t/TDi

Take ln of both sidesln0.35 = ln(e−t/TDi)

ln0.35 =−1.05≈−1.0−1.0 = −t/TDi

t = TDi

By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmoredelay.

5.4.4 Examples of Using Elmore Delay 491

5.4.4 Examples of Using Elmore Delay

5.4.4.1 Interconnect with Single Fanout


G1 G2

G1Ra1

C1 Ra2

Ra3

C2C3Ra4

G2Rw1

Rw2Rw3

C1

G1

Vi

Rpu

Rpd

Cp C2

Rw1

C3

Rw2 Rw3

CG2

G2

Ra1 Ra2 Ra3 Ra4

G* gateC* capacitance on wireRa* resistance through antifuseRw* resistance through wire


Question: Calculate delay from gate 1 to gate 2

C1

G1

Vi

Rpu

Rpd

Cp C2

Rw1

C3

Rw2 Rw3

CG2

G2

Ra1 Ra2 Ra3 Ra4


Doubling Antifuses

Question: If you double the number of antifuses and wires needed toconnect two gates, what will be the approximate effect on the wire delaybetween the gates?


5.4.4.2 Interconnect with Multiple Gates inFanout

G1 G2

G3 G1

G2

G3

Question: Assuming that wire resistance is much less than antifuseresistance and that all antifuses have equal resistance, calculate the delayfrom the source inverter (G1) to G2



Delay to G2 vs G3

Question: Assuming all wire segments at same level have roughly thesame capacitance, which is greater, the delay to G2 or the delay to G3?

G1R1

C1

R2

R3

C2

C4R4

G2

C6

R6

R5

G3

C3

C5

C7

C1

G1

Vi

Rpu

Rpd

Cp C2

R1

C4

R2 R3 R4

C5

G2

C6

R5 R6

C7

G3

C3

n1 n2 n3 n4 n5

n6 n7


5.5 Practical Usage of Timing Analysis

Speed Grading

• Fabs sort chips according to their speed (sorting is known as speed gradingor speed binning)

• Faster chips are more expensive

• In FPGAs, sorting is based usualy on propagation delay through an FPGAcell. As wires become a larger portiono of delay, some analysis of wiredelays is also being done.

• Propagation delay is the average of the rising and falling propagation delays.

• Typical speed grades for FPGAs:

Std standard speed grade1 15% faster than Std2 25% faster than Std3 35% faster than Std

Worst-Case Timing

• Maximum Delay in CMOS. When?

5.5. PRACTICAL USAGE OF TIMING ANALYSIS 499

– Minimum voltage

– Maximum temperature

– Slow-slow conditions (process variation/corner which result in slowp-channel and slow n-channel). We could also have fast-fast, slow-fast,and fast-slow process corners

• Increasing temperature increases delay

– ⇑ Temp =⇒ ⇑ resistivity

– ⇑ resistivity =⇒ ⇑ electron vibration

– ⇑ electron vibration =⇒ ⇑ colliding with current electrons

– ⇑ colliding with current electrons =⇒ ⇑ delay

• Increasing supply voltage decreases delay

– ⇑ supply voltage =⇒ ⇑ current

– ⇑ current =⇒ ⇓ load capacitor charge time

– ⇓ load capacitor charge time =⇒ ⇓ total delay

• Derating factor is a number used to adjust timing number to account forvoltage and temp conditions


• ASIC manufacturers classes, based on variety of environments:VDD TA (ambient temp) TC (case temp)

Commercial 5V ± 5% 0 to +70CIndustrial 5V ± 10% –40 to +85CMilitary 5V ± 10% –55 to +125C

• What is important is the transistor temperature inside the chip, TJ (junctiontemperature)

5.5.1 Speed BinningSpeed binning is the process of testing each manufactured part to determine themaximum clock speed at which it will run reliably.

Manufacturers sell chips off of the same manufacturing line at different pricesbased on how fast they will run.

A “speed bin” is the clock speed that chips will be labeled with when sold.

Overclocking: running a chip at a clock speed faster than what it is rated for (andhoping that your software crashes more frequently than your over-stressedhardware will).

5.5.1 Speed Binning 501

5.5.1.1 FPGAs, Interconnect, andSynthesis

On FPGAs 40-60% of clock cycle is consumed by interconnect.

When synthesizing, increasing effort (number of iterations) of place and route cansignificantly reduce the clock period on large designs.


5.5.2 Worst Case Timing

5.5.2.1 Fanout delay

In Smith’s book, Table 5.2 (Fanout delay) combines two separate parameters:

• capacitive load delay

• interconnect delay

into a single parameter (fanout). This is common, and fine.

But, when reading a table such as this, you need to know whether fanout delay iscombining both capacitive load delay and interconnect delay, or is just capacitiveload.

5.5.2 Worst Case Timing 503

5.5.2.2 Derating Factors

Delays are dependent upon supply voltage and temperature.

⇑ Temp =⇒ ⇑ Delay⇑ Supply voltage =⇒ ⇓ Delay


Temperature

• ⇑ Temp =⇒ ⇑ Delay

– ⇑ Temp =⇒ ⇑ Resistivity of wires

– As temp goes up, atoms vibrate more, and so have greater probability ofcolliding with electrons flowing with current.

5.5.2 Worst Case Timing 505

Supply Voltage

• ⇑ Supply voltage =⇒ ⇓ Delay

– ⇑ Supply voltage =⇒ ⇑ current (V = IR)

– ⇑ current =⇒ ⇓ time to charge load capacitors to threshold voltage


Derating Factor Definition

A “derating factor” is a number to adjust timing numbers to account for differenttemperature and voltage conditions.

Excerpt from table 5.3 in Smith’s book (Actel Act 3 derating factors):

Derating factor Temp Vdd1.17 125C 4.5V1.00 70C 5.0V0.63 -55C 5.5V

Chapter 6

Power Analysis and Power-AwareDesign

507

508 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.1 Overview

6.1.1 Importance of Power and Energy• Laptops, PDA, cell-phones, etc — obvious!

• For microprocessors in personal computers, every watt above 40W adds $1 tomanufacturing cost

• Approx 25% of operating expense of server farm goes to energy bills

• (Dis)Comfort of Unix labs in E2

• Sandia Labs had to build a special sub-station when they took delivery ofTeraflops massively parallel supercomputer (over 9000 Pentium Pros)

• High-speed microprocessors today can run so hot that they will damagethemselves — Athlon reliability problems, Pentium 4 processor thermal throttling

• In 2000, information technology consumed 8% of total power in US.

• Future power viruses: cell phone viruses cause cell phone to run in full powermode and consume battery very quickly; PC viruses that cause CPU tomeltdown batteries

6.1.2 Industrial Names and Products 509

6.1.2 Industrial Names and ProductsNote: Lots of links from E&CE 327 web pages under “Docu-mentation”

6.1.3 Power vs Energy

Most people talk about “power” reduction, but sometimes they mean “power” andsometimes “energy.”• Power minimization is usually about heat removal

• Energy minimization is usually about battery life or energy costs

Type Units Equivalent Types EquationsEnergy Joules Work = Volts×Coulombs

= 12×C×Volts2

Power Watts Energy / Time = Volts× I= Joules/sec


6.1.4 Batteries, Power and Energy

6.1.4.1 Do Batteries Store Energy orPower?

Energy = Volts×Coulombs

Power =EnergyTime

Batteries rated in Amp-hours at a voltage.

battery = Amps×Seconds×Volts

= CoulombsSeconds ×Seconds×Volts

= Coulombs×Volts

= Energy

Batteries store energy.

6.1.4 Batteries, Power and Energy 511

6.1.4.2 Battery Life and Efficiency

To extend battery life, we want to increase the amount of work done and/ordecrease energy consumed.

Work and energy are same units, therefore to extend battery life, we truly want toimprove efficiency.

“Power efficiency” of microprocessors normally measured in MIPS/Watt. Is this areal measure of efficiency?

MIPsWatts = millions of instructions

Seconds ×SecondsEnergy

= millions of instructionsEnergy

Both instructions executed and energy are measures of work, so MIPs/Watt is ameasure of efficiency.

Question: What is the weakness of this analysis?


6.1.4.3 Battery Life and Power

Question: Running a VHDL simulation requires executing an average of 1million instructions per simulation step. My computer runs at 700MHz, has aCPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH.Assuming all of my computer’s clock cycles go towards running VHDLsimulations, how many simulation steps can I run on one battery charge?

6.1.4 Batteries, Power and Energy 513

Battery Life and Power

Question: If I use the SpeedStep feature of my computer, my computerruns at 600MHz with 60W of power. With SpeedStep activated, muchlonger can I keep the computer running on one battery?


Battery Life and Power

Question: With SpeedStep activated, how many more simulation steps canI run on one battery?

6.2. POWER EQUATIONS 515

6.2 Power Equations

Power = SwitchPower+ShortPower︸︷︷︸

+ LeakagePower︸︷︷︸

DynamicPower StaticPower

Dynamic Power dependent upon clock speed

Switching Power useful — charges up transistors

Short Circuit Power not useful — both N and P transistors are on

Static Power independent of clock speed

Leakage Power not useful — leaks around transistor


Dynamic Power

Dynamic power is proportional to how often signals change their value (switch).• Roughly 20% of signals switch during a clock cycle.

• Need to take glitches into account when calculating activity factor. Glitchesincrease the activity factor.

• Equations for dynamic power contain clock speed and activity factor.

6.2.1 Switching Power 517

6.2.1 Switching Power

1->00->1CapLoad

Charging a capacitor

0->11->0CapLoad

Disharging a capacitor

energy to (dis)charge capacitor =12×CapLoad×VoltSup2


Switching Power

When a capacitor C is charged to a voltage V , the energy stored in capacitor is12CV 2.

The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy(12CV 2 is dissipated as heat through the pullup resistance. Half of energy is

transfered to the capacitor.

When the capacitor discharges from V to 0, the energy stored in the capacitor(12CV 2) is dissipated as heat through the pulldown resistance.

6.2.1 Switching Power 519

Switching Power

f ′: frequency at which invertor goes through complete charge-discharge cycle .(eqn 15.4 in Smith)

average switching power = f ′×CapLoad×VoltSup2

ClockSpeed clock speedActFact average number of times that signal switches from 0→ 1

or from 1→ 0 during a clock cycle

average switching power =12×ActFact×ClockSpeed×CapLoad×VoltSup2


6.2.2 Short-Circuited Power

Vi Vo

IShort

VoltSup

GND

VoltThresh

VoltSup - VoltThresh

P-trans on

N-trans on

TimeShort

Gate Voltage

PwrShort = ActFact×ClockSpeed×TimeShort× IShort×VoltSup

6.2.3 Leakage Power 521

6.2.3 Leakage Power

N-substrate

P

Vi

Vo

N N P

P

Cross section of invertor showingparasitic diode

I

V

ILeak

Leakage current through parasitic diode

PwrLk = ILeak×VoltSup

ILeak ∝ e

(−q×VoltThresh

k×T

)


6.2.4 Glossary


6.2.5 Note on Power Equations


6.3 Overview of Power ReductionTechniques

We can divide power reduction techniques into two classes: analog and digital.

6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 523

Analog Parameters

Power reduction parameters at the analog level.

capacitance for example, Silicon on Insulator (SOI)

resistance for example, copper wires

voltage low-voltage circuits


Analog Techniques

Power reduction techniques at the analog level.

dual-VDD Two different supply voltages: high voltage for performance-criticalportions of design, low voltage for remainder of circuit. Alternatively, can varyvoltage over time: high voltage when running performance-critical software andlow voltage when running software that is less sensitive to performance.

dual-Vt Two different threshold voltages: transistors with low threshold voltagefor performance-critical portions of design (can switch more quickly, but moreleakage power), transistors with high threshold voltage for remainder of circuit(switches more slowly, but reduces leakage power).

exotic circuits Special flops, latches, and combinational circuitry that run at ahigh frequency while minimizing power

adiabatic circuits Special circuitry that consumes power on 0→ 1 transitions,but not 1→ 0 transitions. These sacrifice performance for reduced power.

clock trees Up to 30% of total power can be consumed in clock generation andclock tree

6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 525

Digital Parameters

Power-reduction parameters at the digital level.

capacitance (number of gates)

activity factor

clock frequency


Digital Techniques

Power-reduction techniques at the digital level.

multiple clocks Put a high speed clock in performance-critical parts of designand a low speed clock for remainder of circuit

clock gating Turn off clock to portions of a chip when it’s not being used

data encoding Gray coding vs one-hot vs fully encoded vs ...

glitch reduction Adjust circuit delays or add redundant circuitry to reduce oreliminate glitches.

asynchronous circuits Get rid of clocks altogether....

Additional low-power design techniques for RTL from a Qualis engineer:http://home.europa.com/ ˜ celiac/lowpower.html

6.4. VOLTAGE REDUCTION FOR POWER REDUCTION 527

6.4 Voltage Reduction for PowerReduction

If our goal is to reduce power, the most promising approach is to reduce thesupply voltage, because, from:

Power = (ActFact×ClockSpeed× 12CapLoad×VoltSup2)

+ (ActFact×ClockSpeed×TimeShort× IShort×VoltSup)+ (ILeak×VoltSup)

we observe:

Power ∝ VoltSup2


Reducing Difference Between Supply and

Threshold Voltage

As the supply voltage decreases, it takes longer to charge up the capacitive load,which increases the load delay of a circuit.

In the chapter on timing analysis, we saw that increasing the supply voltage willdecrease the delay through a circuit. (From V = IR, increasing V causes anincrease in I, which causes the capacitive load to charge more quickly.) However,it is more accurate to take into account both the value of the supply voltage, andthe difference between the supply voltage and the threshold voltage.

MaxClockSpeed ∝(VoltSup−VoltThresh)2

VoltSup

6.4. VOLTAGE REDUCTION FOR POWER REDUCTION 529

Effect of Decreasing Supply Voltage on

Delay

Question: If the delay along the critical path of a circuit is 20 ns, the supplyvoltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical pathdelay if the supply voltage is dropped to 2.2 V.


Reducing Threshold Voltage IncreasesLeakage Current

If we reduce the supply voltage, we want to also reduce the threshold voltage, sothat we do not increase the delay through the circuit. However, as thresholdvoltage drops, leakage current increases:

ILeak ∝ e

(−q×VoltThresh

k×T

)

And increasing the leakage current increases the power:

Power ∝ ILeak

So, need to strike a balance between reducing VoltSup (which has a quadraticaffect on reducing power), and increasing ILeak, which has a linear affect onincreasing power.

6.5. DATA ENCODING FOR POWER REDUCTION 531

6.5 Data Encoding for Power Reduction

6.5.1 How Data Encoding Can ReducePower

Data encoding is a technique that chooses data values so that normal executionwill have a low activity factor.

The most common example is “Gray coding” where exactly one bit changes valueeach clock cycle when counting.


Decimal Gray Binary0 0000 00001 0001 00012 0011 00103 0010 00114 0110 01005 0111 01016 0101 01107 0100 01118 1100 10009 1101 1001

10 1111 101011 1110 101112 1010 110013 1011 110114 1001 111015 1000 1111

6.5.1 How Data Encoding Can Reduce Power 533

8-bit Counter

Question: For an eight-bit counter, how much more power will a binarycounter consume than a Gray-code counter?


Random Data

Question: For completely random eight-bit data, how much more power willa binary circuit consume than a Gray-code circuit?

6.5.2 Example Problem: Sixteen Pulser 535

6.5.2 Example Problem: Sixteen Pulser

6.5.2.1 Problem StatementYour task is to do the power analysis for a circuit that should send out aone-clock-cycle pulse on the done signal once every 16 clock cycles. (That is,done is ’0’ for 15 clock cycles, then ’1’ for one cycle, then repeat with 15 cycles of’0’ followed by a ’1’, etc.)

done

1 2 3 1615 17 3231 33

clk

Required behaviour

You have been asked to consider three different types of counters: a binarycounter, a Gray-code counter, and a one-hot counter. (The table below shows thevalues from 0 to 15 for the different encodings.)

Question: What is the relative amount of power consumption for thedifferent options?


6.5.2.2 Additional Information

Your implementation technology is an FPGA where each cell has a programablecombinational circuit and a flip-flop. The combinational circuit has 4 inputs and 1output. The capacitive load of the combinational circuit is twice that of the flip-flop.

PLA

cell

1. You may neglect power associated with clocks.

2. You may assume that all counters:

(a) are implemented on the same fabrication process

(b) run at the same clock speed

(c) have negligible leakage and short-circuit currents


Data Encoding

Decimal Gray One-Hot Binary0 0000 0000000000000001 00001 0001 0000000000000010 00012 0011 0000000000000100 00103 0010 0000000000001000 00114 0110 0000000000010000 01005 0111 0000000000100000 01016 0101 0000000001000000 01107 0100 0000000010000000 01118 1100 0000000100000000 10009 1101 0000001000000000 1001

10 1111 0000010000000000 101011 1110 0000100000000000 101112 1010 0001000000000000 110013 1011 0010000000000000 110114 1001 0100000000000000 111015 1000 1000000000000000 1111


6.5.2.3 Answer

Sketch the Circuitry

Name the output “done” and the count digits “d()”.


Capacitance

cap number subtotal capGray d() PLAs

Flops

done PLAs

Flops

1-Hot d() PLAs

Flops

done PLAs

Flops

Binary d() PLAs

Flops

done PLAs

Flops


Activity Factors

Gray Coding Activity Factor

d(0)

d(1)

d(2)

d(3)

done

clk

4/16

2/16

2/16

2/16

8/16

Gray coding


One-Hot Activity Factor

d(0)

d(1)

d(2)

done

clk

2/16

2/16

2/16

2/16

2/16

One-hot coding


Binary Coding Activity Factor

d(0)

d(1)

d(2)

d(3)

done

clk

8/16

4/16

2/16

2/16

16/16

Binary coding


Putting it all Together

subtotal cap act fact power

Gray d() PLAs

Flops

done PLAs

Flops

Total

1-Hot d() PLAs

Flops

done PLAs

Flops

Total

Binary d() PLAs

Flops

done PLAs

Flops

Total


6.6 Clock Gating

The basic idea of clock gating is to reduce power by turning off the clock when acircuit isn’t needed. This reduces the activity factor.

6.6.1 Introduction to Clock Gating

Examples of Clock Gating

Condition Circuitry turned offO/S in standby mode Everything except “core” state (PC, registers,

caches, etc)No floating point instruc-tions for k clock cycles

floating point circuitry

Instruction cache miss Instruction decode circuitryNo instruction in pipestage i

Pipe stage i

6.6.2 Implementing Clock Gating 545

6.6.2 Implementing Clock Gating

Clock gating is implemented by adding a component that disables the clock whenthe circuit isn’t needed.

i_data

clk

o_data

i_valid

o_valid

Without clock gating

Clock EnableState Machine

clk

i_wakeup

clk_en

cool_clk

i_data o_data

i_valid

o_valid

With clock gating


6.6.3 Design Process

6.6.4 Effectiveness of Clock Gating

Parameters to characterize effectiveness of clock gating:

Eff = effectiveness of clock gatingPctValid = percentage of clock cycles with valid data in the circuit —

the clock must be togglingPctClk = percentage of clock cycles that clock toggles

Effectiveness measures the percentage of clock cycles with invalid data in whichthe clock is turned off. Equation for effectiveness of clock gating:

Eff =PctClkOffPctInvalid

=1−PctClk

1−PctValid

6.6.4 Effectiveness of Clock Gating 547

Clock Gating Effectiveness Questions

Question: What is the effectiveness if the clock toggles only when there isvalid data?

Question: What is the effectiveness of a clock that always toggles?


Clock Gating Effectiveness Questions

Question: What does it mean for a clock gating scheme to be 75%effective?

Question: What happens if PctClk < PctValid?

6.6.4 Effectiveness of Clock Gating 549

Effect of Effectiveness

We can see the effect of the effectiveness of a clock-gating scheme on the activityfactor:

A’

Eff

A

0 10

PctValid * A

The new activity factor with a clock gating scheme is:

A′ = A− (1−PctValid)×Eff ×A


6.6.5 Example: Reduced Activity Factorwith Clock Gating

Question: How much power will be saved in the following clock-gatingscheme?

• 70% of the time the main circuit has valid data

• clock gating circuit is 90% effective (90% of the time that the circuit has invaliddata, the clock is off)

• clock gating circuit has 10% of the area of the main circuit

• clock gating circuit has same activity factor as main circuit

• neglect short-circuiting and leakage power

6.6.5 Example: Reduced Activity Factor with Clock Gating 551


6.6.6 Clock Gating with Valid-Bit Protocol

6.6.6.1 Valid-Bit Protocol

Need a mechanism to tell circuit when to pay attention to data inputs

clk

i_valid

i_data o_data

o_valid

clk

i_valid

i_data α β γ

6.6.6 Clock Gating with Valid-Bit Protocol 553

Valid-Bit Protocol

clk

i_valid

i_data o_data

o_valid

clk

i_valid

i_data

o_data

o_valid

α β γ

α β γ

i valid : high when i data has valid data — signifies whether circuit should payattention to or ignore data.

o valid : high when o data has valid data — signifies whether whetherenvironment should pay attention to output of circuit.

For more on circuit protocols, see section 2.12.


Microscopic Analysis

Which clock edges are needed?

i_valid

clk

o_valid

clk

i_valid

o_valid


6.6.6.2 How Many Clock Cycles forModule?

Given a module with latency Lat , if the module receives a stream of NumPclsconsecutive valid parcels, how many clock cycles must the clock-enable signal beasserted?

Latency NumPcls NumClkEn

i_valido_validclk_en

Latency NumPcls NumClkEn








6.6.6.3 Adding Clock-Gating Circuitry

Before Clock Gating

data_in

clk

data_out

valid_in valid_out

clk

α β δγ

α β γ

data_in

valid_in

data_out

valid_out don’t care

uninitialized


After Clock Gating: Circuitry

Clock EnableState Machine

data_in

hot_clk

wakeup_in

data_out

clk_en

cool_clk

valid_in valid_out

wakeup_out

• hot clk : clock that always toggles

• cool clk : gated clock — sometimes toggles, sometimes stays low

• wakeup : alerts circuit that valid data will be arriving soon

• clk en : turns on cool clk


After Clock Gating: New Signals

data_in

valid_in

hot_clk

data_out

valid_out

wakeup_in

cool_clk

clk_en

wakeup_out

α β δγ

α β γ

6.6.7 Example: Pipelined Circuit with Clock-Gating 559

6.6.7 Example: Pipelined Circuit withClock-Gating

Design a “clock enable state machine” for the pipelined component describedbelow.• capacitance of pipelined component = 200

• latency varies from 5 to 10 clock cycles, even distribution of latencies

• contains a maximum of 6 instructions (parcels of data).

• 60% of incoming parcels are valid

• average length of continuous sequence of valid parcels is 80

• use input and output valid bits for wakeup

• leakage current is negligible

• short-circuit current is negligible

• LUTs have a capacitance of 1, flops have a capacitance of 2


Waveforms for Parcel Count

i_valid

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

o_valid

parcel_count

parcel_clk_en

18 19 20 21 22 23 24

Waveforms for Cycle Count

i_valid

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

o_valid

cycle_count

1 2 0 0 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1000

cycle_clk_en

18 19 20 21 22 23 24

5

6.6.7 Example: Pipelined Circuit with Clock-Gating 561

Summary of Design Process

Outline:

1. sketch out circuitry for parcel count and cycle count state machine

2. estimate capacitance of each state machine

3. estimate activity factor of main circuit, based on behaviour


Parcel Count Design

Need to count (0..6) parcels, therefore need 3 bits for counter.

Counter must be able to increment and decrement.

Equations for counter action (increment/decrement/no-change):

i valid o valid action0 0 no change0 1 decrement1 0 increment1 1 no change

Chapter 7

Fault Testing and Testability

563

564 CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1 Faults and Testing

7.1.1 Overview of Faults and Testing

7.1.1.1 Faults

During manufacturing, faults can occur that make the physical product behaveincorrectly.

Definition : A fault is a manufacturing defect that causes a wire, poly, diffusion, orvia to either break or connect to something it shouldn’t .

Good wires Shorted wires Open wire

7.1.1 Overview of Faults and Testing 565

7.1.1.2 Causes of Faults• Fabrication process (initial construction is bad)

chemical mix, impurities, dust

• Manufacturing process (damage during construction)

– handling: probing, cutting, mounting

– materials: corrosion, adhesion failure, cracking, peeling

7.1.1.3 Testing

Definition Testing is the process of checking that the manufacturedwafer/chip/board/system has the same functionality as the simulations.


7.1.1.4 Burn In

Definition Burn-in: The process of subjecting chips to extreme conditions (highand low temps, high and low voltages, high and low clock speeds) before andduring testing.

Soon to break wire

7.1.1.5 Bin Sorting

Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped andlabeled (binned) by the maximum clock frequency at which they will work reliably.

For example, chips coming off of the same production line might be labelled as800MHz, 900MHz, and 1000MHz.

7.1.2 Example Problem: Economics of Testing 567

7.1.1.6 Testing Techniques

7.1.1.7 Design for Testability (DFT)

7.1.2 Example Problem: Economics ofTesting

Note: There is a tradeoff between the amount of money spenton testing chips vs dealing with (e.g. replacing) faulty chips. Usu-ally the best tradeoff is to ship chips with a small, but non-zeroprobability that the chip has a fault.

7.1.3 Physical Faults


7.1.3.1 Types of Physical Faults

Good Circuit Bad Circuitsab

cd open

ab

cd

wired-AND bridging shortab

cd

wired-OR bridging shortab

cd

stronger wins bridging shortab

cd

(b is stronger)

short to VDDab

cd

short to GND

ab

cd

7.1.3 Physical Faults 569

7.1.3.2 Locations of Faults

Each segment of wire, poly, diffusion, via, etc is a potential fault location.

Different segments affect different gates in the fanout.

A potential fault location is a segment or segments where a fault at any positionaffects the same set of gates in the same way.

b b


7.1.3.3 Layout Affects Locations

a

d

ef

g

h

ibc

e

g

h

bL1

L2

L3

L4

e

g

h

bL1

L2

L3

L4

L5

7.1.3.4 Naming Fault Locations

Two ways to name a fault location:

pin-fault model Faults are modelled as occuring on input and output pins ofgates.

net-fault model Faults are modelled as occuring on segments of wires.

In E&CE 327, we’ll use the net-fault model, because it is simpler to work with andis closer to what actually happens in hardware.

7.1.4 Detecting a Fault 571

7.1.4 Detecting a Fault

To detect a fault, we compare the actual output of the circuit against the expectedvalue.

7.1.4.1 Which Test Vectors will Detect aFault?

Question: For the good circuit and faulty circuit shown below, which testvectors will detect the fault?

a b

c

d

e

Good circuit

a b

c

d

e

Faulty circuit


Answer:

a b c good faulty0 0 0 0 00 0 1 1 10 1 0 0 00 1 1 1 11 0 0 0 01 0 1 1 11 1 0 1 01 1 1 1 1

Sometimes multiple test vectors will catch the same fault.

Sometimes a single test vector can catch multiple faults.

7.1.4 Detecting a Fault 573

a b

c

d

e

a b

c

d

e

Another fault

a b c good faulty1 1 0 1 0 ←−

The test vector 110 can catch both this fault and the previous one.

Note: Detect vs. diagnose Testing detects faults. Testing doesnot diagnose which fault occurred.


7.1.5 Mathematical Models of Faults

Goal: develop reliable and predictable technique for detecting faults in circuits.

Observations:

• The possible faults in a circuit are dependent upon the physical layout of thecircuit.

• A very wide variety of possible faults

• A single test vector can catch many different faults

Need: a mathematical model for faults that is abstracted from complexities ofcircuit layout and plethora of possible faults, yet still detects most or all possiblefaults.

7.1.5 Mathematical Models of Faults 575

7.1.5.1 Single Stuck-At Fault Model

Two simplifying assumptions:

1. A maximum of one fault per tested circuit (hence “single”)

2. All faults are either:

(a) stuck-at 1: short to VDD

(b) stuck-at 0: short to GND

hence, “stuck at”


Example of Stuck-At Faults

a

d

ibc

Question: If we consider all possible stuck-at faults, how many faultycircuits would we need to test for?

Question: If we consider only single-stuck-at faults, how many faultycircuits would we need to test for?

7.1.6 Generate Test Vector to Find a Mathematical Fault 577

7.1.6 Generate Test Vector to Find aMathematical Fault

Faults are detected by stimulating circuits (real, manufactured circuit, not asimulation!) with test-vectors and checking that the real circuit gives the correctoutput.

7.1.6.1 Algorithm1. compute Karnaugh map for correct circuit

2. compute Karnaugh map for faulty circuit

3. find region of disagreement

4. any assignment in region of disagreement is a test vector that will detect fault

5. any assignment outside of region of disagreement will result in same output onboth correct and faulty circuit


7.1.6.2 Example of Finding a Test Vector

a b

c

d

e

a b

c

d

e

c

ba

1

0

10 11 01 00ba ba ba

c

a b

c

ab

c

Good circuit Faulty circuit

Question: Find a test test vector will detect the faulty circuit

a bc

7.1.7 Undetectable Faults 579

7.1.7 Undetectable Faults

Not all faults are detectable.

1. If a circuit is irredundant then all single stuck-at faults can be detected.

A redundant circuit is one where one or more gates can be removedwithout affecting the functional behaviour.

2. If not trying to find all of the faults in a circuit, then a fault that you aren’t lookingfor can mask a fault that you are looking for.

7.1.7.1 Redundant Circuitry

Some faults are undetectable. Undetectable stuck-at faults are located inredundant parts of a circuit.


Timing Hazards

Static hazardDynamic hazard

Timing hazards are often removed byadding redundant circuitry.

Redundant Circuitry

ab

c

1,0

1,1

1,1

0,10,1

1,0

1,0,1

d

e

fg

Irredundant circuit

a

b

c

d

e

f

g

Illustration of timing hazard

Glitch on g is caused because the AND gate for e turns off before f turns on.

7.1.7 Undetectable Faults 581

Redundant Circuitry

Question: Add one or more gates to the circuit so that the static hazard isguaranteed to be prevented, independent of the delay values through thegates

a b

c

ab

c

1,0

1,1

1,1

0,10,1

1,0

1,0,1

d

e

fg

Redundant Circuitry

Question: Has the redundant circuitry introduced any undetectable faults?If so, identify an undetectable fault.


7.1.7.2 Curious Circuitry and FaultDetection

Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 aredetectable.

a

b

c

zL1

L2

L3

a

c

z

ab

c

fault eqn K-map diff w/ ckt

L2@0 a⊕ (b⊕ c)

ab

c

ab

c

L2@1 a⊕ (b⊕ c)

ab

c

ab

c

7.2. TEST GENERATION 583

7.2 Test Generation

7.2.1 A Small Example

a

b

c

zL2

L4

L5

ab+bca

bc

fault eqn K-map diff w/ ckt test vectors

1) L2@1

a bc

a bc

2) L4@1

a bc

a bc

3) L5@1

a bc

a bc


7.2.2 Choosing Test Vectors

The goal of test vector generation is to find the smallest set of test vectors that willdetect the faults of interest.

Test vector generation requires analyzing the faults.

We can simplify the task of fault analysis by reducing the number of faults that wehave to analyze.

Smith has examples of this in Figures 14.13 and 14.14.

7.2.2 Choosing Test Vectors 585

7.2.2.1 Fault Dominationfault eqn K-map Diff w/ ckt test vectors

1) L5@1 ab+c

ab

c

ab

c

101, 001

2) L6@1 1

ab

c

ab

c

101, 001, 100, 010, 000

Definition dominates: f1 dominates f2: any test vector that detects f1 willalso detect f2.

When choosing test vectors, we can ignore the dominated fault, but must keep thedominant fault.

Question: To detect both L5@1 and L6@1, can we ignore one of the faults?

Question: What would happen if we ignored the “wrong” fault?


7.2.2.2 Fault Equivalence

fault eqn K-map Diff w/ ckt

1) L1@1 b

ab

c

ab

c

2) L3@1 b

ab

c

ab

c

Definition fault equivalence: f1 is equivalent to f2: f1 and f2 are detected byexactly the same set of test vectors. That is, all of the test vectors thatdetect f1 will also detect f2, and vice versa.

When choosing test vectors we can ignore one of the faults and just include theother.

7.2.2 Choosing Test Vectors 587

7.2.2.3 Gate Collapsing

A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault onthe output of the OR gate.

Definition Gate collapsing: : The technique of looking at the functionality of agate and finding equivalent faults between inputs and outputs.

Sets of collapsable faults for common gates

AND

@0

@0@0

OR

@1

@1@1

Question What is the set of collapsible faults for a NAND gate?

NAND


7.2.2.4 Node Collapsing

Note: Node collapsing is relevant only for the pin-fault model

7.2.2.5 Fault Collapsing Summary

When calculating the test-vectors to detect a set of faults, apply the faultcollapsing techniques of:• gate collapsing

• node collapsing (if using pin-fault model)

• general fault equivalence (intelligent collapsing)

• fault domination

to reduce the number of faults that you must examine.

7.2.3 Fault Coverage 589

7.2.3 Fault Coverage

Definition Fault coverage: percentage of detectable faults that are detected by aset of test vectors.

FaultCoverage =DetectedFaults

DetectableFaults

Some people’s definition of fault coverage has a denominator of AllPossibleFaults,not just those that are detectable.


7.2.4 Test Vector Generation and FaultDetection

There are two ways to generate vectors and check results: built-in tests and scantesting.

Both require:• generate test vectors

• overide normal datapath to send test-vectors, rather than normal inputs, asinputs to flops

• compare outputs of flops to expected result

7.2.5 Generate Test Vectors for 100% Coverage 591

7.2.5 Generate Test Vectors for 100%Coverage

In this section we will find the test vectors to achieve 100% coverage of singlestuck at faults for the circuit of the day.

We will use a simple algorithm, there are much more sophisticated algorithms thatare more efficient.

The problem of test vector generation is often called Automatic Test PatternGeneration (ATPG) and continues to be an active area of research.

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

ab+bca

bc

Example Circuit with Fault Locations and Karnaugh Map


7.2.5.1 Collapse the Faults

Initial circuit with potential faults:

a

b

c

z

L7@0,1

L6@0,1

L8@0,1

L1@0,1

L2@0,1

L3@0,1

L4@0,1

L5@0,1


Gate Collapsing

gate faults kept fault

For each set of equivalent faults, we will keep the fault shown in bold and eliminatethe other faults. A good heuristic for choosing which fault to keep: keep the faultcloses to the output. The closer a fault is to the output, the easier it is to analyzeits behaviour, because the equation for the output will be simpler.


Intelligent Collapsing1. delete faults that previously decided could be ignored

2. by intelligent analysis of circuit, find equivalent faults

a

b

c

z

L7@0,1

L6@0,1

L8@0,1

L1@0,1

L2@0,1

L3@0,1

L4@0,1

L5@0,1


7.2.5.2 Check for Fault Dominationfault eqn K-map Diff w/ ckt

1) L2@1 a+c

ab

ca b

c

2) L3@1 b

ab

ca b

c

3) L4@1 a+bc

ab

ca b

c

4) L5@1 ab+c

ab

ca b

c

5) L6@0 bc

ab

ca b

c

6) L7@0 ab

ab

ca b

c

7) L8@0 0

ab

ca b

c

8) L8@1 1

ab

ca b

c


Remove dominated faults

Current faults:

a

b

c

z

L7@0,1

L6@0,1

L8@0,1

L1@0,1

L2@0,1

L3@0,1

L4@0,1

L5@0,1

Dominated faults:


7.2.5.3 Required Test Vectors

Definition required test vector: A test vector tv is required if there is a fault forwhich tv is the only test vector that will detect the fault.


1) L3@1 b

ab

c

ab

c

2) L4@1 a+bc

ab

c

ab

c

3) L5@1 ab+c

ab

c

ab

c

4) L6@0 bc

ab

c

ab

c

5) L7@0 ab

ab

c

ab

c


7.2.5.4 Faults Not Covered by RequiredTest Vectors


1) L4@1 a+bc

ab

c

ab

c

2) L5@1 ab+c

ab

c

ab

c

Test vector(s) required to catch these faults:


7.2.5.5 Order to Run Test Vectors

The order in which the test vectors are run is important because it can affect howlong a faulty chip stays in the tester before the chip’s fault is detected.

The first vector to run should be the one that detects the most faults.

Build a table for which faults each test vector will detect.


Test Vector

faulta

bc

ab

c

ab

c

ab

c

110 010 011 101

1) L1@0a

bc

1

2) L1@1a

bc

1

3) L2@0a

bc

1 1

4) L2@1a

bc

1

5) L3@0a

bc

1

6) L3@1a

bc

1

7) L4@0a

bc

1

8) L4@1a

bc

1

9) L5@0a

bc

1

10) L5@1a

bc

1

11) L6@0a

bc

1

12) L6@1a

bc

1 1

13) L7@0a

bc

1

14) L7@1a

bc

1 1

15) L8@0a

bc

1 1

16) L8@1a

bc

1 1Faults detected 5 5 5 6


7.2.5.6 Summary of Technique to Find andOrder Test Vectors

1. identify all possible faults

2. gate collapsing

3. node collapsing

4. intelligent collapsing

5. fault domination

6. determine required test vectors

7. choose minimal set of test vectors to detect remaining faults

8. order test vectors based on number of faults detected (NOTE: when iteratingthrough this step, need to take into account faults detected by earlier testvectors)


7.2.6 One Fault Hiding Anothera

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

Assume that we are not trying to detect all faults — L1 is viewed as not being atrisk for faults, but L3 is at risk for faults.

a

b

c

z

L1

L3

a

b

c

z

L1

L3

7.2.6 One Fault Hiding Another 603

Fault Hiding

a

b

c

z

L1

L3

a

b

c

z

L1

L3

Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will notdetect L3@0.

In the presence of other faults, the set of test vectors to detect a fault will change.

fault(s) eqn K-map Diff w/ ckt

L3@0 aba

bc

ab

c

L1@1,L3@0 ba

bc

ab

c


7.3 Scan Testing in General

7.3.1 Structure and Behaviour of ScanTesting

circuitundertest

data_in(3)

data_in(1)

data_in(2)

data_in(0)

zeta_in(3)

zeta_in(1)

zeta_in(2)

zeta_in(0)

anot

her

circ

uit #

0

anot

her

circ

uit #

1

Normal Circuit

7.3.1 Structure and Behaviour of Scan Testing 605

circuitundertest

anot

her

circ

uit

yet a

noth

er c

ircui

t

mode0 scan_in0

scan_out0

mode1 scan_in1

scan_out1

scan

cha

in 0

scan

cha

in 1

Circuit with Scan Chains Added


7.3.2 Scan Chains

circuitundertest

data_in(3)

data_in(1)

data_in(2)

data_in(0)

zeta_in(3)

zeta_in(1)

zeta_in(2)

zeta_in(0)

anot

her

circ

uit #

0

anot

her

circ

uit #

1Normal Circuit

mode0 scan_in0

circuitundertest

scan_out0

mode1 scan_in1

scan_out1

data_in(3)

data_in(1)

data_in(2)

data_in(0)

zeta_in(3)

zeta_in(1)

zeta_in(2)

zeta_in(0)

Circuit with Scan Chains Added

7.3.2 Scan Chains 607

7.3.2.1 Circuitry in Normal and Scan Modemode0 scan_in0

circuitundertest

scan_out0

mode1 scan_in1

scan_out1

Normal Mode

mode0 scan_in0

circuitundertest

scan_out0

mode1 scan_in1

scan_out1

Scan Mode


7.3.2.2 Scan in Operation

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1sc

an c

hain

0Circuit under test with scan chains

clk

scan_in0

mode0

scan_out1

scan_out0

scan_in1

currentvector0

currentresults1

Sequence of load; test; unload

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

currentvector0

Load Test Vector(1 cycle per bit)

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

Run Test VectorThrough Circuit

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

currentresults1

Unload Result(1 cycle per bit)


Unload and Load and Same Time

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

currentvector0

previousresults0

previousresults1

currentvector1

Unload Prev ResultLoad Cur Test Vector

(1 cycle per bit)

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

Run Cur Test VectorThrough Circuit

circuitundertest

anot

her

circ

uit

yet

anot

her

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

cha

in 0

scan_out1

scan

cha

in 0

next testvector0

currentresults0

currentresults1

next testvector1

Unload Cur ResultLoad New Test Vector

(1 cycle per bit)

clk

scan_in0

mode0

scan_out1

next testvector0

previousresults1

scan_out0

scan_in1 currentvector1

currentresults0

previousresults0currentvector0

next testvector1

currentresults1

Sequence of load; run; unload


7.3.2.3 Scan in Operation with ExampleCircuit

a b

c z

d

y

Circuit under test

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

Circuit under test with scan test circuitry


mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

δδ

Start Loading Test Vector (Load δ)

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

γ γ δ

δδ

Load γmode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

δ

β

γ

δδ

γ

γβ

Load β

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

α α β

β β γ

γ γ

δδ

δ

Load α


mode0 scan_in0 mode1 scan_in1

scan_out0 scan_out1

clk

mode0

β β

α βα

α

γ

γ γ δ

Run Test Vector


scan_out0 scan_out1

clk

mode0

α

α α β

β β γ

γ γ δ

αβ

α__

δ

β__

γ

βδ

αβ+β__

γ

α__

δ+βδ

Test Values Propagate

(α__

δ+βδ)


scan_out0 scan_out1

−-

clk

mode0

δ’ δ’

αβ+β__

γ

α__

δ+βδ

Flop-In Result, Start (Un)loading Test Vector


scan_out0 scan_out1

−

−

αβ+β__

γ

(α__

δ+βδ, αβ+β__

γ)

−

−−

clk

mode0

δ’

δ’ δ’

γ’ γ’

Continue (Un)loading Test Vector



scan_out0 scan_out1

−

−

−

clk

mode0

−

ζζ

γ’

γ’ γ’ δ’

δ’ δ’

β’ β’

(α__

δ+βδ, αβ+β__

γ)

Continue (Un)loading Test Vector


scan_out0 scan_out1

(α__

δ+βδ, αβ+β__

γ)clk

mode0

−ζ

ζ

ζ

ψψ

β’

β’ β’ γ’

γ’ γ’ δ’

δ’ δ’ δ’

α’ α’

Finish (Un)loading Test Vector


scan_out0 scan_out1

ψ

(α__

δ+βδ, αβ+β__

γ)

ψ

clk

mode0

α’

β’

γ’

δ’

ψ

ζ

Run Next Test Vector


7.3.3 Summary of Scan Testing

• Adding scan circuitry

1. Registers around circuit to be tested are grouped into scan chains

2. Replace each flop with mux + flop

3. Flops and muxes wired together into scan chains

4. Each scan chain is connected to dedicated I/O pins for loading andunloading test vectors

• Running test vectors

1. Put scan chain in “scan” mode

2. Load in test vector (one element of vector per clock cycle)

3. Put scan chain in “normal” mode

4. Run circuit for one clock cycle — load result of test into flops

5. Unload results of current test vector while simultaneously loading in nexttest vector (one element of vector per clock cycle)

7.3.4 Time to Test a Chip 615

7.3.4 Time to Test a Chip

If the length (number of flops) of a scan chain is n, then it takes 2n+1 clock cyclesto run a single test: n clock cycles to scan in the test vector, 1 clock cycle toexecute the test vector, and n cycles to scan out the results. Once the results arescanned out, they can be compared to the expected results for a correctly workingcircuit.

If we run 2 or more tests (and chips generally are subjected to hundreds ofthousands of tests), then we speed things up by scanning in the next test vectorwhile we scan out the previous result.

ScanLength = number of flip flops in a scan chainNumVectors = number of test vectors in test suiteTimeScan = number of clock cycles to run test suite

= NumVectors× (ScanLength+1)+ScanLength


7.3.4.1 Example: Time to Test a Chip

A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits,22,000 bits, and two of 15,000 bits.

500,000 test vectors are used for each scan chain.

The tests are run at 80% of full speed.

Question: Calculate the total test time.

7.4. BOUNDARY SCAN AND JTAG 617

7.4 Boundary Scan and JTAG

Boundary scan originated as technique to test wires on printed circuit boards(PCBs).

Goal was to replace “bed-of-nails” style testing with technique that would work forhigh-density PCBs (lots of small wires close together)

Now used to test both boards and chip internals.

Used both on boundaries (I/O pins) and internal flops.


Boundary Scan with JTAG

Standardized by IEEE (1149) and previously by JTAG:• 4 required signals (Scan Pins: TDI , TDO, TCK, TMS)

• 1 optional signal (Scan Pin: TRST)

• protocol to connect circuit under test to tester and other circuits

• state machine to drive test circuitry on chip

• Boundary Scan Description Language (BSDL): structural language used todescribe which features of JTAG a circuit supports

JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library.Rarely is a JTAG circuit custom-built as part of a larger part. So, you’ll probably bechoosing and using JTAG circuits, not constructing new ones.

Using JTAG circuitry is usually done by giving a description of your printed circuitboard (PCB) and the JTAG components on each chip (in BSDL) to test generationsoftware. The software then generates a sequence of JTAG commands and datathat can be used to test the wires on the circuit board for opens and shorts.

7.4.1 Scan Instructions 619

JTAG Structure

scan registers

TDI TDOTCK TMS

circuitundertest

chip

control

normalinputpins

normaloutputpins

High-level view

BSC

BSC

BSC

BR

IR

IDCODE

TAP Controller

BSR

TDI TDO

TCK

TMS

IRC IRC

circuitundertest

chip

Instruction Decoder

BSC

BSC

BSC

control

Detailed view


7.4.1 Scan Instructions

This the set of required instructions, other instructions are optional.

EXTEST Test board-level interconnect. Drive output pins of chipwith hard-coded test vector. Sample results on inputs.

SAMPLE Sample result dataPRELOADLoad test vectorBYPASS Directly connect TDI to TDO. This is used when several

chips are daisy chained together to skip loading data intosome chips.

IDCODE Output manufacturer and part number

7.5. BUILT IN SELF TEST 621

7.5 Built In Self Test

7.5.1 Block Diagrammode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

resultchecker

all_ok

o_data(0)d(0)

d(1)

d(2)

d(3)

o_data(1)

o_data(2)

circuitundertest

Circuit in Normal Mode

mode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

resultchecker

all_ok

o_data(0)d(0)

d(1)

d(2)

d(3)

o_data(1)

o_data(2)

circuitundertest

Circuit in Test Mode


Circuit w/ BIST in Normal Mode

circuitundertest

mode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

7.5.1 Block Diagram 623

Circuit w/ BIST in Test Mode

circuitundertest

mode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)


7.5.1.1 Components

Test Generatormode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

• generates a psuedo-random set of test vectors

• for n output bits, generates all vectors from 1 to 2n−1 in a pseudo random order

• built with a linear-feedback shift register (shift-register portion is the input flops)


Test Generator

q2q1q0

Question: Why not just use a counter to generate 1..2n−1?


Signature Analyzer

mode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

• checks that the output it is examining has the correct results for the completeset of tests that are run

• only has a meaningful result at the end of the entire test sequence.

• built with a linear-feedback shift register

• similar to a hash function or a lossy compression function

• if there are no faults, the signature analyzer will definitely say “ok” (no falsenegatives)

• if there is a fault, the signature analyzer might say “ok” or might say “bad” (falsepositives are possible)

• design tradeoff: more accurate signature analyzers require more hardware


Result Checkermode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

• signature analyzers output “ok”/”bad” on every clock cycle, but the result is onlymeaningful at the end of running the complete set of test vectors

• the result checker looks at test vector inputs to detect the end of the test suiteand outputs “all ok” if all signature analyzers report “ok” at that moment

• implemented as an AND gate


7.5.1.2 Linear Feedback Shift Register(LFSR)

Basically, a shift register (sequence of flip-flops) with the output of the last flip-flopfed back into some of the earlier flip-flops with XOR gates.

Design parameters:

• number of flip-flops

• external or internal XOR

• feedback taps (coefficients)

• external-input orself-contained

• reset or set

S

R

S

R

S

R

reset

d0 q0 d1 q1 d2 q2i

LFSR Example


Example LFSRs

S

R

S

R

S

R

reset

d0 q0 d1 q1 d2 q2i

External-XOR, input, reset

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

External-XOR, no input, set

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2i

Internal-XOR, input, set

S

R

S

R

S

R

reset

d0 q0 d1 q1 d2 q2i

Internal-XOR, input, reset

In E&CE 327, we use internal- XOR LFSR’s, because the circuitry matches themathematics of Galois fields.

External-XOR LFSR’s work just fine, but they are more difficult to analyze, becausetheir behaviour can’t be treated as Galois fields.


7.5.1.3 Maximal-Length LFSR

Definition maximal-length linear feedback shift register: An LFSR thatoutputs a pseudo-random sequence of all representable bit-vectors except0...00 .

Definition pseudo random: The same elements in the same order every time,but the relationship between consecutive elements is apparantly random.

Maximal-length linear feedback shift registers are used to generate test vectors forbuilt-in self test.


Maximal-Length LFSR Circuits

The figures below illustrate the two maximal-length internal-XOR linear feedbackshift registers that can be constructed with 3 flops.

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

Maximal-length internal-XOR LFSR

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

Maximal-length internal-XOR LFSR

Question: Why do maximal-length LFSRs not generate the test vector0...00?


Maximal Length LFSR Characteristics

Maximal-length LFSRs:

• set to all 1s initially

• self contained (no external i input)

clk

d0

q0

reset

d1

q1

val 6 4 1 2 5 3 77

q2

1 2 3 4 5 6 7 8

6

Timing diagram for a 3-flop maximal-length LFSR

7.5.2 Test Generator 633

7.5.2 Test Generatormode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

The test generator component is a maximal-length LFSR ...

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2


Test Generator

The test generator component is a maximal-length LFSR with multiplexors on theinputs to each flip-flop. In test mode, the data input on each flip flop is connectedto the output of the previous flip flop. In normal mode, the input of each flip flop isconnected to the environment.

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

i_d(0) i_d(1) i_d(2)

mode

q2q1q0

7.5.2 Test Generator 635

Test Generator

mode

i_d(0)

i_d(2)

i_d(1)

q0

q1

q2

d0

d1

d2

A test generator, reset not shown


7.5.3 Signature Analyzer

There are four things that change between different signature analyzers:

• number of flops (⇑ flops =⇒ ⇑ area, ⇑ accuracy)

• choice of feedback taps: a good choice can improve accuracy (more isn’tnecessarily better)

• bubbles on input to AND gate for “ok”: determined by expected result fromsimulating test sequence through circuit under test and LFSR of analyzer.

mode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

7.5.3 Signature Analyzer 637

Signature Analyzer

This circuit:

• Two flops, most analyzers use more — the HP boards in the 1970s used 37flops!

• Feedback taps on both flops. Different signature analyzers have differentconfigurations of feedback taps.

• Also contains “ok” tester (AND gate). Expected output of LFSR at end of testsequence is: q0=1 and q1=1 , or 01 . (We know this because of bubble on AND

gate. To see why this is the expected output of the signature analyzer, we wouldneed to know the correct sequence of outputs of the circuit under test.)

S

R

S

R

reset

d0 q0 d1 q1i

ok


Signature Analyzer

clk

q0

q1

reset

0

0

i i6 i5 i4 i3 i2 i1 i0 -

d0 -

d1

7.5.3 Signature Analyzer 639

Signature Analyzer Timing

clk

q0

q1

reset

0

0

i6

i60

i i6 i5 i4 i3 i2 i1 i0

356 = i3⊕i5⊕i62356 = i2⊕i3⊕i5⊕i6etc...

-

d0 i6 i5 -

d1 0 i6 i5⊕i6

i5

i5⊕i6

i4⊕i6

i4⊕i6

356

356

i4⊕i5

i4⊕i5

346

346

245

245

2356

2356

1346

1346

02356

02356

1245

1245

-


7.5.4 Result Checkermode

i_data(0)

i_data(2)

i_data(1)

i_data(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

resultchecker

all_ok

test gen LFSR

o_data(0)d(0)

d(1)

d(2)

d(3)

ok(0)

ok(1)

ok(2)

o_data(1)

o_data(2)

circuitundertest

The purpose of the result checker is to check the “ok” circuit at the end of the testsequence.

q0 q1 all_ok

reset

q2 ok

7.5.5 Arithmetic over Binary Fields 641

7.5.5 Arithmetic over Binary Fields• Galois Fields!

• Two operations: “+” and “×”

• Two values: 0 and 1

• Bit vectors and shift-registers are written as polynomials in terms of x.

Addition+ represents XOR

expression result0+0 00+1 11+0 11+1 0x+ x 0

Multiplication× represents concatenating shift

registers

expression resultx4×1 x4

x2× x3 x5


Example

Calculate (x3 + x2 +1)× (x2 + x)

x2 × (x3 + x2 +1) = x5 + x4 + x2

x × (x3 + x2 +1) = x4 + x3 + xx5 + x3 + x2 + x

7.5.6 Shift Registers and Characteristic Polynomials 643

7.5.6 Shift Registers and CharacteristicPolynomials

Each linear feedback shift register has a corresponding characteristic polynomial.

From polynomials to hardware:

• The maximum exponent denotes the number of flops

• The other exponents denote the flops that tap off of feedback line from last flop

• From the characteristic polynomial, we cannot determine whether the shiftregister has an external input. Stated another way, two shift registers that areidentical except that one has an external input and the other does not will havethe same characteristic polynomial.


Shift Regs and Polynomials

S

R

S

R

reset

d0 q0 q1i

S

R q2

p(x) = x3

S

R

S

R

reset

d0 q0 q1

S

R q2d1i

x0 x1 x2 x3

p(x) = x3 + x

S

R

S

R

reset

d0 q0 q1i

S

R q2

x0 x1 x2 x3

p(x) = x3 +1

7.5.6 Shift Registers and Characteristic Polynomials 645

Shift Regs and Polynomials

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2

x0 x1 x2 x3

p(x) = x3 + x+1

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2d2

x0 x1 x2 x3

p(x) = x3 + x2 + x+1

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2

S

R q3d3

x0 x1 x2 x3 x4

p(x) = x4 + x3 + x+1


7.5.6.1 Circuit Multiplication

Redoing the multiplication example (x2 + x)× (x3 + x2 +1) as circuits:

x2 + x

x3 + x2 +1

(x2 + x)× (x3 + x2 +1)

= x× (x3 + x2 +1)

+ x2× (x3 + x2 +1)

= x5 + x3 + x2 + x

7.5.7 Bit Streams and Characteristic Polynomials 647

7.5.7 Bit Streams and CharacteristicPolynomials

A bit stream, or bit sequence, can be represented as a polynomial.

The oldest (first) bit in a sequence of n bits is represented by xn−1 and theyoungest (last) bit is x0.

The bit sequence 1010011 can be represented as x6 + x4 + x+1:

1 0 1 0 0 1 1= 1x6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0

= x6 + x4 + x+1


7.5.8 Division

With rules for multiplication and addition, we can define division.

A fundamental theorem of division defines q and r to be the quotient andremainder, respectively, of m÷ p iff:

m(x) = q(x)× p(x)+ r(x)

7.5.8 Division 649

Long Division

In Galois fields, we do division just as with long division in elementary school.

Given:

m(x) = x6 + x4 + x3

p(x) = x4 + x

Calculate the quotient, q(x) and remainder r(x) for m(x)÷ p(x):

x2 + 1x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0

x6 + 1x3

1x4

1x4 + xx

Quotient q(x) = x2 +1Remainder r(x) = x


Long Division (Check)

Check result:

m(x) = q(x) × p(x) + r(x)= (x2 +1) × (x4 + x) + x= x6 + x3 + x4 + x + x= x6 + x4 + x3

7.5.9 Signature Analysis: Math and Circuits 651

7.5.9 Signature Analysis: Math andCircuits

The input to the signature analyzer is a “message”, m(x), which is a sequence of n

bits represented as a polynomial.

After n shifts through an LFSR with l flops:

• The sequence of output bits forms a quotient, q(x), of length n− l

• The flops in the analyzer form a remainder, r(x), of length l

m(x) = q(x)× p(x)+ r(x)

The remainder is the signature.


Test Generation: Math and Circuits

The mathematics for an LFSR without an input i :

• same polynomial as if the circuit had an input

• input sequence is all 0s


Input Streams and Error Polynomials

An input stream with an error can be represented as m(x)+ e(x)

• e(x) is the error polynomial

• bits in the message that are flipped have a coefficient of 1 in e(x)

m(x)+ e(x) = q′(x)× p(x)+ r′(x)


Input Streams and Error Polynomials

The error e(x) will be detected if it results in a different signature (remainder).

m(x) and m(x)+ e(x) will have the same remainder iff

e(x) mod p(x) = 0

That is e(x) must be a multiple of p(x).

The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).


BIST for a Simple Circuit

Outline of steps to see if a fault will be detected by BIST:

1. Output sequence from test generator

2. Output sequence from correct circuit

3. Remainder for signature analyzer with correct output sequence

4. Output sequence from faulty circuit

5. Remainder for signature analyzer with faulty output sequence

6. Compare correct and faulty remainder, if different then fault detected


Components

a

b z

a

L1

L2

L3

L4

L5

L6

L7L8

t0 t1 t2D QD QD Q

r0 r1 r2D QD QD Qz


t0 t1 t2t0 t1 t2a b c

corr

ect

faul

ty

z z

z r0 r1 r2 z r0 r1 r2


Question: Determine if L2@1 will be detected

Test Generation Sequencet0 t1 t2

1 1 11

1

11

11

1

1

initial values = 101

111

00

0

00

0

00

01

111

00

final values are repeatof initial values

Technique is to shift; then computeresult of XORs

Equation for correct circuit: ab+bc

Equation for faulty circuit: a+ c

Output sequences for correct and faultycircuits

t0 t1 t2a b c

corr

ect

faul

ty

z z1 1 11

1

1

1

1

0

00

0

00

01

11

00

1

vectors from test generationsequence

1110000

output sequencesfrom circuits

1111

11

0


Signature analyzer sequence for correctCircuit

z r0 r1 r21110000

0 0 0

output sequencefrom correct circuit

initialvalues = 0

1111001

111100

remainder

011

1

1

0

0

0011

1

1

0

0

01

11

00

001

11

00

1

Signature analyzer sequence for faultycircuit

z r0 r1 r2

output sequencefrom correct circuit

initialvalues = 0

remainder

11

1111

11

0

1 0 0 00 011

11

00

111

1

00

110001

011000

010000

0010000


7.6 Scan vs Self Test

Scan

⇑ less hardware

⇓ slower

⇑ well defined coverage

⇑ test vectors are easy to modify

Self Test

⇓ more hardware

⇑ faster

⇓ ill defined coverage

⇓ test vectors are hard to modify

Chapter 8

Review

This chapter lists the major topics of the term. The “Topics List” section for eachmajor area is meant to be relatively complete.

661

662 CHAPTER 8. REVIEW

8.1 Overview of the Term

• The purely digital world

– VHDL

– design and optimization methods

– functional verification

– performance analysis

• Analog effects in the digitalworld

– timing analysis

– power

– faults and testing

8.2. VHDL 663

8.2 VHDL

8.2.1 VHDL Topics• simple syntax and semantics — things that you should know simply by having

done the labs and project

• behavioural semantics of VHDL

• synthesis semantics of VHDL

• synthesizable and unsynthesizable code


8.2.2 VHDL Example Problems• identify whether a particular signal will be the output of combinational circuitry or

a flop

• identify whether a particular process is combinational or clocked

• legal, synthesizable, and good code

• perform delta-cycle simulation of VHDL

• perform RTL simulation of VHDL

• identify whether two VHDL fragments have same behaviour

• match VHDL code with waveforms

• match VHDL code with hardware

• choose the VHDL fragment that generates smaller or faster hardware

8.3. RTL DESIGN TECHNIQUES 665

8.3 RTL Design Techniques

8.3.1 Design Topics• coding guidelines

• generic FPGA hardware

• area estimation

• finite state machines

– implicit

– explicit-current

– explicit-current+next

• from algorithm to hardware

– dependency graph

– dataflow diagram

– scheduling

– input/output allocation

– register allocation

– datapath allocation

– hardware block diagram

– state machine

• memory dependencies

• memory arrays and dataflow diagrams


8.3.2 Design Example Problems• choose design guidelines to follow in different situations

• estimate area to implement a circuit in an FPGA

• calculate resource usage for a dataflow diagram

• calculate performance data for a dataflow diagram

• given an algorithm, design a dataflow diagram

• given a dataflow diagram, design the datapath and finite state machine

• optimize a dataflow diagram to improve performance or reduce resource usage

• given a dataflow diagram, calculate the clock period that will result in themaximum performance

8.4. FUNCTIONAL VERIFICATION 667

8.4 Functional Verification

8.4.1 Verification Topics• test cases

• measuring coverage

• time for verification

• test benches

• assertions

• coverage monitors

• relational specification

• functional specification

• boundary conditions / corner cases


8.4.2 Verification Example Problems• choose first cases to test

• identify corner cases

• choose technique to detect bug (test case, assertion/test bench)

• determine whether a code change will cause a bug

• identify a test case and either assertion or test bench to catch a bug

8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION 669

8.5 Performance Analysis andOptimization

8.5.1 Performance Topics• time to execute a program

• definition of performance

• speedup

• n% faster

• calculating performance of different different tasks and of average task

• choosing which task to optimize to best improve overall performance

• cpi calculations

• performance increase over time

• design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market)

• CPI calculations

• MIPs calculations

• Clock speed vs. performance

• Optimality — performance / area tradeoffs


8.5.2 Performance Example Problems• calculate performance / area tradeoffs

• calculate performance / time tradeoffs

• compare performance data between products

• evaluate performance criteria

8.6. TIMING ANALYSIS 671

8.6 Timing Analysis

8.6.1 Timing Topics• circuit parameters that affect delay

– clock period

– clock skew

– clock jitter

– propagation delay

– load delay

– setup time

– hold time

– clock-to-Q time

• timing analysis of latch

• timing analysis of master-slaveflip-flop

• timing analysis of hierachical storagedevice

• critical path and false path

– algorithm to find critical path

– algorithm to determine if path isfalse or critical

– signal assignment to exercisecritical path

• elmore timing model

• derating factors


8.6.2 Timing Example Problems• timing parameters for minimum clock period

• timing parameters for hold constraint

• find the critical path and assignment to exercise it

• compute elmore delay constant

• compare accuracy of different timing models

• determine if a storage device will work correctly

• compute timing parameters of storage device

• identify timing violation, suggest remedy

• suggest design change to increase clock speed

8.7. POWER 673

8.7 Power

8.7.1 Power Topics• power vs energy

• equations for power

– dynamic power

– static power

– switching power

– short circuit power

– leakage power

– activity factor

– leakage current

– threshold voltage

– supply voltage

• analog power reduction techniques

• rtl power reduction techniques

– data encoding

– clock gating


8.7.2 Power Example Problems• predict effect of new fabrication process on power

• predict effect of environment change (temp, supply voltage, etc) on powerconsumption

• predict effect of design change on power consumption (capacitance, activityfactor)

• design data-encoding scheme for a circuit, predict effect on power consumption

• design clock gating scheme for a circuit, predict effect on power consumption

• asses validity of various power- or energy-consumption metrics

8.8. TESTING 675

8.8 Testing

8.8.1 Testing Topics• causes of faults

• locations of faults

• physical faults

• single stuck-at fault model

• testable / untestable fault

• economics of testing

• fault coverage

• test vector generation

• order test vectors to reduce test time

• behaviour of a scan chain

• time to run a scan test

• JTAG

• built-in self-test

• linear feedback shift register

• signature analyzer

• Galois fields

• process and time to run a BIST test


8.8.2 Testing Example Problems• compute optimal amount of testing to maximize profits

• compute coverage for a given set of test vectors

• find test vectors to catch a set of faults, choose order to run test vectors

• determine if a fault is detectable

• choose an LFSR to use for BIST test generation

• choose an LFSR to use for BIST signature analysis

• determine if a given BIST will catch a given fault

• determine probability that a given BIST technique will report that a faulty circuitis correct

• determine if a given fault-testing scheme will detect a physical fault

• match LFSR to characteristic polynomial

• match BIST hardware to Galois mathematics

• perform Galois field mathematics, compare to waveforms

8.9. FORMULAS TO BE GIVEN ON FINAL EXAM 677

8.9 Formulas to be Given on Final Exam

T =Ins×C

F

Pf =W

T

S =T1

T2

M =F/106

(n

∑i=0

PIi×Ci)


Formulas II

P =12(A×CL×V

2×F)+(τ×A×V× ISh×F)+(V× IL)

q = 1.60218×10−19C

k = 1.38066×10−23J/K

F ∝(V−VTh)2

V

IL ∝ e

−q×VTh

k×T

ECE 327 Slides VHDL Verilog Digital Hardware Design

Documents

ECE 327 Slides VHDL Verilog Digital Hardware Design