Section I I ntroduction to Programmable Logic Devices.

Section I

Introduction to Programmable Logic Devices

Programmable Logic Device FamiliesSource: Dataquest

Logic

StandardLogic

ASIC

ProgrammableLogic Devices(PLDs)

GateArrays

Cell-BasedICs

Full CustomICs

CPLDsSPLDs(PALs) FPGAs

AcronymsSPLD = Simple Prog. Logic Device PAL = Prog. Array of LogicCPLD = Complex PLDFPGA = Field Prog. Gate Array

Common ResourcesConfigurable Logic Blocks (CLB)

– Memory Look-Up Table– AND-OR planes– Simple gates

Input / Output Blocks (IOB)– Bidirectional, latches, inverters, pullup/pulldowns

Interconnect or Routing– Local, internal feedback, and global

CPLDs and FPGAsCPLD FPGA

Architecture PAL/22V10-like Gate array-likeMore Combinational More Registers + RAM

Density Low-to-medium Medium-to-high 0.5-10K logic gates 1K to 500K system gates

Performance Predictable timing Application dependent Up to 200 MHz today Up to 135MHz today

Interconnect “Crossbar” Incremental

Complex Programmable Logic Device Field-Programmable Gate Array

Not shown: Simple PLD (SPLD) Architecture

PLD Industry Growth

Programmable Logic vs. Semi-Custom ASIC Market

Mask ProgrammedGate Arrays

$7.4B

ProgrammableLogic Share

$5.8B

Standard Logic$2.6B

37%37%16%

47%

Total 1996 Market – $9.5B Total 2001 Market – $15.8B

Mask ProgrammedGate Arrays

$5.6B

ProgrammableLogic Share

$1.9B

Standard Logic$2.0B

20%20%21%

59%

Source: Dataquest, May 1997

Who is Xilinx?• World’s leading innovator of complete

programmable logic solutions

• Inventor of the Field Programmable Gate Array• $600M Annual Revenues; 35+% annual growth• Fabless* Semiconductor and Software Company

– UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}

– Yamaha (Japan)– Seiko Epson (Japan)

Programmable Logic Chips Foundation and Alliance Series

Design Software

Xilinx vs. Competitors1997 Calendar Year Revenues

0

100

200

300

400

500

600

700

Altera

Xilinx

Vantis

Lattic

eActe

l

Lucent

Cypress

Atmel

QuickL

ogic

$ Millions

Source: Company reports & In-Stat. Includes SPLD, CPLD, FPGA revenues.

FPGA Market Share Q4 1997FPGA Market Share Q4 1997

Xilinx55%

Others5%

Lucent10%

Altera14%

Actel16%

Source: In-Stat Research, March 1998Altera number includes both 8K and 10K families

Process & Density LeadershipT

ran

sist

or

Co

un

t (m

illi

on

s)

XC40125XV - Industry’s 1st 0.25u PLD. ~250K gates, 5 LM.

XC40150XV

XC40250XV ~500K gates

0.25u process

Virtex1 Million Gates

7.5

25

50

75

2Q984Q97 3Q98 4Q981Q98

Xilinx Integrated Circuit Products• XC9500: Flash-based In System Program. CPLDs

– Lowest price, best pin locking, 600 - 7K gates

• XC4000: Industry’s largest & fastest FPGAs– XC4000E: 0.5, 5V, 5K - 40K gates

– XC4000EX: 0.5, 5V, 45K - 60K gates

– XC4000XL: 0.35, 3.3V devices, 5V compatible I/O, 3K - 180K gates

– XC4000XV: 0.25, 2.5V / 3.3V, 5V compatible I/O, 250K - 500K gates

– Spartan: 0.5, 5V, Low Cost, 10K - 40K gates

• Virtex: New FPGA architecture in 1998– 0.25, 5LM, 250K-1M gates, Select & Block-RAM

• XC6200: Reconfigurable Processing Unit – Dynamically and partially reconfigurable

• Low-cost solutions (Industry)– XC3000 (no RAM), XC5200 (no RAM), HardWire ResearchUpper

LevelClass

CoreClass

X

X

X X

X

X

X

X

X

X

X

XX X

XX XX

* Gates are in terms of system-level gates

XC9500 CPLDs

• 5 volt in-system programmable (ISP) CPLDs

• 5 ns pin-to-pin

• 36 to 288 macrocells (6400 gates)

• Industry’s best pin-locking architecture

• 10,000 program/erase cycles

• Complete IEEE 1149.1 JTAG capability

FunctionBlock 1

JTAGController

FunctionBlock 2

I/O

FunctionBlock 4

3

Global Tri-

States 2 or 4

FunctionBlock 3

I/O

In-SystemProgramming Controller

FastCONNECTSwitch Matrix

JTAG Port

3

I/O

I/O

Global Set/Reset

Global Clocks

I/OBlocks

1

Xilinx XC4000 Architecture

CLB

CLB

CLB

CLB

SwitchMatrix

ProgrammableInterconnect

I/O Blocks (IOBs)

D Q

SlewRate

Control

PassivePull-Up,

Pull-Down

Delay

Vcc

OutputBuffer

InputBuffer

Q D

Pad

D QSD

RD

EC

S/RControl

D QSD

RD

EC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

Y

X

H1 DIN S/R EC

• High Density -> 1M System Gates

• SRAM Based LUT for Synchronous Dual Port RAM or Logic

• ASIC-like array structure

• Built-in Tri-States

• Infinite reconfigurations, downloaded from PC or workstation in ~1 second

ConfigurableLogic Blocks (CLBs)

XC6200 Reconfigurable Processing Unit

CPU

XC6200XC6200RPURPU

I/O

I/OMemory

1000x improvement in reconfigurationtime from external memory

FastMAPtm assures high speed direct access to all internal registers

All registers accessed viabuilt-in low-skewFastMAPtm busses

Microprocessor interfacebuilt-in: “XC6200 is memory mapped to look like SRAM to a host processor”

High capacity distributed memorypermits allocation of chipresources to logic or memory- 256kbits in XC6264Ultrafast Partial

Reconfiguration(40ns to 100’s of usec) Up to 100,000 gates

• Nov. 1997- shipping world’s largest FPGA, XC40125XV (10,982 logic cells, 250K System Gates)

• 1 Logic cell = 4-input LUT + FF• 175,000 Logic cells = 2.0 M logic gates in 2001

Year

Logic Cells Logic Gates

1,000

10,000

100,000

1,000,000

1994 1996 1998 2000 2002

12M

1.2M

120K

12K

2 Million logic gates2 Million logic gates

D Q

FFLUT

Exponential Growth in Density

Design Flow

XC4000XC4000XC4000

3

Design Entry in schematic, ABEL, VHDL, and/or Verilog. Vendors include Synopsys, Aldec (Xilinx Foundation), Mentor, Cadence, Viewlogic, and 35 others.

Implementation includes Placement & Routing and bitstream generation using Xilinx’s M1 Technology. Also, analyze timing, view layout, and more.

Download directly to the Xilinxhardware device(s) with

unlimited reconfigurations* !!

1

2

*XC9500 has 10,000 write/erase cycles

M1 Technology

Foundation Series Delivers Value & Ease of Use

• Complete, ready-to-use software solution

• Simple, easy-to-use design environment

• Easy-to-learn schematic, state-diagram, ABEL, VHDL, & Verilog design

• Synopsys FPGA Express Integration*

The Xilinx Student Edition • Prentice Hall’s most requested new engineering product in Q1

‘98 ! – Complete, affordable, and practical digital design course environment for all students– Predeveloped and tested lab-based course

• Includes – Foundation Series 1.3 for students’ computers– Practical Xilinx Designer lab tutorial book– Coupon for XS40-005XL and XS95-108 boards ($129)

• Sold through bookstores by Prentice Hall and www.Amazon.com, listed at $79 (ISBN 0136716296)

• Integrated tutorial projects cover:TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors

• Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web

Section IIBasic PLD Architecture

Section II Agenda• Basic PLD Architecture

– XC9500 and XC4000 Hardware Architectures

– Foundation and Alliance Series Software


XC9500 and XC4000 Hardware Architectures

XC9500 CPLDs

• 5 volt in-system programmable (ISP) CPLDs

• 5 ns pin-to-pin

• 36 to 288 macrocells (6400 gates)

• Industry’s best pin-locking architecture

• 10,000 program/erase cycles

• Complete IEEE 1149.1 JTAG capability

FunctionBlock 1

JTAGController

FunctionBlock 2

I/O

FunctionBlock 4

3

Global Tri-

States 2 or 4

FunctionBlock 3

I/O



JTAG Port

3

I/O

I/O

Global Set/Reset

Global Clocks

I/OBlocks

1

XC9500 - Architectural Features• Uniform, all pins fast, PAL-like architecture• FastCONNECT switch matrix provides 100%

routing with 100% utilization• Flexible function block

– 36 inputs with 18 outputs– Expandable to 90 product terms per macrocell– Product term and global three-state enables– Product term and global clocks– Product term and global set/reset signals

• 3.3V/5V I/O operation • Complete IEEE 1149.1 JTAG interface

XC9500 Function Block

ToFastCONNECT

FromFastCONNECT

2 or 43 GlobalTri-State

GlobalClocks

I/O

I/O

36

Product-Term

Allocator

Macrocell 1

ANDArray

Macrocell 18

Each function block is like a 36V18 !

XC9500 Product Family9536

Macrocells

Usable Gates

tPD (ns)

Registers

Max I/O

36 72 108 144 216

800 1600 2400 3200 4800

5 7.5 7.5 7.5 10

36 72 108 144 216

34 72 108 133 166

Packages VQ44PC44 PC44

PC84TQ100PQ100

PC84TQ100PQ100PQ160

PQ100PQ160

288

6400

10

288

192

HQ208BG352

PQ160HQ208BG352

9572 95108 95144 95216 95288

XC4000 ArchitectureCLB

CLB

CLB

CLB

SwitchMatrix

ProgrammableInterconnect I/O Blocks (IOBs)

ConfigurableLogic Blocks (CLBs)

D Q

SlewRate

Control

PassivePull-Up,

Pull-Down

Delay

Vcc

OutputBuffer

InputBuffer

Q D

Pad

D QSD

RDEC

S/RControl

D QSD

RDEC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

Y

X

H1 DIN S/R EC

XC4000E/X Configurable Logic Blocks

D QSD

RDEC

S/RControl

D QSD

RDEC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

YQ

Y

XQ

X

H1 DIN S/R EC

• 2 Four-input function generators (Look Up Tables)- 16x1 RAM or Logic function

• 2 Registers- Each can be configured as Flip Flop or Latch- Independent clock polarity- Synchronous and asynchronous Set/Reset

Look Up Tables

Capacity is limited by number of inputs, not complexity

Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM

• Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB

• Example:

A B C D Z

0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1

Look Up Table

Combinatorial Logic

AB

CD

Z

4-bit address

GFunc.Gen.

G4G3G2G1

WE

2(2 )4

= 64K !

XC4000X I/O Block Diagram

Shaded areas are not included in XC4000E family.

Xilinx FPGA Routing• 1) Fast Direct Interconnect - CLB to CLB• 2) General Purpose Interconnect - Uses switch matrix

CLBCLB

CLBCLB

CLBCLB

CLBCLB

SwitchMatrix

SwitchMatrix

• 3) Long Lines– Segmented across

chip– Global clocks, lowest

skew– 2 Tri-states per CLB

for busses

• Other routing types in CPLDs and XC6200

Other FPGA Resources• Tri-state buffers for busses (BUFT’s)

• Global clock & high speed buffers (BUFG’s)

• Wide Decoders (DECODEx)

• Internal Oscillator (OSC4)

• Global Reset to all Flip-Flops, Latches (STARTUP)

• CLB special resources– Fast Carry logic built into CLBs

– Synchronous Dual Port RAM

– Boundary Scan

What’s Really In that Chip?

CLB(Red)

Switch Matrix

Long Lines(Purple)

Direct Interconnect (Green)

Routed Wires (Blue)

Programmable Interconnect Points, PIPs (White)

XC4000XL Family

* 25-30% of CLBs as RAM


4005XL 4010XL 4013XL 4020XL 4028XL

Logic Cells 466 950 1,368 1,862 2,432

Typ Gate Range* 3 - 9K 7-20K 10-30K 13-40K 18-50K(Logic + Select-RAM) Max. RAM bits 6K 13K 18K 25K 33K(no Logic)

I/O 112 160 192 224 256Initial Packages PC84 PC84

PQ100 PQ100PQ160 PQ160 PQ160 PQ160PQ208 PQ208 PQ208 PQ208 HQ208

PQ240 PQ240 HQ240BG256 BG256 BG256

BG352 BG352

4036XL 4044XL 4052XL 4062XL 4085XL 40125XV

Logic Cells 3,078 3,800 4,598 5,472 7,448 10,982

Typ Gate Range* 22-65K 27-80K 33-100K 40-130K 55-180K 78-250K(Logic + Select-RAM)

Max. RAM bits 42K 51K 62K 74K 100K 158K(no Logic)

I/O 288 320 352 384 448 544Initial packages HQ208

HQ240 HQ240 HQ240 HQ240BG352BG432 BG432 BG432 BG432PG411 PG411 PG411 PG475 PG559 PG559

BG560 BG560 BG560 BG560

HardWireTM

• Unique no-risk 100% compatible mask-programmed cost reduction of Xilinx FPGA

• Cost-effective for volume applications– Savings of 40% to 70%

• Architecture-equivalent mask-programmed version of any FPGA– Requires virtually no customer engineering resources, test

vectors, or simulation

– ALL FPGA features (e.g., Configuration, Power-On Reset, JTAG, etc.) are fully supported

FPGA

HARDWIRE

HardWire Methodology vs. Gate Array Conversion

Xilinx

ATPG

Prototypes

Test

Development

Verification

Place and Route

Verification

Capture

Typical Gate Array Design Phases

FPGA

Design

Xilinx HardWire Methodology

Production ReadyPrototypes

Physical Data Base

Iterations

Gate ArrayRedesign Path

Physical Data Base.LCA File Conversion

Cost Reduction & Density Increases

Logic Cells

Cost

7.5K0.4K5,000 85,000 Logic Gates

XC4000XV

XC4000E

XC4000XL

Virtex Series

XC4085XLXC40250XV

(500K System-levelGates)

1M Gates*

20K250,000

3K36,000

XC4036EX

HardWire

XC5200

XC4000EX

* Starting with Virtex, Xilinx numbering scheme reflects approximate Logic + RAM gates rather than Logic gates only.

1996 1997 1998

CPLD or FPGA? CPLD• Non-volatile• JTAG Testing• Wide fan-in• Fast counters, state

machines• Combinational Logic• Small student

projects, lower level courses

FPGA• SRAM reconfiguration• Excellent for computer

architecture, DSP, registered designs

• ASIC like design flow• Great for first year to

graduate work• More common in schools• PROM required for non-

volatile operation


Foundation and Alliance Series Software

Xilinx M1-Based SoftwareLibraries and Interfaces for Leading EDA Vendors

Complete, Ready-to-Use

Includes Schematic, Simulation, VHDL and Verilog Synthesis

Foundation Series

ALLIANCE Series

Software Backplane

Core Implementation Software - Map, Place, Route, Bitstream generation, and analysis

Graphical User Interface is very similar to XACTStep v.6.0

Design Tools• Standard CAE entry and verification tools• Xilinx Implementation software implements the design

– The design is optimized for best performance and minimal size– Graphical User Interface and Command Line Interface– Easy access to other Xilinx programs– Manages and tracks design revisions– ~

Functional Simulation

Back AnnotationSchematic, State Mach., HDL Code, LogiBLOX, CORE Gen

Design Implementation

Verification

Static Timing Analysis,In-Circuit Testing

Design Entry

Simulator

M1 Design Manager

Xilinx

Foundationor Alliance

Multi-Source IntegrationMixed-Level Flows

Ch

eck

Po

int

Ve

rifi

ca

tio

n

EDIFVHDLVerilogSDF

KnowledgeDriven

Implementation

Design Source Integration

HDLSchematic

Existing Designs Cores

StandardsBased

Enables multiple sources and multiple EDA vendors in the same flow

Allows team development

Reduces design source translations

Design the way you are used to

Enables rapid, accurate iterations

Works well within existing ASIC flows

Facilitates Design Reuse

3rd Party Support & Libraries• Xilinx 3rd Party Design Entry & Simulation Support

– Synopsys, Cadence, Mentor Graphics, Aldec (Foundation)– Viewlogic, Synplicity, OrCad, Model Technologies, Synario, Exemplar and

others supply libs & interfaces– Industry standard file formats:

• VHDL, Verilog, and EDIF netlist formats• SDF Standard Delay files• VITAL library support

• Xilinx Libraries– Optimized components for use in any Xilinx FPGA or CPLD– Wide range of functions

• Comparators, Arithmetic functions, memory• DSP and PCI interfaces

– Easy to use with ABEL, VHDL, Verilog, schematic entry

Libraries, Macros & Attributes• Libraries are common design sets for all design entry tools (eg. text, schematic,

Foundation, Synopsys, Viewlogic, etc.)

• Library “interfaces” are specific to each front end• Attributes are library element properties• Online “Libraries Guide” has full listings and descriptions

– Unified Libraries: • Boolean functions, TTL, Flip-Flops,

Adders, RAM, small functions

– LogiBlox Libraries: • Variable size blocks of adders,

registers, RAM, ROM, etc.

• Properties defined as attributes

Core Design TechnologyOptimal Core Creation & Flexible Core

Delivery

Data sheets

CoreLINX:

SystemLINX:

Web Mechanism to Download New Cores

Third Party System Tools Directly Linked With Core Generator

Parameterizable Cores

Foundation Series Express Overview

• Easy to use, yet powerful• Based on Industry Standards, not proprietary

languages• Features:

– Schematic (partnership with Aldec)

– IEEE VHDL, Verilog, ABEL

– State Diagram Editor

– Interactive Simulation

– Exclusive partnership with Synopsys, the synthesis leaderAldecSynopsys

Xilinx

Foundation Project Manager• Integrates all tools into one environment

Schematic Entry

ABEL and VHDL Text Entry• From schematic menu (or

via HDL Editor), select Hierarchy -> New Symbol Wizard… to create symbol.

• Select HDL Editor & Language Assistant to learn by example, then define block.

• Synthesize to EDIF.

54

3

1

2

State Machine Graphical Editor

Graphical editor synthesizes into ABEL or VHDL code

Simulation - Easy to Use and Learn

• Generate stimulus easily and quickly

– Keyboard toggling– Simple clock stimulus– Custom formulas

• Easy debugging– Waveform viewer– Signals easily added and

removed– Simulator access from

schematic– Color-coded values on

schematic• Script Editor

Foundation Express 1.4 Features • Express Technology

– Optimizes the design for Xilinx Architectures – Optimized arithmetic functions– Automatic Global Signal Mapping– Automatic I/O Pad Mapping– Resource Sharing– Hierarchy Control– Source Code Compatible With Synopsys Design Compiler and FPGA Compiler– Verilog (IEEE 1364) and VHDL (IEEE 1076-1987) Support – Easy, graphical constraint entry– F1.4 is stand-alone

• F1.5: Sept / Oct ’98 – Integrated into Foundation Project Manager – Replaces Metamor

Xilinx-Express Design Flow

.VEI

.VHI

.UCF Reports

DSP COREGen & LogiBLOX

Module Generator

XNF.NGO

HDL Editor

State DiagramEditor

VHDLVerilog

.V.VHD

Foundation Design Entry Tools

Gate LevelSimulator

SchematicCapture

EDIFXNF

TimingRequirements

VHDLVerilog

Express

EDIF/XNF .XNF

BITJDEC

SDFVHDL

Verilog

Reports

EDIF

Xilinx Implementation Tools

HDL

SIMULATION

VHDLVerilog

Behavioral Simulation Models

Express Input and Output

– Mixed Verilog/VHDL modules are accepted

– Schematics may also be used, but should not be input into Express

– Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager

• Output netlists are in XNF format• Timing Specifications may be

specified in Express

Reports

TimingRequirements

VHDLVerilog

Express

.XNF

• Input files may be VHDL or Verilog format

– Timing Specifications are not used during Synthesis

– Timing Specifications can be included in the output netlist

Express Design Process

1. Analyze - Syntax check

2. Implement - Create generic logic design (Elaborate)

3. Enter constraints and options

4. Synthesize - Optimize the design for specific device

5. Export XNF Netlist

6. Implement layout with Xilinx Design Manager

1

3

2

2

{4

Implementation - M1 Design Manager

• Manages design data

• Access reports

• Supports CPLDs, FPGAs

Flow Engine

Timing Analyzer

PROM File FormatterHardware DebuggerEPIC Design Editor

Terminology• Project

– Source file; has a defined working directory and family• Version

– A Xilinx netlist translation of the schematic– Multiple Versions result from iterative schematic changes

• Revision– An implementation of a Xilinx netlist– Multiple revisions typically result from different options

• Part type– Specified at translation; can be changed in a new revision

Toolbox Programs• Flow Engine

– Controls start/stop points and custom options

• Timing Analyzer– Report on net and path delays

• PROM File Formatter– Create file to program configuration

file into PROM• Hardware Debugger

– Download configuration file with XChecker, Serial or JTAG Cable

• EPIC Design Editor– Device-level view of routing

Flow Engine

• View status of tools

• Control tool options

• Implements design to the bitstream

Section III

Advanced Hardware Design Techniques

Section III Agenda

• Advanced Hardware Design Techniques– General Hardware Information– Combinational Logic Design (Look Up Tables and

other Resources)– Synchronous Logic (Flip Flops and Latches– Memory Design (RAM and ROM)– Input / Output Design

Section III Advanced Hardware Design Techniques

General Hardware Information

Resource Estimation• Find comparable functions in

macro library and XAPP application notes

– Or, use other designs to estimate device utilization

• Or, quickly implement a design and view the MAP report file

– Select Utilities -> Report Browser -> Map Report

– IOBs, CLBs, Global Buffers, and other components listed separately

• For unfinished designs– Use save flags on unconnected nets,

or– Deselect “Trim Unconnected Logic in

Implementation OptionsMACRO

S

Performance Estimation• Use block delays as estimate of net delays• Use desired clock frequency to determine allowed CLB

depth– Compare to functional requirements and modify design to meet

performance needs

• Example for 50 MHz clock frequency in XC4000XL-3:Clock period 20 nsOne level - 8 ns (tCO + tNET + tSU)Delay allowance 12 nsEach added level % 6 ns (tPD + tNET)Added levels of logic allowed 2 CLBs

tCO tNET tPD tNET tPD tNET tSU

CLB CLB CLB CLB

Power Consumption

• Xilinx FPGAs have flexible routing – Power consumption can be

half that of FPGAs with less flexible routing channels

• Power = kCV2F – How many nodes change state (hard to estimate)

– Capacitive loading on CLB and IOB outputs (known)

• Power consumption is not a concern in regular course labs• Power estimation methods

– See application notes under http://www.xilinx.com/apps/3volt.htm

XC4000XL 3.3 V, 0.35, 5 Volt Compatible

• Accepts 5Volt inputs

• Drives standard TTL levels

• Totally compatible in 5Volt environment

• 0.25 XV family is also 5 Volt TTL compatible when used with 3.3Volt I/O supply, 2.5Volt core supply

Any 5 V

device

XC4000XLFPGA0.35

3.3 V Logic3.3 V I/O

5 V3.3 V

5 V

3.3 V

Meets TTLLevels

5 V Tolerant Inputs

XC4000XV & Virtex 2.5 V, 0.25, 5 Volt Compatible

• Devices with 5V, 3.3V, and 2.5V power supplies can be interfaced


Combinational Logic Design (Look Up Tables and

Other Resources)

XC4000X Configurable Logic Blocks

• G, F, H function generators

• 2 Flip-Flops– Individual

clock polarity

– Sync. and async. Set/Reset

• Delay from F1 to Y in the XC4000X-1 is ~1 nsec

D QSD

RD

EC

S/RControl

D QSD

RD

EC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

YQ

Y

XQ

X

H1 DIN S/R EC

Look Up Tables

Capacity is limited by number of inputs, not complexity

Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM

• Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB

• Example:

A B C D Z

0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1

Look Up Table

Combinatorial Logic

AB

CD

Z

4-bit address

GFunc.Gen.

G4G3G2G1

WE

2(2 )4

= 64K !

16-bit Adder Examples• Many choices for implementing an adder

– Speed vs. density trade-off controlled by user and PLD features

Family

XC3000A

XC3000A

XC3000A

XC3000A

XC4000E-3

Type

Bit-Serial

Parallel

Lookahead

Conditional

Carry

CLBs

16

24

30

41

8

Levels

16

8

6

3

10.1ns

AppLINX

XAPP 022

XAPP 022

XAPP 022

XAPP 022

XAPP 018

XC5200-5 Carry 8 20ns 5200 DataSheet

Arithmetic Functions• Arithmetic Macros are optimized for density and speed with dedicated carry logic in CLBs

– Example: Each CLB can form a two-bit full-adder

• Carry Logic components have vertical orientation– Needed for speed and utilization– Known as RPM or “Relationally Placed Macro”– Examples:

• ADDx adders

• ADSUx adder/subtractors

• CCx counters

• COMPMCx magnitude

comparators

A<3>B<3>

A<2>B<2>

A<1>B<1>

A<0>B<0>

Z<3>

Z<2>

Z<1>

Z<0>

ADD4

Three-State Buffers• Each CLB is associated with two Three-State

buffers (BUFT)– BUFTs are used independently of LUTs and Flip-Flops

• Three-State library components:– Three-state buffers: BUFT, BUFT4, BUFT8, BUFT16

– Wired AND (open Drain) : WAND1, WAND4, WAND8, WAND16

– Two input OR driving Wired AND : WOR2AND

• Delay varies per family– 3.7 ns in the XC4005XL (-1)

– 13.6 ns in the XC4085XL (-1)

• Use to multiplex signals onto long routing lines to use as buses

Use BUFT for Buses

B3

B2

B1

B0

A3

A2

A1

A0

BUS<3>

BUS<2>

BUS<1>

BUS<0>

_ENABLE_A _ENABLE_B

BUFT

BUFTs for Multiplexers• BUFT can can be used to build large MUXes

– Large MUXes composed of LUTs need multiple levels of logic

– Large MUXes composed of BUFTs have only one level of logic• CLB resources are not used

– Use of BUFTs constrains placement

• Multiplexer macros use lookup tables – Example: M4_1E

• Create BUFT macros from Three-State buffer components – BUFT, BUFT4, BUFT8, BUFT16

Wide Decoders• The Wide Decoder is a dedicated

wired-AND– Useful for address decoding

• IOBs or CLBs can drive the Wide Decoder– Located along the periphery

of the die– All IOB drivers must be on same edge as the

decoder– Four decoder lines per edge

• Use DECODE macro– DECODE4/8/16/24– Must use a PULLUP primitive

A0A1A2A3A4A5A6A7 O

DECODE8

PULLUP

CLB Mapping Control in Schematic• Allows user to force mapping of logic from

schematic into a single CLB• XC3000

– CLBMap can specify entire CLB

• XC4000/XC5000– FMap specifies a function generator in a CLB– HMap specifies an XC4000 H function generator in a

CLB

A0

A2

B2

B0

FMAP

I1A0I2B0I3A2I4B2

OC0 C0


Synchronous Logic(Flip-Flops and Latches)

• Each register can be configured as a Flip-Flop or Latch

• Independent clock polarity

• Asynchronous Set or Reset

• Clock Enable

• Direct input from CLB input (Connections bypass LUTs)

CLB RegistersS/RDIN

FG

K(CLOCK)

EC (CLOCKENABLE)

RESET

SETQ QXD

H

EC

1

S/RControl

FG

RESET

SETQ QYD

H

EC

1

S/RControl

Library offerings• “Unified” library contains many standard functions

– Pre-defined size and functionality

• LogiBLOX templates are available– Can be customized for bus size and function

• Types of LogiBLOX register functions– Shift Registers

• Left/Right, Arithmetic, Logical, Circular

– Clock Dividers• Output Duty Cycle

– Counters• LFSR, Binary, One_Hot, Carry Logic

– Accumulators

• Xilinx CORE Generator recommended for very complex functions (DSP, FFT, UARTs, Multipliers...)

Naming ConventionsFlip-Flop

D-Type (D), JK-Type (JK), Toggle-Type (T)Asynchronous Preset (P), Asynchronous Clear (C)

Synchronous Set (S), Synchronous Reset (R)Clock Enable

Inverted Clock

FD PE _1

Flip-Flop, D Type

SizeSynchronous Reset

Clock Enable

FD16 R ELDCE_1

Transparent D Latch

Asynchronous Preset (P), Asynchronous Clear (C)

Gate Enable

Inverted Gate

Counters• Libraries support a wide variety of fast and efficient

counters– Counters offer trade-offs between speed, density, and

complexity– Example: LogiBlox counter styles

• Binary: predictable outputs, uses carry logic• Johnson: fastest practical counter, but uses more flip-flops; glitch free

decoding• LFSR: fast & dense, but pseudo-random outputs• One-Hot: useful for generating series of enables• Carry Chain: High speed and density

– The LogiBlox synthesizer will automatically pick the best implementation based on your design, or you can force an implementation with the STYLE parameter (schematic).

• The following are implemented in XC4000XL-3

Macro CLBs Clock

CB16CLE/D 18 - 20 23 - 24 ns

CC16CLED 19 19 ns

CC16CLE 9 16 ns

X-BLOX: LFSR 9 7 ns

• Simpler functions are faster and smaller• Carry Logic Counters are generally faster (depends on size)

16 Bit Counter Examples

Global Clock Buffers• Clock Buffers are low-skew, high drive buffers

– Also known as Global Buffers– Drive low-skew, high-speed long line resources– Drive all Flip-Flops and Latches in FPGA– Can also be used for high-fanout signals

• Additional clocks and high fanout signals can be routed on long lines

• Instantiation: if the BUFG component is instantiated, software will select one of these buffers based on the design

• Synthesis: Clocks are identified by different means depending on Vendor

– Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins

• Control clock buffer insertion with separate commands• Consult Synthesis interface guide or vendor

Global Buffer Types

BUFGLS is used by default in the Xilinx software if a BUFG component is specified in the design

Name Buffer Description Applications LimitationsBUFG Global Clock

(Architecture independent)M1 converts BUFG to most appropriate global buffer

BUFFCLK Global Fast Clock Fastest way to bring clock on chip

4 per chip, 4KX only; slower for CLBs

BUFGE Global Low Early Clock Faster than BUFGLS; fast IO interface

8 per chip, 4KX only; drives only 1 quadrant

BUFGLS Global Low Skew Clock Can access any CLB or IOB, best for CLBs

8 per chip, 4KX only

BUFGP Primary Global Buffer Drives Clocks or Longlines

4 per chip

BUFGS Secondary Global Buffer Drives Clocks or Longlines

4 per chip

Generating Clock On-Chip

• Internal configuration clock available after configuration– Use OSC4 primitive

– Nominal values (approximately):• 8 MHz, (500 kHz, 16 kHz, 490 Hz, 15 Hz)

– Very limited accuracy (+/- 50%)

OSC4

F15

F500k

F16k

F490

F8M

BUFGS

Global Reset• All flip-flops are initialized during power up via

Global Set/Reset network• You can access Global Set/Reset network by

instantiating the STARTUP primitive– Assert GSR for global set or reset– GSR is automatically connected to all CLB flip-flops

using dedicated routing resources– Saves general use routing resources for your design– DO NOT CONNECT GSR to set/reset inputs on Flip-

Flops

• Any signal can source the global set/reset, but the source must be defined in the design

GR/GSR

GTS

CLK

Q1

Q2

Q3

DoneIn

STARTUPQ4

• Use Global Reset as much as possible– Limit the number of flip-flops with an asynchronous reset – Extra routing resources are used

Avoid Gated-Clock or Asynch. Reset• Move gating from clock pin to prevent glitch from affecting

logic.

Carry-1Q0Q1Q2

Binary Counter

CE

QDTC

CK

Improved Designs:

TC will not glitch during the transition of Q<0:2> from 011 to 100

D QTC

Q0Q1Q2

Binary Counter

CK

TC and Q may glitch during the transition of Q<0:2> from 011 to 100

Poor Design:

D Q

CE

Or use MUXed data when using only 1-2 logic inputs

Shift Registers are Fast & Dense

• The CLB can handle two bits of a shift register

• Fast and dense independent of size– Fast connections between adjacent lookup tables

D Q

D Q

Left/RightQi

Qi+1

Qi-1

Qi+2

EC

EC

Prescale Non-Loadable Counters• Counter speed is determined by the carry

delay from LSB to MSB• Non-loadable counters can use prescaling

– Pre-scaling restricts load timing

FastSmall

Counter

Large Dense Counterwith Slower Carry

TC CE

Use One-Hot Encoding for State Machines

• Shift register is always fast and dense– “One-hot” uses one flip-flop for each count– Useful for state machine encoding in FPGAs

• Another alternative is a Johnson Counter– Inverted output of last stage drives input of first stage– Doubles the number of states versus one-hot

• Binary encoding is best for CPLDs

D Q D Q D Q D Q D Q

• Split complex states• Need to minimize number of inputs, not

number of flip-flops, in FPGAs– Use one-hot encoding for medium-size state

machines (~8-16 states)

• Complex states may be improved by breaking up into additional simpler states

StateA

StateA1

StateA2

StateB

cond1

StateB

cond1 cond1

State Machine Design Tips

Use binary sequence only if necessary• CLB can generate any sequence desired at same speed• Use Pre-Scaling on non-loadable counters to increase speed

– LSBs toggle quickly– See Application Notes

XAPP001 and XAPP014

• Use Gray code counters if decoding outputs– One bit changes per transition

• Consider Linear Feedback Shift Register for speed when terminal count is all that is needed– Or when any regular sequence

is acceptable (e.g., FIFO)

Large Dense Counter

with Slower Carry

TCCEFast

SmallCounter

10-bit SRQ0 Q9Q6

• Register-rich FPGAs encourage pipelining• Pipelining improves speed

– Consider wherever latency is not an issue– Use for terminal counts, carry lookahead, etc.

• How to estimate the clock period– 2 x (number of combinatorial levels) x (speed grade)– XC4000XL-3: 3 levels x 2 x 3ns = 18 ns clock period

Pipeline for Speed


Memory Design (RAM and ROM)

ROM is Equivalent to Logic• When using ROM, it is simply defining logic

functions in a look-up table format– Memory might be an easier way to define logic

– Xilinx provides ROM library cells

• FPGA lookup tables are essentially blocks of RAM– Data is written during configuration

– Data is read after configuration• Effectively operate as a ROM

O = I1*I2I1

I2O

F1

F2X

DATA(0)=0DATA(1)=0DATA(2)=0DATA(3)=1

A0

A1DOUT

F1

F2X

As Gates As ROM

RAM Provides 16X the Storage of Flip-Flops

• 32 bits versus 2 bits of storage – Two 16x1 RAMS or One 32X1 Single Port Ram fit in one CLB

– One 16x1 Dual Port RAM fits in one CLB

• 32x8 shift register with RAM = 11 CLBs– Using flip-flops, takes 128 CLBs for data alone

– Address decoders not included

32 bitsA0A1A2A3A4

O12 bits

D Q

D Q

Q1

Q2

CLB CLB

D1

D2

WE CLK

D1

• Synchronous RAM (SYNC_RAM)– Synchronous Write

Operation

• Synchronous Dual-Port (DP_RAM)– Can read & write to

different addresses simultaneously

RAM Types

Data

Write EnableWrite Clock

Address

Output

DataWrite Enable

Write Clock

Write Address/Single-Port Read Address

SPOutput

DPOutput

Dual-Port Read Address

RAM Guidelines• Less than 32 words is best

– 32x1 or 16x2 per RAM requires only one CLB• Delays are short, (one level of logic)

– Data and output MUXes are required to expand depth

• Less than 256 words recommended per RAM– Use external memory for 256 words or more

• Width easily expanded– Connect the address lines to multiple blocks

• Recommendation: Use less than 1/2 of max memory resources– Maximum memory uses all logic resources of CLBs

Memory Use• Most synthesis tools can synthesize ROM from

behavioral HDL code, but RAMS must be instantiated

• Use library primitives and macros for standard size memory– RAM/ROM16X1S to 32X8S– Use S suffix for Synchronous RAM– Use D suffix for Dual-Port RAM

• Use LogiBlox to generate arbitrary size memories

ORAM32X1S

DWEA0A1A2A3A4

• Use LogiBlox utility to create arbitrary size RAM or ROM– Select type: ROM, Synchronous, Asynchronous, or Dual Port

RAM

– Specify Depth: number of words must be a multiple of 16, ranging from 16 to 256 words

– Specify Width: word size ranges from 1 to 64 bits

– Specify initialization values with attribute file

• LogiBLOX also creates RAM interface– Entity and component declaration - cut and paste into the design

(VHDL designs)

– Module declaration (Verilog designs)

– Symbol Graphic (schematic entry designs)

How to Generate Memory

exampleexample

Memory Generator Dialog

LogiBLOX function

Data file for initialization

Memory Function

Specify memory type, size, name and function in the LogiBLOX GUI

Instance Name


Input / Output Design

XC4000X IOB Block Diagram


How to specify IO blocks - Schematic• User explicitly defines what resources in the

IOB are to be used

• I/Os are defined with– 1 pad primitive– At least 1 function primitive:

• Buffer, F/F ,or Latch• 1 input element, 1 output element or both

– Inverters may also be pulled into IOBs

• IOBs are named by net between pad and function primitives

IPADIN1_PAD

IOB IN1_PAD

IBUFIPADIN2_PAD

IOB IN2_PAD

ILD

Primary and Secondary Global Buffers• Eight global buffers per FPGA

– Four primary (BUFGP), Four secondary (BUFGS)

• Primary buffers must be driven by a semi-dedicated IOB

• Secondary buffers can be driven by a semi-dedicated IOB or internal logic and have more routing flexibility– Use BUFGS if extra 1-2ns of delay is acceptable

• Use generic BUFG primitive in your design– Allows software to choose best type of buffer

– Allows easy migration across families

IPADBUFG

D

I/O Logic• 4000E families have no boolean logic other than

inverters in the IOBs• XC4000EX adds optional output logic

– Can be used as a generic two-input function generator or MUX

– One input can be driven by IOB output clock signal• Driving from FastCLK buffer provides less than 6 ns pin-to-pin delay

– Requires library components beginning with “O”

IPAD F OPAD

BUFFCLK

FROM INTERNAL LOGIC FAST

OAND2

Use Pull-ups/Pull-downs to Prevent Floating

• Unused IOBs:– Outputs of unused IOBs are automatically disabled – Pull-ups are automatically connected on unused IOBs

• Used IOBs:– A PULLUP or PULLDOWN primitive can be connected to

used IOBs– Inputs should not be left floating

• Add a pull-up to design inputs that may be left floating to reduce power and noise

• Output enable may be inverted– Use OBUFE macro for active-high enable– Use OBUFT primitive for active-low enable

• Three-state control also via a dedicated global net– Controlled by same

STARTUP primitive

• All I/O disabled during configuration

Output Three-State Control

STARTUP

GTS

OE

OBUFE

T

OBUFT

OET

Fast Capture Latch• Additional latch on input driven by output’s clock signal• Allows capture of input by very fast clock

– Followed by standard I/O storage element for synchonization to internal logic

– Very fast setup (6.8 NS for 4000EX-3), 0 ns hold

– Available on 4000X, not 4000E family

• Example– ILDFFDX macro includes Fast Capture Latch and IFDX

– Connect BUFGE to fast capture latch

– Opposite edge of same clock via BUFGLS drives IFDX

D

GF

DCE

QIPAD

IPAD

BUFGE

BUFGLS

Data

Clock

tointernallogic

ILDFFDX

Decrease Hold time with NODELAY

• NODELAY attribute– Removes delay element to the IFD or ILD– Decreases setup time, add creates hold time– Available on IFD/ILD macros in XC5200 and

XC4000E/X families

Delay

Q D

IOB

RoutingDelay

Pad

ExternalClock

Pad

ExternalDelay

InputBuffer

Output MUX

• OMUX2– Fast output signal (from output

clock pin) MUXes IOB output or clock enable pins to pad

– Effectively doubles the number of device outputs without requiring a larger, more expensive package

– Pin-to-pin delay is less than 6 ns

D0

D1

S0

O

OMUX2

OPAD

OPADOBUF

FAST

Slew Rate Control• Slew rate controls output speed

• Two slew rates– Default slow slew rate reduces noise– Use fast slew rate wherever speed is important– FAST Slew rates are approximately 2x faster than SLOW slew

rates

• Slew rate specification– Instantiation: in the user constraint file:

• INST $1I87/obuf SLOW;

– Synthesis: vendor dependent

• Output drive varies by family– 4KEX/XL families have 12 mA drive

Choose TTL or CMOS Thresholds• Threshold is selected during configuration• Default is TTL

– Global selection on inputs or outputs– Change to CMOS in Configuration Template– 3V devices need TTL threshold when interfacing to 5V devices

Section IV

Advanced Software Design with Xilinx M1-Based Software

Section IV Agenda• Design Entry Tips

• Library Types

• FPGA Express for VHDL & Verilog

• M1-Based Software Flow

• Implementation Options

• Design Verification

• PLD Configuration Settings

• Design Constraints

Section IV Advanced Software Design

with Xilinx M1-Based Software

Design Entry Tips

Design Entry Tip - Label Nets• Label as many nets as possible

– Net names are passed to report files– Eases debugging

• Names may change due to hierarchy or optimization

• An IOB is named by the net between the pad and I/O function primitives

• A CLB is named by the net on the output– Flip-flops are always outputs

IN1

IOB IN1

D QQ2

CLB Q2

Use Legal and Readable Names• Allowable characters

– Alphanumeric: A - Z, a - z, 0 - 9

– Underline _, Dash -

– Reserved characters• Angle brackets for buses <>

• Slash / for hierarchy

• Dollar sign $ for reference designators

• Names must contain at least one non-digit

• Avoid using names that correspond to device resources– CLB row/column locations: AA, AB, etc.

– IOB pin locations: P1, P2, etc.

Component Naming Conventions

• Common component names, pin names and functions for all families

• Basic format is <function><width><control_inputs>– CB4CLE = Counter, Binary, 4 bits, Clear, Load, Enable

– FD16RE = Flip-flops, D-type, 16 bits, Reset, Enable

• Control inputs are referenced by a single letter– C = asynchronous Clear, R = synchronous Reset

– Listed in order of precedence

Use Hierarchy in Design• Adds structure to design• Eases debug• Users can build libraries of common functions• Allows each design portion to be entered by

most efficient method• Facilitates incremental design and

floorplanning• Supports team design

Notes



Library Types

Xilinx Libraries Overview• Libraries contain descriptions of each

component with pin names, functionality, timing, etc.

• There are two libraries:– The Unified Library contains “ready made” components

with non-variable function and size

– The LogiBLOX Library contains templates which can be customized for function and size

• Both libraries allow easy design migration across Xilinx devices and families

LogiBLOX templates and GUI

• LogiBLOX is composed of two parts:– LogiBLOX Library containing templates of VARIABLE SIZE

• Templates are expanded or customized (Counters, Adders, Registers, RAM, ROM)

• Templates have many implementations (e.g. Binary, Johnson, LFSR counters)

– LogiBLOX GUI and Synthesizer to create• A design file for implementation

• Symbol for schematic capture tool

• HDL code for instantiation in your design

• Functional simulation model

• One generic model per function type(ex: counter) - Attributes can be specified– ex: bus width, load, clock enable, etc.

• Arithmetic: COUNTER,ADDER, SUBTRACTOR, ACCUMULATOR

• Storage: SHIFT, DATA_REG, PROM, SRAM, DRAM Logic: ANDBUS, ORBUS, MUXBUS, DECODE, TRISTATE, COMPARATOR

• I/O: INPUTS, OUTPUTS, BIDIR_IO

• DSP and other complex functions are also available through CORE Generator

Generic LogiBLOX Functions

LogiBLOX Module Selector• Simple Combinatorial Logic

– Bus size from 2 to 32 bits – Supports AND, Invert, NAND,

NOR, OR, XNOR, XOR– Any of the inputs or output can be

inverted independently• Use Decode or MASK function

• Three-State Drivers– Bus size from 2 to 32 bits– Optional pull-up resistors

• Constants– Allows signals to be tied high or

low

How to use LogiBLOX in HDL code• If a LogiBLOX function is inferred, there is nothing more to do!

– Check with the synthesis vendor. Most synthesis tools infer simple LogiBlox components automatically

– Example: Synthesis tools will infer an adder for X <= A +B;

• To instantiate a LogiBlox function, or if the synthesis tool does not infer LogiBLOX automatically

– Use LogiBLOX GUI from command-line in “stand-alone” mode: %lbgui -vendor

* Creates a LogiBLOX module for simulation* Creates an entity or module declaration



FPGA Express for VHDL & Verilog Design

Section Agenda

• Overview

• Design Flow

• Instantiation Guidelines

• Coding Style Guidelines

Overview• Xilinx leads in FPGAs - 55% market share• Synopsys leads in VHDL/Verilog synthesis -

80% market share• One result of long term technology partnership is

FPGA Express– Xilinx is only silicon supplier with right to distribute FPGA

Express technology

– Integration into Foundation Series

Express Input and Output• Input files may be VHDL or Verilog format

– Mixed Verilog/VHDL modules are accepted

– Schematics may also be used, but should not be input into Express

– Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager

• Output netlists are in XNF format• Timing Specifications may be

specified in Express– Timing Specifications are not used during Synthesis

– Timing Specifications can be included in the output netlist

Reports

TimingRequirements

VHDLVerilog

Express

.XNF

Analyze the Design (1)• “Analyze” checks the HDL code for syntax errors

– Also creates internal files

• Files are automatically analyzed when selected for a project

• Do not select XNF or EDIF files– Will be merged

into the design by Design Manager

Synthesis -> Identify Sources

Analyze the Design (2)• As the design blocks are analyzed, status is

displayed:

• In this example, all blocks were analyzed successfully

Main Window

No Errors or Warnings

Warnings

Errors

Out of Date

Implement the Design• Express Implementation maps the HDL code to standard logic, creating a generic netlist.• At this stage, the design has not been optimized• To implement a design, select only the top level block, and

then select the Implement icon

Main Window

Check for Errors and Warnings

• After implementation is complete, the chip symbol plus status is displayed

• View errors, warnings, and messages

• Right click inside window to save information to a text file

Constraint Entry• Constraints are NOT applied to Synthesis

– Constraints are written to the output netlist (XNF) file for use by Design Manager (Xilinx Implementation Tools)

• Timing constraints control path delay• Specify paths with timing groups, or groups of IO or

sequential elements– The INPUT Group includes all input ports at the top level of the

design

– The OUTPUT Group includes all output ports at the top level of the design

– All flip-flops clocked by the same edge of a common clock belong to a group

– To define constraints: select Synthesis -> Edit Constraints forms

Define Clock Period

• Enter Period, Rise, and Fall Time– Select Clock entry -> Define

Synthesis -> Edit Constraints -> Clocks -> Define

Synthesis -> Edit Constraints -> Clocks

Define Global Synchronous Delays• The clock period creates 3 types of global

constraints with the same default value:(1) All input ports to sequential Elements– Setup of flip-flop or latch is included

(2) Sequential Element to all output ports – Flip-Flop Clock to Q delay is included

(3) Sequential Element to Sequential Element3

Clock period

logic

logic logicD Q1

D Q

2

Synthesis -> Edit Constraints -> Paths form

Define Individual Synchronous Delays

• Default delay from Clock specification is used in the Paths form

• Individual, or path specific delays can be defined on the Ports form– Port delays over-write the global delays from the Paths form

• Input delay, shown here, arrives 20 ns before the rising edge of the clock.

Synthesis -> Edit Constraints -> Ports

Define Key Port Features• Global Buffer defines the type of Clock Distribution

network - Use BUFG for most applications(default)

• Resistance specifies use of pullup or pulldown resistor on unused pads– Reduces power consumption and noise

• Use IO Reg allows use of sequential elements within IO Blocks to minimize Input or Output delay (default)– Dependent on device type

• Pad Location is used to specify pin number of the IO pad

Synthesis -> Edit Constraints -> Ports

Control the Hierarchy• Eliminate (default) or save hierarchical

boundaries

• Flat designs yield best results because more merging and sharing of boolean logic occurs

• However, small blocks are easier to debug– Easier to match source HDL code to synthesized design

• Synthesis goals (Speed or Area) and Effort level can be defined for each module

Synthesis -> Edit Constraints -> Modules (implemented design)

Optimize the Design• Optimization minimizes the design for speed or

area

• Select the implementation, and then select the Optimize icon

• After Optimization, check for errors and warnings again

Main Window

View Results• Select File -> Project Report to generate a

report

• Report file contains:– Files and libraries used

– Settings for Synthesis

– Chip type and speed grade

– Estimated Timing

– Warning: Circuit timing estimates tend to be optimistic. Run timing analysis after routing for most accurate timing analysis.

Report.txt file

Verify Results (1)• After Optimization, open Synthesis -> Edit Constraints to verify that

correct constraints were specified

• Results are based on estimated routing delays

Synthesis -> Edit Constraints -> Paths (for an optimized design)

Verify Results (2)• Review size of the design

• Resource use is displayed for each hierarchical block– Resources used per hierarchical block

– Black Box instantiations cannot be analyzed by Express

Synthesis -> Edit Constraints -> Modules (Optimized Design)

Export Netlist• Create the output netlist for use with the Xilinx Design Manager

(Xilinx Implementation Tools)– Output File format is XNF

• Select the optimized design, then select Synthesis -> Export Netlist to create the file– XNF file format

is used

• Enable Export Timing Specifications to include constraints in the output netlist

Synthesis -> Export Netlist

Simulation• Not covered in this workshop

• Free VHDL / Verilog simulators– See http://www.xilinx.com/xup/express/express1.htm

– Active VHDL Simulator, by Aldec (Most Recommended)– VHDL Tools from RASSP – Accolade Design Automation demo VHDL Simulator – SimuCAD Silos III (Recommended for Verilog)– Wellspring Verilog Simulator

• Model Technology Inc. (MTI) and major CAD vendors sell other HDL simulators

Instantiation and Hierarchy• Hierarchy is created when one design is instantiated into

another design• All components in the Unified and LogiBLOX Libraries

may be instantiated– Unified library components are described in the Libraries Guide– LogiBLOX components are described in the LogiBLOX

Reference/User Guide

• Cells that must be instantiated with Express SynthesisRAM/ROM Readback OSC

Bscan WOR WAND

OAND…(all IOB combinatorial logic)

Black Box Instantiation• What is a black box? Any element not analyzed by Express.

Examples:– Existing Design Modules or Elements (XNF, EDIF, .ngo)– LogiBLOX Components– Pre Optimized Netlists (PCI Cores or LOGICOREs)

• Procedure for using a black box:– Create a place holder in the HDL code– Synthesize the design without the XNF, EDIF, or NGO files– The Xilinx Implementation Tools will resolve (link in) all black box references

• Limitations– Express cannot check timing constraints through a black box.– Express cannot include black box resources in it’s reports.– GSR nets are not automatically inferred within Black Boxes

• Instantiate STARTUP and explicitly connect GSR ports in HDL

M1 - Introduction 152

LogiBLOX & CORE Generator Functions

• For HDL designs, LogiBLOX and CORE Gen generate:– Behavioral VHDL or Verilog model - for simulation only

– VHDL/Verilog Template - for component instantiation

– NGO file - for Xilinx implementation

• Most LogiBLOX functions can be inferred. Exceptions include READBACK and RAM blocks.

• Instantiation may provide better control of design implementation

How to Use LogiBLOX1. Invoke LogiBLOX from

Foundation

2. Select Setup

a. Specify VHDL or Verilog Template in the LogiBLOX Setup form

b. Other setup options may also be required*

3. Specify component features

4. Select OK to create component

5. VHDL/Verilog) Use template file (.vhi / .vei) to easily instantiate the component

Verilog - Add empty interface file to define busses.

6. Compile as usual*To access Verilog options, invoke LogiBLOX directly from Start -> Programs -> Xilinx Foundation Series -> LogiBLOX

RAM Example• Code is shown in the following slides:• VHDL instantiation:

– Component and entity declarations where copied into top level design file from LogiBLOX VHI file

• Verilog instantiation: – Module declaration is copied into top level design file from

LogiBLOX VEI file

– Additional empty file is required to specify pin type (input or output)

• Do not try to Analyze the VHD or VEI file from LogiBLOX, but DO Analyze the top level design file– Verilog users will synthesize the additional empty Verilog file

RAM Instantiation (VHDL)Library IEEE;

use IEEE.STD_LOGIC_1164.all;

use IEEE.STD_LOGIC_UNSIGNED.all;

entity top is

port (NOTCLR, CLKEN, NOTLD, UPCNT: in STD_LOGIC;

CNT_DI, RAM_DI: in STD_LOGIC_VECTOR (7 downto 0);

QO_LO: out STD_LOGIC_VECTOR (7 downto 0));

end top;

. . .

component ram256x8

PORT(

A: IN std_logic_vector(7 DOWNTO 0);

DI: IN std_logic_vector(7 DOWNTO 0);

WR_EN: IN std_logic;

WR_CLK: IN std_logic;

DO: OUT std_logic_vector(7 DOWNTO 0));

end component;

Top levelentity and RAMComponent declaration

Copied from VHI file

RAM Instantiation (VHDL) (2)begin

U1: OSC4

port map (OSC_CK);

U2: BUFG

port map (OSC_CK, CLK);

U3: CB8CLED

port map (CLK, NOTCLR, CLKEN, NOTLD,

UPCNT, CNT_DI, ADDR);

xram : ram256x8 port map

(A => ADDR ,

DI => RAM_DI,

WR_EN => CLKEN,

WR_CLK => CLK ,

DO => QO_LO );

end cr;

Last part of Top architecure

Component declarationis copied from VHI file, and instance name is entered

Coding for Performance

• FPGAs require better coding styles and more effective design methodologies – Pipelining techniques allow FPGAs to reach gate array system speeds

• Gate Arrays can tolerate poor coding styles and design practices – 66 MHz is easy for an Gate Array

• Designs coded for a Gate Array tend to perform 3x slower when converted to an FPGA– Not uncommon to see up to 30 layers of logic and 10-20 MHz FPGA designs– 6-8 FPGA Logic Levels = 50 MHz

Case vs If-Then-Else (Verilog)

in0

in1

in2

in3

mux_out

sel

in0in1

in2

in3

sel=00sel=01

sel=10p_encoder_out

module mux (in0, in1, in2, in3, sel, mux_out);input in0, in1, in2, in3; input [1:0] sel;output mux_out;reg mux_out;always @(in0 or in1 or in2 or in3 or sel) begin

case (sel)2'b00: mux_out = in0;2'b01: mux_out = in1;2'b10: mux_out = in2;default: mux_out = in3;

endcaseend

endmodule

module p_encoder (in0, in1, in2, in3, sel, p_encoder_out);input in0, in1, in2, in3;input [1:0] sel;output p_encoder_out;reg p_encoder_out;always @(in0 or in1 or in2 or in3 or sel) begin

if (sel == 2'b00)p_encoder_out = in0;

else if (sel == 2'b01)p_encoder_out = in1;

else if (sel == 2'b10)p_encoder_out = in2;

else p_encoder_out = in3;end

endmodule

Reduce Logical Levels of Critical Path(Verilog)

critical

in0in1

in2

in3out

in2

in0in1

in3

criticalout

module critical_bad (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;

assign out = (((in0&in1) & ~critical) | ~in2) & ~in3;

endmodule

module critical_good (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;

assign out = ((in0&in1) | ~in2) & ~in3 & ~critical;

endmodule

Resource Sharing (Verilog)

a0b0

+

+a1b1

sum

sel

+ sumsel

a0

a1

b0

b1

module poor_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;always @(a0 or a1 or b0 or b1 or sel) begin

if (sel)sum = a1 + b1;

elsesum = a0 + b0;

endendmodule

module good_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;reg a_temp, b_temp;always @(a0 or a1 or b0 or b1 or sel) begin

if (sel) begina_temp = a1;b_temp = b1;

endelse begin

a_temp = a0;b_temp = b0;

endsum = a_temp + b_temp;

endendmodule

Register Duplication to Reduce Fan-Out(Verilog)

module low_fanout(in, en, clk, out);input [23:0] in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en1, tri_en2;always @(posedge clk) begin

tri_en1 = en; tri_en2 = en;endalways @(tri_en1 or in)begin

if (tri_en1) out[23:12] = in[23:12];else out[23:12] = 12'bZ;

endalways @(tri_en2 or in) begin

if (tri_en2) out[11:0] = in[11:0];else out[11:0] = 12'bZ;

endendmodule

module high_fanout(in, en, clk, out);input [23:0]in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en;always @(posedge clk) tri_en = en;always @(tri_en or in) begin

if (tri_en) out = in;else out = 24'bZ;

endendmodule

en

clk

[23:0]in [23:0]out

tri_en

en

clk

[23:0]in[23:0]out

en

clk

24 loads

12 loads

12 loads

tri_en1

tri_en2

Design Partition - Reg at Boundary (Verilog)

a0

clk

a1

clk

+ sum

+a0

a1

clk

sum

module reg_at_boundary (a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;always @(posedge clk) begin

sum = a0 + a1;end

endmodule

module reg_in_module(a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;reg a0_temp, a1_temp;always @(posedge clk) begin

a0_temp = a0;a1_temp = a1;

endalways @(a0_temp or a1_temp) begin

sum = a0_temp + a1_temp;end

endmodule

Managing FPGA Speed Booster Pipeline (Verilog)

1 cyclemodule no_pipeline (a, b, c, clk, out);

input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp;always @(posedge clk) begin

out = (a_temp * b_temp) + c_temp;a_temp = a; b_temp = b; c_temp = c;

endendmodule

module pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp, mult_temp;always @(posedge clk) begin

mult_temp = a_temp * b_temp;a_temp = a; b_temp = b;

endalways @(posedge clk) begin

out = mult_temp + c_temp;c_temp = c;

endendmodule

*

+

a

b

c

out

2 cycle

*

+

a

b

c

out


When to Use Tri-state Buffers (BUFTs)

• BUFTs can be used to implement:– Internal Tri-state busses– Muxes greater than 4-to-1 or Multiplexed Buses

• BUFTs can be inferred:– Tri-states are inferred when a ‘Z’ can be assigned to a

signal

• BUFTs can be instantiated:– BUFT components– LogiBLOX Tri-State Buffers– Within a wide MUX: LogiBLOX Wired-AND MUX


4-to-1 Tri-State MUX Before (VHDL)

SEL(0) SEL(2)

SIGDATA(2)

DATA(3)

SEL(3)SEL(1)

DATA(0)

DATA(1)

library IEEE;use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;

entity TST is port( DATA: in std_logic_vector(3 downto 0); SEL: in integer; SIG: out std_logic );end TST;

architecture BEH of TST isbeginLOOP1: for I in 0 to 3 generate SIG <= DATA(I) when (SEL = I) else 'Z'; end generate ;end BEH;

• Is there a problem with this example?


4-to-1 Tri-State MUX After (VHDL)

• How can this code be improved?– Default integer is 32 bits– Define a limit

library IEEE;use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;

entity TST is port( DATA: in std_logic_vector(3 downto 0); SELECTOR: in integer range 0 to 3; SELECTION: out std_logic );end TST;. . .

Before AfterCLBs 8 4IOBs 37 7TBUFs 4 4


Flip-Flop Examples (VHDL)• Flip-Flop inference driven by ‘event in VHDL

-- D flip-flop FF: process (CLOCK) begin if (CLOCK'event and CLOCK='1') then A_Q_OUT <= D_IN; end if; end process; -- End FF -- Flip-flop with asynchronous preset and clock enable FF_CLOCK_ENABLE: process (ENABLE, PRESET, CLOCK) begin if (PRESET = '1') then D_Q_OUT <= " 11111111"; elsif (CLOCK'event and CLOCK='1') then if (ENABLE='1') then D_Q_OUT <= D_IN; end if; end if; end process; -- End FF_CLOCK_ENABLE

Producesregisteredoutput

Generates clock enable

Generates async preset

Flip-Flops Vs. Latches• Latches inference does not include an edge

(‘event or posedge)• Latches are generated when:

– A signal is assigned in one branch of an if statement or case statement, but not all branches

– An if or case statement does not define all possible conditions

• Does not apply to case statements in VHDL

• Use Synopsys parallel_case and full_case directives for Verilog to avoid latches

• Or, include a default clause before the if statement


Global SET/RESET • All Xilinx FPGAs have a built-in global synchronous reset

facility• Global SET/RESET sets or resets every sequential element

in the FPGA– GSR signal is accessed by instantiating the STARTUP block. – GSR will be inferred when the design has a net that sets / resets

all sequential elements in the design– Additionally, sequential elements may be set or reset individually

• These global nets exist outside of the general purpose routing within the device.


How to access Global SET/RESET

• The Global Set/Reset (GSR) signal is accessed by instantiating the STARTUP block. – Polarity may be inferred

• GSR will be inferred when the design has a net that sets / resets all sequential elements in the design


State Machine Encoding• For FPGAs, use of one-hot encoding for complex state

machines– Works well in Xilinx’ register-rich FPGAs

– Uses fewer wide-input functions

– Generally produces fast state machines

• For CPLDs, use Binary encoding

• One-hot and binary encoding can be selected in Express at Synthesis -> Options -> Project

– Other types of encoding such as BCD or Gray may be specified in the HDL code

• Its best to break up large state machines into smaller ones


Address Range Identification• For the inequality operators, synthesis will infer two

12-bit comparators• VHDL Example:

if ADDRESS(31 downto 20) <= “000000000110” and

ADDRESS(31 downto 20) >= “000000000001” then

• More address ranges are synthesized to more comparators

• Better solution: look for patterns in address bits that can eliminate need for comparators

if (ADDRESS(31 downto 23) = “000000000”) and (ADDRESS(22 downto 20) /= “111”) and

(ADDRESS(22 downto 20) /= “000”) then

Arithmetic and Comparison Operators• Use arithmetic and comparison operators whenever

possible. Example:

if (Y > Z) then X <= A + B;• Arithmetic and comparaison operators give Express the

most flexibility to optimize – Multiplier– Adder, Subtracter, and Adder/Subtracter

– Incrementer, decrementer, and incrementer/decrementer– Comparater– Mutiplexer (select operator)

• Operators can be instantiated, but generally you will get the best performance with operator inference

Expressions

• Expressions– Use parentheses to indicate precedence. – Replace repetitive expressions with function

calls or continuous assignments

Last but not least….• Expressions

– Use parentheses to indicate precedence. – Replace repetitive expressions with function calls or continuous assignments

• VHDL generate statements can cause long compile times unfolding the logic - Use wisely– Be careful with generate statements nested in loops or within generate statements– Generate example-- Generate 3 instances of ALU2

GEN1: for N in 0 to 2 generateALU2_X3: ALU2 port map ( CTL(2+ N*3 downto N*3), A(7+ N*8 downto N*8), Y(7 + N*8 downto N*8));end generate;

Resources• Support Resources– www.xilinx.com ( Answers Search)– Express Expert Journal

http://www.xilinx.com/support/techsup/journals/fpga_exp/index.htm– Synthesis Design Guides

http://www.xilinx.com/apps/hdl.htm

• On-Line DocumentationSTART -> Programs -> Xilinx Foundation Series -> VHDL Reference Manual

START -> Programs -> Xilinx Foundation Series -> Verilog Reference Manual

START -> Programs -> Xilinx Foundation Series -> On-Line Books -> Express User’s Guide and Express Application Supplement

Section IV Advanced Software Design with Xilinx M1-Based Software

M1-Based Software Flow

Logical Design Files• Logical Design Files describe your design, and are

composed of logical components– Typically a netlist, generated by Schematic Capture or Synthesis

– Composed of Boolean Gates, FIFOs, RAMs

• Netlist input to XACT-Step M1 is in EDIF format– XNF files are also accepted

• EDIF format files are translated to (Native Generic Design) NGD format

– NGD files have varying extensions— Ex: NGD, NGM, NGA, NGO

• NGD files can be translated to other formats for simulation

Physical Design Files• Physical design files are composed of components

found in a Xilinx FPGA such as look-up tables and flip-flops– Physical design files have .ncd extension

– Map creates an NCD file from an NGD file

– NCD files contain varying pieces of information• Mapping, placement, and routing tools each concatenate data

to the bottom of the NCD file

M1-Based Design Flow

NGDBUILD Flatten Hierarchical Design

.NGD

MAPLogical to Physical translation

Groups LUTs and FFs Into CLBs

.XNF or EDIF netlist

.BIT

TRCE Static Timing Analysis

BITGEN Generates configuration file

.PCF.NCD

TRCE Static Timing Estimates

.NCD

PARLayout of Physical DesignRoutes Physical Design

UCFUser Constraint File

*Design entry tool flows to M1 are shown in the Appendix.

Design Flow Programs (1)• NGDBUILD

– Merges hierarchical EDIF or XNF files into one hierarchical file– Creates internal netlist .ngd(Native Generic Design) files– Contains logical components: combinatorial gates, RAMS, flip-flops,

etc. • MAP

– Maps logical components to physical components found in Xilinx FPGA: look up tables, Flip-Flops, three state buffers, etc.

– Packs physical components into COMPS– Creates internal .ncd (Native Circuit Design) file

Translate Map Place & Route Configure

Design Flow Programs (2)• TRCE

– Analyzes Timing• Use before PAR to analyze constraints

• PAR– Places COMPS on FPGA– Routes the FPGA

• TRCE– Analyzes Timing

• Use after PAR to check delays• NGDANNO

– Back-annotate timing delays for Simulation• BITGEN

– Create file to configure FPGA

Key M1 Browser Reports• Map Report

– Displays result of DRC (Design Rule Check)

– Indicates if the design will fit into the specified part

– Identifies ways to improve the design

– Reports nets with no source or load

• Logic Level Timing Report provides delay estimates– Reports longest paths in the design

– Created before placement

– Based on block delays and minimum net delays

Key Report Files• Placement and Routing Report includes resource

summary– Indicates the percentage of utilization– The number of I/O and flip-flops is specified– Reports if the design routed – Gives an overall timing score

• Score of zero indicates all timing specifications were met

• Post Layout Timing Report– Based on block delays and net delays after routing– Used for detailed delay analysis after implementation

• Pad report– Cross reference of Input/Output components and package pins

BEL and Comp Terminology• XACTstep M1 uses two new terms for FPGA resources: “Comps”

and “Bels”– A comp may refer to a CLB, IOB, TBUF, or Decoder

– A BEL may refer to the contents of a comp, such as F-LUT, H-LUT, FFX, FFY, RAM, or PAD

• The Graphic Design Editor (EPIC), and TRCE timing reports will refer to BELS

G_LUT

F_LUT

H_LUT

FFX

FFY

4000X CLB

The COMP shown here is a CLB, which contains BELS: F_LUT, G_LUT, H_LUT, FFX, and FFY


Implementation Options

Main Implementation Menu Options

• Guide Option –Use a previous implementation as template for current implementation

–Specify constraint file (optional)

• MAP, PAR, and configuration options

–Implementation has four sub-menus: Optimize and Map, Place and Route, Timing, and Interface

Optimization and Map Options (1)

• Trim Unconnected Signals (default is On) – Trims all fan-out/fan-in

from unconnected pins

– Turn off to implement hierarchical blocks separately

• Replicate Logic (default is on)– Duplicates logic with high

fan-out

– Increases utilization, decreases delay

Map optimizes your design before it is partitioned into LUTs, Flip-Flops,etc. The GUI includes these options:

Optimization and Map Options (2)• Optimization Strategy (default is Off)

– Minimizes logic to optimize logic for speed, area, or both

– Synthesized designs have been optimized already

• Packing Strategy (default is minimum density)– Informs Map of how to pack COMPS with logic

– Minimum Density - Map only puts related logic into the same COMP

– Fit Device - packs components more tightly into COMPS

– Can adversely affect timing and routability

• Generate 5-I/P Functions– Reduces block levels but increases area

Place and Route Options (1)

• Runtime (default is 2)– Trades off placement effort verses CPU time

• Router Passes (default is Auto)

–The Router will run until no improvement is made to meet timing constraints.– Specify a number to avoid very long run times for difficult designs.– Start with 3 passes

Utilities -> Template Manager -> Edit Implementation Template -> Place and Route

Place and Route Options (2)

• Workstation users may run PAR LOOP on multiple workstations simultaneously– Create a list of available workstations

• One name per line, no comments

– Include the file name in the Nodelist field

• Many other options for advanced users, not shown here

Implementation Options for Fast Runtime versus PAR Effort

Other hints: - 4KX and 9500 families give fastest runtimes. - Save this as an implementation template

1

Deselect these 3 checkboxes

Select fast placement option, 1-2 routing passes, 0 clean-up passes, and deselect “Use Timing Constraints”

Timing Report Options• Enable the creation of the

Timing Report– Logic Level Timing Report is

created before PAR• Has minimal net delays

• Used to predict realistic constraints

– Post Layout Timing Report is created after PAR

• Verify that the design meets constraints

Timing Report Options (2)• These options limit the information placed in the report file• All options list paths in order of delay length; longest paths are

listed first

Design Performance Summary (Default)– Displays longest clock-to-setup, pad-to-

setup, and setup-to-pad delays for each clock in the design

Default Timing Constraints– Lists longest Flip-Flop-to-Flip-Flop, Pad-

to-Flip-Flop, and Flip-Flop-to-Pad paths

User Timing Constraints– Report longest paths for each constraint

Design -> Implement -> Options -> Edit Template -> Timing

Controlling the Back Annotation Netlist Format

Format options:- VHDL- Verilog- XNF- EDIF

EDIF formats:- Standard (2.0.0)- Viewlogic- Mentor EDIF- LogicModelling

How to Start and Stop the Flow Engine

• Select Flow Engine -> Setup Advanced to select the starting state

• Select Flow Engine -> Setup -> Stop After to set stopping point

Create a Script from the GUI• M1 can create a script file from the GUI session

– Available from the Flow Engine or Design Manager

– Select Utilities -> Command History -> Command Line

– Select Utilities -> Project Notes• Copy, paste, and save text from Command History Window

The Guide Option• Allows use of a previously placed and routed

design to guide a new placement– Can be useful if there are few design changes

• Guide is used for Map, Place, and Route– Map may take much longer to execute, but PAR will be

faster

• Recommended alternative is to use location constraints in design

PreviousDesign New Design

GuidePlace & Route

Effective use of Guide• Guide uses signal and component names to

determine edited parts of the design• Name all nets

– Do not change names

• Minimize changes to the design– Any new hierarchy changes all names below– Avoid any changes to synthesized logic

• Synthesis users: please try to freeze the design with “set_don’t_touch” or like command

– Otherwise, guide option may not be useful


Design Verification

Recommended Verification Flow

Implement

BitgenProm File Formatter

Download

Netlist

Timing Analysis

FUNCTIONAL SIMULATION

TIMING SIMULATION

IN-CIRCUIT VERIFICATION

Timing Analyzer • Analyze delays before and after implementation

Timing Analyzer Benefits• Combines block delays from data book with net

delays from implementation files• Quickly identifies critical paths and timing

hazards• Report shows all elements in path, each

element's delay, and cumulative delay– Can determine if slow paths are due to block delays

(design) or net delays (implementation)

Element Delay TotalPAD to IOB.I 2.2 2.2IOB.I to CLB1.F11.1 3.3CLB1.F1 to CLB1.X 2.7 6.0CLB1.X to CLB2.F3 1.2 7.2CLB2.F3 to Clock 2.1 9.3

IOB CLB1 D Q

CLB2

I F1 X F3 blocknetblocknetblockblock net block net block

Output files for Simulationngdanno

ngd2xxx

EDIF XNFVerilog / SDF

VHDL / SDF

• Before implementation, output netlist has unit delays, no back-annotation (use for functional simulation)

• After implementation, post-route delays are back-annotated – EDIF or XNF output files include back-annotated delays– SDF files are created in addition to Verilog & VHDL netlists

• VHDL and Verilog output netlists do not contain delays

M1 HDL Simulation Flow

VHDL & Verilog Simulation Libraries

• UNISIM– New for A1.4 allowing RTL and post-synthesis simulation

• SIMPRIM– Family/architecture independent models

– Used for Post-M1 simulation including full timing

– VHDL and Verilog

• Standard Delay Format (SDF) files– Separate file used to specify design timing (delays) to VHDL

and Verilog simulators

– Xilinx software version 1.4 supports SDF version 2.1

Hardware Configuration Readback• Can occur while FPGA runs• Requires XChecker cable• Readback Trigger input starts serial readback• XC3000 controlled via Bitstream Generator

– Default is enabled– Data and trigger connected to Mode pins

• XC4/5000 controlled via schematic and Bitstream Generator

– Include Readback symbol in schematic• Connect TRIG and DATA to I/O pins• Can use MD0 and MD1

• See Appendix for more information

IPADIBUF

OPADOBUF

(MD0)

(MD1)

CLK

TRIG

DATA

RIPREADBACK

XCheckerRT

XCheckerRD


PLD Configuration Settings

Bitstream Generator Options - Configuration• Controlled via

Configuration Template• Increase Configuration

Rate if not concerned about compatibility with earlier families

• Add Pull-Up or Pull-Down to avoid having to connect external resistors

• All configuration controls are set in template.

Bitstream Generator Options - Startup

The “Start-Up Clock” switch enables the designer to synchronize startup with the FPGAs’ own configuration clock or an external clock signal.

Start-up can also begin when the “Done” pin goes high.

To program the “Output Events” refer to the Implementation Options of the “Design Manager User Guide included with the Documentation CD.

Bitstream Generator Options - Readback The Hardware Debugger can verify the downloaded configuration and probe the internal states of the device by using the Readback feature.

To use this feature you will need to assert the “Enable Bitstream Verification” box, connect the XChecker Cable to your device, and insert the “Readback” symbol into your design.

For more information, refer to the Xilinx Data Book and the Hardware Debugger Reference Guide on the Documentation CD.

Choose a Configuration MethodConfiguration Mode Data Characteristics

Master Parallel Byte-Wide FPGA loads itself from externalbyte-wide PROM

Master Serial Bit-Serial FPGA loads itself from externalserial PROM

Peripheral Byte-Wide FPGA loaded undermicroprocessor control

Synchronous Peripheral Byte-Wide FPGA loaded by users’configuration clock

Express Byte-Wide Fastest configuration mode;4000EX devices only

Slave Bit-Serial FPGA loaded by microprocessoror DMA controller; used byXChecker Download Cable

Daisy Chain Bit-Serial FPGAs load themselves fromPROM; PROM Formatter createsbitstream

M[2:0] pins control configuration mode setting.



Design Constraints

Section Agenda• Overview• Location and implementation constraints• General timing constraints• Specific timing constraints

– Path and block specific constraints – Path and block grouping – Advanced constraint commands– Priority

Constraint Entry Overview• All constraints can be entered in User Constraint File (UCF)

– Maximum allowable delay – Placement of package pins– Implementation Options– Bitstream Generation / Prom Configuration

• Timing constraints may also be defined in schematic– Advantage: Easy entry for hierarchical blocks

• UCF files must have hierarchical net and component names

– Disadvantage: Not all constraints are supported– See Libraries guide for schematic syntax and availability

• Some synthesis tools allow entry of constraints– Constraint files may be generated by the synthesis tools, or constraints may be

written in output netlist– FPGA Express puts constraints into XNF file

UCF Syntax• Use uppercase letters for keywords

– Keywords include names used in constraints, such as:

AFTER OFFSET PERIOD BEFORENET LOC

IN OUT

• Use quotes around names with non-alphanumeric characters

• Two types of wildcards may be used:– “?” is a wildcard for a single character

– “*” is a wildcard for any number of characters

Pin Location, Implementation Constraints

• Pads can be assigned to a package pin– Ex: Assign a bus signal to pin 32

INST “QOUT<3>” LOC = P32;

• Physical Implementation may be controlled in the UCF file, such as:– FAST: Set fast I/O slew rate

Example: INST “$1I87/OBUF” FAST;– PART: Define part type to be used

Example: CONFIG PART=4005E-PQ160C-5;

• Consider the following path:

• Assume system requirements dictate a delay of 27 ns for all input to output pins

• The TIMESPEC constraint communicates this requirement to software:TIMESPEC TS01 = FROM PADS TO PADS 27 NS;

• PAD-to-PAD TIMESPECS constrain the delay of input and output pads, and all net and block delays in the path

Simple Combinatorial Path

B<9:0>

OUT2

27 NS

A 2 levels of logic

Synchronous I/O Constraints

• Timing requirements for the design are described by defining system delays• System delay include these questions:

– What is the clock period?

– When do inputs arrive at IC2?

– When must outputs be stable to meet setup at IC3?

IC1IC2 : FPGA Under

Development IC3

CLOCK

Input Arrival Calculation• Inputs are constrained by their input arrival.• Example: When does data arrive at pin D1?

– After the clock trigger, data delay is TCKO + Tnet + Tpad + TC1

– Delay C1 net delays, or other combinatorial elements on the board– Delay TCD is the delay through the FPGA clock distribution network

Tarrival = Tcko + Tnet + Tpad + TC10 50

Tarrival

CLK

IC 1

D Q C1

Tcko Tnet Tpad

Tc1IC2: Device under Development

C2 D

CK

QD1

Tcd

Tpad

Output Stability Calculation• When does output data need to be stable?

– Data must be stable in order to meet the setup requirement for IC3 – How long must the data be stable before data is latched in IC3?

• Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup

• TCD is the delay through the clock distribution network

Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup

0 50

Tstable

IC2: Device under DevelopmentIC 3

C2 C3 C4 D

CK

Q D

CK

Q

CLK

Tsetup

Tnet

Tpad Tc4Tc3

O1

Period and Offset Constraints• Two commands are used to describe synchronous delays

– Period defines the clock – Offset constraints define input arrival time and output stability time relative to the clock

• Xilinx software determines internal FPGA delays from Period and Offset constraints• Syntax:

NET clock_name PERIOD = some_delay time_unit;

NET input_name OFFSET = IN Tarrival time AFTER clock_name;

NET output_name OFFSET = OUT Tstable BEFORE clock_name;

(Input_name and output_name are the names of nets connecting to the IO Pad)

Clock Constraint Example

• Use the Period Command to define the clock

• Given that the clock frequency is 20 MHz for the example:NET “CLK” PERIOD = 50 ns;

Example waveform for CLK

0 50 100

Synchronous Constraint Example• OFFSET defines the delay of a signal external to the chip, relative to a clock. Internal clock delays are determined by Software

FF1 FF2

40nsDetermined by

SoftwareTarrival 14ns

Tstable 12ns

Determined bySoftware

0 20 4014

ADD0_IN

CLK28

OUT1NET “CLK” PERIOD = 40;NET “ADD0_IN” OFFSET = IN 14 AFTER CLK;NET “ADD0_OUT” OFFSET = OUT 12 BEFORE CLK;

Constraint Recommendations• Use a given TIMESPEC name for only one path• Keep constraints in one source

– Either UCF file or in schematics, but not both

• Avoid OVER-constraining the design– Design Performance suffers

• Critical timing paths get the best placement and fastest routing options• As the number of critical paths increases, routability decreases

– Run times increase

• More information in the On-Line Docs:– Libraries Guide– Development Systems Reference Guide, Using Timing Constraints, UCF sections

• Schematic users: for path-specific constraints, vendor documentation may be necessary

Question• Given the following:

Clock Frequency = 20 MHz

Tarrival = 31 ns = delay from CLK to Input pin D1 of IC2

Tstable = 27 ns = Delay (including setup) from O1 to D pin of FF3 (IC3)

NET _____ PERIOD = _____ NS;NET _____ OFFSET = IN _____ AFTER CLK;NET _____ OFFSET = OUT _____ BEFORE CLK;

Fill in the constraints below :

IC2: Device under Development

CLK

IC 1 IC 3

D

CK

Q C2 C3C1 C4D

CK

Q D

CK

QD1 O1

CLK 50D 31 nsO1 27 ns

Answers :

Path and Block Specific Constraints

• Why use path or block specific constraints? – To decrease speed requirements wherever possible

– To Increase routability and overall speed of the design

– To decrease software run-time

• General Methodology– Use PERIOD and OFFSET to constrain the design

globally

– Use specific “FROM-TO” constraints to modify timing for specific blocks or paths

“FROM-TO” Constraint Example• Consider the example shown below with TIMESPEC:

TIMESPEC TS01 = FROM PADS TO PADS 21;• TS01 is applied to both Y - OUT1 and Z - OUT2.

• TS01 over constrains path from Z to OUT2.– Tight constraints decrease routability and increase run time

21 ns

FF1 FF2

OUT1CLK

X

Y

Z<0:31>

OUT2

1 Level of Logic

2 Levels of Logic

21 ns

“FROM-TO” Constraints• The two paths could be constrained with two commands:TIMESPEC TS01 = FROM PADS(Y) TO PADS(OUT1)21;

TIMESPEC TS02 = FROM PADS(Z) TO PADS(OUT2)28;

• “FROM:TO” Constraints can start and stop at Flip-Flops (use “FFS”), LATCHES, PADS, or RAMS• Examples:

– Constrain all inputs to all Flip-Flops in block NEWFIE:

TIMESPEC TS03 = FROM PADS TO FFS(NEWFIE)18 ns; – Constrain all Flip-Flop to Flip-Flop paths in the design:

TIMESPEC TS04 = FROM FFS TO FFS 15 ns; – Constrain all Flip-Flop to output paths in the design

TIMESPEC TS05 = FROM FFS TO PADS 25 ns;

Creating Groups with TNM• The TNM constraint creates a group of individual components• Example: divide Flip-Flops into two groups based on instance name

INST SLOWFF* TNM = SLO;

INST FASTFF* TNM = FST;• TIMESPECS are assigned to the new groups:

TIMESPEC TS14 = FROM FFS TO SLO 40 NS;

TIMESPEC TS15 = FROM FFS TO FST 20 NS;• Greater flexibility in routing is achieved by creating a different timing requirement for these two

groups

SLOWFF2

SLOWFF1

FASTFF1

FASTFF2

REG1

REG2

COMB3

Pre-Scaled Counter Example

• Highest speed is required in the pre-scaled block– Constrain the two counter blocks separately to avoid over-constraining

COUNT12

• Define two groups for use in TIMESPEC. Example UCF file:INST FFS(PRE2) TNM = PRE;

INST COUNT12 TNM = UPPER;

TIMESPEC TS_PRE = FROM PRE TO PRE 60 MHZ;

TIMESPEC TS_TC2CE = FROM PRE TO UPPER 60 MHZ;

TIMESPEC TS_UPPER = FROM UPPER TO UPPER 15 MHZ;

Q5 Q6Q3 Q4 Q9 Q10Q7 Q8 Q13Q11 Q12Q2

COUNT12

Q0 Q1

PRE2TC CE

Creating Groups with TIMEGRP

• Another way to constrain this design is by creating smaller groups of endpoints:

• The TIMEGRP constraint is used to create new groups from other groups.

• FFS, LATCHES, RAMS, and PADS are predefined groups• Example: ALL_FFS group contains all Flip-Flops whose

instance name begins with SLOWFF or FASTFF:INST SLOWFF* TNM = SLO;

INST FASTFF* TNM = FST;

TIMEGRP ALL_FFS = FFS (FST* : SLO*) ;

Select One Path From Many Paths • Use to constrain one path among several parallel paths• First identify the path to be constrained with TPTHRU, then

use THRU in Timespec constraint• Example: constrain the path through component ABC

fiforam

my_reg01

my_reg00

my_reg02

my_reg03

TPTHRU=ABC

NET RED TPTHRU = ABC; TIMESPEC TS_FIFOS = FROM RAMS(FIFORAM) THRU ABC TO FFS(MY_REG*) 25;

RED

Forward Tracing• Forward tracing occurs when a constraint is assigned to a net• Constraint is applied to all global endpoints driven by the net• Example: constrain nets driven by DATA0 to Flip-Flops in

block CNT25:NET “DATA0” TNM = MYBUS;TIMESPEC TS_REGCNT = FROM MYBUS TO FFS(CNT25) 30 NS;

...CHEW

BONEDATA0

BARK

CNT25

TS_REGNCT

Ignoring Paths with TIG and NET• Timespec Ignore, “TIG”, attribute ignores a TIMESPEC

for a specific path or net

• Ex: Assume that net DOG_SLOW was constrained by 2 constraints, TS01 and TS02. The following specification ignores TS01. TS02 only is applied to DOG_SLOW.

NET “DOG_SLOW” TIG = TS01;

• Example to ignore a slow path between registers:INST REGA* TNM = REGA; INST REGB* TNM = REGB;TIMESPEC TS_TIG01 = FROM FFS (REGA) TO

FFS(REGB) TIG;

• TIG improves software run-time and routability of the design

Other Constraint Constructs• Use “Except” to filter a group of endpoints.

INST FASTFF* TNM = FST;

TIMEGRP SLO = FFS EXCEPT FST;

• TPSYNC allows definition of end points that are not FFS, RAMS, PADS or LATCHES.NET “BLUE” TPSYNC = BLUE_S;

TIMESPEC TS_1A = FROM FFS TO BLUE_S 15 NS;

• Signal skew for logic driven by clocks can be constrained using MAXSKEW constraintNET “$1I3245/$SIG_6” MAXSKEW = 3;

Specifies a 3 ns difference between the arrival times at all destinations of net $1I3245/$SIG_6. Cannot constrain skew of global nets (skew is fixed)

Constraint Priority• All constraints are not created equal

– Highest Priority - Timing ignores (TIG)

- FROM:THRU:TO specs

- FROM:TO specs– Lowest Priority - PERIOD specs

• “FROM:TO” constraints are further prioritized:– Highest:

FROM PATH-SPECIFIC TO PATH_SPECIFIC FROM PATH-SPECIFIC TO GLOBAL

– Lowest: FROM GLOBAL TO GLOBAL

Section V

Special Topics

Section V Agenda• DSP Design with FPGAs

• New Developments in Programmable Logic

• Virtex, XC6200 and Reconfigurable Logic

• FPGA versus ASIC costs

• Xilinx Student Edition

• Xilinx University Program participation

Section V Special Topics

DSP Design with FPGAs

FPGAs Provide Outstanding DSP Performance

Mult. • • • Mult. Mult. Mult.Mult.Mult.

1 2 3 4 N

DSPProcessor FPGA

AddAdd

1

Parallel processing Configurable to specific needs No software programming

Sequential processing Fixed architecture Complex real time software

FPGAs Lower the Cost of

High Performance DSP

$

Relative Performance

500

300

200

400

100

•

•

•

5 2010 15

µP/PDSP

FPGA-Based DSP

Customer Successes

TIM40 Module using FPGAs (XC4010)3 times the price at 175 times the TI TMS320C40 performance

DNA Matching (XC4010)Similar performance at 1/20th price

128-Track Audio Recording Studio (XC3190)3 times the functionality at 1/10th the price

FIR Filter Example

XC0

X0

XC1

X1

XC2

X2

•

•

•

• • •

SUM

0

K

• • •

SAMPLE DATAN BITS WIDE

K TAPS LONG

K COEFFICIENTS

K SUMS

OUTPUT DATA

PRODUCT K Multiplies

K Sums

CLOCK = Multiply Time

Sample Rate = Clock Rate

IMPLEMENTATION ???

Sum of Products Equation

Traditional FIR Filter Implementation

General-Purpose DSP

– PERFORMANCE =

– TMS320: MAC cycle time = one clock cycle

10-bit, 20-tap filter with 50 MHz TMS320 = 2.5 MHz

Additional filter taps slow performance

– Pentium: MAC cycle time = 11 clock cycles

1

MAC cycle time X Number of Taps

Distributed Arithmetic (DA) Filter Design

A

B

REGISTER

FILTEREDDATA OUT

2 -1 Scaler

LOOKUP

TABLE

ADRS

DATA

...000000

C0

8 WORD X N BITLOOK UP TABLE

C1

C1 + C0

000

001

010

011

100

101

110

111

C2

C2 + C0

C2 + C1

C2 + C1 + C0

PERFORMANCE =

10-bit, 20-tap filter using XC4000 at 50 MHz = 5 MHz

Clock Frequency

Number of Bits in Sample

PARALLEL INSERIAL OUT

SAMPLEDATA

BinarySHIFT

MSB

n

n

Distributed Arithmetic - 3 bit Example

D2 x C2

1 0 0x 1 1 0

D1 x C1

1 1 1x 1 0 1

D0 x C0

0 1 1x 1 0 0

DataCoefficient

C2 x D2

1 1 0 x 1 0 0 0 0 00 0 0

1 1 0

C1 x D1

1 0 1x 1 1 11 0 1

1 0 1 1 0 1

C0 x D0

1 0 0x 0 1 11 0 0

1 0 0 0 0 0

CoefficientData

0 1 1 = LUT Address ==> (C1 + C0 ) from previous slide

Resource Tradeoffs for Higher Performance

Number of Filter Taps

CL

Bs

100

200

300

16 32 48 64 80

• • • • • Serial SequentialSerial Sequential

• •

• •

•

Bit-SerialBit-SerialDistributedDistributedArithmeticArithmetic

8.1 MHz

100 Hz to 100 kHz

16.2 MHz

Double-RateDouble-RateDistributedDistributedArithmeticArithmetic

66MHz

Fully-ParallelFully-ParallelDistributedDistributedArithmeticArithmetic

400

XC4085XL 10 Times Faster Than TMS320C6x

Billions of MACs per

Second

16 bit FIR Filter Benchmark

Multiply ACcumulates per Second

4005XL 4013XL 4036XL 4062XLTMS320C6x0.25 , 200 MHz

1

2

3

4

5

6

7

8

4085XLXC4000XL using 80 MHz clock rate

FPGA DSP is Lower Cost

Price per Million MACs per Second - 16-bit word

TMS320C6x(25,000 pcs)

Xilinx FPGA(25,000 pcs)

$0.25

$0.20

$0.15

$0.10

$0.05

Where FPGA-Based DSP is Used• High Data Rates

– 1 to 70 M samples/sec

• High Complexity– 10’s to 100’s of

MACs in a single chip

• Fixed-Point Data• Audio, Video,

Radio & Voiceband Modems, HDTV

1k

10k

100k

1M

10M

100M

1G

Less Complex More ComplexAlgorithm Complexity

Data Rate

Samples per second

MPU/MCU

Single-Chip DSP

FPGA-BasedFPGA-BasedDSPDSP

ASICASIC

Multiple DSP Coresor Chips

CORE Generator

THIRD-PARTYDSP

SOFTWARE

Instantiate into

schematic or HDL

PLACE AND ROUTE

POST ROUTESIMULATION

Coefficients

DSP / FPGA Design MethodologyXilinx CORE Generator 1.4available now!

BIT STREAM FOR DOWNLOAD CABLE, OR EPROM

XC4000 Resource Cross Reference Chart (Bit-Serial Implementation)

TAPS

8

16

24

32

40

48

56

NUMBER OF XC4000 CLBs

WORD SIZE 6 8 10 12 14 16 18 20 22 24

17 20 23 26 29 36 39 42 45 48

37 44 51 58 65 80 87 94 101 108

57 68 79 90 91 124 135 146 157 168

77 92 107 122 137 168 183 198 213 228

97 116 135 154 173 212 231 250 269 288

117 140 163 186 209 256 275 302 325 348

137 164 191 218 245 300 327 354 367 408

8.3 6.3 5.0 4.2 3.6 3.1 2.8 2.5 2.3 2.18.3 6.3 5.0 4.2 3.6 3.1 2.8 2.5 2.3 2.1 M

samples/sec@50MHz

0.18u

0.15u

0.5u

0.25u

0.35u

Density/S

ystem Gates25K

Perfo

rman

cePro

cess

Tec

hnol

ogy

50 Mhz

100K 100Mhz

500k

1 Million

10 Million

133Mhz

150Mhz

300MhzCores

HDL

Schematic

Modular

Team-BasedSection V Special Topics

The Road Ahead New Developments in

Programmable Logic

Process Technology and Supply Voltage

Xilinx leads PLD industry in fab technology.

Fab partners use FPGAs to drive their process.

• Lower cost• Faster speed• Higher density• Lower power

Feature Size ()

0

0.2

0.4

0.6

0.8

1

1.2

1990 1992 1994 1996 1998 2000 2002

5 V

3.3 V2.5 V1.8 V1.3 V

Today

Advanced Process Technology

0.5u Process 0.25u UMC Process- locos isolation - shallow trench isolation- birds beak - 0.9u metal pitch- no planarization - CMP- only contact plug - plug for all vias

Process & Density Leadership

XC4085XL

XC40125XVIndustry’s 1st 0.25u PLD, 25M Transistors, 5LM

1997 1998 1999 2000 2001 2002

Virtex75+M Transistors

Den

sity

(sy

stem

gat

es)

10M GatesIn 2002

Virtex II

10 Million System Gates in 2002!

10M

2M

1M

250k

180k

XC40250XV500k

Distributed Dual Port RAMIO RegistersInternal Bussing5V Tolerant I/O3.3V and 5V PCI

Fea

ture

s

Block Dual Port RAM Multiple Standard I/O Vector Based Interconnect Phase Locked Loops 66 MHz 64-Bit PCI

1998 1999 2000 2001 2002

Reconfigurable Logic On-Chip AD/DA Embedded Functions 1GHz Diff. Interface Built-in Logic Analyzer

Architecture Innovation & Leadership

133 MHz SDRAM I/F 155 MHz SONET 66 MHz PCI

MHz

* 1/(Tsetup+Tclock-to-out)

0

20

40

60

80

100

120

140

160

180

200

1995 1996 1997 1998 1999 2000

Sys

tem

Clo

ck R

ate*

(M

Hz)

220

240

260

280

300

2001 2002

100 MHz SDRAM I/F 100 MHz DSP for

Wireless Base Station 33 MHz PCI

233 MHz UP 300 MHz RAM I/F 133 MHz PCI

Performance Leadership

Chip ScaleFine Pitch BGA

Flip ChipTechnology

PLCC

PGAPQFP

HQFP

BGA

SBGA

1998 2000 2002

Packaging Leadership

1.0mm

<0.8mm

1.27mm

100

300

500

700

1000

Pins

Compile Time Leadership

Min

utes

*

* 100k System gate designs (200MHz Pentium)

1999 Goal: 1 Million Gates in 45 minutes!

Release

• With Faster CPUs• Faster Compile Times• Modular Compile

0

50

100

150

200

250

1.3 1.4 1.5 2.1 2.2

F1.5 Features• Tight integration

– FPGA Express inside Foundation Project Manager

– Single Project Management / Flow Engine environment

• Improved ease of use – Complete pushbutton

• New Virtex, XC9500XL support• Improved FPGA Express synthesis runtimes &

performance• Improved PAR runtimes and performance

Xilinx Smart-IP Delivers...

High Flexibility High Predictability

Intelligent SoftwareImplementation

Intelligent SoftwareImplementation

Architectures tailored to cores

Architectures tailored to cores

Flexible Core Technology

Flexible Core Technology

High Performance

Xilinx Smart-IPTechnology

Performance + Time to Market

1998 1999

Sta

nd

ard

Bu

sIn

terf

ac

es

DS

P

Fu

nc

tio

ns

Co

mm

un

ica

tio

n&

Ne

two

rkin

gB

as

e L

ev

el

Fu

nc

tio

ns

•PCMCIA•USB

•CAN Bus•ISA PnP•I2C•PCI 32bit

•Add, Subtract, Integrate•Correlators•Filters: FIR, Comb•Multipliers•Transforms: FFT, DFT•Sin/Cos

•ATM Cell Assembly/Delineation•CRC-16/32•T1 Framer•HDLC•Reed-Solomon, Viterbi•UTOPIA, 25/33/50 MHz

•82xx, UARTs, DMA, •66 MHz DRAM/SDRAM I/F•Memory (RAM, ROM, FIFO)•Micro Sequencer (2901)•Proprietary RISC Processors

•CardBus•FireWire(100-400 Mbps)•PCI 64bit/66MHz•PC104•VME

• DCT• Cordic• DES• Divider• JPEG• NCO

•10/100 Ethernet•1Gb Ethernet•ADSL, HDSL, XDSL•ATM/IP Over SONET•SONET OC3/12

•Microprocessor I/Fs•8051/8031•IEEE 1284•MIPS•133+ MHz SDRAM I/F

• Emerging High- Speed Standard Interfaces

• DSP Processor I/Fs• DSP Functions >

200 MSPS• Programmable DSP

Engines• QAM

• Modems• SONET OC48 • Emerging Telecom and Networking Standards

• Satellite decoders

• Speech Recognition

• Advanced processors

2000

By 2002: Virtually All Functions Available as Cores

Leader in Core SolutionsXilinx and Partners’ COREs

• Wasted Routing• Unpredictable Timing• High Power Consumption

• Efficient Routing• Predictable Timing• Low Power Consumption

Segmented Routing Non-Segmented Routing

Core1

Core2

Architecture Tailored to CoresSegmented Architecture

• Portable RAM Based Cores• Improves Logic Efficiency by 16X• High Performance Cores

RAM AvailableLocally

To The Core

Architecture Tailored to CoresDistributed RAM

Relative Placement

Guarantees I/O &Logic Predictability

Fixed Placement & Pre-defined Routing

Other Logic Has No Effect on the Core

Fixed Placement

GuaranteesPerformance

I/Os

Enhances Performance & Predictability

Intelligent SoftwarePre-defined Placement & Routing

50

60

70

80

1 2 4 8

12x12 Multiplier

Speed(MHz)

XilinxSegmented

Number of Cores

Non-XilinxNon-Segmented

Smart-IP Performance Is Independent of Number of Cores in a Design

Smart-IP Delivers Performance

Smart-IP Performance Is Independent of a Core’s Placement in the Device

80 MHZ

80 MHZ

80 MHZ

80 MHZ

Smart-IP Delivers Portability

Smart-IP Performance is Independent of Device Size

80 MHZ 80 MHZ 80 MHZ

Non-Segmented Architecture May Experience 30% Performance Degradation

Smart-IP Delivers Transportability

Xilinx Architecture for Fastest Performance

LogicBlock

1

LogicBlock

2

LogicBlock

n

4x4x

4x

LogicBlock

(next row)

...

Across Chip

LogicBlock

1

LogicBlock

2

LogicBlock

n

LogicBlock

3 ...

6x

3x

LogicBlock

(next row)

1x

Across Chip

Xilinx Segmented Interconnect Non-segmented Interconnect

Segmented Interconnect Structure Provides Faster Logic Cell Connections

1x 1x

Core FunctionXCS30XL

Price*Percentage of Device Used

EffectiveFunction Cost

UART $6.95 17% $1.20

16-bit RISC Processor $6.95 36% $2.50

16-bit, 16-tap Symmetrical FIR Filter

$6.95 27% $1.90

Reed-Solomon Encoder $6.95 6% $0.40

PCI Interface(w/ faster speed grade)

$12.00 45% $5.40

High Value Cores with Spartan

*100,000 units, mid-1999 projection


Virtex, XC6200 and Reconfigurable Logic

RAD

D

DSP XC

6200

TRADITIONAL THINKING

It’s About Time!

VHDL DesignEnvironment

Verilog DesignEnvironment CoreGen

Designer#2

DSP

133MhzSDRAM

Designer#1

GbitEthernet

66MhzPCI

NewModules

IP Modules

LogiCore

FIFO

AllianceCore

CPU

DesignReuse

160 MHz I/O Performance133 MHz Memory Performance

1 Million System Gates

Virtex

Virtex Enables System on a Programmable Chip

Virtex Series Overview• New FPGA architecture, similar to XC4000

• 0.25 and 0.18 micron 5LM process

• Segmented routing

• SelectRAM+ offers 3 types of RAM– Distributed SelectRAM– Block SelectRAM (new)

– High-speed access to external memory (new)

• Traditional and Low Voltage support – CMOS, TTL– LVTTL, LVCMOS, GTL+, and SSTL3

• 250K - 1M system gates in 1998

• Some XC6200-like features– Ideal for Reconfigurable Logic– Dynamic & Partial reconfiguration

Virtex Functional Block DiagramCLB Segmented routing

SelectI/OPins

DistributedSelectRAMMemory

BlockSelectRAMMemory

Phase Locked Loop (PLL)

66 MHz PCI SSTL3

Vector BasedInterconnectdelay=f(vector)

Xilinx 0.25 5 Volt-Compatible FPGAs

• 4KXL / 4KXV Family migration possible if you plan for:

– Additional power/ground pins– Dedicated clock and configuration pins

• Voltage migration guide to help users

Any 5 V

device(XC4000E)

Virtex&

XC4000XV2.5 V logic3.3 V I/O

Any 3.3 V

device(XC4000XL)

5 V3.3 V

2.5 V

5 V

3.3 V 3.3 V

3.3 V

I/OSupply

LogicSupply

Meets TTLLevels

Accepts5 V levels

Virtex FPGA Performance

• 100+ MHz internal speeds – 155 MHz SONET data stream processing

– 100+ MHz Pipelined Multipliers

– 66 MHz PCI

• 100+ MHz system interface speeds

without PLL with PLLTco (output register) 6 ns 3.5 ns

Tsu (input register) 3 ns 3 ns

Th (input register) 0 ns 0 ns

Max I/O performance 110 MHz 160 MHz

Segmented Routing Interconnect

3-STATE BUSSES

SWITCHMATRIX

2 LCs 2 LCs

CA

RR

Y

CA

RR

Y

CLB

CA

RR

Y

CA

RR

Y

• Fast local routing within CLBs

• General purpose routing between CLBs

• Fast Interconnect– 8ns across

250,000 system gates

• Predictable for early design analysis

• Optimized for five layer metal process

2 LCs 2 LCs

CLB

4 InputLUT

RegisterCarryand

Control

I3I2I1I0

O

WI DI

DCE

CLK

Q

CO

CI

4 InputLUT

RegisterCarryand

Control

I3I2I1I0

O

WI DI

DCE

CLK

Q

CO

CI

PR

RS

PR

RS

Polarity of all control signals selectable

Fast arithmetic and multiplier circuitry

Optimized for synthesis

Virtex Configurable Logic Block

SelectRAM+ Memory Features• Distributed SelectRAM Memory

– Pioneered in XC4000 family

– 16x1 synchronous SRAM implemented in LUT

– Ideal for DSP applications

– Access over 100 Billion bytes/sec

• Block SelectRAM Memory– Up to 32 4,096-bit blocks of dual port synchronous SRAM

– Configurable widths of 1, 2, 4, 8, and 16

– Ideal for data buffers and FIFOs

– Up to 17 gigabytes/sec access

• Fast Access to External RAM– Direct interface to SSTL3, 3.3V synchronous DRAM standard

– 133 MHz

Block RAM• Configure as: 4096 bits with variable aspect ratio

• 8-32 blocks across family devices

• True dual-port, fully synchronous operation– Cycle time <10 ns

• Flexible block RAM configuration– 5 blocks: 2K x 10 video line buffer

– 1 block: 512 x 8 ATM buffer (9 frames)

– 4 blocks: 2K x 8 FIFO

– 9 blocks: 4K x 9 FIFO with parity

WEAENACLKAADDRADINA

DOA

DOB

RAMB4

WEBENBCLKBADDRBDINB

CPU

XC6200XC6200RPURPU

I/O

I/OMemory

1000x improvement in reconfigurationtime from external memory

FastMAPtm assures high speed access to all internal registers

All registers accessed viabuilt-in low-skewFastMAPtm busses

Microprocessor interfacebuilt-in

High capacity distributed memorypermits allocation of chipresources to logic or memory

Ultrafast Partial Reconfigurationfully supported

XC6264 - Up to 100,000 gates

XC6200 Reconfigurable Processing Unit

XC6200 Architecture4x4 Block

User I/Os

16x16 Tile

Address

Data

FastMAPtm

Interface

Use

r I/

Os Use

r I/Os

User I/Os

Control

*Number of tiles varies between devices in family

Function Cell

How Dynamic Reconfiguration HelpsExample: DSP

3D Graphics Reconfiguration- DSP Algorithms PDSP FPGA Optimized FPGAs

- Texture- Shadow- Reflections- Perspective- Edge

Some functionsrun while othersare loading

One function at a time

Two or more functions at a time

All functions done in time

Reconfiguration Advantages:Lower cost by reusing silicon for multiple functions over time

OR10-500x performance increase in hardware versus software implementation

Reconfigurable Logic - Research vs. Component $

Problem Size

Pe

rfo

rma

nce

Computer

Embedded Microprocessor

Zillions of Component Dollars(3)

Zillions of Research Dollars(1)

Reconfigurable Logic research has typically focussed on reconfigurable computing1. But there are really two potential markets: high-end embedded computing2 and the low-cost

embedded market3. ?

[Graph is compliments of Nick Treddenick.]

(2)

XC6200 Dynamic & Partial Reconfiguration

ns us ms s

XC4013

40ns

200us

250ms

XC6216

Design Swapping

Block Swapping

Circuit Updates

Rewiring

Directions in Reconfigurable Logic• XC6200 was first Xilinx product to XC6200 chips &

XACT6000 software are available, but no further product development

– Divergent architecture and incomplete tools support – XUP support for Research only, not classes:

Adaptive or Reconfigurable Logic, Place & Route algorithms

• Key XC6200 features brought into mainstream families (Virtex)!

– Dynamic & Partial reconfiguration

– Full industry and software support

– Easier to design to

– New Rec.Logic curriculum should use Virtex

• Virtex-ready PCI board available from Virtual Computer Corp.

• Further info: http://www.xilinx.com/xup/6200rc.htm


FPGA versus ASIC Costs

Pad-Limited Die Size

Core

core-limited

I/O pads

Mid-high density:Gate count determines

die sizeAs Processes Migrate

FPGA Cost = Gate Array Cost

pad-limited

Core

I/O pads

Low Density:I/O count

determinesdie size

1998 1999 2000

Spartan

$395

Spartan

$395

Pric

e

SpartanXL

$295

SpartanXL

$295

0.35 5LMSpartan-II

< $200

Spartan-II

< $200

0.5 3LM

2.5 Volt

More Features

Without Compromises• Pricing competitive with ASICs• High Performance• On-chip SelectRAMTM

3.3 Volt

5 Volt

*Prices are for 5K system gates, 100K units, -3 speed, Lowest Cost Package

0.25 5LM

2002

SpartanNext Generation

< $150

SpartanNext Generation

< $150

1.8 Volt

0.18

FPGA Price Leadership

Pric

e

XC9536

2001200019991998

$0.80

$9$1.80

$15

2002

XC95216

Without Compromises• Flexible ISP• Highest Performance• Pin-Locking• Full JTAG

* Prices are based on 100Ku+, slowest speed grade, lowest cost package

CPLD Price Leadership

Density

(System Gates)

1997 1998 1999 2000 2001 2002

15K

40K

100K

100K unit volume price projections

$10

60K

New Applications• Set Top Box• DVD• Digital Camera• PC Peripherals• Consumer Electronics

$2025K

60K

200K

100K

10K gates/$ in 2002!

$10

$20

Priced for High-Volume Leadership

The Real Cost of Ownership• Even in mid & high density, FPGAs often have cost advantage

• FPGA vs ASIC goes far beyond obvious unit costs calculations

• Real Comparison includes Real factors

Programmable FPGA Gate Array(Application Specific Integrated Circuit)

Lower unit cost

Custom ProductMonths to manufactureSlow Time to MarketNRE+Customer specificUser Test DevelopmentSimulation CriticalNo In-Circuit verification

Higher unit cost

Standard ProductOff the shelf deliveryFast Time to Market

No Non-Recurring Eng. FeeNo inventory risk

Fully factory testedSimulation helpful

In-Circuit verification

(-)

(+)(+)(+)(+)(+)(+)(+)(+)

Cost Calculations - Basic Model

• Breakeven - Solve for X (units)

ASIC Cost = FPGA Cost

$25K NRE + $79K Engineering& Tools + X * $10

= $0 NRE + $25K Engineering&Tools + X * $30

54K / 20=X

2,700 units=X

Cost Calculations - Market Model

Maximum Revenue from delayed entry

Product Life = 2WW W

Maximum Available Revenue

% of Lost Revenue = (Delay * (3W-Delay)/2W^2)*100

= (5.25 (3*18 - 5.25)/ 36^2) *100

= 19.75%

Net Profit = Volume * (System Price - System Cost )

= ($2K - $1.1K) * (1K + 12K + 5K)

= $16,200,000

ASIC Cost = $25K NRE + $79K Engineering + .1975*$16.2M Lost Profit + X*$10

FPGA Cost = $25K Engineering + X*$30

Being late to market costs Real $$

Total ASIC Development = 32 weeksTotal FPGA Development = 11 weeks

Breakeven, X = 162,700 units

= $3.2M

Hardwire Technology Model• ASIC Re-spin delay & expense risk 30%

• PLD price reductions 25% vs. 5% per year

• Hardwire Technology lowers FPGA cost 40-60%– No additional design work or test vectors

– Preserves nets, placement, routing

– All FPGA characteristics maintained

Total ASIC Cost = $25K + $79K + $5.3M + $22.8K + 18.7K + X * $10FPGA/HWire Cost = $25K Engineering + 1K*$30

+ $18K NRE + (X-1K) Units * $18

Breakeven, X = 674,000 units !!!

Download the Xilinx ASIC Estimator program at http://www.xilinx.com/products/hardwire/hardwire.htm to compare costs or learn more.

Total Cost of Ownership - ASIC vs. FPGA

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

Units

To

tal C

ost

($)

FPGA (B, M)

ASIC (B)

ASIC (M)

FPGA (H)

ASIC (H)

B = Basic analysisM = Market modelH = Hardwire model


Xilinx Student Edition

The Xilinx Student Edition • Prentice Hall’s most requested new engineering product in Q1

‘98 ! – Complete, affordable, and practical digital design course environment for all students– Predeveloped and tested lab-based course

• Includes – Foundation Series 1.3 for students’ computers– Practical Xilinx Designer lab tutorial book– Coupon for XS40-005XL and XS95-108 boards ($129)

• Sold through bookstores by Prentice Hall and www.Amazon.com, listed at $79 (ISBN 0136716296)

• Integrated tutorial projects cover:TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors

• Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web Aug.1

The Practical Xilinx Designer• The Digital Design Process - Basic concepts and TTL logic• Programmable Logic Design Techniques - Programmable logic introduction and Foundation tutorial• Programmable Logic Architectures - XC9500 CPLD and XC4000 FPGA• Combinatorial Logic Design - LED decoder circuit with both CPLDs and FPGAs.• Modular Designs and Hierarchy - step-wise refinement using Foundation • Electrical Characteristics of Programmable Logic - I/O drivers, timing/delay models, and power consumption• Flip-Flops - introduces sequential logic• State Machine Design

- design examples for counters, drink machine, etc.• Memories

- how to build memory with flip-flops, logic gates.• The GNOME Microcoomputer - construction and improvements of simple, 8-bit microcomputer.

Xilinx Student Edition Development Boards


Xilinx University ProgramParticipation

Section Agenda

• Course recommendations

• How to learn more

• Contacts & Support

• Why use Xilinx?

• Products & Ordering– Software– Hardware

Course Recommendations• See http://www.xilinx.com/programs/univ.htm

Trends in Teaching with PLDs• Increasing density and Cores enable System-level design

and test on an FPGA– LogiCOREs available to all universities– PCI, DSP, math, other complex functions

• VHDL or Verilog design is commonplace• PLDs in many subjects beyond Digital Design and

Computer Engineering– System Level Design and Test – Dynamically Reconfigurable Logic– Digital Signal or Video Processing– Network Design

• Prevalent usage in required EE, CS, CE courses• Students use their own computers

How To Learn More (1)• AppLinx CD / Xilinx data book

• On-line books, On-line Help

• Excellent on-line tutorials in Foundation & Express

• Xilinx Web Site• Application notes

• Latest technical information and status

• Fast Technical Help

• Whatever it is, it’s probably there!

• Subscribe to XCELL Journal

• Xilinx Student Edition is great practical guide

XUP Contacts & Support• XUP Staff:

– Jason Feinsmith, XUP Manager ([email protected], USA 408-879-4961)

– Anna Acevedo, XUP Coordinator ([email protected], USA 408-879-5338)

– Chris Grundy, XUP European Liason ([email protected], UK +44-1-932-333-523)

– XUP Website: http://www.xilinx.com/programs/univ.htm

• Xilinx commercial or university distributors– Channel for product distribution, updates– http://www.xilinx.com for listing of commercial distributors– Europractice, Chip Implementation Center (Taiwan ROC), IDEC (S.Korea),

Canadian MicroElectronics Corp.

• Technical Support– Answers Database http://www.xilinx.com/support/searchtd.htm– For Instructors: [email protected], USA 800-255-7778

Xilinx Donation Policy“If a new or expanded course with lab or a research project is

being added and funding is not adequate to purchase the required products at the University Program discounts, Xilinx encourages any university or college to submit a

donation request.”

To Purchase or To Request a Donation - What's Practical for you? If you have sufficient budget to purchase Xilinx software, development boards, and/or chips, then we encourage you to do so. We offer significant discounts for Xilinx software and Xilinx development boards. However, we recognize that very often, schools simply do not have the funding even for the discounted products. In some cases, a school might have some funding, but not enough to obtain everything that is needed for the lab. We encourage you to make the choice that you feel is right for your situation. Most importantly, if money is any barrier to your immediate use of Xilinx products, you should request a donation for what you need.

Why Xilinx?• Xilinx is world’s leading Programmable Logic innovator with 55% commercial FPGA

marketshare• Xilinx is nearly twice as popular in the academic market as its nearest competitor• Best PLD Software: Foundation; Alliance; & Synopsys partnership • Best PLD hardware architectures

– Xilinx FPGAs and CPLDs all Reprogrammable In-System. – Tri-state and dual port RAMs in FPGAs are best for computer structures, DSP, research, etc. – Only vendor with dynamically & partially reconfigurable RPU’s

• Prentice Hall / Xilinx Student Edition includes best tools on the market with fully integrated hardware environment

• If you don’t have the budget, request a donation.

FPGA3K, 5K, 4K

CPLD 9500

Co

mp

lexi

tyF

un

ctio

nal

ity

/C

ou

rse

Lev

el

Speed

Exciting Research areas:

• Reconfigurable ComputingVirtex, XC6200

• Digital Signal ProcessingXC4000X

• Networking, PCI, Computer Architectures, Neural Nets, etc.

Computer Lab Requirements• Win ‘95, Win NT, HP, Sun, Solaris, use Xilinx

software version 1.3, available now– Foundation Series Express recommended for all PC users

– Other design entry tools OK too, especially on workstation

• v1.4 RAM Hard Drive Processor Minimum 32MB 200MB 486DX2 MuchBetter 32+MB 500MB Pentium 120+

Typical Lab Setup• Primary and Additional licenses *

• Cables vs. PROM Programmers• Foundation Series Express package

recommended for lab– Software updates

– Full range of devices supported

– Additional license scheme

1 US-FND-EXP-PC Primary Foundation package 9 UA-FND-EXP-PC Additional FND licenses10 XS40-010XL XC4010XL FPGA board & cable 2 XS95-108 XS9500 CPLD board & cable

* Workstation users, use Ux-ALI-STD-WS, and subsitute these for 10 XS40-010XL’s 10 UW-FPGABOARD 3K/4K Development boards10 UW-XCHCBL-PC XChecker cables

CPLD or FPGA?CPLD

• Non-volatile

• JTAG Testing

• Wide fan-in

• Fast counters, state machines

• Combinational Logic

• Small student projects, lower level courses

FPGA

• More common in schools

• Great for first year to graduate work

• Excellent for computer architecture, DSP, registered designs

• ASIC like design flow

• SRAM reconfiguration

• PROM required for non-volatile operation

Since the software is integrated, you can teach with both !

Hardware Boards for PCs

XSTEND- Plug-in extension for XS40 & XS95’s- Purchase from XESS Corp.

XS40 & XS95 Boards- Purchase from XESS or

donation from Xilinx

Access toI/O Pins foreasy prototyping

Hardware Boards (2)

• H.O.T. II PCI Board1

• UW-FPGABOARD2

Access toI/O Pins foreasy prototyping

Battery not includ

ed!

(1) Purchase HOT II from VCC(2) Most popular board for the workstation. Purchase or donation from Xilinx

Summary• Enhance Your Lab Curriculum with Xilinx • Students get better job offers • Great products for your lab

– Leading, industry standard software – IEEE Standard VHDL & Verilog– Innovative hardware solutions

• Ideal from intro to graduate courses• Great publications from Prentice Hall • Areas of strength for research

– DSP, Reconfigurable Logic

Xilinx = Long term Programable Logic Solutions Leader

Appendix A: Xilinx Configurable Logic

Blocks

D Q

SD

RD

EC

S/RControl

D Q

SD

RD

EC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4

G3G2

G1

F4F3

F2

F1

C4C1 C2 C3

K

YQ

Y

XQ

X

H1 DIN S/R EC

XC4000 CLB

XC4000X I/O Block Diagram


XC9500 CPLDs

FunctionBlock 1

JTAGController

FunctionBlock 2

I/O

FunctionBlock 4

3

Global Tri-States

2 or 4

FunctionBlock 3

I/O



JTAG Port

3

I/O

I/O

Global Set/Reset

Global Clocks

I/OBlocks

1


ToFastCONNECT

FromFastCONNECT

2 or 43 GlobalTri-State

GlobalClocks

I/O

I/O

36

Product-Term

Allocator

Macrocell 1

ANDArray

Macrocell 18


(2nd View)

QD/T

FixedOutput

Pin


Function BlockLogic

36Inputs

Appendix B: FPGA Family Comparisons

Xilinx Spartan Series

5 Volt -> XCS05 XCS10 XCS20 XCS30 XCS40

3.3 Volt -> XCS05XL XCS10XL XCS20XL XCS30XL XCS40XL

System Gates 2K-5K 3K-10K 7K-20K 10K-30K 13K-40K

Logic Cells 238 466 950 1368 1862

Max Logic Gates 3,000 5,000 10,000 13,000 20,000

Flip-Flops 360 616 1120 1536 2016

Max RAM bits 3,200 6,272 12,800 18,432 25,088

Max I/O 80 112 160 192 224

Performance 80MHz 80MHz 80MHz 80MHz 80MHz

XC4000E 5V FPGA Family

4003E 4005E 4006E 4008E 4010E 4013E 4020E 4025E

Logic Cells 238 466 608 770 950 1,368 1,862 2,432

Max Logic Gates 3K 5K 6K 8K 10K 13K 20K 25K

Typ Gate Range* 2-5K 3-9K 4-12K 6-15K 7-20K 10-30K 13-40K 15-45K(Logic + Select-RAM)

Max I/O 80 112 128 144 160 192 224 256

Packages: PC84 PC84 PC84 PC84 PC84TQ100PQ100 PQ100

TQ144 TQ144 PQ160 PQ160 PQ160 PQ160

PQ208 PQ208 PQ208 PQ208 PQ208 HQ208HQ240 HQ240 HQ240

HQ304PG120 PG156 PG156 PG191 PG191 PG223 PG223 PG223

BG225 BG225 PG299


100%Footprint

Compatible

Spartan/XC4000E/XC5200 Density

Spartan/XL XC4000E XC5200

Logic Cells 238 - 1,862 238 - 2,432 256 - 1,936

Typ Gate Range 2,000 - 40,000 2,000 - 45,000 2,000 - 23,000(Logic + SelectRAM)

I/O 77 - 205 80 - 256 84 - 244

Number of Devices 5 8 5

Power Supply 5V / 3.3V 5V 5V

I/O Interface 5V / 3.3V 5V 5V

XC4000X Series Density

XC4000EX XC4000XL XC4000XV

Logic Cells 2,432 - 3,078 152 - 7,448 10,982 - 20,102

Typ Gate Range 18,000 - 65,000 1,000 - 180,000 80,000 - 500,000(Logic + SelectRAM)

I/O 256 - 288 64 - 448 448

Number of Devices 2 11 4

Power Supply 5V 3.3V 3.3V + 2.5V

I/O Interface 5V 5V / 3.3V 5V / 3.3V / 2.5V

Common Features

Spartan XC4000 XC5200

Function Generators/CLB 3 3 4

Flip-flops/CLB 2 2 4

Global Nets 8 8 4

Global Three-State Control Yes Yes Yes

Carry Logic Yes Yes Yes

Internal Three-State BuffersYes Yes Yes

Boundary Scan Logic Yes Yes Yes

Output Drive (Sink) 12 mA 12 mA 8 mA

Differentiating FeaturesSpartan XC4000 XC5200

LCs/CLB 2.375 2.375 4RAM Sync. Sync./Async. NonePCI Yes Yes NoDecode No Yes NoWired-AND No Yes NoI/O FFs Yes Yes NoConfig Ser Par/Ser Par/SerPackages 6 16 18

• Complete pinout compatibility within Spartan Series• Not directly pinout-compatible with XC4000/XC5200

- Spartan has only one MODE pin- Mode pin cannot be used as I/O

Xilinx XC4000-based Architecture Comparison

Spartan/XL XC4000X XC4000E

Extended Routing No Yes No

Fast Capture Latch No Yes No

Global Early Buffers No Yes No

Output Mux No Yes No

CLB Latches No Yes No

Asynchronous RAM No Yes Yes

Edge Decoders No Yes Yes

Wired-AND Function No Yes Yes

Density Comparison

Xilinx Device Competing ProductMax Max

Max RAM Logic RAM MaxDevice I/O Bits Cells Bits I/O DeviceXC4000 Series Altera FLEX 10K

XC4085XL 448 100K 7,448XC4062XL 384 74K 5,472

4,992 25K 406 EPF10K100XC4052XL 352 62K 4,598XC4044XL 320 51K 3,800

3,774 18K 358 EPF10K70XC4036EX 288 42K 3,078

2,880 20K 310 EPF10K50XC4028EX 256 33K 2,432XC4025E 256 33K 2,432

2,304 16K 278 EPF10K40XC4020E 224 25K 1,862

1,728 12K 246 EPF10K30XC4013E 192 18K 1,368

1,152 12K 198 EPF10K20

Xilinx University Workshops Appendix C

Design Tool Flows

Xilinx-Express Design Flow

.VEI

.VHI

.UCF Reports

DSP COREGen & LogiBLOX

Module Generator

XNF.NGO

HDL Editor

State DiagramEditor

VHDLVerilog

.V.VHD

Foundation Design Entry Tools

Gate LevelSimulator

SchematicCapture

EDIFXNF

TimingRequirements

VHDLVerilog

Express

EDIF/XNF .XNF

BITJDEC

SDFVHDL

Verilog

Reports

EDIF

Xilinx Implementation Tools

HDL

SIMULATION

VHDLVerilog


Xilinx Design

Manager Flow 1.4

FPGA Implementatio

n

Xilinx Design Manager Flow

1.4

CPLD Implementat

ion

Design Entry

Concept

Mixed-LevelSchematic/HDL

NetlistInformation

Design Synthesis & Retargetability

Synergy HDL/VHDLSynthesis

SynthesisLibraries

Design Optimization/ Partitionning for PLDs

PLD Designer

Design Optimizationfor FPGAs

FPGA Designer

Post ImplementationNetlist & SDF

PLD & FPGA Designer

Device Programming Files

Schematic Redraw


Verilog XL,Leapfrog

Functional Simulation / Verification

OpenSIM BackPlane

Netlist Creation

VerilogLink/VHDLLink

TimingSimulation

Verilog-XL,Leapfrog

SimulationLibraries

Timing Backannotation

*EDIF, XNF

Verilog, VHDL**SDF, *EDIF

*Standard Interface Netlist Format** Standard Delay Format

SimulationLibraries

Place & Route

Implementation Tools

VHDL, VERILOG

Design Flow M1M1


Schematic Entry / View Schematic

ViewlogicViewDraw

Place & Route

PAR (Place & Route)

Structural Simulation / Functional Simulation

ViewlogicViewSim

**SDF

Netlist(XNF or *EDIF)

Netlist Launcher

NGDBUILD


ABEL HDL

LogiBlox

LogiCores

Optional

VHDL,*XNF

Waveform Analysis

ViewlogicViewTrace

TimingSimulation

ViewlogicViewSim


Timing AnnotatedEDIF Netlist

VHDL Entry & Compile

ViewlogicViewSyn

VHDL Synthesis

ViewlogicViewSyn

Behavioral Simulation

ViewlogicSpeedwave

Design FlowM1M1

Design Flow


HDL Design Flow Schematic Design Flow

Timing Simulation

QuickHDL

*EDIF

LogiBlox

LogiCores

VHDL / Verilog HDL

Notepad / QuickHDL

Mentor Design Manager

Optional

FunctionalSimulation

QuickHDL

Place & Route


Synthesis & Optomization

Autologic II

VHDL or VERILOG*SDF

Design Entry

Design Architect

Simulation Preparation

Design View Editor

Mentor Design Manager


QuickSim II

Place & Route


ABLE HDL

LogiBlox

LogiCores

Optional*EDIF

Timing Simulation

QuickSim II

*EDIF w/ Timing*SDFDevice Programming Files


M1M1

Synthesis

Synopsys FPGA Compiler orDesign Compiler

ConstraintsFile


LogiBlox

LogiCores

Optional

Place & Route

PAR (Place & Route)


SynopsysVHDL System Simulator

or3rd Party

VHDL/VERILOG Simulator

VHDL,VERILOG,

*SDF

Static Timing Report

Static TimingVerification

SynopsysVSS Simulator

SynopsysVSS Simulator

or3rd Party

VHDL/VERILOG Simulator

Timing Simulation

Post-layout Verification

Simulation Library

Synthesis Library


Netlist Launcher

NGDBUILD



HDL Source File(VHDL or

Verilog HDL)

Xilinx Unified Libraries VHDL/VERILOG Models

Synopsys Design Compiler Design Flow M1M1


Schematic Entry

OrCAD/ESPDesign Environment

Place & Route

PAR (Place & Route)


OrCAD SimulateXSimMake

*SDF


Netlist Launcher

NGDBUILD


ABEL HDL

LogiBlox

LogiCores

Optional

XNF modules(Created by HDLSynthesis tools)

VHDL,*XNF


Design Flow M1M1

Synplicity Design Flow

VHDLVerilog

XNFEDIF

HDL Editor VHDLVerilog

.SDC

HDL Analyst

DSP COREGen& LogiBLOX

Module Generator

VHDLVerilog

.VEI.VHI

.NGOXNF


3rd PartySimulation

VHDLVerilog

XNF, VM,VHM, EDIF

VHDLVerilog

SDF

EDIF

Xilinx Implementation ToolsUser

ConstraintsFile

BITJEDEC

VHDLVerilog

EDIF SDFReports

Functional Simulation Flow

Timing Simulation Flow

Verilog & VHDL Instantiation

.VM.VHM

Unified

simprim

Command Fileor

Test Vectors

HDLTestBench

Unisim

VITAL, Verilog, Gate

VITAL & Verilog

StructuredVerilog andVHDL netlists

Gate

.NGO = Xilinx binary netlist

-route-improve

Compile & Map Engine

.NCF

Plac

e &

Rou

teC

onst

rain

ts

cros

spr

obin

g

Timing &Design

Constraints

Technology View

RTL View

Xilinx University WorkshopAppendix D

XChecker Cable and Configuration

*Note: Although differences are very minimal, this information has not been updated to reflect M1 information.

Use XChecker Cable to Simplify Verification

• Downloading allows quick verification of design in circuit– Bitstream downloaded via computer’s serial port directly into

FPGA– No PROM programming required

– Design changes and verifications made quickly

• Readback sends configuration data and flip-flop values back out of chip– Verifies correct configuration

– Allows in-circuit “probing” of all signals

– Can occur while the FPGA is running

– Uses no CLBs or routing resources

Enabling Configuration Readback• Readback Trigger input starts serial readback

• XC3000 controlled via Bitstream Generator– Default is enabled

– Data and trigger connected to Mode pins

• XC4/5000 controlled via schematic and Bitstream Generator– Include Readback symbol in schematic

– Connect TRIG and DATA to I/O pins

– Can use MD0 and MD1

IPADIBUF

OPADOBUF

(MD0)

(MD1)

CLK

TRIG

DATA

RIPREADBACK

XCheckerRT

XCheckerRD

Available Readback Data

Data includes all storage elements in device– XC4000/XC5000 readback data includes all outputs

of CLBs and IOBs

• XC4000/XC5000 data is captured when readback is triggered

• XC3000 data is captured as readback progresses– May want to stop system clock for logic verification– Requires XChecker control of system clock

Control Panel Defines Debug Session

(XACT™step v6)• Opens automatically for Debug

• Allows direct control of:– System clock source definition and application– Readback trigger source definition and

application– Number of readbacks– Display options

How to Use Programmable Logic to Build Fast and Efficient DSP Functions

XUP WorkshopAppendix E

Originally created by: Greg Goslin

Xilinx, Corporate Applications

Constraint Driven Design Methodology

• Constraints– System Requirements– Hardware Limitations

• Data Rate– Inputs– Outputs– Multi-Channel I/O

• Quality– Number of Bits/Taps– Number of Operations– Error Tolerance

• Processor Power• Clock Rate

Constraint Driven Design methodologies

Clock Rate

Data Rate

Quality

Processor Power

Options

PerformanceEfficiency

Building Fast and Efficient Filters in FPGAs

• Efficient Filter Algorithms for FPGAs– Distributed Arithmetic:

• Bit-Serial

• n-Bit Parallel

• Using Distributed Arithmetic for Filter Designs– Serial FIR Filter Example– Two-Bit Parallel FIR Example– Full Parallel FIR Example

FIR FILTER EXAMPLE

XC0

X0

XC1

X1

XC2

X2

•

•

•

• • •

SUM

0

K

• • •

SAMPLE DATA

N BITS WIDE

K TAPS LONG

K COEFFICIENTS

K SUMS

OUTPUT DATA

PRODUCT K Multiplies

K Sums

CLOCK = Multiply Time

Sample Rate = Clock Rate

IMPLEMENTATION ???

Sum of Products Equation

2’s Complement Math

• The 2’s Complement of a number: Invert (1’s Complement) then Add 1. 11111010 (-6) the 2’s Comp. is (Invert) 00000101, (Add 1) Equals: 00000110 (+6)

• Leading 1’s and 0’s are only place holders: (Sign extending a 2’s Comp. number doesn’t change its value) XMSB ... X2 X1 X0 equals XMSB XMSB XMSB ... X2 X1 X0

The following 2’s Complement pairs are the same: FFFF = FF, 0001 = 01, 11111111101 = 1101

• Adding 2’s Complement numbers: - Sign Extend, the MSB (sign bit) must be extended to allow for word growth:

SE111010 -6 0000110 +6001101 +13 0001101 +13

1000111 +7 0010011 +19(Note: Ignore Overflow)

8-Bit X 8-Bit Signed Multiply

B7B6B5B4B3B2B1B0

S

X A7A6A5A4A3A2A1A0

SIGN EXTEND

A0(B7B6B5B4B3B2B1B0)A1(B7B6B5B4B3B2B1B0)

A2(B7B6B5B4B3B2B1B0)A3(B7 B6B5B4B3B2B1B0)

A4(B7 B6 B5B4B3B2B1B0)A5(B7 B6 B5 B4B3B2B1B0)

A6(B7 B6 B5 B4 B3B2B1B0)A7(B7 B6 B5 B4 B3 B2B1B0)+

S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

8-Bit X 8-Bit Signed Multiply

B7B6B5B4B3B2B1B0

S

X A7A6A5A4A3A2A1A0

SIGN EXTEND

SE(B7 B6 B5 B4 B3 B2B1B0)*A7 27

SE(B7 B6 B5 B4 B3B2B1B0)*A6 26

SE(B7 B6 B5 B4B3B2B1B0)*A5 25

SE(B7 B6 B5B4B3B2B1B0)*A4 24

SE(B7 B6B5B4B3B2B1B0)*A3 23

SE(B7B6B5B4B3B2B1B0)*A2 22



+


4-Bit Signed Tree Multiplier

3:0BA1

Sign Extend

3:0

LSB

REG

A/2

B

3:0

A0

B * A1

B * A0

Sign Extend

3:1

B3

B2

B0

1 CARRY IN{ 1/2 B*A0 - B*A1 }

3:0BA3

Sign Extend

3:0

LSB

REG

A/2

-B

3:0

A2

B * A3

B * A2

Sign Extend

3:1

B3

B2

B0

1 CARRY IN{ 1/2 B*A2 - B*A3 }

B3

B3

5-bit Signed Adder & Reg = 3 CLBs

5-bit Signed Adder & Reg = 3 CLBs16 Gated Bits and Reg = 8 CLBs

+A1 *{ B3B3B2B1B0 }+A0 *{ B3B3B3B2B1B0 }

{ P5P4P3P2P1P0 }

REG

A/4

B

7:2

7:2

5:0

Sign Extend

5:2

B5

B5

B1LSB

B0

-A3 *{ B3B3B2B1B0 }+A2 *{ B3B3B3B2B1B0 }

{ P7P6P5P4P3P2 } -A3 *{ B3B3B2B1B0 }+A2 *{ B3B3B3B2B1B0 }+A1 *{ B3B3B3B3B2B1B0 }+A0 *{ B3B3B3B3B3B2B1B0 }

{ P7P6P5P4P3P2P1P0 }

7:0

Total = 18 CLBs

6-bit Signed Adder & Reg = 4 CLBs

B

X0

SAMPLE DATA

N BITS WIDE

A

B

ScalingAccum.

REGISTER

FILTEREDDATA OUT

2 -1

+ -

LOOKUP

TABLE

ADRS

DATA

D.A. ONE TAP FIR FILTER = D0 C0

REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT

...000000

C0


A0

A[0]0

1

1

X1

X2

X3

Xn

DINN

X0(B7B6B5B4B3B2B1B0)+X1(B7B6B5B4B3B2B1B0)

+X2(B7B6B5B4B3B2B1B0)

+X3(B7 B6B5B4B3B2B1B0)

+X7(B7 B6 B5 B4 B3 B2B1B0)


S9S8S7S6S5S4S3S2S1S0

S10S9S8S7S6S5S4S3S2S1S0

S11S10S9S8S7S6S5S4S3S2S1S0

+X4(B7 B6 B5B4B3B2B1B0)S12S11S10S9S8S7S6S5S4S3S2S1S0

+X5(B7 B6 B5 B4B3B2B1B0)S13S12S11S10S9S8S7S6S5S4S3S2S1S0

+X6(B7 B6 B5 B4 B3B2B1B0)S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

D.A. TWO TAP FIR FILTER = D0 C0 + D1 C1

A

B

ScalingAccum.

REGISTER

FILTEREDDATA OUT

2 -1

+ -

LOOKUP

TABLE

ADRS

DATA

...000000

C0


c1

C0 + C1

00

01

10

11

A[10]

X0

X2

X1

XN

D0

SAMPLE DATA

N BITS WIDE

D1

A0

A1X0

X2

X1

XN

N

(X0,0,X1,0)(B7B6B5B4B3B2B1B0)+(X0,1,X1,1)(B7B6B5B4B3B2B1B0)

+(X0,2,X1,2)(B7B6B5B4B3B2B1B0)

+(X0,3,X1,3)(B7 B6B5B4B3B2B1B0)

+(X0,7,X1,7)(B7 B6 B5 B4 B3 B2B1B0)S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0



S11S10S9S8S7S6S5S4S3S2S1S0+(X0,4,X1,4)(B7 B6 B5B4B3B2B1B0)

S12S11S10S9S8S7S6S5S4S3S2S1S0+(X0,5,X1,5)(B7 B6 B5 B4B3B2B1B0)

S13S12S11S10S9S8S7S6S5S4S3S2S1S0

+(X0,6,X1,6)(B7 B6 B5 B4 B3B2B1B0)S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

A

B

ScalingAccum.

REGISTER

FILTEREDDATA OUT

2 -1

+ -

LOOKUP

TABLE

ADRS

DATA

D.A. THREE TAP FIR FILTER

...000000

C0


C1

C1 + C0

000

001

010

011

100

101

110

111

C2

C2 + C0

C2 + C1

C2 + C1 + C0

A[210]

(X0,0,X1,0,X2,0)(B7B6B5B4B3B2B1B0)+(X0,1,X1,1,X2,1)(B7B6B5B4B3B2B1B0)

+(X0,2,X1,2,X2,2)(B7B6B5B4B3B2B1B0)

+(X0,N,X1,N,X2,N)(B7B6B5B4B3B2B1B0)S(N+M) ... S13S12S11S10S9S8S7S6S5S4S3S2S1S0



X0

X2

X1

XNSAMPLE DATA

N BITS WIDE

A1

D0

D2

D1

A0

X0

X2

X1

XN

A2X0

X2

X1

XN

N

The Development of aDistributed Arithmetic FIR Filter

10-Bit 10-Tap - XC4000 Family Example

DATA

LOOK UPTABLE


SAMPLEDATA

XOR

COMPLEMENT ON LAST BIT & ADD 1

A

B

REGISTER

100 BITSHIFT

REGISTER

FILTEREDDATA OUT

SHIFT

D0

D1

D9

D9

D1

D8

D2

D7

D3

D6

D4

D5

ADD

ADD

ADD

ADD

ADD

A0

A1

A2

A3

A4

10 BIT 10 TAP SYMMETRICAL FIR FILTER

32 X 10 MEMORY

10 10 BITSHIFT

REGISTER

SUM(10,1)

10

11

10

ScalingAccum.

A10A9A8

S1

SUM(0)

DIN

ShiftReg. 10

Least SignificantBYTE

MostSignificantBYTE

OPTIONALDOUBLEPRECISION

S10S9

A0

10

320 BITS

Look Up Table is only 32 words by 10 bits

SerialAdders

C_I

B(9:0)

SIGN EXT B10

LD

LOAD ONFIRST BIT


SAMPLEDATA

N K BITSHIFT

REGISTER

SHIFT

D_0

D_1

D_k-1

N N BITSHIFT

REGISTER

SAMPLE DATA WORD SIZE = N BITSNUMBER OF TAPS = K

• One N Bit Shift Register Per Tap

• Use 4000 RAM to build Shift Register

• One 16 Bit Shift Register Per 1/2 CLB

•

# OUTPUTS = # TAPS

PARALLEL IN

SAMPLEDATA

N K BITSHIFT

REGISTER

D_0

N N BITSHIFT

REGISTER

•

RAM16X1RDATA_I

A3A2A1A0

WRCLK

DATA_O

RAM16X1RDATA_I

A3A2A1A0

WRCLK

DATA_O

SHIFT REGISTERIMPLEMENTED IN RAM

SERIAL TIME SKEWBUFFER

D_k-1

D_1

10 BIT 10 TAP = 50 CLBs 10 BIT 10 TAP = 10 CLBs

Serial Adder

D9

D1

D8

D2

D7

D3

D6

D4

D5

ADD

ADD

ADD

ADD

ADD

SerialAdders

D0

AB

D

Clk

FF

A+B+Carry

A + B

Carry In Carry

CLR

1 CLB Per 2 Taps

D

Clk

FF

CNT=10

SUM

DATA

LOOK UPTABLE

A0

A1

A2

A3

A4

32 X 10 MEMORY

320 BITS

DISTRIBUTED ARITHMETIC LOOK-UP TABLE

• HOLDS ALL PARTIAL PRODUCTS

• LUT IS AS WIDE AS COEFF

• CAN USE MEMGEN TO BUILD LUT

1’s COMPLEMENTER

• INVERTS DATA ON LAST CYCLE

• 2 BITS PER CLB

D Q

D Q

INVERT

D0

D1

SCALING ACCUMULATOR

• ADDS DATA TO (1/2) *(SUMOUT)

• 2 BITS PER CLB

• NEED N+1 BITS

• DOUBLE PRECISION WITH SR

• CAN USE XBLOX FOR RPM

FORCE CARRY-IN ON LAST BIT

A

B

REGISTER

SUM OUT

10

11ScalingAccum.

A10A9A8

S1

SUM(0)

DIN

ShiftReg. 10


MostSignificantBYTE


S10S9

A0

10

C_I

B(9:0)

SIGN EXT B10

LD

LOAD ONFIRST BIT

DATA

DATA

LOOK UPTABLE


SAMPLEDATA

XOR

COMPLEMENT ON LAST BIT & ADD 1

A

B

REGISTER

100 BITSHIFT

REGISTER

FILTEREDDATA OUT

SHIFT

D0

D1

D9

D9

D1

D8

D2

D7

D3

D6

D4

D5

ADD

ADD

ADD

ADD

ADD

A0

A1

A2

A3

A4

10 BIT 10 TAP SYMMETRICAL FIR FILTER

32 X 10 MEMORY

10 10 BITSHIFT

REGISTER

SUM(10,1)

10

11

10

ScalingAccum.

A10A9A8

S1

SUM(0)

DIN

ShiftReg. 10


MostSignificantBYTE


S10S9

A0

10

320 BITSSerialAdders

C_I

B(9:0)

SIGN EXT B10

LD

LOAD ONFIRST BIT

(RAM)

10

10

5FIVE 2 BITADDERS

2 TO 1 REDUCTION DUE TO SYMMETRY

SERIAL TIME SKEW BUFFER

RAM BASED SHIFT REGISTER

RAM OR ROMLOOK UP TABLE

10ADRS DATA

FIR FILTER COEFFICIENTSAND MULTIPLY LOOK UP

32 X 10

10

10

9

REGISTER

ADDER

A

B

10

FILTER OUT

COMPLEMENT ON LASTCYCLE

XOR

SCALING ACCUMULATOR

1’S COMPLEMENT

5 CLBs10 CLBs 10 CLBs

5 CLBs 7 CLBs

SAMPLE DATA

7 CLBs

TIMING AND CONTROL

50 MHz CLK

CLK

A3

A2

A1

A0

CNTEQ10

CNTEQ9

A3

A2

A1

A0

10 BIT 10 TAP FIR FILTER

• TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN)

• ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS

XC4000PART

NUMBER OFINSTANCES

4002A 4003A 4004A 4005A 4006 4008 4010 4013 4025

1 2 3 5 6 8 10 15 23

NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE

9 Most Significant Bits

FIR10B10T

DATA IN DATA OUT

WORD_CLKCLK_OUT

DIN_ DOUT_

Relatively Placed Macro

BIT_CLK 10X_CLK

PERFORMANCE

• FIR10B10T MACRO CAN BE CLOCKED AT 66 MHZ @XC4000E-3

• 10 BIT WORD REQUIRES 11 CLOCKS

• 10 BIT SAMPLE WORD RATE IS 6 MHZ

• 8 BIT WORD REQUIRES 9 CLOCKS, ETC

• 8 BIT SAMPLE WORD RATE IS 8 MHZ

6 8 10 12 14 16

11.1 7.4 6.1 5.1 4.4 3.9

WORD SIZE BITS

MSPSSAMPLE RATE

FIR Filter Macro

Double-Rate DA FIR Filters

• Process 2 Bits per Clock

• # of Clocks = (N/2) + 1

• Twice as fast

Two Bit Parallel Distributed Arithmetic FIR Filter

SAMPLE DATA

N BITS WIDE

A3A2

A

B

ScalingAccum.

REGISTER

FILTEREDDATA OUT

2 -2

+ -

LOOKUP

TABLE

ADRS

DATA

A1

D1

X0

X2

X1

XN

X0

X2

X1

XN

D0N

A0

...000000

C0


2C0

3C0

0000

0001

0010

0011

0100

0101

0110

0111

A[3210]

C1

C2 + 2C1

C1 + 3C0

C2 + C1

1000

1001

1010

1011

2C1

2C1 + 2C0

2C1 + 3C0

2C1 + C0

Double Sample Rate D.A. FIR Filters• Twice the I/O Data Sample Rate

• Two Taps Requires 4 Input LUT without Symmetry

• Four Taps Requires 4 Input LUT with Symmetrical FIR

• Time Skew Buffer uses Twice as many CLBs

• LUTs are the same, if equal bit weights are used to address the LUTs.

• 2-Bit PDA Performance, Clocked at 66 MHz @XC4000E-3

6 8 10 12 14 16

22.2 14.8 12.2 10.2 8.8 7.8

WORD SIZE BITS

MSPSSAMPLE RATE

(Double Precision)

Full Parallel D.A. FIR Filters• One 8-Bit Tap Requires two 4 Input LUTs and an ADDER

with an offset for bit weighting.• Time Skew Buffer must use REGs• Maximum I/O Data Sample Rate• Full PDA Performance, in a XC4000E-3/-2, 50-70 MHz.

– Pipelinning can further increase sample rate

• LUTs are the same, if equal bit weights are used to address the 4-Coefficients in the LUT.

6 8 10 12 14 16

70 70 70 70 66 66

WORD SIZE BITS

MSPSSAMPLE RATE

(Double Precision)

FPGA-Based DSP Coprocessor

Design Implementation• Performance

– Programmable DSP

(DSP56300)• 24 clock cycles

• 360 nsec @ 66 MHz

– FPGA-Based Coprocessor• 9 clock cycles

• 135 nsec @ 66 MHz

• Results:– 37.5% of original processing time

– 2.67X Increase in throughput

– System Requirements:• Before: 4-DSPs, 12-RAMs

• After: 2-DSPs, 6-RAMs, 1-XC4013E

+-

+

-

Old_1

INC

Old_2

-+

+

-

++

++

MUX

MUX

New_1

Diff_2

Diff_1

New_2

MSB

MSB

Prestate Buffer Bit

24-bit 24-bit24-bit

1 0

REG

REG

REG

REG

REG

REG

REG

REG

I/O BusI/O Bus

135 ns

360 ns0

1

2

3

Rel

ativ

e P

erfo

rman

ce

2.67 times better performance with

FPGA-assisted DSP

Two 66 MHz DSPsSix 15 ns RAMs

66 MHz DSP+FPGAThree 15 ns RAMs

Number of TAPS

# CLBs

100

200

300

16 32 48 64 80

• • • • • Serial SequentialDistributed Arithmetic

• •

• •

•

SerialDistributedArithmetic8

MHz

8 Bit Word FIR Filter Structures

1000 to 50 KHz

16 MHz

Two-Bit ParallelDistributedArithmetic

55MHz

ParallelDistributedArithmetic

FIR Filter Implementation Options

Serial* Parallel*Serial* Distributed DistributedSequential Arithmetic Arithmetic

8 Taps

16 Taps

32 Taps

48 Taps

64 Taps

36 CLBs 44 CLBs 250 CLBs1.08 MHz 8.1 MHz 60 MHz 36 CLBs 70 CLBs 400 CLBs0.46 MHz 8.1 MHz 55 MHz

44 CLBs 122 CLBs 0.23 MHz 8.1 MHz

62 CLBs 178 CLBs0.15 MHz 8.1 MHz

70 CLBs 228 CLBs0.11 MHz 8.1 MHz

8 Bit Word Example

* Note: These designs are NOT Pipelined

Lower Sample Rate Applications:

Efficient CLB Counts

Large Number of TAPs

Moderate Sample Rates

Non Symmetrical FIR OK

Serial Sequential Architecture

32 Tap 8 Bit Example

CoefficientTable

REGISTER

ADD

2-1 Scale

32 x 8 LUT

32 - 8 Bit Coefficients

8 CLBsSDB Out

PSR

Parallel to SerialConverter4 CLBs

8

8

9

5 CLBs

24 CLBs Total

Clk50 Mhz

Serial Multiplier

Serial Sequential - FIR Filter

Select

08

SampleData

SAMPLEDATA

BUFFER

ACC

REG

SERIAL MULTIPLYCoefficient

Select

REG

FilteredData Out

5-BITCNTR

5

3 CLBs

64-TAP SerialSequential FIR Filter

ACC

REG


Select

SampleData

SAMPLEDATA

BUFFER

ACC

REG


Select

SAMPLEDATA

BUFFER

ADD

REGISTER

ACC

REG


Select

SampleData

SAMPLEDATA

BUFFER

REG

FilteredData Out

8 Tap

16 Tap

32 Tap

48 Tap

64 Tap

80 Tap

96 Tap

128 Tap

36 43 50 57 64

36 43 50 57 64

44 53 62 71 80

62 77 92 107 122

70 85 100 115 130

97 115 133 151 169

97 115 133 151 169

112 137 162 187 212

8 Bit 10 Bit 12 Bit 14 Bit 16 Bit

Number CLBs vs. Taps / Word Size

• 4002 = 64 CLBs

• 4005 = 196 CLBs

• 4013 = 576 CLBs

• 4025 = 1024 CLBs


781Khz 625Khz 390Khz

390Khz 312Khz 195Khz

195Khz 156Khz 97Khz

130Khz 104Khz 65Khz

97Khz 78Khz 48Khz

78Khz 62Khz 39Khz

65Khz 52Khz 32Khz

48Khz 39Khz 24Khz

8 Tap

16 Tap

32 Tap

48 Tap

64 Tap

80 Tap

96 Tap

128 Tap

TAPS 8 Bit 10 Bit 16 Bit

Maximum Sample Rate / Word Size

• Serial Mult. Limitations

• Can Use Multiple 16 Tap


Building Blocks

• 8X Faster at 128 Taps

ACC

REG


Select

SampleData

SAMPLEDATA

BUFFER

REG

FilteredData Out

58 CLBs for Function plus about 10 CLBs for ControlTotal = 68 CLBs

32 WORD X 12 BIT LOOK UP TABLE A.

M = (A + B + C + D + E)/5

LSB

LUT-A

ADRS

DATA11 Bit

LUT-A

ADRS

DATA11 Bit

D1

D0

C1

C0

E1

E0

8 BITS WIDE

C 8

X0

X2

X1

X8

4xCLK

8 BITS WIDE

D 8

X0

X2

X1

X8

4xCLK

8 BITS WIDE

E 8

X0

X2

X1

X8

4xCLK

SIGN EXTEND

MSB

2xA

B

REG A

B

REG

4x M(A,B,C,D,E)

14SIGN EXTEND

MSB

B 8

X0

X2

X1

X8

A 8

X0

X2

X1

X8

4xCLK

4xCLK

B0

B1

A0

A1

0 0 0 0 0

0 0 0 0 1

0 0 0 1 0

0 0 0 1 1

0 0 1 0 0

0 0 1 0 1

0 0 1 1 0

0 0 1 1 1

0 1 0 0 0

0 1 0 0 1

0 1 0 1 0

0 1 0 1 1

0 1 1 0 0

0 1 1 0 1

0 1 1 1 0

0 1 1 1 1

M(A,B,C,D,E)

001100110011

000110011001

1 0 0 0 0

1 0 0 0 1

1 0 0 1 0

1 0 0 1 1

1 0 1 0 0

1 0 1 0 1

1 0 1 1 0

1 0 1 1 1

1 1 0 0 0

1 1 0 0 1

1 1 0 1 0

1 1 0 1 1

1 1 1 0 0

1 1 1 0 1

1 1 1 1 0

1 1 1 1 1

M(A,B,C,D,E)4-CLBs per 8-Bit Shift Reg4x5ea = 20 CLBs

1-CLBs per Bit12-Bit Partial Sums, MSB bit weight = 112x2ea = 24 CLBs

6-CLBs for Add12-Bit Partial Sums1-CLB for [ Carryout + LSB ]6+1 = 7 CLBs

7-CLBs for 14-Bit Add14-Bit Partial Product Sumsno Carryout and LSBs are dropped7 = 7 CLBs

000110011001

000110011001

000110011001

000110011001

001100110011

001100110011

001100110011

001100110011

001100110011

001100110011

001100110011

001100110011

001100110011

010011001100

010011001100

010011001100

010011001100

010011001100

010011001100

010011001100

010011001100

010011001100

010011001100

011001100110

011001100110

011001100110

011001100110

011001100110

100000000000

000000000000

12-Bits

14-Bits

13.5MHz Median Filter, 5-Point, 2-Bit PDA

Design the following Application:

• Equations:• Y(R,G,B) = 0.299*R + 0.587*G + 0.114*B• U(R,G,B) = -0.169*R - 0.331*G + 0.500*B• V(R,G,B) = 0.500*R - 0.419*G - 0.081*B

• R, G, B Data is 8-Bits at 13.5 MHz. The circuit already has a 2x Clk (27 MHz).

• Draw a functional schematic diagram of the circuit.

How do you implement the three multipliers or MACs?

What is the estimated size of the final design?

What is the estimated speed of the final design?

How long would it take to turn over this design?

Video Coding Application with 4x Clock

8 WORD X 10 BITLOOK UP TABLE A.

000

001

010

011

100

101

110

111

f(RGB)

8 BITS WIDE

G8

X0

X2

X1

X8

8 BITS WIDE

R8

X0

X2

X1

X8

8 BITS WIDE

B8

LUT-A

ADRS

DATA10 Bit

LUT-A

ADRS

DATA10 Bit

G1

G0

R1

R0

B1

B0

X0

X2

X1

X8

Y = 0.299*R + 0.587*G + 0.114*BU = -0.169*R - 0.331*G + 0.500*BV = 0.500*R - 0.419*G - 0.081*B

4xCLK

4xCLK

4xCLK

PARALLEL LOAD2-BIT SHIFT REG4 CLBs EA, = 12 CLBs

...000000

CG

CG + CB

CR

CR + CG

CR + CG + CB

CR + CB

CB

SIGN EXTEND

MSB

2xA

B

REG

A

B

REG

4x

Y(R,G,B)U(R,G,B)V(R,G,B)

12

LUTs are the same5 CLBs EA, = 10 CLBs

10 Bit ADDER + REG5.5 CLBs

12 Bit ADDER6 CLBs

12 BITS WIDE

The total design would use about 110 CLBs with control logic.

LSB SIGN EXTEND

MSB

Video Coding Application with 2x Clock

A

B

REG

Y(R,G,B)U(R,G,B)V(R,G,B)

12

12 BITS WIDE8 BITS WIDE

G 8

X0

X2

X1

X8

8 BITS WIDE

R 8

X0

X2

X1

X8

8 BITS WIDE

B 8

X0

X2

X1

X8

2xCLK

2xCLK

2xCLK

LUT-A

ADRS

DATA10 Bit

LUT-A

ADRS

DATA10 Bit

G0R0

B0SIGN EXTEND

MSB

2xA

B

REG

LUT-A

ADRS

DATA10 Bit

LUT-A

ADRS

DATA10 Bit

SIGN EXTEND

MSB

2xA

B

REG 4x

G1R1

B1

G2R2

B2

G3R3

B3

A

B

REG 16x

PARALLEL LOAD4-BIT SHIFT REG4 CLBs EA, = 12 CLBs

LUTs are the same5 CLBs EA, = 20 CLBs

10 Bit ADDER + REG5.5 CLBs EA, = 11 CLBs

12 Bit ADDER + 2 REGs7 CLBs

14 Bit ADDER7 CLBs

The total design would use about 180 CLBs with control logic.

All four LUTs are the same.

LSB

LSB

LSB

SIGN EXTEND

MSB

Xilinx Introduces First Fully Programmable

System Solution

First FPGA Architecture Designed for Intellectual Property

FPGA Technology Roadmap

1995 1997 1998 1999

Year

XC4000ELargest DeviceXC40250.5m

XC4000EXLargest DeviceXC4036EX0.5m

XC4000XLLargest DeviceXC4085XL0.35m

XC4000XVLargest DeviceXC40250XV0.25m

1996

Generation 3 architecture1 Million+ system gatesSystem Solution0.25/0.18

Den

sity

/Per

form

ance

Process Technology and Supply Voltage

Feature Size (m)

Virtex FPGAs Leverage Xilinx Process Technology Leadership

0

0.2

0.4

0.6

0.8

1

1.2

1990 1992 1994 1996 1998 2000 2002

5

3.32.51.8

1.3

Voltage

• Lower cost• Faster speed• Higher density• Lower power

Virtex FPGAs Ship

Voltage and Family Migration

• Virtex FPGAs and XC4000XV share common process (0.25 ) – 2.5 V logic, 3.3 V I/O with 5 V tolerance

• Family migration from XC4000XL possible– Voltage migration guide will assist users

• Design with XC4000XL now and plan ahead for XC4000XV and Virtex FPGAs

Xilinx 0.25 5 Volt-Compatible FPGAs

• Family migration possible if you plan for:– Additional power/ground pins– Dedicated clock and configuration pins

• Voltage migration guide to help users

Any 5 V

device(XC4000E)

Virtex&

XC4000XV2.5 V logic3.3 V I/O

Any 3.3 V

device(XC4000XL)

5 V3.3 V

2.5 V

5 V

3.3 V 3.3 V

3.3 V

I/OSupply

LogicSupply

Meets TTLLevels

Accepts5 V levels

System Level Design Trend

DSP

CustomLogic

BusI/F

RAMI/F

High-DensityHigh-PerformanceCustom Device

PCI

Scratch PadSRAM

PC Board

Introducing Xilinx Virtex FPGAs

Segmented Routing, 4-Input LUT FPGA Architecture

Fast, Flexible I/Os

System Building Blocks

Software

IP

Leading Edge Process Technology

World’s first fully programmable system-level architecture

Advanced Process Technology

0.5u Process 0.25u UMC Process- locos isolation - shallow trench isolation- birds beak - 0.9u metal pitch- no planarization - CMP- only contact plug - plug for all vias

Family Overview• 0.25um, 5 layer metal process• Density: 50 thousand to 1 million system

gates• Performance

– 100+ MHz performance• 3 to 4 LUT levels

– 160 MHz system performance• Clock to output + input setup

• First device in 2Q98– 250,000 system gates– One million system gate device by end of

1998

Virtex FPGA Performance

• 100+ MHz internal speeds – 155 MHz SONET data stream processing– 100+ MHz Pipelined Multipliers– 66 MHz PCI

• 100+ MHz system interface speedswithout PLL with PLL

Tco (output register) 6 ns 3.5 nsTsu (input register) 3 ns 3 nsTh (input register) 0 ns 0 nsMax I/O performance 110 MHz 160 MHz

Functional Block DiagramCLB Segmented routing

SelectI/OPins

DistributedSelectRAMMemory

BlockSelectRAMMemory

PLL

66 MHz PCI SSTL3

Vector BasedInterconnectdelay=f(vector)

Virtex Clocking

Clocking and PLL• 4 low skew clock resources

– 3ns setup, 0ns hold clock pad -> IOB input FF– 6ns clock to out clock pad -> IOB output FF

• 24 Additional low skew globals – clocks, enables, resets, etc– faster than 4KXL secondary global buffer

• PLL for system clock deskew and fast clock to out.

Virtex CLB

Segmented Routing Interconnect

3-STATE BUSSES

SWITCHMATRIX

2 LCs 2 LCs

CA

RR

Y

CA

RR

Y

CLB

CA

RR

Y

CA

RR

Y

• Fast local routing within CLBs

• General purpose routing between CLBs

• Fast Interconnect– 8ns across

250,000 system gates

• Predictable for early design analysis

• Optimized for five layer metal process

2 LCs 2 LCs

CLB

4 InputLUT

RegisterCarryand

Control

I3I2I1I0

O

WI DI

DCE

CLK

Q

CO

CI

4 InputLUT

RegisterCarryand

Control

I3I2I1I0

O

WI DI

DCE

CLK

Q

CO

CI

PR

RS

PR

RS

Polarity of all control signals selectable

Fast arithmetic and multiplier circuitry

Optimized for synthesis

Virtex Configurable Logic Block

Virtex IO

Simplified IOB• Fast I/O drivers• Registered input,

output, 3-state enable control

• Programmable slew rate, pull-up, input delay, etc.

• Selectable I/O Standards– SSTL, GTL,

LVTTL...

DCE

S/R

Q

DFF/LATCH

DCE

S/R

Q

DFF/LATCH

DCE

S/R

Q

DFF/LATCH

PAD

Virtex Memory

SelectRAM+ Memory Features

• Distributed SelectRAM Memory– Pioneered in XC4000 family– 16x1 synchronous SRAM implemented in LUT– Ideal for DSP applications– Access over one hundred billion bytes/sec

• Block SelectRAM Memory– 4096 bit blocks of dual port synchronous SRAM– Configurable widths of 1, 2, 4, 8, and 16– Ideal for data buffers and fifos– Up to 17 gigabytes/sec access

• Fast Access to External RAM

– Direct interface to SSTL3, 3.3V synchronous DRAM standard

– 133 MHz

Block RAM• Configure as: 4096 bits with variable aspect ratio• 8-32 blocks across family devices• True dual-port, fully synchronous operation

– Cycle time <10 ns

• Flexible block RAM configuration– 5 blocks: 2K x 10 video line buffer

– 1 block: 512 x 8 ATM buffer (9 frames)

– 4 blocks: 2K x 8 FIFO

– 9 blocks: 4K x 9 FIFO with parity

WEAENACLKAADDRADINA

DOA

DOB

RAMB4

WEBENBCLKBADDRBDINB

High SpeedSynchronous

DRAM(Mbytes)

VideoData In

Frame DataBlock

SelectRAMMemory(Kbytes)

BlockSelectRAM

Memory(kbytes)

DistributedSelectRAM

Memory(bytes)

Line Data

DistributedSelectRAM

Memory(bytes)

Pixel Data

Video PixelProcessing

Function(logic)

ProcessedVideo Out

Real Time Video Processor

Virtex FPGA

Hierarchy of RAM provides efficient and very high bandwidth data processing

Virtex FPGA Summary• 1 Million+ system gates

• 100+ MHz performance from all devices

• Building blocks for system level design

• ASIC design flow software

• Platform for CORE reuse

First fully programmable system solution

Section I I ntroduction to Programmable Logic Devices.

Documents