Section I Introduction to Programmable Logic Devices
Section I
Introduction to Programmable Logic Devices
Programmable Logic Device FamiliesSource: Dataquest
Logic
StandardLogic
ASIC
ProgrammableLogic Devices(PLDs)
GateArrays
Cell-BasedICs
Full CustomICs
CPLDsSPLDs(PALs) FPGAs
AcronymsSPLD = Simple Prog. Logic Device PAL = Prog. Array of LogicCPLD = Complex PLDFPGA = Field Prog. Gate Array
Common ResourcesConfigurable Logic Blocks (CLB)
– Memory Look-Up Table– AND-OR planes– Simple gates
Input / Output Blocks (IOB)– Bidirectional, latches, inverters, pullup/pulldowns
Interconnect or Routing– Local, internal feedback, and global
CPLDs and FPGAsCPLD FPGA
Architecture PAL/22V10-like Gate array-likeMore Combinational More Registers + RAM
Density Low-to-medium Medium-to-high 0.5-10K logic gates 1K to 500K system gates
Performance Predictable timing Application dependent Up to 200 MHz today Up to 135MHz today
Interconnect “Crossbar” Incremental
Complex Programmable Logic Device Field-Programmable Gate Array
Not shown: Simple PLD (SPLD) Architecture
PLD Industry Growth
Programmable Logic vs. Semi-Custom ASIC Market
Mask ProgrammedGate Arrays
$7.4B
ProgrammableLogic Share
$5.8B
Standard Logic$2.6B
37%37%16%
47%
Total 1996 Market – $9.5B Total 2001 Market – $15.8B
Mask ProgrammedGate Arrays
$5.6B
ProgrammableLogic Share
$1.9B
Standard Logic$2.0B
20%20%21%
59%
Source: Dataquest, May 1997
Who is Xilinx?• World’s leading innovator of complete
programmable logic solutions
• Inventor of the Field Programmable Gate Array• $600M Annual Revenues; 35+% annual growth• Fabless* Semiconductor and Software Company
– UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}
– Yamaha (Japan)– Seiko Epson (Japan)
Programmable Logic Chips Foundation and Alliance Series
Design Software
Xilinx vs. Competitors1997 Calendar Year Revenues
0
100
200
300
400
500
600
700
Altera
Xilinx
Vantis
Lattic
eActe
l
Lucent
Cypress
Atmel
QuickL
ogic
$ Millions
Source: Company reports & In-Stat. Includes SPLD, CPLD, FPGA revenues.
FPGA Market Share Q4 1997FPGA Market Share Q4 1997
Xilinx55%
Others5%
Lucent10%
Altera14%
Actel16%
Source: In-Stat Research, March 1998Altera number includes both 8K and 10K families
Process & Density LeadershipT
ran
sist
or
Co
un
t (m
illi
on
s)
XC40125XV - Industry’s 1st 0.25u PLD. ~250K gates, 5 LM.
XC40150XV
XC40250XV ~500K gates
0.25u process
Virtex1 Million Gates
7.5
25
50
75
2Q984Q97 3Q98 4Q981Q98
Xilinx Integrated Circuit Products• XC9500: Flash-based In System Program. CPLDs
– Lowest price, best pin locking, 600 - 7K gates
• XC4000: Industry’s largest & fastest FPGAs– XC4000E: 0.5, 5V, 5K - 40K gates
– XC4000EX: 0.5, 5V, 45K - 60K gates
– XC4000XL: 0.35, 3.3V devices, 5V compatible I/O, 3K - 180K gates
– XC4000XV: 0.25, 2.5V / 3.3V, 5V compatible I/O, 250K - 500K gates
– Spartan: 0.5, 5V, Low Cost, 10K - 40K gates
• Virtex: New FPGA architecture in 1998– 0.25, 5LM, 250K-1M gates, Select & Block-RAM
• XC6200: Reconfigurable Processing Unit – Dynamically and partially reconfigurable
• Low-cost solutions (Industry)– XC3000 (no RAM), XC5200 (no RAM), HardWire ResearchUpper
LevelClass
CoreClass
X
X
X X
X
X
X
X
X
X
X
XX X
XX XX
* Gates are in terms of system-level gates
XC9500 CPLDs
• 5 volt in-system programmable (ISP) CPLDs
• 5 ns pin-to-pin
• 36 to 288 macrocells (6400 gates)
• Industry’s best pin-locking architecture
• 10,000 program/erase cycles
• Complete IEEE 1149.1 JTAG capability
FunctionBlock 1
JTAGController
FunctionBlock 2
I/O
FunctionBlock 4
3
Global Tri-
States 2 or 4
FunctionBlock 3
I/O
In-SystemProgramming Controller
FastCONNECTSwitch Matrix
JTAG Port
3
I/O
I/O
Global Set/Reset
Global Clocks
I/OBlocks
1
Xilinx XC4000 Architecture
CLB
CLB
CLB
CLB
SwitchMatrix
ProgrammableInterconnect
I/O Blocks (IOBs)
D Q
SlewRate
Control
PassivePull-Up,
Pull-Down
Delay
Vcc
OutputBuffer
InputBuffer
Q D
Pad
D QSD
RD
EC
S/RControl
D QSD
RD
EC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
Y
X
H1 DIN S/R EC
• High Density -> 1M System Gates
• SRAM Based LUT for Synchronous Dual Port RAM or Logic
• ASIC-like array structure
• Built-in Tri-States
• Infinite reconfigurations, downloaded from PC or workstation in ~1 second
ConfigurableLogic Blocks (CLBs)
XC6200 Reconfigurable Processing Unit
CPU
XC6200XC6200RPURPU
I/O
I/OMemory
1000x improvement in reconfigurationtime from external memory
FastMAPtm assures high speed direct access to all internal registers
All registers accessed viabuilt-in low-skewFastMAPtm busses
Microprocessor interfacebuilt-in: “XC6200 is memory mapped to look like SRAM to a host processor”
High capacity distributed memorypermits allocation of chipresources to logic or memory- 256kbits in XC6264Ultrafast Partial
Reconfiguration(40ns to 100’s of usec) Up to 100,000 gates
• Nov. 1997- shipping world’s largest FPGA, XC40125XV (10,982 logic cells, 250K System Gates)
• 1 Logic cell = 4-input LUT + FF• 175,000 Logic cells = 2.0 M logic gates in 2001
Year
Logic Cells Logic Gates
1,000
10,000
100,000
1,000,000
1994 1996 1998 2000 2002
12M
1.2M
120K
12K
2 Million logic gates2 Million logic gates
D Q
FFLUT
Exponential Growth in Density
Design Flow
XC4000XC4000XC4000
3
Design Entry in schematic, ABEL, VHDL, and/or Verilog. Vendors include Synopsys, Aldec (Xilinx Foundation), Mentor, Cadence, Viewlogic, and 35 others.
Implementation includes Placement & Routing and bitstream generation using Xilinx’s M1 Technology. Also, analyze timing, view layout, and more.
Download directly to the Xilinxhardware device(s) with
unlimited reconfigurations* !!
1
2
*XC9500 has 10,000 write/erase cycles
M1 Technology
Foundation Series Delivers Value & Ease of Use
• Complete, ready-to-use software solution
• Simple, easy-to-use design environment
• Easy-to-learn schematic, state-diagram, ABEL, VHDL, & Verilog design
• Synopsys FPGA Express Integration*
The Xilinx Student Edition • Prentice Hall’s most requested new engineering product in Q1
‘98 ! – Complete, affordable, and practical digital design course environment for all students– Predeveloped and tested lab-based course
• Includes – Foundation Series 1.3 for students’ computers– Practical Xilinx Designer lab tutorial book– Coupon for XS40-005XL and XS95-108 boards ($129)
• Sold through bookstores by Prentice Hall and www.Amazon.com, listed at $79 (ISBN 0136716296)
• Integrated tutorial projects cover:TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors
• Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web
Section IIBasic PLD Architecture
Section II Agenda• Basic PLD Architecture
– XC9500 and XC4000 Hardware Architectures
– Foundation and Alliance Series Software
Section IIBasic PLD Architecture
XC9500 and XC4000 Hardware Architectures
XC9500 CPLDs
• 5 volt in-system programmable (ISP) CPLDs
• 5 ns pin-to-pin
• 36 to 288 macrocells (6400 gates)
• Industry’s best pin-locking architecture
• 10,000 program/erase cycles
• Complete IEEE 1149.1 JTAG capability
FunctionBlock 1
JTAGController
FunctionBlock 2
I/O
FunctionBlock 4
3
Global Tri-
States 2 or 4
FunctionBlock 3
I/O
In-SystemProgramming Controller
FastCONNECTSwitch Matrix
JTAG Port
3
I/O
I/O
Global Set/Reset
Global Clocks
I/OBlocks
1
XC9500 - Architectural Features• Uniform, all pins fast, PAL-like architecture• FastCONNECT switch matrix provides 100%
routing with 100% utilization• Flexible function block
– 36 inputs with 18 outputs– Expandable to 90 product terms per macrocell– Product term and global three-state enables– Product term and global clocks– Product term and global set/reset signals
• 3.3V/5V I/O operation • Complete IEEE 1149.1 JTAG interface
XC9500 Function Block
ToFastCONNECT
FromFastCONNECT
2 or 43 GlobalTri-State
GlobalClocks
I/O
I/O
36
Product-Term
Allocator
Macrocell 1
ANDArray
Macrocell 18
Each function block is like a 36V18 !
XC9500 Product Family9536
Macrocells
Usable Gates
tPD (ns)
Registers
Max I/O
36 72 108 144 216
800 1600 2400 3200 4800
5 7.5 7.5 7.5 10
36 72 108 144 216
34 72 108 133 166
Packages VQ44PC44 PC44
PC84TQ100PQ100
PC84TQ100PQ100PQ160
PQ100PQ160
288
6400
10
288
192
HQ208BG352
PQ160HQ208BG352
9572 95108 95144 95216 95288
XC4000 ArchitectureCLB
CLB
CLB
CLB
SwitchMatrix
ProgrammableInterconnect I/O Blocks (IOBs)
ConfigurableLogic Blocks (CLBs)
D Q
SlewRate
Control
PassivePull-Up,
Pull-Down
Delay
Vcc
OutputBuffer
InputBuffer
Q D
Pad
D QSD
RDEC
S/RControl
D QSD
RDEC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
Y
X
H1 DIN S/R EC
XC4000E/X Configurable Logic Blocks
D QSD
RDEC
S/RControl
D QSD
RDEC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
YQ
Y
XQ
X
H1 DIN S/R EC
• 2 Four-input function generators (Look Up Tables)- 16x1 RAM or Logic function
• 2 Registers- Each can be configured as Flip Flop or Latch- Independent clock polarity- Synchronous and asynchronous Set/Reset
Look Up Tables
Capacity is limited by number of inputs, not complexity
Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM
• Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB
• Example:
A B C D Z
0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1
Look Up Table
Combinatorial Logic
AB
CD
Z
4-bit address
GFunc.Gen.
G4G3G2G1
WE
2(2 )4
= 64K !
XC4000X I/O Block Diagram
Shaded areas are not included in XC4000E family.
Xilinx FPGA Routing• 1) Fast Direct Interconnect - CLB to CLB• 2) General Purpose Interconnect - Uses switch matrix
CLBCLB
CLBCLB
CLBCLB
CLBCLB
SwitchMatrix
SwitchMatrix
• 3) Long Lines– Segmented across
chip– Global clocks, lowest
skew– 2 Tri-states per CLB
for busses
• Other routing types in CPLDs and XC6200
Other FPGA Resources• Tri-state buffers for busses (BUFT’s)
• Global clock & high speed buffers (BUFG’s)
• Wide Decoders (DECODEx)
• Internal Oscillator (OSC4)
• Global Reset to all Flip-Flops, Latches (STARTUP)
• CLB special resources– Fast Carry logic built into CLBs
– Synchronous Dual Port RAM
– Boundary Scan
What’s Really In that Chip?
CLB(Red)
Switch Matrix
Long Lines(Purple)
Direct Interconnect (Green)
Routed Wires (Blue)
Programmable Interconnect Points, PIPs (White)
XC4000XL Family
* 25-30% of CLBs as RAM
* 20-25% of CLBs as RAM
4005XL 4010XL 4013XL 4020XL 4028XL
Logic Cells 466 950 1,368 1,862 2,432
Typ Gate Range* 3 - 9K 7-20K 10-30K 13-40K 18-50K(Logic + Select-RAM) Max. RAM bits 6K 13K 18K 25K 33K(no Logic)
I/O 112 160 192 224 256Initial Packages PC84 PC84
PQ100 PQ100PQ160 PQ160 PQ160 PQ160PQ208 PQ208 PQ208 PQ208 HQ208
PQ240 PQ240 HQ240BG256 BG256 BG256
BG352 BG352
4036XL 4044XL 4052XL 4062XL 4085XL 40125XV
Logic Cells 3,078 3,800 4,598 5,472 7,448 10,982
Typ Gate Range* 22-65K 27-80K 33-100K 40-130K 55-180K 78-250K(Logic + Select-RAM)
Max. RAM bits 42K 51K 62K 74K 100K 158K(no Logic)
I/O 288 320 352 384 448 544Initial packages HQ208
HQ240 HQ240 HQ240 HQ240BG352BG432 BG432 BG432 BG432PG411 PG411 PG411 PG475 PG559 PG559
BG560 BG560 BG560 BG560
HardWireTM
• Unique no-risk 100% compatible mask-programmed cost reduction of Xilinx FPGA
• Cost-effective for volume applications– Savings of 40% to 70%
• Architecture-equivalent mask-programmed version of any FPGA– Requires virtually no customer engineering resources, test
vectors, or simulation
– ALL FPGA features (e.g., Configuration, Power-On Reset, JTAG, etc.) are fully supported
FPGA
HARDWIRE
HardWire Methodology vs. Gate Array Conversion
Xilinx
ATPG
Prototypes
Test
Development
Verification
Place and Route
Verification
Capture
Typical Gate Array Design Phases
FPGA
Design
Xilinx HardWire Methodology
Production ReadyPrototypes
Physical Data Base
Iterations
Gate ArrayRedesign Path
Physical Data Base.LCA File Conversion
Cost Reduction & Density Increases
Logic Cells
Cost
7.5K0.4K5,000 85,000 Logic Gates
XC4000XV
XC4000E
XC4000XL
Virtex Series
XC4085XLXC40250XV
(500K System-levelGates)
1M Gates*
20K250,000
3K36,000
XC4036EX
HardWire
XC5200
XC4000EX
* Starting with Virtex, Xilinx numbering scheme reflects approximate Logic + RAM gates rather than Logic gates only.
1996 1997 1998
CPLD or FPGA? CPLD• Non-volatile• JTAG Testing• Wide fan-in• Fast counters, state
machines• Combinational Logic• Small student
projects, lower level courses
FPGA• SRAM reconfiguration• Excellent for computer
architecture, DSP, registered designs
• ASIC like design flow• Great for first year to
graduate work• More common in schools• PROM required for non-
volatile operation
Section IIBasic PLD Architecture
Foundation and Alliance Series Software
Xilinx M1-Based SoftwareLibraries and Interfaces for Leading EDA Vendors
Complete, Ready-to-Use
Includes Schematic, Simulation, VHDL and Verilog Synthesis
Foundation Series
ALLIANCE Series
Software Backplane
Core Implementation Software - Map, Place, Route, Bitstream generation, and analysis
Graphical User Interface is very similar to XACTStep v.6.0
Design Tools• Standard CAE entry and verification tools• Xilinx Implementation software implements the design
– The design is optimized for best performance and minimal size– Graphical User Interface and Command Line Interface– Easy access to other Xilinx programs– Manages and tracks design revisions– ~
Functional Simulation
Back AnnotationSchematic, State Mach., HDL Code, LogiBLOX, CORE Gen
Design Implementation
Verification
Static Timing Analysis,In-Circuit Testing
Design Entry
Simulator
M1 Design Manager
Xilinx
Foundationor Alliance
Multi-Source IntegrationMixed-Level Flows
Ch
eck
Po
int
Ve
rifi
ca
tio
n
EDIFVHDLVerilogSDF
KnowledgeDriven
Implementation
Design Source Integration
HDLSchematic
Existing Designs Cores
StandardsBased
Enables multiple sources and multiple EDA vendors in the same flow
Allows team development
Reduces design source translations
Design the way you are used to
Enables rapid, accurate iterations
Works well within existing ASIC flows
Facilitates Design Reuse
3rd Party Support & Libraries• Xilinx 3rd Party Design Entry & Simulation Support
– Synopsys, Cadence, Mentor Graphics, Aldec (Foundation)– Viewlogic, Synplicity, OrCad, Model Technologies, Synario, Exemplar and
others supply libs & interfaces– Industry standard file formats:
• VHDL, Verilog, and EDIF netlist formats• SDF Standard Delay files• VITAL library support
• Xilinx Libraries– Optimized components for use in any Xilinx FPGA or CPLD– Wide range of functions
• Comparators, Arithmetic functions, memory• DSP and PCI interfaces
– Easy to use with ABEL, VHDL, Verilog, schematic entry
Libraries, Macros & Attributes• Libraries are common design sets for all design entry tools (eg. text, schematic,
Foundation, Synopsys, Viewlogic, etc.)
• Library “interfaces” are specific to each front end• Attributes are library element properties• Online “Libraries Guide” has full listings and descriptions
– Unified Libraries: • Boolean functions, TTL, Flip-Flops,
Adders, RAM, small functions
– LogiBlox Libraries: • Variable size blocks of adders,
registers, RAM, ROM, etc.
• Properties defined as attributes
Core Design TechnologyOptimal Core Creation & Flexible Core
Delivery
Data sheets
CoreLINX:
SystemLINX:
Web Mechanism to Download New Cores
Third Party System Tools Directly Linked With Core Generator
Parameterizable Cores
Foundation Series Express Overview
• Easy to use, yet powerful• Based on Industry Standards, not proprietary
languages• Features:
– Schematic (partnership with Aldec)
– IEEE VHDL, Verilog, ABEL
– State Diagram Editor
– Interactive Simulation
– Exclusive partnership with Synopsys, the synthesis leaderAldecSynopsys
Xilinx
Foundation Project Manager• Integrates all tools into one environment
Schematic Entry
ABEL and VHDL Text Entry• From schematic menu (or
via HDL Editor), select Hierarchy -> New Symbol Wizard… to create symbol.
• Select HDL Editor & Language Assistant to learn by example, then define block.
• Synthesize to EDIF.
54
3
1
2
State Machine Graphical Editor
Graphical editor synthesizes into ABEL or VHDL code
Simulation - Easy to Use and Learn
• Generate stimulus easily and quickly
– Keyboard toggling– Simple clock stimulus– Custom formulas
• Easy debugging– Waveform viewer– Signals easily added and
removed– Simulator access from
schematic– Color-coded values on
schematic• Script Editor
Foundation Express 1.4 Features • Express Technology
– Optimizes the design for Xilinx Architectures – Optimized arithmetic functions– Automatic Global Signal Mapping– Automatic I/O Pad Mapping– Resource Sharing– Hierarchy Control– Source Code Compatible With Synopsys Design Compiler and FPGA Compiler– Verilog (IEEE 1364) and VHDL (IEEE 1076-1987) Support – Easy, graphical constraint entry– F1.4 is stand-alone
• F1.5: Sept / Oct ’98 – Integrated into Foundation Project Manager – Replaces Metamor
Xilinx-Express Design Flow
.VEI
.VHI
.UCF Reports
DSP COREGen & LogiBLOX
Module Generator
XNF.NGO
HDL Editor
State DiagramEditor
VHDLVerilog
.V.VHD
Foundation Design Entry Tools
Gate LevelSimulator
SchematicCapture
EDIFXNF
TimingRequirements
VHDLVerilog
Express
EDIF/XNF .XNF
BITJDEC
SDFVHDL
Verilog
Reports
EDIF
Xilinx Implementation Tools
HDL
SIMULATION
VHDLVerilog
Behavioral Simulation Models
Express Input and Output
– Mixed Verilog/VHDL modules are accepted
– Schematics may also be used, but should not be input into Express
– Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager
• Output netlists are in XNF format• Timing Specifications may be
specified in Express
Reports
TimingRequirements
VHDLVerilog
Express
.XNF
• Input files may be VHDL or Verilog format
– Timing Specifications are not used during Synthesis
– Timing Specifications can be included in the output netlist
Express Design Process
1. Analyze - Syntax check
2. Implement - Create generic logic design (Elaborate)
3. Enter constraints and options
4. Synthesize - Optimize the design for specific device
5. Export XNF Netlist
6. Implement layout with Xilinx Design Manager
1
3
2
2
{4
Implementation - M1 Design Manager
• Manages design data
• Access reports
• Supports CPLDs, FPGAs
Flow Engine
Timing Analyzer
PROM File FormatterHardware DebuggerEPIC Design Editor
Terminology• Project
– Source file; has a defined working directory and family• Version
– A Xilinx netlist translation of the schematic– Multiple Versions result from iterative schematic changes
• Revision– An implementation of a Xilinx netlist– Multiple revisions typically result from different options
• Part type– Specified at translation; can be changed in a new revision
Toolbox Programs• Flow Engine
– Controls start/stop points and custom options
• Timing Analyzer– Report on net and path delays
• PROM File Formatter– Create file to program configuration
file into PROM• Hardware Debugger
– Download configuration file with XChecker, Serial or JTAG Cable
• EPIC Design Editor– Device-level view of routing
Flow Engine
• View status of tools
• Control tool options
• Implements design to the bitstream
Section III
Advanced Hardware Design Techniques
Section III Agenda
• Advanced Hardware Design Techniques– General Hardware Information– Combinational Logic Design (Look Up Tables and
other Resources)– Synchronous Logic (Flip Flops and Latches– Memory Design (RAM and ROM)– Input / Output Design
Section III Advanced Hardware Design Techniques
General Hardware Information
Resource Estimation• Find comparable functions in
macro library and XAPP application notes
– Or, use other designs to estimate device utilization
• Or, quickly implement a design and view the MAP report file
– Select Utilities -> Report Browser -> Map Report
– IOBs, CLBs, Global Buffers, and other components listed separately
• For unfinished designs– Use save flags on unconnected nets,
or– Deselect “Trim Unconnected Logic in
Implementation OptionsMACRO
S
Performance Estimation• Use block delays as estimate of net delays• Use desired clock frequency to determine allowed CLB
depth– Compare to functional requirements and modify design to meet
performance needs
• Example for 50 MHz clock frequency in XC4000XL-3:Clock period 20 nsOne level - 8 ns (tCO + tNET + tSU)Delay allowance 12 nsEach added level % 6 ns (tPD + tNET)Added levels of logic allowed 2 CLBs
tCO tNET tPD tNET tPD tNET tSU
CLB CLB CLB CLB
Power Consumption
• Xilinx FPGAs have flexible routing – Power consumption can be
half that of FPGAs with less flexible routing channels
• Power = kCV2F – How many nodes change state (hard to estimate)
– Capacitive loading on CLB and IOB outputs (known)
• Power consumption is not a concern in regular course labs• Power estimation methods
– See application notes under http://www.xilinx.com/apps/3volt.htm
XC4000XL 3.3 V, 0.35, 5 Volt Compatible
• Accepts 5Volt inputs
• Drives standard TTL levels
• Totally compatible in 5Volt environment
• 0.25 XV family is also 5 Volt TTL compatible when used with 3.3Volt I/O supply, 2.5Volt core supply
Any 5 V
device
XC4000XLFPGA0.35
3.3 V Logic3.3 V I/O
5 V3.3 V
5 V
3.3 V
Meets TTLLevels
5 V Tolerant Inputs
XC4000XV & Virtex 2.5 V, 0.25, 5 Volt Compatible
• Devices with 5V, 3.3V, and 2.5V power supplies can be interfaced
Section III Advanced Hardware Design Techniques
Combinational Logic Design (Look Up Tables and
Other Resources)
XC4000X Configurable Logic Blocks
• G, F, H function generators
• 2 Flip-Flops– Individual
clock polarity
– Sync. and async. Set/Reset
• Delay from F1 to Y in the XC4000X-1 is ~1 nsec
D QSD
RD
EC
S/RControl
D QSD
RD
EC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
YQ
Y
XQ
X
H1 DIN S/R EC
Look Up Tables
Capacity is limited by number of inputs, not complexity
Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM
• Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB
• Example:
A B C D Z
0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1
Look Up Table
Combinatorial Logic
AB
CD
Z
4-bit address
GFunc.Gen.
G4G3G2G1
WE
2(2 )4
= 64K !
16-bit Adder Examples• Many choices for implementing an adder
– Speed vs. density trade-off controlled by user and PLD features
Family
XC3000A
XC3000A
XC3000A
XC3000A
XC4000E-3
Type
Bit-Serial
Parallel
Lookahead
Conditional
Carry
CLBs
16
24
30
41
8
Levels
16
8
6
3
10.1ns
AppLINX
XAPP 022
XAPP 022
XAPP 022
XAPP 022
XAPP 018
XC5200-5 Carry 8 20ns 5200 DataSheet
Arithmetic Functions• Arithmetic Macros are optimized for density and speed with dedicated carry logic in CLBs
– Example: Each CLB can form a two-bit full-adder
• Carry Logic components have vertical orientation– Needed for speed and utilization– Known as RPM or “Relationally Placed Macro”– Examples:
• ADDx adders
• ADSUx adder/subtractors
• CCx counters
• COMPMCx magnitude
comparators
A<3>B<3>
A<2>B<2>
A<1>B<1>
A<0>B<0>
Z<3>
Z<2>
Z<1>
Z<0>
ADD4
Three-State Buffers• Each CLB is associated with two Three-State
buffers (BUFT)– BUFTs are used independently of LUTs and Flip-Flops
• Three-State library components:– Three-state buffers: BUFT, BUFT4, BUFT8, BUFT16
– Wired AND (open Drain) : WAND1, WAND4, WAND8, WAND16
– Two input OR driving Wired AND : WOR2AND
• Delay varies per family– 3.7 ns in the XC4005XL (-1)
– 13.6 ns in the XC4085XL (-1)
• Use to multiplex signals onto long routing lines to use as buses
Use BUFT for Buses
B3
B2
B1
B0
A3
A2
A1
A0
BUS<3>
BUS<2>
BUS<1>
BUS<0>
_ENABLE_A _ENABLE_B
BUFT
BUFTs for Multiplexers• BUFT can can be used to build large MUXes
– Large MUXes composed of LUTs need multiple levels of logic
– Large MUXes composed of BUFTs have only one level of logic• CLB resources are not used
– Use of BUFTs constrains placement
• Multiplexer macros use lookup tables – Example: M4_1E
• Create BUFT macros from Three-State buffer components – BUFT, BUFT4, BUFT8, BUFT16
Wide Decoders• The Wide Decoder is a dedicated
wired-AND– Useful for address decoding
• IOBs or CLBs can drive the Wide Decoder– Located along the periphery
of the die– All IOB drivers must be on same edge as the
decoder– Four decoder lines per edge
• Use DECODE macro– DECODE4/8/16/24– Must use a PULLUP primitive
A0A1A2A3A4A5A6A7 O
DECODE8
PULLUP
CLB Mapping Control in Schematic• Allows user to force mapping of logic from
schematic into a single CLB• XC3000
– CLBMap can specify entire CLB
• XC4000/XC5000– FMap specifies a function generator in a CLB– HMap specifies an XC4000 H function generator in a
CLB
A0
A2
B2
B0
FMAP
I1A0I2B0I3A2I4B2
OC0 C0
Section III Advanced Hardware Design Techniques
Synchronous Logic(Flip-Flops and Latches)
• Each register can be configured as a Flip-Flop or Latch
• Independent clock polarity
• Asynchronous Set or Reset
• Clock Enable
• Direct input from CLB input (Connections bypass LUTs)
CLB RegistersS/RDIN
FG
K(CLOCK)
EC (CLOCKENABLE)
RESET
SETQ QXD
H
EC
1
S/RControl
FG
RESET
SETQ QYD
H
EC
1
S/RControl
Library offerings• “Unified” library contains many standard functions
– Pre-defined size and functionality
• LogiBLOX templates are available– Can be customized for bus size and function
• Types of LogiBLOX register functions– Shift Registers
• Left/Right, Arithmetic, Logical, Circular
– Clock Dividers• Output Duty Cycle
– Counters• LFSR, Binary, One_Hot, Carry Logic
– Accumulators
• Xilinx CORE Generator recommended for very complex functions (DSP, FFT, UARTs, Multipliers...)
Naming ConventionsFlip-Flop
D-Type (D), JK-Type (JK), Toggle-Type (T)Asynchronous Preset (P), Asynchronous Clear (C)
Synchronous Set (S), Synchronous Reset (R)Clock Enable
Inverted Clock
FD PE _1
Flip-Flop, D Type
SizeSynchronous Reset
Clock Enable
FD16 R ELDCE_1
Transparent D Latch
Asynchronous Preset (P), Asynchronous Clear (C)
Gate Enable
Inverted Gate
Counters• Libraries support a wide variety of fast and efficient
counters– Counters offer trade-offs between speed, density, and
complexity– Example: LogiBlox counter styles
• Binary: predictable outputs, uses carry logic• Johnson: fastest practical counter, but uses more flip-flops; glitch free
decoding• LFSR: fast & dense, but pseudo-random outputs• One-Hot: useful for generating series of enables• Carry Chain: High speed and density
– The LogiBlox synthesizer will automatically pick the best implementation based on your design, or you can force an implementation with the STYLE parameter (schematic).
• The following are implemented in XC4000XL-3
Macro CLBs Clock
CB16CLE/D 18 - 20 23 - 24 ns
CC16CLED 19 19 ns
CC16CLE 9 16 ns
X-BLOX: LFSR 9 7 ns
• Simpler functions are faster and smaller• Carry Logic Counters are generally faster (depends on size)
16 Bit Counter Examples
Global Clock Buffers• Clock Buffers are low-skew, high drive buffers
– Also known as Global Buffers– Drive low-skew, high-speed long line resources– Drive all Flip-Flops and Latches in FPGA– Can also be used for high-fanout signals
• Additional clocks and high fanout signals can be routed on long lines
• Instantiation: if the BUFG component is instantiated, software will select one of these buffers based on the design
• Synthesis: Clocks are identified by different means depending on Vendor
– Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins
• Control clock buffer insertion with separate commands• Consult Synthesis interface guide or vendor
Global Buffer Types
BUFGLS is used by default in the Xilinx software if a BUFG component is specified in the design
Name Buffer Description Applications LimitationsBUFG Global Clock
(Architecture independent)M1 converts BUFG to most appropriate global buffer
BUFFCLK Global Fast Clock Fastest way to bring clock on chip
4 per chip, 4KX only; slower for CLBs
BUFGE Global Low Early Clock Faster than BUFGLS; fast IO interface
8 per chip, 4KX only; drives only 1 quadrant
BUFGLS Global Low Skew Clock Can access any CLB or IOB, best for CLBs
8 per chip, 4KX only
BUFGP Primary Global Buffer Drives Clocks or Longlines
4 per chip
BUFGS Secondary Global Buffer Drives Clocks or Longlines
4 per chip
Generating Clock On-Chip
• Internal configuration clock available after configuration– Use OSC4 primitive
– Nominal values (approximately):• 8 MHz, (500 kHz, 16 kHz, 490 Hz, 15 Hz)
– Very limited accuracy (+/- 50%)
OSC4
F15
F500k
F16k
F490
F8M
BUFGS
Global Reset• All flip-flops are initialized during power up via
Global Set/Reset network• You can access Global Set/Reset network by
instantiating the STARTUP primitive– Assert GSR for global set or reset– GSR is automatically connected to all CLB flip-flops
using dedicated routing resources– Saves general use routing resources for your design– DO NOT CONNECT GSR to set/reset inputs on Flip-
Flops
• Any signal can source the global set/reset, but the source must be defined in the design
GR/GSR
GTS
CLK
Q1
Q2
Q3
DoneIn
STARTUPQ4
• Use Global Reset as much as possible– Limit the number of flip-flops with an asynchronous reset – Extra routing resources are used
Avoid Gated-Clock or Asynch. Reset• Move gating from clock pin to prevent glitch from affecting
logic.
Carry-1Q0Q1Q2
Binary Counter
CE
QDTC
CK
Improved Designs:
TC will not glitch during the transition of Q<0:2> from 011 to 100
D QTC
Q0Q1Q2
Binary Counter
CK
TC and Q may glitch during the transition of Q<0:2> from 011 to 100
Poor Design:
D Q
CE
Or use MUXed data when using only 1-2 logic inputs
Shift Registers are Fast & Dense
• The CLB can handle two bits of a shift register
• Fast and dense independent of size– Fast connections between adjacent lookup tables
D Q
D Q
Left/RightQi
Qi+1
Qi-1
Qi+2
EC
EC
Prescale Non-Loadable Counters• Counter speed is determined by the carry
delay from LSB to MSB• Non-loadable counters can use prescaling
– Pre-scaling restricts load timing
FastSmall
Counter
Large Dense Counterwith Slower Carry
TC CE
Use One-Hot Encoding for State Machines
• Shift register is always fast and dense– “One-hot” uses one flip-flop for each count– Useful for state machine encoding in FPGAs
• Another alternative is a Johnson Counter– Inverted output of last stage drives input of first stage– Doubles the number of states versus one-hot
• Binary encoding is best for CPLDs
D Q D Q D Q D Q D Q
• Split complex states• Need to minimize number of inputs, not
number of flip-flops, in FPGAs– Use one-hot encoding for medium-size state
machines (~8-16 states)
• Complex states may be improved by breaking up into additional simpler states
StateA
StateA1
StateA2
StateB
cond1
StateB
cond1 cond1
State Machine Design Tips
Use binary sequence only if necessary• CLB can generate any sequence desired at same speed• Use Pre-Scaling on non-loadable counters to increase speed
– LSBs toggle quickly– See Application Notes
XAPP001 and XAPP014
• Use Gray code counters if decoding outputs– One bit changes per transition
• Consider Linear Feedback Shift Register for speed when terminal count is all that is needed– Or when any regular sequence
is acceptable (e.g., FIFO)
Large Dense Counter
with Slower Carry
TCCEFast
SmallCounter
10-bit SRQ0 Q9Q6
• Register-rich FPGAs encourage pipelining• Pipelining improves speed
– Consider wherever latency is not an issue– Use for terminal counts, carry lookahead, etc.
• How to estimate the clock period– 2 x (number of combinatorial levels) x (speed grade)– XC4000XL-3: 3 levels x 2 x 3ns = 18 ns clock period
Pipeline for Speed
Section III Advanced Hardware Design Techniques
Memory Design (RAM and ROM)
ROM is Equivalent to Logic• When using ROM, it is simply defining logic
functions in a look-up table format– Memory might be an easier way to define logic
– Xilinx provides ROM library cells
• FPGA lookup tables are essentially blocks of RAM– Data is written during configuration
– Data is read after configuration• Effectively operate as a ROM
O = I1*I2I1
I2O
F1
F2X
DATA(0)=0DATA(1)=0DATA(2)=0DATA(3)=1
A0
A1DOUT
F1
F2X
As Gates As ROM
RAM Provides 16X the Storage of Flip-Flops
• 32 bits versus 2 bits of storage – Two 16x1 RAMS or One 32X1 Single Port Ram fit in one CLB
– One 16x1 Dual Port RAM fits in one CLB
• 32x8 shift register with RAM = 11 CLBs– Using flip-flops, takes 128 CLBs for data alone
– Address decoders not included
32 bitsA0A1A2A3A4
O12 bits
D Q
D Q
Q1
Q2
CLB CLB
D1
D2
WE CLK
D1
• Synchronous RAM (SYNC_RAM)– Synchronous Write
Operation
• Synchronous Dual-Port (DP_RAM)– Can read & write to
different addresses simultaneously
RAM Types
Data
Write EnableWrite Clock
Address
Output
DataWrite Enable
Write Clock
Write Address/Single-Port Read Address
SPOutput
DPOutput
Dual-Port Read Address
RAM Guidelines• Less than 32 words is best
– 32x1 or 16x2 per RAM requires only one CLB• Delays are short, (one level of logic)
– Data and output MUXes are required to expand depth
• Less than 256 words recommended per RAM– Use external memory for 256 words or more
• Width easily expanded– Connect the address lines to multiple blocks
• Recommendation: Use less than 1/2 of max memory resources– Maximum memory uses all logic resources of CLBs
Memory Use• Most synthesis tools can synthesize ROM from
behavioral HDL code, but RAMS must be instantiated
• Use library primitives and macros for standard size memory– RAM/ROM16X1S to 32X8S– Use S suffix for Synchronous RAM– Use D suffix for Dual-Port RAM
• Use LogiBlox to generate arbitrary size memories
ORAM32X1S
DWEA0A1A2A3A4
• Use LogiBlox utility to create arbitrary size RAM or ROM– Select type: ROM, Synchronous, Asynchronous, or Dual Port
RAM
– Specify Depth: number of words must be a multiple of 16, ranging from 16 to 256 words
– Specify Width: word size ranges from 1 to 64 bits
– Specify initialization values with attribute file
• LogiBLOX also creates RAM interface– Entity and component declaration - cut and paste into the design
(VHDL designs)
– Module declaration (Verilog designs)
– Symbol Graphic (schematic entry designs)
How to Generate Memory
exampleexample
Memory Generator Dialog
LogiBLOX function
Data file for initialization
Memory Function
Specify memory type, size, name and function in the LogiBLOX GUI
Instance Name
Section III Advanced Hardware Design Techniques
Input / Output Design
XC4000X IOB Block Diagram
Shaded areas are not included in XC4000E family.
How to specify IO blocks - Schematic• User explicitly defines what resources in the
IOB are to be used
• I/Os are defined with– 1 pad primitive– At least 1 function primitive:
• Buffer, F/F ,or Latch• 1 input element, 1 output element or both
– Inverters may also be pulled into IOBs
• IOBs are named by net between pad and function primitives
IPADIN1_PAD
IOB IN1_PAD
IBUFIPADIN2_PAD
IOB IN2_PAD
ILD
Primary and Secondary Global Buffers• Eight global buffers per FPGA
– Four primary (BUFGP), Four secondary (BUFGS)
• Primary buffers must be driven by a semi-dedicated IOB
• Secondary buffers can be driven by a semi-dedicated IOB or internal logic and have more routing flexibility– Use BUFGS if extra 1-2ns of delay is acceptable
• Use generic BUFG primitive in your design– Allows software to choose best type of buffer
– Allows easy migration across families
IPADBUFG
D
I/O Logic• 4000E families have no boolean logic other than
inverters in the IOBs• XC4000EX adds optional output logic
– Can be used as a generic two-input function generator or MUX
– One input can be driven by IOB output clock signal• Driving from FastCLK buffer provides less than 6 ns pin-to-pin delay
– Requires library components beginning with “O”
IPAD F OPAD
BUFFCLK
FROM INTERNAL LOGIC FAST
OAND2
Use Pull-ups/Pull-downs to Prevent Floating
• Unused IOBs:– Outputs of unused IOBs are automatically disabled – Pull-ups are automatically connected on unused IOBs
• Used IOBs:– A PULLUP or PULLDOWN primitive can be connected to
used IOBs– Inputs should not be left floating
• Add a pull-up to design inputs that may be left floating to reduce power and noise
• Output enable may be inverted– Use OBUFE macro for active-high enable– Use OBUFT primitive for active-low enable
• Three-state control also via a dedicated global net– Controlled by same
STARTUP primitive
• All I/O disabled during configuration
Output Three-State Control
STARTUP
GTS
OE
OBUFE
T
OBUFT
OET
Fast Capture Latch• Additional latch on input driven by output’s clock signal• Allows capture of input by very fast clock
– Followed by standard I/O storage element for synchonization to internal logic
– Very fast setup (6.8 NS for 4000EX-3), 0 ns hold
– Available on 4000X, not 4000E family
• Example– ILDFFDX macro includes Fast Capture Latch and IFDX
– Connect BUFGE to fast capture latch
– Opposite edge of same clock via BUFGLS drives IFDX
D
GF
DCE
QIPAD
IPAD
BUFGE
BUFGLS
Data
Clock
tointernallogic
ILDFFDX
Decrease Hold time with NODELAY
• NODELAY attribute– Removes delay element to the IFD or ILD– Decreases setup time, add creates hold time– Available on IFD/ILD macros in XC5200 and
XC4000E/X families
Delay
Q D
IOB
RoutingDelay
Pad
ExternalClock
Pad
ExternalDelay
InputBuffer
Output MUX
• OMUX2– Fast output signal (from output
clock pin) MUXes IOB output or clock enable pins to pad
– Effectively doubles the number of device outputs without requiring a larger, more expensive package
– Pin-to-pin delay is less than 6 ns
D0
D1
S0
O
OMUX2
OPAD
OPADOBUF
FAST
Slew Rate Control• Slew rate controls output speed
• Two slew rates– Default slow slew rate reduces noise– Use fast slew rate wherever speed is important– FAST Slew rates are approximately 2x faster than SLOW slew
rates
• Slew rate specification– Instantiation: in the user constraint file:
• INST $1I87/obuf SLOW;
– Synthesis: vendor dependent
• Output drive varies by family– 4KEX/XL families have 12 mA drive
Choose TTL or CMOS Thresholds• Threshold is selected during configuration• Default is TTL
– Global selection on inputs or outputs– Change to CMOS in Configuration Template– 3V devices need TTL threshold when interfacing to 5V devices
Section IV
Advanced Software Design with Xilinx M1-Based Software
Section IV Agenda• Design Entry Tips
• Library Types
• FPGA Express for VHDL & Verilog
• M1-Based Software Flow
• Implementation Options
• Design Verification
• PLD Configuration Settings
• Design Constraints
Section IV Advanced Software Design
with Xilinx M1-Based Software
Design Entry Tips
Design Entry Tip - Label Nets• Label as many nets as possible
– Net names are passed to report files– Eases debugging
• Names may change due to hierarchy or optimization
• An IOB is named by the net between the pad and I/O function primitives
• A CLB is named by the net on the output– Flip-flops are always outputs
IN1
IOB IN1
D QQ2
CLB Q2
Use Legal and Readable Names• Allowable characters
– Alphanumeric: A - Z, a - z, 0 - 9
– Underline _, Dash -
– Reserved characters• Angle brackets for buses <>
• Slash / for hierarchy
• Dollar sign $ for reference designators
• Names must contain at least one non-digit
• Avoid using names that correspond to device resources– CLB row/column locations: AA, AB, etc.
– IOB pin locations: P1, P2, etc.
Component Naming Conventions
• Common component names, pin names and functions for all families
• Basic format is <function><width><control_inputs>– CB4CLE = Counter, Binary, 4 bits, Clear, Load, Enable
– FD16RE = Flip-flops, D-type, 16 bits, Reset, Enable
• Control inputs are referenced by a single letter– C = asynchronous Clear, R = synchronous Reset
– Listed in order of precedence
Use Hierarchy in Design• Adds structure to design• Eases debug• Users can build libraries of common functions• Allows each design portion to be entered by
most efficient method• Facilitates incremental design and
floorplanning• Supports team design
Notes
Section IV Advanced Software Design
with Xilinx M1-Based Software
Library Types
Xilinx Libraries Overview• Libraries contain descriptions of each
component with pin names, functionality, timing, etc.
• There are two libraries:– The Unified Library contains “ready made” components
with non-variable function and size
– The LogiBLOX Library contains templates which can be customized for function and size
• Both libraries allow easy design migration across Xilinx devices and families
LogiBLOX templates and GUI
• LogiBLOX is composed of two parts:– LogiBLOX Library containing templates of VARIABLE SIZE
• Templates are expanded or customized (Counters, Adders, Registers, RAM, ROM)
• Templates have many implementations (e.g. Binary, Johnson, LFSR counters)
– LogiBLOX GUI and Synthesizer to create• A design file for implementation
• Symbol for schematic capture tool
• HDL code for instantiation in your design
• Functional simulation model
• One generic model per function type(ex: counter) - Attributes can be specified– ex: bus width, load, clock enable, etc.
• Arithmetic: COUNTER,ADDER, SUBTRACTOR, ACCUMULATOR
• Storage: SHIFT, DATA_REG, PROM, SRAM, DRAM Logic: ANDBUS, ORBUS, MUXBUS, DECODE, TRISTATE, COMPARATOR
• I/O: INPUTS, OUTPUTS, BIDIR_IO
• DSP and other complex functions are also available through CORE Generator
Generic LogiBLOX Functions
LogiBLOX Module Selector• Simple Combinatorial Logic
– Bus size from 2 to 32 bits – Supports AND, Invert, NAND,
NOR, OR, XNOR, XOR– Any of the inputs or output can be
inverted independently• Use Decode or MASK function
• Three-State Drivers– Bus size from 2 to 32 bits– Optional pull-up resistors
• Constants– Allows signals to be tied high or
low
How to use LogiBLOX in HDL code• If a LogiBLOX function is inferred, there is nothing more to do!
– Check with the synthesis vendor. Most synthesis tools infer simple LogiBlox components automatically
– Example: Synthesis tools will infer an adder for X <= A +B;
• To instantiate a LogiBlox function, or if the synthesis tool does not infer LogiBLOX automatically
– Use LogiBLOX GUI from command-line in “stand-alone” mode: %lbgui -vendor
* Creates a LogiBLOX module for simulation* Creates an entity or module declaration
Section IV Advanced Software Design
with Xilinx M1-Based Software
FPGA Express for VHDL & Verilog Design
Section Agenda
• Overview
• Design Flow
• Instantiation Guidelines
• Coding Style Guidelines
Overview• Xilinx leads in FPGAs - 55% market share• Synopsys leads in VHDL/Verilog synthesis -
80% market share• One result of long term technology partnership is
FPGA Express– Xilinx is only silicon supplier with right to distribute FPGA
Express technology
– Integration into Foundation Series
Express Input and Output• Input files may be VHDL or Verilog format
– Mixed Verilog/VHDL modules are accepted
– Schematics may also be used, but should not be input into Express
– Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager
• Output netlists are in XNF format• Timing Specifications may be
specified in Express– Timing Specifications are not used during Synthesis
– Timing Specifications can be included in the output netlist
Reports
TimingRequirements
VHDLVerilog
Express
.XNF
Analyze the Design (1)• “Analyze” checks the HDL code for syntax errors
– Also creates internal files
• Files are automatically analyzed when selected for a project
• Do not select XNF or EDIF files– Will be merged
into the design by Design Manager
Synthesis -> Identify Sources
Analyze the Design (2)• As the design blocks are analyzed, status is
displayed:
• In this example, all blocks were analyzed successfully
Main Window
No Errors or Warnings
Warnings
Errors
Out of Date
Implement the Design• Express Implementation maps the HDL code to standard logic, creating a generic netlist.• At this stage, the design has not been optimized• To implement a design, select only the top level block, and
then select the Implement icon
Main Window
Check for Errors and Warnings
• After implementation is complete, the chip symbol plus status is displayed
• View errors, warnings, and messages
• Right click inside window to save information to a text file
Constraint Entry• Constraints are NOT applied to Synthesis
– Constraints are written to the output netlist (XNF) file for use by Design Manager (Xilinx Implementation Tools)
• Timing constraints control path delay• Specify paths with timing groups, or groups of IO or
sequential elements– The INPUT Group includes all input ports at the top level of the
design
– The OUTPUT Group includes all output ports at the top level of the design
– All flip-flops clocked by the same edge of a common clock belong to a group
– To define constraints: select Synthesis -> Edit Constraints forms
Define Clock Period
• Enter Period, Rise, and Fall Time– Select Clock entry -> Define
Synthesis -> Edit Constraints -> Clocks -> Define
Synthesis -> Edit Constraints -> Clocks
Define Global Synchronous Delays• The clock period creates 3 types of global
constraints with the same default value:(1) All input ports to sequential Elements– Setup of flip-flop or latch is included
(2) Sequential Element to all output ports – Flip-Flop Clock to Q delay is included
(3) Sequential Element to Sequential Element3
Clock period
logic
logic logicD Q1
D Q
2
Synthesis -> Edit Constraints -> Paths form
Define Individual Synchronous Delays
• Default delay from Clock specification is used in the Paths form
• Individual, or path specific delays can be defined on the Ports form– Port delays over-write the global delays from the Paths form
• Input delay, shown here, arrives 20 ns before the rising edge of the clock.
Synthesis -> Edit Constraints -> Ports
Define Key Port Features• Global Buffer defines the type of Clock Distribution
network - Use BUFG for most applications(default)
• Resistance specifies use of pullup or pulldown resistor on unused pads– Reduces power consumption and noise
• Use IO Reg allows use of sequential elements within IO Blocks to minimize Input or Output delay (default)– Dependent on device type
• Pad Location is used to specify pin number of the IO pad
Synthesis -> Edit Constraints -> Ports
Control the Hierarchy• Eliminate (default) or save hierarchical
boundaries
• Flat designs yield best results because more merging and sharing of boolean logic occurs
• However, small blocks are easier to debug– Easier to match source HDL code to synthesized design
• Synthesis goals (Speed or Area) and Effort level can be defined for each module
Synthesis -> Edit Constraints -> Modules (implemented design)
Optimize the Design• Optimization minimizes the design for speed or
area
• Select the implementation, and then select the Optimize icon
• After Optimization, check for errors and warnings again
Main Window
View Results• Select File -> Project Report to generate a
report
• Report file contains:– Files and libraries used
– Settings for Synthesis
– Chip type and speed grade
– Estimated Timing
– Warning: Circuit timing estimates tend to be optimistic. Run timing analysis after routing for most accurate timing analysis.
Report.txt file
Verify Results (1)• After Optimization, open Synthesis -> Edit Constraints to verify that
correct constraints were specified
• Results are based on estimated routing delays
Synthesis -> Edit Constraints -> Paths (for an optimized design)
Verify Results (2)• Review size of the design
• Resource use is displayed for each hierarchical block– Resources used per hierarchical block
– Black Box instantiations cannot be analyzed by Express
Synthesis -> Edit Constraints -> Modules (Optimized Design)
Export Netlist• Create the output netlist for use with the Xilinx Design Manager
(Xilinx Implementation Tools)– Output File format is XNF
• Select the optimized design, then select Synthesis -> Export Netlist to create the file– XNF file format
is used
• Enable Export Timing Specifications to include constraints in the output netlist
Synthesis -> Export Netlist
Simulation• Not covered in this workshop
• Free VHDL / Verilog simulators– See http://www.xilinx.com/xup/express/express1.htm
– Active VHDL Simulator, by Aldec (Most Recommended)– VHDL Tools from RASSP – Accolade Design Automation demo VHDL Simulator – SimuCAD Silos III (Recommended for Verilog)– Wellspring Verilog Simulator
• Model Technology Inc. (MTI) and major CAD vendors sell other HDL simulators
Instantiation and Hierarchy• Hierarchy is created when one design is instantiated into
another design• All components in the Unified and LogiBLOX Libraries
may be instantiated– Unified library components are described in the Libraries Guide– LogiBLOX components are described in the LogiBLOX
Reference/User Guide
• Cells that must be instantiated with Express SynthesisRAM/ROM Readback OSC
Bscan WOR WAND
OAND…(all IOB combinatorial logic)
Black Box Instantiation• What is a black box? Any element not analyzed by Express.
Examples:– Existing Design Modules or Elements (XNF, EDIF, .ngo)– LogiBLOX Components– Pre Optimized Netlists (PCI Cores or LOGICOREs)
• Procedure for using a black box:– Create a place holder in the HDL code– Synthesize the design without the XNF, EDIF, or NGO files– The Xilinx Implementation Tools will resolve (link in) all black box references
• Limitations– Express cannot check timing constraints through a black box.– Express cannot include black box resources in it’s reports.– GSR nets are not automatically inferred within Black Boxes
• Instantiate STARTUP and explicitly connect GSR ports in HDL
M1 - Introduction 152
LogiBLOX & CORE Generator Functions
• For HDL designs, LogiBLOX and CORE Gen generate:– Behavioral VHDL or Verilog model - for simulation only
– VHDL/Verilog Template - for component instantiation
– NGO file - for Xilinx implementation
• Most LogiBLOX functions can be inferred. Exceptions include READBACK and RAM blocks.
• Instantiation may provide better control of design implementation
How to Use LogiBLOX1. Invoke LogiBLOX from
Foundation
2. Select Setup
a. Specify VHDL or Verilog Template in the LogiBLOX Setup form
b. Other setup options may also be required*
3. Specify component features
4. Select OK to create component
5. VHDL/Verilog) Use template file (.vhi / .vei) to easily instantiate the component
Verilog - Add empty interface file to define busses.
6. Compile as usual*To access Verilog options, invoke LogiBLOX directly from Start -> Programs -> Xilinx Foundation Series -> LogiBLOX
RAM Example• Code is shown in the following slides:• VHDL instantiation:
– Component and entity declarations where copied into top level design file from LogiBLOX VHI file
• Verilog instantiation: – Module declaration is copied into top level design file from
LogiBLOX VEI file
– Additional empty file is required to specify pin type (input or output)
• Do not try to Analyze the VHD or VEI file from LogiBLOX, but DO Analyze the top level design file– Verilog users will synthesize the additional empty Verilog file
RAM Instantiation (VHDL)Library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.STD_LOGIC_UNSIGNED.all;
entity top is
port (NOTCLR, CLKEN, NOTLD, UPCNT: in STD_LOGIC;
CNT_DI, RAM_DI: in STD_LOGIC_VECTOR (7 downto 0);
QO_LO: out STD_LOGIC_VECTOR (7 downto 0));
end top;
. . .
component ram256x8
PORT(
A: IN std_logic_vector(7 DOWNTO 0);
DI: IN std_logic_vector(7 DOWNTO 0);
WR_EN: IN std_logic;
WR_CLK: IN std_logic;
DO: OUT std_logic_vector(7 DOWNTO 0));
end component;
Top levelentity and RAMComponent declaration
Copied from VHI file
RAM Instantiation (VHDL) (2)begin
U1: OSC4
port map (OSC_CK);
U2: BUFG
port map (OSC_CK, CLK);
U3: CB8CLED
port map (CLK, NOTCLR, CLKEN, NOTLD,
UPCNT, CNT_DI, ADDR);
xram : ram256x8 port map
(A => ADDR ,
DI => RAM_DI,
WR_EN => CLKEN,
WR_CLK => CLK ,
DO => QO_LO );
end cr;
Last part of Top architecure
Component declarationis copied from VHI file, and instance name is entered
Coding for Performance
• FPGAs require better coding styles and more effective design methodologies – Pipelining techniques allow FPGAs to reach gate array system speeds
• Gate Arrays can tolerate poor coding styles and design practices – 66 MHz is easy for an Gate Array
• Designs coded for a Gate Array tend to perform 3x slower when converted to an FPGA– Not uncommon to see up to 30 layers of logic and 10-20 MHz FPGA designs– 6-8 FPGA Logic Levels = 50 MHz
Case vs If-Then-Else (Verilog)
in0
in1
in2
in3
mux_out
sel
in0in1
in2
in3
sel=00sel=01
sel=10p_encoder_out
module mux (in0, in1, in2, in3, sel, mux_out);input in0, in1, in2, in3; input [1:0] sel;output mux_out;reg mux_out;always @(in0 or in1 or in2 or in3 or sel) begin
case (sel)2'b00: mux_out = in0;2'b01: mux_out = in1;2'b10: mux_out = in2;default: mux_out = in3;
endcaseend
endmodule
module p_encoder (in0, in1, in2, in3, sel, p_encoder_out);input in0, in1, in2, in3;input [1:0] sel;output p_encoder_out;reg p_encoder_out;always @(in0 or in1 or in2 or in3 or sel) begin
if (sel == 2'b00)p_encoder_out = in0;
else if (sel == 2'b01)p_encoder_out = in1;
else if (sel == 2'b10)p_encoder_out = in2;
else p_encoder_out = in3;end
endmodule
Reduce Logical Levels of Critical Path(Verilog)
critical
in0in1
in2
in3out
in2
in0in1
in3
criticalout
module critical_bad (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;
assign out = (((in0&in1) & ~critical) | ~in2) & ~in3;
endmodule
module critical_good (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;
assign out = ((in0&in1) | ~in2) & ~in3 & ~critical;
endmodule
Resource Sharing (Verilog)
a0b0
+
+a1b1
sum
sel
+ sumsel
a0
a1
b0
b1
module poor_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;always @(a0 or a1 or b0 or b1 or sel) begin
if (sel)sum = a1 + b1;
elsesum = a0 + b0;
endendmodule
module good_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;reg a_temp, b_temp;always @(a0 or a1 or b0 or b1 or sel) begin
if (sel) begina_temp = a1;b_temp = b1;
endelse begin
a_temp = a0;b_temp = b0;
endsum = a_temp + b_temp;
endendmodule
Register Duplication to Reduce Fan-Out(Verilog)
module low_fanout(in, en, clk, out);input [23:0] in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en1, tri_en2;always @(posedge clk) begin
tri_en1 = en; tri_en2 = en;endalways @(tri_en1 or in)begin
if (tri_en1) out[23:12] = in[23:12];else out[23:12] = 12'bZ;
endalways @(tri_en2 or in) begin
if (tri_en2) out[11:0] = in[11:0];else out[11:0] = 12'bZ;
endendmodule
module high_fanout(in, en, clk, out);input [23:0]in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en;always @(posedge clk) tri_en = en;always @(tri_en or in) begin
if (tri_en) out = in;else out = 24'bZ;
endendmodule
en
clk
[23:0]in [23:0]out
tri_en
en
clk
[23:0]in[23:0]out
en
clk
24 loads
12 loads
12 loads
tri_en1
tri_en2
Design Partition - Reg at Boundary (Verilog)
a0
clk
a1
clk
+ sum
+a0
a1
clk
sum
module reg_at_boundary (a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;always @(posedge clk) begin
sum = a0 + a1;end
endmodule
module reg_in_module(a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;reg a0_temp, a1_temp;always @(posedge clk) begin
a0_temp = a0;a1_temp = a1;
endalways @(a0_temp or a1_temp) begin
sum = a0_temp + a1_temp;end
endmodule
Managing FPGA Speed Booster Pipeline (Verilog)
1 cyclemodule no_pipeline (a, b, c, clk, out);
input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp;always @(posedge clk) begin
out = (a_temp * b_temp) + c_temp;a_temp = a; b_temp = b; c_temp = c;
endendmodule
module pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp, mult_temp;always @(posedge clk) begin
mult_temp = a_temp * b_temp;a_temp = a; b_temp = b;
endalways @(posedge clk) begin
out = mult_temp + c_temp;c_temp = c;
endendmodule
*
+
a
b
c
out
2 cycle
*
+
a
b
c
out
M1 - Introduction 165
When to Use Tri-state Buffers (BUFTs)
• BUFTs can be used to implement:– Internal Tri-state busses– Muxes greater than 4-to-1 or Multiplexed Buses
• BUFTs can be inferred:– Tri-states are inferred when a ‘Z’ can be assigned to a
signal
• BUFTs can be instantiated:– BUFT components– LogiBLOX Tri-State Buffers– Within a wide MUX: LogiBLOX Wired-AND MUX
M1 - Introduction 166
4-to-1 Tri-State MUX Before (VHDL)
SEL(0) SEL(2)
SIGDATA(2)
DATA(3)
SEL(3)SEL(1)
DATA(0)
DATA(1)
library IEEE;use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;
entity TST is port( DATA: in std_logic_vector(3 downto 0); SEL: in integer; SIG: out std_logic );end TST;
architecture BEH of TST isbeginLOOP1: for I in 0 to 3 generate SIG <= DATA(I) when (SEL = I) else 'Z'; end generate ;end BEH;
• Is there a problem with this example?
M1 - Introduction 167
4-to-1 Tri-State MUX After (VHDL)
• How can this code be improved?– Default integer is 32 bits– Define a limit
library IEEE;use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;
entity TST is port( DATA: in std_logic_vector(3 downto 0); SELECTOR: in integer range 0 to 3; SELECTION: out std_logic );end TST;. . .
Before AfterCLBs 8 4IOBs 37 7TBUFs 4 4
M1 - Introduction 168
Flip-Flop Examples (VHDL)• Flip-Flop inference driven by ‘event in VHDL
-- D flip-flop FF: process (CLOCK) begin if (CLOCK'event and CLOCK='1') then A_Q_OUT <= D_IN; end if; end process; -- End FF -- Flip-flop with asynchronous preset and clock enable FF_CLOCK_ENABLE: process (ENABLE, PRESET, CLOCK) begin if (PRESET = '1') then D_Q_OUT <= " 11111111"; elsif (CLOCK'event and CLOCK='1') then if (ENABLE='1') then D_Q_OUT <= D_IN; end if; end if; end process; -- End FF_CLOCK_ENABLE
Producesregisteredoutput
Generates clock enable
Generates async preset
Flip-Flops Vs. Latches• Latches inference does not include an edge
(‘event or posedge)• Latches are generated when:
– A signal is assigned in one branch of an if statement or case statement, but not all branches
– An if or case statement does not define all possible conditions
• Does not apply to case statements in VHDL
• Use Synopsys parallel_case and full_case directives for Verilog to avoid latches
• Or, include a default clause before the if statement
M1 - Introduction 170
Global SET/RESET • All Xilinx FPGAs have a built-in global synchronous reset
facility• Global SET/RESET sets or resets every sequential element
in the FPGA– GSR signal is accessed by instantiating the STARTUP block. – GSR will be inferred when the design has a net that sets / resets
all sequential elements in the design– Additionally, sequential elements may be set or reset individually
• These global nets exist outside of the general purpose routing within the device.
M1 - Introduction 171
How to access Global SET/RESET
• The Global Set/Reset (GSR) signal is accessed by instantiating the STARTUP block. – Polarity may be inferred
• GSR will be inferred when the design has a net that sets / resets all sequential elements in the design
M1 - Introduction 172
State Machine Encoding• For FPGAs, use of one-hot encoding for complex state
machines– Works well in Xilinx’ register-rich FPGAs
– Uses fewer wide-input functions
– Generally produces fast state machines
• For CPLDs, use Binary encoding
• One-hot and binary encoding can be selected in Express at Synthesis -> Options -> Project
– Other types of encoding such as BCD or Gray may be specified in the HDL code
• Its best to break up large state machines into smaller ones
M1 - Introduction 173
Address Range Identification• For the inequality operators, synthesis will infer two
12-bit comparators• VHDL Example:
if ADDRESS(31 downto 20) <= “000000000110” and
ADDRESS(31 downto 20) >= “000000000001” then
• More address ranges are synthesized to more comparators
• Better solution: look for patterns in address bits that can eliminate need for comparators
if (ADDRESS(31 downto 23) = “000000000”) and (ADDRESS(22 downto 20) /= “111”) and
(ADDRESS(22 downto 20) /= “000”) then
Arithmetic and Comparison Operators• Use arithmetic and comparison operators whenever
possible. Example:
if (Y > Z) then X <= A + B;• Arithmetic and comparaison operators give Express the
most flexibility to optimize – Multiplier– Adder, Subtracter, and Adder/Subtracter
– Incrementer, decrementer, and incrementer/decrementer– Comparater– Mutiplexer (select operator)
• Operators can be instantiated, but generally you will get the best performance with operator inference
Expressions
• Expressions– Use parentheses to indicate precedence. – Replace repetitive expressions with function
calls or continuous assignments
Last but not least….• Expressions
– Use parentheses to indicate precedence. – Replace repetitive expressions with function calls or continuous assignments
• VHDL generate statements can cause long compile times unfolding the logic - Use wisely– Be careful with generate statements nested in loops or within generate statements– Generate example-- Generate 3 instances of ALU2
GEN1: for N in 0 to 2 generateALU2_X3: ALU2 port map ( CTL(2+ N*3 downto N*3), A(7+ N*8 downto N*8), Y(7 + N*8 downto N*8));end generate;
Resources• Support Resources– www.xilinx.com ( Answers Search)– Express Expert Journal
http://www.xilinx.com/support/techsup/journals/fpga_exp/index.htm– Synthesis Design Guides
http://www.xilinx.com/apps/hdl.htm
• On-Line DocumentationSTART -> Programs -> Xilinx Foundation Series -> VHDL Reference Manual
START -> Programs -> Xilinx Foundation Series -> Verilog Reference Manual
START -> Programs -> Xilinx Foundation Series -> On-Line Books -> Express User’s Guide and Express Application Supplement
Section IV Advanced Software Design with Xilinx M1-Based Software
M1-Based Software Flow
Logical Design Files• Logical Design Files describe your design, and are
composed of logical components– Typically a netlist, generated by Schematic Capture or Synthesis
– Composed of Boolean Gates, FIFOs, RAMs
• Netlist input to XACT-Step M1 is in EDIF format– XNF files are also accepted
• EDIF format files are translated to (Native Generic Design) NGD format
– NGD files have varying extensions— Ex: NGD, NGM, NGA, NGO
• NGD files can be translated to other formats for simulation
Physical Design Files• Physical design files are composed of components
found in a Xilinx FPGA such as look-up tables and flip-flops– Physical design files have .ncd extension
– Map creates an NCD file from an NGD file
– NCD files contain varying pieces of information• Mapping, placement, and routing tools each concatenate data
to the bottom of the NCD file
M1-Based Design Flow
NGDBUILD Flatten Hierarchical Design
.NGD
MAPLogical to Physical translation
Groups LUTs and FFs Into CLBs
.XNF or EDIF netlist
.BIT
TRCE Static Timing Analysis
BITGEN Generates configuration file
.PCF.NCD
TRCE Static Timing Estimates
.NCD
PARLayout of Physical DesignRoutes Physical Design
UCFUser Constraint File
*Design entry tool flows to M1 are shown in the Appendix.
Design Flow Programs (1)• NGDBUILD
– Merges hierarchical EDIF or XNF files into one hierarchical file– Creates internal netlist .ngd(Native Generic Design) files– Contains logical components: combinatorial gates, RAMS, flip-flops,
etc. • MAP
– Maps logical components to physical components found in Xilinx FPGA: look up tables, Flip-Flops, three state buffers, etc.
– Packs physical components into COMPS– Creates internal .ncd (Native Circuit Design) file
Translate Map Place & Route Configure
Design Flow Programs (2)• TRCE
– Analyzes Timing• Use before PAR to analyze constraints
• PAR– Places COMPS on FPGA– Routes the FPGA
• TRCE– Analyzes Timing
• Use after PAR to check delays• NGDANNO
– Back-annotate timing delays for Simulation• BITGEN
– Create file to configure FPGA
Key M1 Browser Reports• Map Report
– Displays result of DRC (Design Rule Check)
– Indicates if the design will fit into the specified part
– Identifies ways to improve the design
– Reports nets with no source or load
• Logic Level Timing Report provides delay estimates– Reports longest paths in the design
– Created before placement
– Based on block delays and minimum net delays
Key Report Files• Placement and Routing Report includes resource
summary– Indicates the percentage of utilization– The number of I/O and flip-flops is specified– Reports if the design routed – Gives an overall timing score
• Score of zero indicates all timing specifications were met
• Post Layout Timing Report– Based on block delays and net delays after routing– Used for detailed delay analysis after implementation
• Pad report– Cross reference of Input/Output components and package pins
BEL and Comp Terminology• XACTstep M1 uses two new terms for FPGA resources: “Comps”
and “Bels”– A comp may refer to a CLB, IOB, TBUF, or Decoder
– A BEL may refer to the contents of a comp, such as F-LUT, H-LUT, FFX, FFY, RAM, or PAD
• The Graphic Design Editor (EPIC), and TRCE timing reports will refer to BELS
G_LUT
F_LUT
H_LUT
FFX
FFY
4000X CLB
The COMP shown here is a CLB, which contains BELS: F_LUT, G_LUT, H_LUT, FFX, and FFY
Section IV Advanced Software Design with Xilinx M1-Based Software
Implementation Options
Main Implementation Menu Options
• Guide Option –Use a previous implementation as template for current implementation
–Specify constraint file (optional)
• MAP, PAR, and configuration options
–Implementation has four sub-menus: Optimize and Map, Place and Route, Timing, and Interface
Optimization and Map Options (1)
• Trim Unconnected Signals (default is On) – Trims all fan-out/fan-in
from unconnected pins
– Turn off to implement hierarchical blocks separately
• Replicate Logic (default is on)– Duplicates logic with high
fan-out
– Increases utilization, decreases delay
Map optimizes your design before it is partitioned into LUTs, Flip-Flops,etc. The GUI includes these options:
Optimization and Map Options (2)• Optimization Strategy (default is Off)
– Minimizes logic to optimize logic for speed, area, or both
– Synthesized designs have been optimized already
• Packing Strategy (default is minimum density)– Informs Map of how to pack COMPS with logic
– Minimum Density - Map only puts related logic into the same COMP
– Fit Device - packs components more tightly into COMPS
– Can adversely affect timing and routability
• Generate 5-I/P Functions– Reduces block levels but increases area
Place and Route Options (1)
• Runtime (default is 2)– Trades off placement effort verses CPU time
• Router Passes (default is Auto)
–The Router will run until no improvement is made to meet timing constraints.– Specify a number to avoid very long run times for difficult designs.– Start with 3 passes
Utilities -> Template Manager -> Edit Implementation Template -> Place and Route
Place and Route Options (2)
• Workstation users may run PAR LOOP on multiple workstations simultaneously– Create a list of available workstations
• One name per line, no comments
– Include the file name in the Nodelist field
• Many other options for advanced users, not shown here
Implementation Options for Fast Runtime versus PAR Effort
Other hints: - 4KX and 9500 families give fastest runtimes. - Save this as an implementation template
1
Deselect these 3 checkboxes
Select fast placement option, 1-2 routing passes, 0 clean-up passes, and deselect “Use Timing Constraints”
Timing Report Options• Enable the creation of the
Timing Report– Logic Level Timing Report is
created before PAR• Has minimal net delays
• Used to predict realistic constraints
– Post Layout Timing Report is created after PAR
• Verify that the design meets constraints
Timing Report Options (2)• These options limit the information placed in the report file• All options list paths in order of delay length; longest paths are
listed first
Design Performance Summary (Default)– Displays longest clock-to-setup, pad-to-
setup, and setup-to-pad delays for each clock in the design
Default Timing Constraints– Lists longest Flip-Flop-to-Flip-Flop, Pad-
to-Flip-Flop, and Flip-Flop-to-Pad paths
User Timing Constraints– Report longest paths for each constraint
Design -> Implement -> Options -> Edit Template -> Timing
Controlling the Back Annotation Netlist Format
Format options:- VHDL- Verilog- XNF- EDIF
EDIF formats:- Standard (2.0.0)- Viewlogic- Mentor EDIF- LogicModelling
How to Start and Stop the Flow Engine
• Select Flow Engine -> Setup Advanced to select the starting state
• Select Flow Engine -> Setup -> Stop After to set stopping point
Create a Script from the GUI• M1 can create a script file from the GUI session
– Available from the Flow Engine or Design Manager
– Select Utilities -> Command History -> Command Line
– Select Utilities -> Project Notes• Copy, paste, and save text from Command History Window
The Guide Option• Allows use of a previously placed and routed
design to guide a new placement– Can be useful if there are few design changes
• Guide is used for Map, Place, and Route– Map may take much longer to execute, but PAR will be
faster
• Recommended alternative is to use location constraints in design
PreviousDesign New Design
GuidePlace & Route
Effective use of Guide• Guide uses signal and component names to
determine edited parts of the design• Name all nets
– Do not change names
• Minimize changes to the design– Any new hierarchy changes all names below– Avoid any changes to synthesized logic
• Synthesis users: please try to freeze the design with “set_don’t_touch” or like command
– Otherwise, guide option may not be useful
Section IV Advanced Software Design with Xilinx M1-Based Software
Design Verification
Recommended Verification Flow
Implement
BitgenProm File Formatter
Download
Netlist
Timing Analysis
FUNCTIONAL SIMULATION
TIMING SIMULATION
IN-CIRCUIT VERIFICATION
Timing Analyzer • Analyze delays before and after implementation
Timing Analyzer Benefits• Combines block delays from data book with net
delays from implementation files• Quickly identifies critical paths and timing
hazards• Report shows all elements in path, each
element's delay, and cumulative delay– Can determine if slow paths are due to block delays
(design) or net delays (implementation)
Element Delay TotalPAD to IOB.I 2.2 2.2IOB.I to CLB1.F11.1 3.3CLB1.F1 to CLB1.X 2.7 6.0CLB1.X to CLB2.F3 1.2 7.2CLB2.F3 to Clock 2.1 9.3
IOB CLB1 D Q
CLB2
I F1 X F3 blocknetblocknetblockblock net block net block
Output files for Simulationngdanno
ngd2xxx
EDIF XNFVerilog / SDF
VHDL / SDF
• Before implementation, output netlist has unit delays, no back-annotation (use for functional simulation)
• After implementation, post-route delays are back-annotated – EDIF or XNF output files include back-annotated delays– SDF files are created in addition to Verilog & VHDL netlists
• VHDL and Verilog output netlists do not contain delays
M1 HDL Simulation Flow
VHDL & Verilog Simulation Libraries
• UNISIM– New for A1.4 allowing RTL and post-synthesis simulation
• SIMPRIM– Family/architecture independent models
– Used for Post-M1 simulation including full timing
– VHDL and Verilog
• Standard Delay Format (SDF) files– Separate file used to specify design timing (delays) to VHDL
and Verilog simulators
– Xilinx software version 1.4 supports SDF version 2.1
Hardware Configuration Readback• Can occur while FPGA runs• Requires XChecker cable• Readback Trigger input starts serial readback• XC3000 controlled via Bitstream Generator
– Default is enabled– Data and trigger connected to Mode pins
• XC4/5000 controlled via schematic and Bitstream Generator
– Include Readback symbol in schematic• Connect TRIG and DATA to I/O pins• Can use MD0 and MD1
• See Appendix for more information
IPADIBUF
OPADOBUF
(MD0)
(MD1)
CLK
TRIG
DATA
RIPREADBACK
XCheckerRT
XCheckerRD
Section IV Advanced Software Design with Xilinx M1-Based Software
PLD Configuration Settings
Bitstream Generator Options - Configuration• Controlled via
Configuration Template• Increase Configuration
Rate if not concerned about compatibility with earlier families
• Add Pull-Up or Pull-Down to avoid having to connect external resistors
• All configuration controls are set in template.
Bitstream Generator Options - Startup
The “Start-Up Clock” switch enables the designer to synchronize startup with the FPGAs’ own configuration clock or an external clock signal.
Start-up can also begin when the “Done” pin goes high.
To program the “Output Events” refer to the Implementation Options of the “Design Manager User Guide included with the Documentation CD.
Bitstream Generator Options - Readback The Hardware Debugger can verify the downloaded configuration and probe the internal states of the device by using the Readback feature.
To use this feature you will need to assert the “Enable Bitstream Verification” box, connect the XChecker Cable to your device, and insert the “Readback” symbol into your design.
For more information, refer to the Xilinx Data Book and the Hardware Debugger Reference Guide on the Documentation CD.
Choose a Configuration MethodConfiguration Mode Data Characteristics
Master Parallel Byte-Wide FPGA loads itself from externalbyte-wide PROM
Master Serial Bit-Serial FPGA loads itself from externalserial PROM
Peripheral Byte-Wide FPGA loaded undermicroprocessor control
Synchronous Peripheral Byte-Wide FPGA loaded by users’configuration clock
Express Byte-Wide Fastest configuration mode;4000EX devices only
Slave Bit-Serial FPGA loaded by microprocessoror DMA controller; used byXChecker Download Cable
Daisy Chain Bit-Serial FPGAs load themselves fromPROM; PROM Formatter createsbitstream
M[2:0] pins control configuration mode setting.
Section IV Advanced Software Design
with Xilinx M1-Based Software
Design Constraints
Section Agenda• Overview• Location and implementation constraints• General timing constraints• Specific timing constraints
– Path and block specific constraints – Path and block grouping – Advanced constraint commands– Priority
Constraint Entry Overview• All constraints can be entered in User Constraint File (UCF)
– Maximum allowable delay – Placement of package pins– Implementation Options– Bitstream Generation / Prom Configuration
• Timing constraints may also be defined in schematic– Advantage: Easy entry for hierarchical blocks
• UCF files must have hierarchical net and component names
– Disadvantage: Not all constraints are supported– See Libraries guide for schematic syntax and availability
• Some synthesis tools allow entry of constraints– Constraint files may be generated by the synthesis tools, or constraints may be
written in output netlist– FPGA Express puts constraints into XNF file
UCF Syntax• Use uppercase letters for keywords
– Keywords include names used in constraints, such as:
AFTER OFFSET PERIOD BEFORENET LOC
IN OUT
• Use quotes around names with non-alphanumeric characters
• Two types of wildcards may be used:– “?” is a wildcard for a single character
– “*” is a wildcard for any number of characters
Pin Location, Implementation Constraints
• Pads can be assigned to a package pin– Ex: Assign a bus signal to pin 32
INST “QOUT<3>” LOC = P32;
• Physical Implementation may be controlled in the UCF file, such as:– FAST: Set fast I/O slew rate
Example: INST “$1I87/OBUF” FAST;– PART: Define part type to be used
Example: CONFIG PART=4005E-PQ160C-5;
• Consider the following path:
• Assume system requirements dictate a delay of 27 ns for all input to output pins
• The TIMESPEC constraint communicates this requirement to software:TIMESPEC TS01 = FROM PADS TO PADS 27 NS;
• PAD-to-PAD TIMESPECS constrain the delay of input and output pads, and all net and block delays in the path
Simple Combinatorial Path
B<9:0>
OUT2
27 NS
A 2 levels of logic
Synchronous I/O Constraints
• Timing requirements for the design are described by defining system delays• System delay include these questions:
– What is the clock period?
– When do inputs arrive at IC2?
– When must outputs be stable to meet setup at IC3?
IC1IC2 : FPGA Under
Development IC3
CLOCK
Input Arrival Calculation• Inputs are constrained by their input arrival.• Example: When does data arrive at pin D1?
– After the clock trigger, data delay is TCKO + Tnet + Tpad + TC1
– Delay C1 net delays, or other combinatorial elements on the board– Delay TCD is the delay through the FPGA clock distribution network
Tarrival = Tcko + Tnet + Tpad + TC10 50
Tarrival
CLK
IC 1
D Q C1
Tcko Tnet Tpad
Tc1IC2: Device under Development
C2 D
CK
QD1
Tcd
Tpad
Output Stability Calculation• When does output data need to be stable?
– Data must be stable in order to meet the setup requirement for IC3 – How long must the data be stable before data is latched in IC3?
• Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup
• TCD is the delay through the clock distribution network
Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup
0 50
Tstable
IC2: Device under DevelopmentIC 3
C2 C3 C4 D
CK
Q D
CK
Q
CLK
Tsetup
Tnet
Tpad Tc4Tc3
O1
Period and Offset Constraints• Two commands are used to describe synchronous delays
– Period defines the clock – Offset constraints define input arrival time and output stability time relative to the clock
• Xilinx software determines internal FPGA delays from Period and Offset constraints• Syntax:
NET clock_name PERIOD = some_delay time_unit;
NET input_name OFFSET = IN Tarrival time AFTER clock_name;
NET output_name OFFSET = OUT Tstable BEFORE clock_name;
(Input_name and output_name are the names of nets connecting to the IO Pad)
Clock Constraint Example
• Use the Period Command to define the clock
• Given that the clock frequency is 20 MHz for the example:NET “CLK” PERIOD = 50 ns;
Example waveform for CLK
0 50 100
Synchronous Constraint Example• OFFSET defines the delay of a signal external to the chip, relative to a clock. Internal clock delays are determined by Software
FF1 FF2
40nsDetermined by
SoftwareTarrival 14ns
Tstable 12ns
Determined bySoftware
0 20 4014
ADD0_IN
CLK28
OUT1NET “CLK” PERIOD = 40;NET “ADD0_IN” OFFSET = IN 14 AFTER CLK;NET “ADD0_OUT” OFFSET = OUT 12 BEFORE CLK;
Constraint Recommendations• Use a given TIMESPEC name for only one path• Keep constraints in one source
– Either UCF file or in schematics, but not both
• Avoid OVER-constraining the design– Design Performance suffers
• Critical timing paths get the best placement and fastest routing options• As the number of critical paths increases, routability decreases
– Run times increase
• More information in the On-Line Docs:– Libraries Guide– Development Systems Reference Guide, Using Timing Constraints, UCF sections
• Schematic users: for path-specific constraints, vendor documentation may be necessary
Question• Given the following:
Clock Frequency = 20 MHz
Tarrival = 31 ns = delay from CLK to Input pin D1 of IC2
Tstable = 27 ns = Delay (including setup) from O1 to D pin of FF3 (IC3)
NET _____ PERIOD = _____ NS;NET _____ OFFSET = IN _____ AFTER CLK;NET _____ OFFSET = OUT _____ BEFORE CLK;
Fill in the constraints below :
IC2: Device under Development
CLK
IC 1 IC 3
D
CK
Q C2 C3C1 C4D
CK
Q D
CK
QD1 O1
CLK 50D 31 nsO1 27 ns
Answers :
Path and Block Specific Constraints
• Why use path or block specific constraints? – To decrease speed requirements wherever possible
– To Increase routability and overall speed of the design
– To decrease software run-time
• General Methodology– Use PERIOD and OFFSET to constrain the design
globally
– Use specific “FROM-TO” constraints to modify timing for specific blocks or paths
“FROM-TO” Constraint Example• Consider the example shown below with TIMESPEC:
TIMESPEC TS01 = FROM PADS TO PADS 21;• TS01 is applied to both Y - OUT1 and Z - OUT2.
• TS01 over constrains path from Z to OUT2.– Tight constraints decrease routability and increase run time
21 ns
FF1 FF2
OUT1CLK
X
Y
Z<0:31>
OUT2
1 Level of Logic
2 Levels of Logic
21 ns
“FROM-TO” Constraints• The two paths could be constrained with two commands:TIMESPEC TS01 = FROM PADS(Y) TO PADS(OUT1)21;
TIMESPEC TS02 = FROM PADS(Z) TO PADS(OUT2)28;
• “FROM:TO” Constraints can start and stop at Flip-Flops (use “FFS”), LATCHES, PADS, or RAMS• Examples:
– Constrain all inputs to all Flip-Flops in block NEWFIE:
TIMESPEC TS03 = FROM PADS TO FFS(NEWFIE)18 ns; – Constrain all Flip-Flop to Flip-Flop paths in the design:
TIMESPEC TS04 = FROM FFS TO FFS 15 ns; – Constrain all Flip-Flop to output paths in the design
TIMESPEC TS05 = FROM FFS TO PADS 25 ns;
Creating Groups with TNM• The TNM constraint creates a group of individual components• Example: divide Flip-Flops into two groups based on instance name
INST SLOWFF* TNM = SLO;
INST FASTFF* TNM = FST;• TIMESPECS are assigned to the new groups:
TIMESPEC TS14 = FROM FFS TO SLO 40 NS;
TIMESPEC TS15 = FROM FFS TO FST 20 NS;• Greater flexibility in routing is achieved by creating a different timing requirement for these two
groups
SLOWFF2
SLOWFF1
FASTFF1
FASTFF2
REG1
REG2
COMB3
Pre-Scaled Counter Example
• Highest speed is required in the pre-scaled block– Constrain the two counter blocks separately to avoid over-constraining
COUNT12
• Define two groups for use in TIMESPEC. Example UCF file:INST FFS(PRE2) TNM = PRE;
INST COUNT12 TNM = UPPER;
TIMESPEC TS_PRE = FROM PRE TO PRE 60 MHZ;
TIMESPEC TS_TC2CE = FROM PRE TO UPPER 60 MHZ;
TIMESPEC TS_UPPER = FROM UPPER TO UPPER 15 MHZ;
Q5 Q6Q3 Q4 Q9 Q10Q7 Q8 Q13Q11 Q12Q2
COUNT12
Q0 Q1
PRE2TC CE
Creating Groups with TIMEGRP
• Another way to constrain this design is by creating smaller groups of endpoints:
• The TIMEGRP constraint is used to create new groups from other groups.
• FFS, LATCHES, RAMS, and PADS are predefined groups• Example: ALL_FFS group contains all Flip-Flops whose
instance name begins with SLOWFF or FASTFF:INST SLOWFF* TNM = SLO;
INST FASTFF* TNM = FST;
TIMEGRP ALL_FFS = FFS (FST* : SLO*) ;
Select One Path From Many Paths • Use to constrain one path among several parallel paths• First identify the path to be constrained with TPTHRU, then
use THRU in Timespec constraint• Example: constrain the path through component ABC
fiforam
my_reg01
my_reg00
my_reg02
my_reg03
TPTHRU=ABC
NET RED TPTHRU = ABC; TIMESPEC TS_FIFOS = FROM RAMS(FIFORAM) THRU ABC TO FFS(MY_REG*) 25;
RED
Forward Tracing• Forward tracing occurs when a constraint is assigned to a net• Constraint is applied to all global endpoints driven by the net• Example: constrain nets driven by DATA0 to Flip-Flops in
block CNT25:NET “DATA0” TNM = MYBUS;TIMESPEC TS_REGCNT = FROM MYBUS TO FFS(CNT25) 30 NS;
...CHEW
BONEDATA0
BARK
CNT25
TS_REGNCT
Ignoring Paths with TIG and NET• Timespec Ignore, “TIG”, attribute ignores a TIMESPEC
for a specific path or net
• Ex: Assume that net DOG_SLOW was constrained by 2 constraints, TS01 and TS02. The following specification ignores TS01. TS02 only is applied to DOG_SLOW.
NET “DOG_SLOW” TIG = TS01;
• Example to ignore a slow path between registers:INST REGA* TNM = REGA; INST REGB* TNM = REGB;TIMESPEC TS_TIG01 = FROM FFS (REGA) TO
FFS(REGB) TIG;
• TIG improves software run-time and routability of the design
Other Constraint Constructs• Use “Except” to filter a group of endpoints.
INST FASTFF* TNM = FST;
TIMEGRP SLO = FFS EXCEPT FST;
• TPSYNC allows definition of end points that are not FFS, RAMS, PADS or LATCHES.NET “BLUE” TPSYNC = BLUE_S;
TIMESPEC TS_1A = FROM FFS TO BLUE_S 15 NS;
• Signal skew for logic driven by clocks can be constrained using MAXSKEW constraintNET “$1I3245/$SIG_6” MAXSKEW = 3;
Specifies a 3 ns difference between the arrival times at all destinations of net $1I3245/$SIG_6. Cannot constrain skew of global nets (skew is fixed)
Constraint Priority• All constraints are not created equal
– Highest Priority - Timing ignores (TIG)
- FROM:THRU:TO specs
- FROM:TO specs– Lowest Priority - PERIOD specs
• “FROM:TO” constraints are further prioritized:– Highest:
FROM PATH-SPECIFIC TO PATH_SPECIFIC FROM PATH-SPECIFIC TO GLOBAL
– Lowest: FROM GLOBAL TO GLOBAL
Section V
Special Topics
Section V Agenda• DSP Design with FPGAs
• New Developments in Programmable Logic
• Virtex, XC6200 and Reconfigurable Logic
• FPGA versus ASIC costs
• Xilinx Student Edition
• Xilinx University Program participation
Section V Special Topics
DSP Design with FPGAs
FPGAs Provide Outstanding DSP Performance
Mult. • • • Mult. Mult. Mult.Mult.Mult.
1 2 3 4 N
DSPProcessor FPGA
AddAdd
1
Parallel processing Configurable to specific needs No software programming
Sequential processing Fixed architecture Complex real time software
FPGAs Lower the Cost of
High Performance DSP
$
Relative Performance
500
300
200
400
100
•
•
•
5 2010 15
µP/PDSP
FPGA-Based DSP
Customer Successes
TIM40 Module using FPGAs (XC4010)3 times the price at 175 times the TI TMS320C40 performance
DNA Matching (XC4010)Similar performance at 1/20th price
128-Track Audio Recording Studio (XC3190)3 times the functionality at 1/10th the price
FIR Filter Example
XC0
X0
XC1
X1
XC2
X2
•
•
•
• • •
SUM
0
K
• • •
SAMPLE DATAN BITS WIDE
K TAPS LONG
K COEFFICIENTS
K SUMS
OUTPUT DATA
PRODUCT K Multiplies
K Sums
CLOCK = Multiply Time
Sample Rate = Clock Rate
IMPLEMENTATION ???
Sum of Products Equation
Traditional FIR Filter Implementation
General-Purpose DSP
– PERFORMANCE =
– TMS320: MAC cycle time = one clock cycle
10-bit, 20-tap filter with 50 MHz TMS320 = 2.5 MHz
Additional filter taps slow performance
– Pentium: MAC cycle time = 11 clock cycles
1
MAC cycle time X Number of Taps
Distributed Arithmetic (DA) Filter Design
A
B
REGISTER
FILTEREDDATA OUT
2 -1 Scaler
LOOKUP
TABLE
ADRS
DATA
...000000
C0
8 WORD X N BITLOOK UP TABLE
C1
C1 + C0
000
001
010
011
100
101
110
111
C2
C2 + C0
C2 + C1
C2 + C1 + C0
PERFORMANCE =
10-bit, 20-tap filter using XC4000 at 50 MHz = 5 MHz
Clock Frequency
Number of Bits in Sample
PARALLEL INSERIAL OUT
SAMPLEDATA
BinarySHIFT
MSB
n
n
Distributed Arithmetic - 3 bit Example
D2 x C2
1 0 0x 1 1 0
D1 x C1
1 1 1x 1 0 1
D0 x C0
0 1 1x 1 0 0
DataCoefficient
C2 x D2
1 1 0 x 1 0 0 0 0 00 0 0
1 1 0
C1 x D1
1 0 1x 1 1 11 0 1
1 0 1 1 0 1
C0 x D0
1 0 0x 0 1 11 0 0
1 0 0 0 0 0
CoefficientData
0 1 1 = LUT Address ==> (C1 + C0 ) from previous slide
Resource Tradeoffs for Higher Performance
Number of Filter Taps
CL
Bs
100
200
300
16 32 48 64 80
• • • • • Serial SequentialSerial Sequential
• •
• •
•
Bit-SerialBit-SerialDistributedDistributedArithmeticArithmetic
8.1 MHz
100 Hz to 100 kHz
16.2 MHz
Double-RateDouble-RateDistributedDistributedArithmeticArithmetic
66MHz
Fully-ParallelFully-ParallelDistributedDistributedArithmeticArithmetic
400
XC4085XL 10 Times Faster Than TMS320C6x
Billions of MACs per
Second
16 bit FIR Filter Benchmark
Multiply ACcumulates per Second
4005XL 4013XL 4036XL 4062XLTMS320C6x0.25 , 200 MHz
1
2
3
4
5
6
7
8
4085XLXC4000XL using 80 MHz clock rate
FPGA DSP is Lower Cost
Price per Million MACs per Second - 16-bit word
TMS320C6x(25,000 pcs)
Xilinx FPGA(25,000 pcs)
$0.25
$0.20
$0.15
$0.10
$0.05
Where FPGA-Based DSP is Used• High Data Rates
– 1 to 70 M samples/sec
• High Complexity– 10’s to 100’s of
MACs in a single chip
• Fixed-Point Data• Audio, Video,
Radio & Voiceband Modems, HDTV
1k
10k
100k
1M
10M
100M
1G
Less Complex More ComplexAlgorithm Complexity
Data Rate
Samples per second
MPU/MCU
Single-Chip DSP
FPGA-BasedFPGA-BasedDSPDSP
ASICASIC
Multiple DSP Coresor Chips
CORE Generator
THIRD-PARTYDSP
SOFTWARE
Instantiate into
schematic or HDL
PLACE AND ROUTE
POST ROUTESIMULATION
Coefficients
DSP / FPGA Design MethodologyXilinx CORE Generator 1.4available now!
BIT STREAM FOR DOWNLOAD CABLE, OR EPROM
XC4000 Resource Cross Reference Chart (Bit-Serial Implementation)
TAPS
8
16
24
32
40
48
56
NUMBER OF XC4000 CLBs
WORD SIZE 6 8 10 12 14 16 18 20 22 24
17 20 23 26 29 36 39 42 45 48
37 44 51 58 65 80 87 94 101 108
57 68 79 90 91 124 135 146 157 168
77 92 107 122 137 168 183 198 213 228
97 116 135 154 173 212 231 250 269 288
117 140 163 186 209 256 275 302 325 348
137 164 191 218 245 300 327 354 367 408
8.3 6.3 5.0 4.2 3.6 3.1 2.8 2.5 2.3 2.18.3 6.3 5.0 4.2 3.6 3.1 2.8 2.5 2.3 2.1 M
samples/sec@50MHz
0.18u
0.15u
0.5u
0.25u
0.35u
Density/S
ystem Gates25K
Perfo
rman
cePro
cess
Tec
hnol
ogy
50 Mhz
100K 100Mhz
500k
1 Million
10 Million
133Mhz
150Mhz
300MhzCores
HDL
Schematic
Modular
Team-BasedSection V Special Topics
The Road Ahead New Developments in
Programmable Logic
Process Technology and Supply Voltage
Xilinx leads PLD industry in fab technology.
Fab partners use FPGAs to drive their process.
• Lower cost• Faster speed• Higher density• Lower power
Feature Size ()
0
0.2
0.4
0.6
0.8
1
1.2
1990 1992 1994 1996 1998 2000 2002
5 V
3.3 V2.5 V1.8 V1.3 V
Today
Advanced Process Technology
0.5u Process 0.25u UMC Process- locos isolation - shallow trench isolation- birds beak - 0.9u metal pitch- no planarization - CMP- only contact plug - plug for all vias
Process & Density Leadership
XC4085XL
XC40125XVIndustry’s 1st 0.25u PLD, 25M Transistors, 5LM
1997 1998 1999 2000 2001 2002
Virtex75+M Transistors
Den
sity
(sy
stem
gat
es)
10M GatesIn 2002
Virtex II
10 Million System Gates in 2002!
10M
2M
1M
250k
180k
XC40250XV500k
Distributed Dual Port RAMIO RegistersInternal Bussing5V Tolerant I/O3.3V and 5V PCI
Fea
ture
s
Block Dual Port RAM Multiple Standard I/O Vector Based Interconnect Phase Locked Loops 66 MHz 64-Bit PCI
1998 1999 2000 2001 2002
Reconfigurable Logic On-Chip AD/DA Embedded Functions 1GHz Diff. Interface Built-in Logic Analyzer
Architecture Innovation & Leadership
133 MHz SDRAM I/F 155 MHz SONET 66 MHz PCI
MHz
* 1/(Tsetup+Tclock-to-out)
0
20
40
60
80
100
120
140
160
180
200
1995 1996 1997 1998 1999 2000
Sys
tem
Clo
ck R
ate*
(M
Hz)
220
240
260
280
300
2001 2002
100 MHz SDRAM I/F 100 MHz DSP for
Wireless Base Station 33 MHz PCI
233 MHz UP 300 MHz RAM I/F 133 MHz PCI
Performance Leadership
Chip ScaleFine Pitch BGA
Flip ChipTechnology
PLCC
PGAPQFP
HQFP
BGA
SBGA
1998 2000 2002
Packaging Leadership
1.0mm
<0.8mm
1.27mm
100
300
500
700
1000
Pins
Compile Time Leadership
Min
utes
*
* 100k System gate designs (200MHz Pentium)
1999 Goal: 1 Million Gates in 45 minutes!
Release
• With Faster CPUs• Faster Compile Times• Modular Compile
0
50
100
150
200
250
1.3 1.4 1.5 2.1 2.2
F1.5 Features• Tight integration
– FPGA Express inside Foundation Project Manager
– Single Project Management / Flow Engine environment
• Improved ease of use – Complete pushbutton
• New Virtex, XC9500XL support• Improved FPGA Express synthesis runtimes &
performance• Improved PAR runtimes and performance
Xilinx Smart-IP Delivers...
High Flexibility High Predictability
Intelligent SoftwareImplementation
Intelligent SoftwareImplementation
Architectures tailored to cores
Architectures tailored to cores
Flexible Core Technology
Flexible Core Technology
High Performance
Xilinx Smart-IPTechnology
Performance + Time to Market
1998 1999
Sta
nd
ard
Bu
sIn
terf
ac
es
DS
P
Fu
nc
tio
ns
Co
mm
un
ica
tio
n&
Ne
two
rkin
gB
as
e L
ev
el
Fu
nc
tio
ns
•PCMCIA•USB
•CAN Bus•ISA PnP•I2C•PCI 32bit
•Add, Subtract, Integrate•Correlators•Filters: FIR, Comb•Multipliers•Transforms: FFT, DFT•Sin/Cos
•ATM Cell Assembly/Delineation•CRC-16/32•T1 Framer•HDLC•Reed-Solomon, Viterbi•UTOPIA, 25/33/50 MHz
•82xx, UARTs, DMA, •66 MHz DRAM/SDRAM I/F•Memory (RAM, ROM, FIFO)•Micro Sequencer (2901)•Proprietary RISC Processors
•CardBus•FireWire(100-400 Mbps)•PCI 64bit/66MHz•PC104•VME
• DCT• Cordic• DES• Divider• JPEG• NCO
•10/100 Ethernet•1Gb Ethernet•ADSL, HDSL, XDSL•ATM/IP Over SONET•SONET OC3/12
•Microprocessor I/Fs•8051/8031•IEEE 1284•MIPS•133+ MHz SDRAM I/F
• Emerging High- Speed Standard Interfaces
• DSP Processor I/Fs• DSP Functions >
200 MSPS• Programmable DSP
Engines• QAM
• Modems• SONET OC48 • Emerging Telecom and Networking Standards
• Satellite decoders
• Speech Recognition
• Advanced processors
2000
By 2002: Virtually All Functions Available as Cores
Leader in Core SolutionsXilinx and Partners’ COREs
• Wasted Routing• Unpredictable Timing• High Power Consumption
• Efficient Routing• Predictable Timing• Low Power Consumption
Segmented Routing Non-Segmented Routing
Core1
Core2
Architecture Tailored to CoresSegmented Architecture
• Portable RAM Based Cores• Improves Logic Efficiency by 16X• High Performance Cores
RAM AvailableLocally
To The Core
Architecture Tailored to CoresDistributed RAM
Relative Placement
Guarantees I/O &Logic Predictability
Fixed Placement & Pre-defined Routing
Other Logic Has No Effect on the Core
Fixed Placement
GuaranteesPerformance
I/Os
Enhances Performance & Predictability
Intelligent SoftwarePre-defined Placement & Routing
50
60
70
80
1 2 4 8
12x12 Multiplier
Speed(MHz)
XilinxSegmented
Number of Cores
Non-XilinxNon-Segmented
Smart-IP Performance Is Independent of Number of Cores in a Design
Smart-IP Delivers Performance
Smart-IP Performance Is Independent of a Core’s Placement in the Device
80 MHZ
80 MHZ
80 MHZ
80 MHZ
Smart-IP Delivers Portability
Smart-IP Performance is Independent of Device Size
80 MHZ 80 MHZ 80 MHZ
Non-Segmented Architecture May Experience 30% Performance Degradation
Smart-IP Delivers Transportability
Xilinx Architecture for Fastest Performance
LogicBlock
1
LogicBlock
2
LogicBlock
n
4x4x
4x
LogicBlock
(next row)
...
Across Chip
LogicBlock
1
LogicBlock
2
LogicBlock
n
LogicBlock
3 ...
6x
3x
LogicBlock
(next row)
1x
Across Chip
Xilinx Segmented Interconnect Non-segmented Interconnect
Segmented Interconnect Structure Provides Faster Logic Cell Connections
1x 1x
Core FunctionXCS30XL
Price*Percentage of Device Used
EffectiveFunction Cost
UART $6.95 17% $1.20
16-bit RISC Processor $6.95 36% $2.50
16-bit, 16-tap Symmetrical FIR Filter
$6.95 27% $1.90
Reed-Solomon Encoder $6.95 6% $0.40
PCI Interface(w/ faster speed grade)
$12.00 45% $5.40
High Value Cores with Spartan
*100,000 units, mid-1999 projection
Section V Special Topics
Virtex, XC6200 and Reconfigurable Logic
RAD
D
DSP XC
6200
TRADITIONAL THINKING
It’s About Time!
VHDL DesignEnvironment
Verilog DesignEnvironment CoreGen
Designer#2
DSP
133MhzSDRAM
Designer#1
GbitEthernet
66MhzPCI
NewModules
IP Modules
LogiCore
FIFO
AllianceCore
CPU
DesignReuse
160 MHz I/O Performance133 MHz Memory Performance
1 Million System Gates
Virtex
Virtex Enables System on a Programmable Chip
Virtex Series Overview• New FPGA architecture, similar to XC4000
• 0.25 and 0.18 micron 5LM process
• Segmented routing
• SelectRAM+ offers 3 types of RAM– Distributed SelectRAM– Block SelectRAM (new)
– High-speed access to external memory (new)
• Traditional and Low Voltage support – CMOS, TTL– LVTTL, LVCMOS, GTL+, and SSTL3
• 250K - 1M system gates in 1998
• Some XC6200-like features– Ideal for Reconfigurable Logic– Dynamic & Partial reconfiguration
Virtex Functional Block DiagramCLB Segmented routing
SelectI/OPins
DistributedSelectRAMMemory
BlockSelectRAMMemory
Phase Locked Loop (PLL)
66 MHz PCI SSTL3
Vector BasedInterconnectdelay=f(vector)
Xilinx 0.25 5 Volt-Compatible FPGAs
• 4KXL / 4KXV Family migration possible if you plan for:
– Additional power/ground pins– Dedicated clock and configuration pins
• Voltage migration guide to help users
Any 5 V
device(XC4000E)
Virtex&
XC4000XV2.5 V logic3.3 V I/O
Any 3.3 V
device(XC4000XL)
5 V3.3 V
2.5 V
5 V
3.3 V 3.3 V
3.3 V
I/OSupply
LogicSupply
Meets TTLLevels
Accepts5 V levels
Virtex FPGA Performance
• 100+ MHz internal speeds – 155 MHz SONET data stream processing
– 100+ MHz Pipelined Multipliers
– 66 MHz PCI
• 100+ MHz system interface speeds
without PLL with PLLTco (output register) 6 ns 3.5 ns
Tsu (input register) 3 ns 3 ns
Th (input register) 0 ns 0 ns
Max I/O performance 110 MHz 160 MHz
Segmented Routing Interconnect
3-STATE BUSSES
SWITCHMATRIX
2 LCs 2 LCs
CA
RR
Y
CA
RR
Y
CLB
CA
RR
Y
CA
RR
Y
• Fast local routing within CLBs
• General purpose routing between CLBs
• Fast Interconnect– 8ns across
250,000 system gates
• Predictable for early design analysis
• Optimized for five layer metal process
2 LCs 2 LCs
CLB
4 InputLUT
RegisterCarryand
Control
I3I2I1I0
O
WI DI
DCE
CLK
Q
CO
CI
4 InputLUT
RegisterCarryand
Control
I3I2I1I0
O
WI DI
DCE
CLK
Q
CO
CI
PR
RS
PR
RS
Polarity of all control signals selectable
Fast arithmetic and multiplier circuitry
Optimized for synthesis
Virtex Configurable Logic Block
SelectRAM+ Memory Features• Distributed SelectRAM Memory
– Pioneered in XC4000 family
– 16x1 synchronous SRAM implemented in LUT
– Ideal for DSP applications
– Access over 100 Billion bytes/sec
• Block SelectRAM Memory– Up to 32 4,096-bit blocks of dual port synchronous SRAM
– Configurable widths of 1, 2, 4, 8, and 16
– Ideal for data buffers and FIFOs
– Up to 17 gigabytes/sec access
• Fast Access to External RAM– Direct interface to SSTL3, 3.3V synchronous DRAM standard
– 133 MHz
Block RAM• Configure as: 4096 bits with variable aspect ratio
• 8-32 blocks across family devices
• True dual-port, fully synchronous operation– Cycle time <10 ns
• Flexible block RAM configuration– 5 blocks: 2K x 10 video line buffer
– 1 block: 512 x 8 ATM buffer (9 frames)
– 4 blocks: 2K x 8 FIFO
– 9 blocks: 4K x 9 FIFO with parity
WEAENACLKAADDRADINA
DOA
DOB
RAMB4
WEBENBCLKBADDRBDINB
CPU
XC6200XC6200RPURPU
I/O
I/OMemory
1000x improvement in reconfigurationtime from external memory
FastMAPtm assures high speed access to all internal registers
All registers accessed viabuilt-in low-skewFastMAPtm busses
Microprocessor interfacebuilt-in
High capacity distributed memorypermits allocation of chipresources to logic or memory
Ultrafast Partial Reconfigurationfully supported
XC6264 - Up to 100,000 gates
XC6200 Reconfigurable Processing Unit
XC6200 Architecture4x4 Block
User I/Os
16x16 Tile
Address
Data
FastMAPtm
Interface
Use
r I/
Os Use
r I/Os
User I/Os
Control
*Number of tiles varies between devices in family
Function Cell
How Dynamic Reconfiguration HelpsExample: DSP
3D Graphics Reconfiguration- DSP Algorithms PDSP FPGA Optimized FPGAs
- Texture- Shadow- Reflections- Perspective- Edge
Some functionsrun while othersare loading
One function at a time
Two or more functions at a time
All functions done in time
Reconfiguration Advantages:Lower cost by reusing silicon for multiple functions over time
OR10-500x performance increase in hardware versus software implementation
Reconfigurable Logic - Research vs. Component $
Problem Size
Pe
rfo
rma
nce
Computer
Embedded Microprocessor
Zillions of Component Dollars(3)
Zillions of Research Dollars(1)
Reconfigurable Logic research has typically focussed on reconfigurable computing1. But there are really two potential markets: high-end embedded computing2 and the low-cost
embedded market3. ?
[Graph is compliments of Nick Treddenick.]
(2)
XC6200 Dynamic & Partial Reconfiguration
ns us ms s
XC4013
40ns
200us
250ms
XC6216
Design Swapping
Block Swapping
Circuit Updates
Rewiring
Directions in Reconfigurable Logic• XC6200 was first Xilinx product to XC6200 chips &
XACT6000 software are available, but no further product development
– Divergent architecture and incomplete tools support – XUP support for Research only, not classes:
Adaptive or Reconfigurable Logic, Place & Route algorithms
• Key XC6200 features brought into mainstream families (Virtex)!
– Dynamic & Partial reconfiguration
– Full industry and software support
– Easier to design to
– New Rec.Logic curriculum should use Virtex
• Virtex-ready PCI board available from Virtual Computer Corp.
• Further info: http://www.xilinx.com/xup/6200rc.htm
Section V Special Topics
FPGA versus ASIC Costs
Pad-Limited Die Size
Core
core-limited
I/O pads
Mid-high density:Gate count determines
die sizeAs Processes Migrate
FPGA Cost = Gate Array Cost
pad-limited
Core
I/O pads
Low Density:I/O count
determinesdie size
1998 1999 2000
Spartan
$395
Spartan
$395
Pric
e
SpartanXL
$295
SpartanXL
$295
0.35 5LMSpartan-II
< $200
Spartan-II
< $200
0.5 3LM
2.5 Volt
More Features
Without Compromises• Pricing competitive with ASICs• High Performance• On-chip SelectRAMTM
3.3 Volt
5 Volt
*Prices are for 5K system gates, 100K units, -3 speed, Lowest Cost Package
0.25 5LM
2002
SpartanNext Generation
< $150
SpartanNext Generation
< $150
1.8 Volt
0.18
FPGA Price Leadership
Pric
e
XC9536
2001200019991998
$0.80
$9$1.80
$15
2002
XC95216
Without Compromises• Flexible ISP• Highest Performance• Pin-Locking• Full JTAG
* Prices are based on 100Ku+, slowest speed grade, lowest cost package
CPLD Price Leadership
Density
(System Gates)
1997 1998 1999 2000 2001 2002
15K
40K
100K
100K unit volume price projections
$10
60K
New Applications• Set Top Box• DVD• Digital Camera• PC Peripherals• Consumer Electronics
$2025K
60K
200K
100K
10K gates/$ in 2002!
$10
$20
Priced for High-Volume Leadership
The Real Cost of Ownership• Even in mid & high density, FPGAs often have cost advantage
• FPGA vs ASIC goes far beyond obvious unit costs calculations
• Real Comparison includes Real factors
Programmable FPGA Gate Array(Application Specific Integrated Circuit)
Lower unit cost
Custom ProductMonths to manufactureSlow Time to MarketNRE+Customer specificUser Test DevelopmentSimulation CriticalNo In-Circuit verification
Higher unit cost
Standard ProductOff the shelf deliveryFast Time to Market
No Non-Recurring Eng. FeeNo inventory risk
Fully factory testedSimulation helpful
In-Circuit verification
(-)
(+)(+)(+)(+)(+)(+)(+)(+)
Cost Calculations - Basic Model
• Breakeven - Solve for X (units)
ASIC Cost = FPGA Cost
$25K NRE + $79K Engineering& Tools + X * $10
= $0 NRE + $25K Engineering&Tools + X * $30
54K / 20=X
2,700 units=X
Cost Calculations - Market Model
Maximum Revenue from delayed entry
Product Life = 2WW W
Maximum Available Revenue
% of Lost Revenue = (Delay * (3W-Delay)/2W^2)*100
= (5.25 (3*18 - 5.25)/ 36^2) *100
= 19.75%
Net Profit = Volume * (System Price - System Cost )
= ($2K - $1.1K) * (1K + 12K + 5K)
= $16,200,000
ASIC Cost = $25K NRE + $79K Engineering + .1975*$16.2M Lost Profit + X*$10
FPGA Cost = $25K Engineering + X*$30
Being late to market costs Real $$
Total ASIC Development = 32 weeksTotal FPGA Development = 11 weeks
Breakeven, X = 162,700 units
= $3.2M
Hardwire Technology Model• ASIC Re-spin delay & expense risk 30%
• PLD price reductions 25% vs. 5% per year
• Hardwire Technology lowers FPGA cost 40-60%– No additional design work or test vectors
– Preserves nets, placement, routing
– All FPGA characteristics maintained
Total ASIC Cost = $25K + $79K + $5.3M + $22.8K + 18.7K + X * $10FPGA/HWire Cost = $25K Engineering + 1K*$30
+ $18K NRE + (X-1K) Units * $18
Breakeven, X = 674,000 units !!!
Download the Xilinx ASIC Estimator program at http://www.xilinx.com/products/hardwire/hardwire.htm to compare costs or learn more.
Total Cost of Ownership - ASIC vs. FPGA
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
0
1000
00
2000
00
3000
00
4000
00
5000
00
6000
00
7000
00
8000
00
Units
To
tal C
ost
($)
FPGA (B, M)
ASIC (B)
ASIC (M)
FPGA (H)
ASIC (H)
B = Basic analysisM = Market modelH = Hardwire model
Section V Special Topics
Xilinx Student Edition
The Xilinx Student Edition • Prentice Hall’s most requested new engineering product in Q1
‘98 ! – Complete, affordable, and practical digital design course environment for all students– Predeveloped and tested lab-based course
• Includes – Foundation Series 1.3 for students’ computers– Practical Xilinx Designer lab tutorial book– Coupon for XS40-005XL and XS95-108 boards ($129)
• Sold through bookstores by Prentice Hall and www.Amazon.com, listed at $79 (ISBN 0136716296)
• Integrated tutorial projects cover:TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors
• Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web Aug.1
The Practical Xilinx Designer• The Digital Design Process - Basic concepts and TTL logic• Programmable Logic Design Techniques - Programmable logic introduction and Foundation tutorial• Programmable Logic Architectures - XC9500 CPLD and XC4000 FPGA• Combinatorial Logic Design - LED decoder circuit with both CPLDs and FPGAs.• Modular Designs and Hierarchy - step-wise refinement using Foundation • Electrical Characteristics of Programmable Logic - I/O drivers, timing/delay models, and power consumption• Flip-Flops - introduces sequential logic• State Machine Design
- design examples for counters, drink machine, etc.• Memories
- how to build memory with flip-flops, logic gates.• The GNOME Microcoomputer - construction and improvements of simple, 8-bit microcomputer.
Xilinx Student Edition Development Boards
Section V Special Topics
Xilinx University ProgramParticipation
Section Agenda
• Course recommendations
• How to learn more
• Contacts & Support
• Why use Xilinx?
• Products & Ordering– Software– Hardware
Course Recommendations• See http://www.xilinx.com/programs/univ.htm
Trends in Teaching with PLDs• Increasing density and Cores enable System-level design
and test on an FPGA– LogiCOREs available to all universities– PCI, DSP, math, other complex functions
• VHDL or Verilog design is commonplace• PLDs in many subjects beyond Digital Design and
Computer Engineering– System Level Design and Test – Dynamically Reconfigurable Logic– Digital Signal or Video Processing– Network Design
• Prevalent usage in required EE, CS, CE courses• Students use their own computers
How To Learn More (1)• AppLinx CD / Xilinx data book
• On-line books, On-line Help
• Excellent on-line tutorials in Foundation & Express
• Xilinx Web Site• Application notes
• Latest technical information and status
• Fast Technical Help
• Whatever it is, it’s probably there!
• Subscribe to XCELL Journal
• Xilinx Student Edition is great practical guide
XUP Contacts & Support• XUP Staff:
– Jason Feinsmith, XUP Manager ([email protected], USA 408-879-4961)
– Anna Acevedo, XUP Coordinator ([email protected], USA 408-879-5338)
– Chris Grundy, XUP European Liason ([email protected], UK +44-1-932-333-523)
– XUP Website: http://www.xilinx.com/programs/univ.htm
• Xilinx commercial or university distributors– Channel for product distribution, updates– http://www.xilinx.com for listing of commercial distributors– Europractice, Chip Implementation Center (Taiwan ROC), IDEC (S.Korea),
Canadian MicroElectronics Corp.
• Technical Support– Answers Database http://www.xilinx.com/support/searchtd.htm– For Instructors: [email protected], USA 800-255-7778
Xilinx Donation Policy“If a new or expanded course with lab or a research project is
being added and funding is not adequate to purchase the required products at the University Program discounts, Xilinx encourages any university or college to submit a
donation request.”
To Purchase or To Request a Donation - What's Practical for you? If you have sufficient budget to purchase Xilinx software, development boards, and/or chips, then we encourage you to do so. We offer significant discounts for Xilinx software and Xilinx development boards. However, we recognize that very often, schools simply do not have the funding even for the discounted products. In some cases, a school might have some funding, but not enough to obtain everything that is needed for the lab. We encourage you to make the choice that you feel is right for your situation. Most importantly, if money is any barrier to your immediate use of Xilinx products, you should request a donation for what you need.
Why Xilinx?• Xilinx is world’s leading Programmable Logic innovator with 55% commercial FPGA
marketshare• Xilinx is nearly twice as popular in the academic market as its nearest competitor• Best PLD Software: Foundation; Alliance; & Synopsys partnership • Best PLD hardware architectures
– Xilinx FPGAs and CPLDs all Reprogrammable In-System. – Tri-state and dual port RAMs in FPGAs are best for computer structures, DSP, research, etc. – Only vendor with dynamically & partially reconfigurable RPU’s
• Prentice Hall / Xilinx Student Edition includes best tools on the market with fully integrated hardware environment
• If you don’t have the budget, request a donation.
FPGA3K, 5K, 4K
CPLD 9500
Co
mp
lexi
tyF
un
ctio
nal
ity
/C
ou
rse
Lev
el
Speed
Exciting Research areas:
• Reconfigurable ComputingVirtex, XC6200
• Digital Signal ProcessingXC4000X
• Networking, PCI, Computer Architectures, Neural Nets, etc.
Computer Lab Requirements• Win ‘95, Win NT, HP, Sun, Solaris, use Xilinx
software version 1.3, available now– Foundation Series Express recommended for all PC users
– Other design entry tools OK too, especially on workstation
• v1.4 RAM Hard Drive Processor Minimum 32MB 200MB 486DX2 MuchBetter 32+MB 500MB Pentium 120+
Typical Lab Setup• Primary and Additional licenses *
• Cables vs. PROM Programmers• Foundation Series Express package
recommended for lab– Software updates
– Full range of devices supported
– Additional license scheme
1 US-FND-EXP-PC Primary Foundation package 9 UA-FND-EXP-PC Additional FND licenses10 XS40-010XL XC4010XL FPGA board & cable 2 XS95-108 XS9500 CPLD board & cable
* Workstation users, use Ux-ALI-STD-WS, and subsitute these for 10 XS40-010XL’s 10 UW-FPGABOARD 3K/4K Development boards10 UW-XCHCBL-PC XChecker cables
CPLD or FPGA?CPLD
• Non-volatile
• JTAG Testing
• Wide fan-in
• Fast counters, state machines
• Combinational Logic
• Small student projects, lower level courses
FPGA
• More common in schools
• Great for first year to graduate work
• Excellent for computer architecture, DSP, registered designs
• ASIC like design flow
• SRAM reconfiguration
• PROM required for non-volatile operation
Since the software is integrated, you can teach with both !
Hardware Boards for PCs
XSTEND- Plug-in extension for XS40 & XS95’s- Purchase from XESS Corp.
XS40 & XS95 Boards- Purchase from XESS or
donation from Xilinx
Access toI/O Pins foreasy prototyping
Hardware Boards (2)
• H.O.T. II PCI Board1
• UW-FPGABOARD2
Access toI/O Pins foreasy prototyping
Battery not includ
ed!
(1) Purchase HOT II from VCC(2) Most popular board for the workstation. Purchase or donation from Xilinx
Summary• Enhance Your Lab Curriculum with Xilinx • Students get better job offers • Great products for your lab
– Leading, industry standard software – IEEE Standard VHDL & Verilog– Innovative hardware solutions
• Ideal from intro to graduate courses• Great publications from Prentice Hall • Areas of strength for research
– DSP, Reconfigurable Logic
Xilinx = Long term Programable Logic Solutions Leader
Appendix A: Xilinx Configurable Logic
Blocks
D Q
SD
RD
EC
S/RControl
D Q
SD
RD
EC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4
G3G2
G1
F4F3
F2
F1
C4C1 C2 C3
K
YQ
Y
XQ
X
H1 DIN S/R EC
XC4000 CLB
XC4000X I/O Block Diagram
Shaded areas are not included in XC4000E family.
XC9500 CPLDs
FunctionBlock 1
JTAGController
FunctionBlock 2
I/O
FunctionBlock 4
3
Global Tri-States
2 or 4
FunctionBlock 3
I/O
In-SystemProgramming Controller
FastCONNECTSwitch Matrix
JTAG Port
3
I/O
I/O
Global Set/Reset
Global Clocks
I/OBlocks
1
XC9500 Function Block
ToFastCONNECT
FromFastCONNECT
2 or 43 GlobalTri-State
GlobalClocks
I/O
I/O
36
Product-Term
Allocator
Macrocell 1
ANDArray
Macrocell 18
XC9500 Function Block
(2nd View)
QD/T
FixedOutput
Pin
FastCONNECTSwitch Matrix
Function BlockLogic
36Inputs
Appendix B: FPGA Family Comparisons
Xilinx Spartan Series
5 Volt -> XCS05 XCS10 XCS20 XCS30 XCS40
3.3 Volt -> XCS05XL XCS10XL XCS20XL XCS30XL XCS40XL
System Gates 2K-5K 3K-10K 7K-20K 10K-30K 13K-40K
Logic Cells 238 466 950 1368 1862
Max Logic Gates 3,000 5,000 10,000 13,000 20,000
Flip-Flops 360 616 1120 1536 2016
Max RAM bits 3,200 6,272 12,800 18,432 25,088
Max I/O 80 112 160 192 224
Performance 80MHz 80MHz 80MHz 80MHz 80MHz
XC4000E 5V FPGA Family
4003E 4005E 4006E 4008E 4010E 4013E 4020E 4025E
Logic Cells 238 466 608 770 950 1,368 1,862 2,432
Max Logic Gates 3K 5K 6K 8K 10K 13K 20K 25K
Typ Gate Range* 2-5K 3-9K 4-12K 6-15K 7-20K 10-30K 13-40K 15-45K(Logic + Select-RAM)
Max I/O 80 112 128 144 160 192 224 256
Packages: PC84 PC84 PC84 PC84 PC84TQ100PQ100 PQ100
TQ144 TQ144 PQ160 PQ160 PQ160 PQ160
PQ208 PQ208 PQ208 PQ208 PQ208 HQ208HQ240 HQ240 HQ240
HQ304PG120 PG156 PG156 PG191 PG191 PG223 PG223 PG223
BG225 BG225 PG299
* 20-25% of CLBs as RAM
100%Footprint
Compatible
Spartan/XC4000E/XC5200 Density
Spartan/XL XC4000E XC5200
Logic Cells 238 - 1,862 238 - 2,432 256 - 1,936
Typ Gate Range 2,000 - 40,000 2,000 - 45,000 2,000 - 23,000(Logic + SelectRAM)
I/O 77 - 205 80 - 256 84 - 244
Number of Devices 5 8 5
Power Supply 5V / 3.3V 5V 5V
I/O Interface 5V / 3.3V 5V 5V
XC4000X Series Density
XC4000EX XC4000XL XC4000XV
Logic Cells 2,432 - 3,078 152 - 7,448 10,982 - 20,102
Typ Gate Range 18,000 - 65,000 1,000 - 180,000 80,000 - 500,000(Logic + SelectRAM)
I/O 256 - 288 64 - 448 448
Number of Devices 2 11 4
Power Supply 5V 3.3V 3.3V + 2.5V
I/O Interface 5V 5V / 3.3V 5V / 3.3V / 2.5V
Common Features
Spartan XC4000 XC5200
Function Generators/CLB 3 3 4
Flip-flops/CLB 2 2 4
Global Nets 8 8 4
Global Three-State Control Yes Yes Yes
Carry Logic Yes Yes Yes
Internal Three-State BuffersYes Yes Yes
Boundary Scan Logic Yes Yes Yes
Output Drive (Sink) 12 mA 12 mA 8 mA
Differentiating FeaturesSpartan XC4000 XC5200
LCs/CLB 2.375 2.375 4RAM Sync. Sync./Async. NonePCI Yes Yes NoDecode No Yes NoWired-AND No Yes NoI/O FFs Yes Yes NoConfig Ser Par/Ser Par/SerPackages 6 16 18
• Complete pinout compatibility within Spartan Series• Not directly pinout-compatible with XC4000/XC5200
- Spartan has only one MODE pin- Mode pin cannot be used as I/O
Xilinx XC4000-based Architecture Comparison
Spartan/XL XC4000X XC4000E
Extended Routing No Yes No
Fast Capture Latch No Yes No
Global Early Buffers No Yes No
Output Mux No Yes No
CLB Latches No Yes No
Asynchronous RAM No Yes Yes
Edge Decoders No Yes Yes
Wired-AND Function No Yes Yes
Density Comparison
Xilinx Device Competing ProductMax Max
Max RAM Logic RAM MaxDevice I/O Bits Cells Bits I/O DeviceXC4000 Series Altera FLEX 10K
XC4085XL 448 100K 7,448XC4062XL 384 74K 5,472
4,992 25K 406 EPF10K100XC4052XL 352 62K 4,598XC4044XL 320 51K 3,800
3,774 18K 358 EPF10K70XC4036EX 288 42K 3,078
2,880 20K 310 EPF10K50XC4028EX 256 33K 2,432XC4025E 256 33K 2,432
2,304 16K 278 EPF10K40XC4020E 224 25K 1,862
1,728 12K 246 EPF10K30XC4013E 192 18K 1,368
1,152 12K 198 EPF10K20
Xilinx University Workshops Appendix C
Design Tool Flows
Xilinx-Express Design Flow
.VEI
.VHI
.UCF Reports
DSP COREGen & LogiBLOX
Module Generator
XNF.NGO
HDL Editor
State DiagramEditor
VHDLVerilog
.V.VHD
Foundation Design Entry Tools
Gate LevelSimulator
SchematicCapture
EDIFXNF
TimingRequirements
VHDLVerilog
Express
EDIF/XNF .XNF
BITJDEC
SDFVHDL
Verilog
Reports
EDIF
Xilinx Implementation Tools
HDL
SIMULATION
VHDLVerilog
Behavioral Simulation Models
Xilinx Design
Manager Flow 1.4
FPGA Implementatio
n
Xilinx Design Manager Flow
1.4
CPLD Implementat
ion
Design Entry
Concept
Mixed-LevelSchematic/HDL
NetlistInformation
Design Synthesis & Retargetability
Synergy HDL/VHDLSynthesis
SynthesisLibraries
Design Optimization/ Partitionning for PLDs
PLD Designer
Design Optimizationfor FPGAs
FPGA Designer
Post ImplementationNetlist & SDF
PLD & FPGA Designer
Device Programming Files
Schematic Redraw
Functional Simulation
Verilog XL,Leapfrog
Functional Simulation / Verification
OpenSIM BackPlane
Netlist Creation
VerilogLink/VHDLLink
TimingSimulation
Verilog-XL,Leapfrog
SimulationLibraries
Timing Backannotation
*EDIF, XNF
Verilog, VHDL**SDF, *EDIF
*Standard Interface Netlist Format** Standard Delay Format
SimulationLibraries
Place & Route
Implementation Tools
VHDL, VERILOG
Design Flow M1M1
*Standard Interface Netlist Format** Standard Delay Format
Schematic Entry / View Schematic
ViewlogicViewDraw
Place & Route
PAR (Place & Route)
Structural Simulation / Functional Simulation
ViewlogicViewSim
**SDF
Netlist(XNF or *EDIF)
Netlist Launcher
NGDBUILD
Implementation Tools
ABEL HDL
LogiBlox
LogiCores
Optional
VHDL,*XNF
Waveform Analysis
ViewlogicViewTrace
TimingSimulation
ViewlogicViewSim
Device Programming Files
Timing AnnotatedEDIF Netlist
VHDL Entry & Compile
ViewlogicViewSyn
VHDL Synthesis
ViewlogicViewSyn
Behavioral Simulation
ViewlogicSpeedwave
Design FlowM1M1
Design Flow
*Standard Interface Netlist Format** Standard Delay Format
HDL Design Flow Schematic Design Flow
Timing Simulation
QuickHDL
*EDIF
LogiBlox
LogiCores
VHDL / Verilog HDL
Notepad / QuickHDL
Mentor Design Manager
Optional
FunctionalSimulation
QuickHDL
Place & Route
Implementation Tools
Synthesis & Optomization
Autologic II
VHDL or VERILOG*SDF
Design Entry
Design Architect
Simulation Preparation
Design View Editor
Mentor Design Manager
Functional Simulation
QuickSim II
Place & Route
Implementation Tools
ABLE HDL
LogiBlox
LogiCores
Optional*EDIF
Timing Simulation
QuickSim II
*EDIF w/ Timing*SDFDevice Programming Files
Device Programming Files
M1M1
Synthesis
Synopsys FPGA Compiler orDesign Compiler
ConstraintsFile
*Standard Interface Netlist Format** Standard Delay Format
LogiBlox
LogiCores
Optional
Place & Route
PAR (Place & Route)
Functional Simulation
SynopsysVHDL System Simulator
or3rd Party
VHDL/VERILOG Simulator
VHDL,VERILOG,
*SDF
Static Timing Report
Static TimingVerification
SynopsysVSS Simulator
SynopsysVSS Simulator
or3rd Party
VHDL/VERILOG Simulator
Timing Simulation
Post-layout Verification
Simulation Library
Synthesis Library
Netlist(XNF or *EDIF)
Netlist Launcher
NGDBUILD
Implementation Tools
Device Programming Files
HDL Source File(VHDL or
Verilog HDL)
Xilinx Unified Libraries VHDL/VERILOG Models
Synopsys Design Compiler Design Flow M1M1
*Standard Interface Netlist Format** Standard Delay Format
Schematic Entry
OrCAD/ESPDesign Environment
Place & Route
PAR (Place & Route)
Functional Simulation
OrCAD SimulateXSimMake
*SDF
Netlist(XNF or *EDIF)
Netlist Launcher
NGDBUILD
Implementation Tools
ABEL HDL
LogiBlox
LogiCores
Optional
XNF modules(Created by HDLSynthesis tools)
VHDL,*XNF
Device Programming Files
Design Flow M1M1
Synplicity Design Flow
VHDLVerilog
XNFEDIF
HDL Editor VHDLVerilog
.SDC
HDL Analyst
DSP COREGen& LogiBLOX
Module Generator
VHDLVerilog
.VEI.VHI
.NGOXNF
Behavioral Simulation Models
3rd PartySimulation
VHDLVerilog
XNF, VM,VHM, EDIF
VHDLVerilog
SDF
EDIF
Xilinx Implementation ToolsUser
ConstraintsFile
BITJEDEC
VHDLVerilog
EDIF SDFReports
Functional Simulation Flow
Timing Simulation Flow
Verilog & VHDL Instantiation
.VM.VHM
Unified
simprim
Command Fileor
Test Vectors
HDLTestBench
Unisim
VITAL, Verilog, Gate
VITAL & Verilog
StructuredVerilog andVHDL netlists
Gate
.NGO = Xilinx binary netlist
-route-improve
Compile & Map Engine
.NCF
Plac
e &
Rou
teC
onst
rain
ts
cros
spr
obin
g
Timing &Design
Constraints
Technology View
RTL View
Xilinx University WorkshopAppendix D
XChecker Cable and Configuration
*Note: Although differences are very minimal, this information has not been updated to reflect M1 information.
Use XChecker Cable to Simplify Verification
• Downloading allows quick verification of design in circuit– Bitstream downloaded via computer’s serial port directly into
FPGA– No PROM programming required
– Design changes and verifications made quickly
• Readback sends configuration data and flip-flop values back out of chip– Verifies correct configuration
– Allows in-circuit “probing” of all signals
– Can occur while the FPGA is running
– Uses no CLBs or routing resources
Enabling Configuration Readback• Readback Trigger input starts serial readback
• XC3000 controlled via Bitstream Generator– Default is enabled
– Data and trigger connected to Mode pins
• XC4/5000 controlled via schematic and Bitstream Generator– Include Readback symbol in schematic
– Connect TRIG and DATA to I/O pins
– Can use MD0 and MD1
IPADIBUF
OPADOBUF
(MD0)
(MD1)
CLK
TRIG
DATA
RIPREADBACK
XCheckerRT
XCheckerRD
Available Readback Data
Data includes all storage elements in device– XC4000/XC5000 readback data includes all outputs
of CLBs and IOBs
• XC4000/XC5000 data is captured when readback is triggered
• XC3000 data is captured as readback progresses– May want to stop system clock for logic verification– Requires XChecker control of system clock
Control Panel Defines Debug Session
(XACT™step v6)• Opens automatically for Debug
• Allows direct control of:– System clock source definition and application– Readback trigger source definition and
application– Number of readbacks– Display options
How to Use Programmable Logic to Build Fast and Efficient DSP Functions
XUP WorkshopAppendix E
Originally created by: Greg Goslin
Xilinx, Corporate Applications
Constraint Driven Design Methodology
• Constraints– System Requirements– Hardware Limitations
• Data Rate– Inputs– Outputs– Multi-Channel I/O
• Quality– Number of Bits/Taps– Number of Operations– Error Tolerance
• Processor Power• Clock Rate
Constraint Driven Design methodologies
Clock Rate
Data Rate
Quality
Processor Power
Options
PerformanceEfficiency
Building Fast and Efficient Filters in FPGAs
• Efficient Filter Algorithms for FPGAs– Distributed Arithmetic:
• Bit-Serial
• n-Bit Parallel
• Using Distributed Arithmetic for Filter Designs– Serial FIR Filter Example– Two-Bit Parallel FIR Example– Full Parallel FIR Example
FIR FILTER EXAMPLE
XC0
X0
XC1
X1
XC2
X2
•
•
•
• • •
SUM
0
K
• • •
SAMPLE DATA
N BITS WIDE
K TAPS LONG
K COEFFICIENTS
K SUMS
OUTPUT DATA
PRODUCT K Multiplies
K Sums
CLOCK = Multiply Time
Sample Rate = Clock Rate
IMPLEMENTATION ???
Sum of Products Equation
2’s Complement Math
• The 2’s Complement of a number: Invert (1’s Complement) then Add 1. 11111010 (-6) the 2’s Comp. is (Invert) 00000101, (Add 1) Equals: 00000110 (+6)
• Leading 1’s and 0’s are only place holders: (Sign extending a 2’s Comp. number doesn’t change its value) XMSB ... X2 X1 X0 equals XMSB XMSB XMSB ... X2 X1 X0
The following 2’s Complement pairs are the same: FFFF = FF, 0001 = 01, 11111111101 = 1101
• Adding 2’s Complement numbers: - Sign Extend, the MSB (sign bit) must be extended to allow for word growth:
SE111010 -6 0000110 +6001101 +13 0001101 +13
1000111 +7 0010011 +19(Note: Ignore Overflow)
8-Bit X 8-Bit Signed Multiply
B7B6B5B4B3B2B1B0
S
X A7A6A5A4A3A2A1A0
SIGN EXTEND
A0(B7B6B5B4B3B2B1B0)A1(B7B6B5B4B3B2B1B0)
A2(B7B6B5B4B3B2B1B0)A3(B7 B6B5B4B3B2B1B0)
A4(B7 B6 B5B4B3B2B1B0)A5(B7 B6 B5 B4B3B2B1B0)
A6(B7 B6 B5 B4 B3B2B1B0)A7(B7 B6 B5 B4 B3 B2B1B0)+
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
8-Bit X 8-Bit Signed Multiply
B7B6B5B4B3B2B1B0
S
X A7A6A5A4A3A2A1A0
SIGN EXTEND
SE(B7 B6 B5 B4 B3 B2B1B0)*A7 27
SE(B7 B6 B5 B4 B3B2B1B0)*A6 26
SE(B7 B6 B5 B4B3B2B1B0)*A5 25
SE(B7 B6 B5B4B3B2B1B0)*A4 24
SE(B7 B6B5B4B3B2B1B0)*A3 23
SE(B7B6B5B4B3B2B1B0)*A2 22
SE(B7B6B5B4B3B2B1B0)*A1 21
SE(B7B6B5B4B3B2B1B0)*A0 20
+
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
4-Bit Signed Tree Multiplier
3:0BA1
Sign Extend
3:0
LSB
REG
A/2
B
3:0
A0
B * A1
B * A0
Sign Extend
3:1
B3
B2
B0
1 CARRY IN{ 1/2 B*A0 - B*A1 }
3:0BA3
Sign Extend
3:0
LSB
REG
A/2
-B
3:0
A2
B * A3
B * A2
Sign Extend
3:1
B3
B2
B0
1 CARRY IN{ 1/2 B*A2 - B*A3 }
B3
B3
5-bit Signed Adder & Reg = 3 CLBs
5-bit Signed Adder & Reg = 3 CLBs16 Gated Bits and Reg = 8 CLBs
+A1 *{ B3B3B2B1B0 }+A0 *{ B3B3B3B2B1B0 }
{ P5P4P3P2P1P0 }
REG
A/4
B
7:2
7:2
5:0
Sign Extend
5:2
B5
B5
B1LSB
B0
-A3 *{ B3B3B2B1B0 }+A2 *{ B3B3B3B2B1B0 }
{ P7P6P5P4P3P2 } -A3 *{ B3B3B2B1B0 }+A2 *{ B3B3B3B2B1B0 }+A1 *{ B3B3B3B3B2B1B0 }+A0 *{ B3B3B3B3B3B2B1B0 }
{ P7P6P5P4P3P2P1P0 }
7:0
Total = 18 CLBs
6-bit Signed Adder & Reg = 4 CLBs
B
X0
SAMPLE DATA
N BITS WIDE
A
B
ScalingAccum.
REGISTER
FILTEREDDATA OUT
2 -1
+ -
LOOKUP
TABLE
ADRS
DATA
D.A. ONE TAP FIR FILTER = D0 C0
REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT
...000000
C0
2 WORD X N BITLOOK UP TABLE
A0
A[0]0
1
1
X1
X2
X3
Xn
DINN
X0(B7B6B5B4B3B2B1B0)+X1(B7B6B5B4B3B2B1B0)
+X2(B7B6B5B4B3B2B1B0)
+X3(B7 B6B5B4B3B2B1B0)
+X7(B7 B6 B5 B4 B3 B2B1B0)
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
S9S8S7S6S5S4S3S2S1S0
S10S9S8S7S6S5S4S3S2S1S0
S11S10S9S8S7S6S5S4S3S2S1S0
+X4(B7 B6 B5B4B3B2B1B0)S12S11S10S9S8S7S6S5S4S3S2S1S0
+X5(B7 B6 B5 B4B3B2B1B0)S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+X6(B7 B6 B5 B4 B3B2B1B0)S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
D.A. TWO TAP FIR FILTER = D0 C0 + D1 C1
A
B
ScalingAccum.
REGISTER
FILTEREDDATA OUT
2 -1
+ -
LOOKUP
TABLE
ADRS
DATA
...000000
C0
4 WORD X N BITLOOK UP TABLE
c1
C0 + C1
00
01
10
11
A[10]
X0
X2
X1
XN
D0
SAMPLE DATA
N BITS WIDE
D1
A0
A1X0
X2
X1
XN
N
(X0,0,X1,0)(B7B6B5B4B3B2B1B0)+(X0,1,X1,1)(B7B6B5B4B3B2B1B0)
+(X0,2,X1,2)(B7B6B5B4B3B2B1B0)
+(X0,3,X1,3)(B7 B6B5B4B3B2B1B0)
+(X0,7,X1,7)(B7 B6 B5 B4 B3 B2B1B0)S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
S9S8S7S6S5S4S3S2S1S0
S10S9S8S7S6S5S4S3S2S1S0
S11S10S9S8S7S6S5S4S3S2S1S0+(X0,4,X1,4)(B7 B6 B5B4B3B2B1B0)
S12S11S10S9S8S7S6S5S4S3S2S1S0+(X0,5,X1,5)(B7 B6 B5 B4B3B2B1B0)
S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+(X0,6,X1,6)(B7 B6 B5 B4 B3B2B1B0)S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
A
B
ScalingAccum.
REGISTER
FILTEREDDATA OUT
2 -1
+ -
LOOKUP
TABLE
ADRS
DATA
D.A. THREE TAP FIR FILTER
...000000
C0
8 WORD X N BITLOOK UP TABLE
C1
C1 + C0
000
001
010
011
100
101
110
111
C2
C2 + C0
C2 + C1
C2 + C1 + C0
A[210]
(X0,0,X1,0,X2,0)(B7B6B5B4B3B2B1B0)+(X0,1,X1,1,X2,1)(B7B6B5B4B3B2B1B0)
+(X0,2,X1,2,X2,2)(B7B6B5B4B3B2B1B0)
+(X0,N,X1,N,X2,N)(B7B6B5B4B3B2B1B0)S(N+M) ... S13S12S11S10S9S8S7S6S5S4S3S2S1S0
S9S8S7S6S5S4S3S2S1S0
S10S9S8S7S6S5S4S3S2S1S0
X0
X2
X1
XNSAMPLE DATA
N BITS WIDE
A1
D0
D2
D1
A0
X0
X2
X1
XN
A2X0
X2
X1
XN
N
The Development of aDistributed Arithmetic FIR Filter
10-Bit 10-Tap - XC4000 Family Example
DATA
LOOK UPTABLE
PARALLEL INSERIAL OUT
SAMPLEDATA
XOR
COMPLEMENT ON LAST BIT & ADD 1
A
B
REGISTER
100 BITSHIFT
REGISTER
FILTEREDDATA OUT
SHIFT
D0
D1
D9
D9
D1
D8
D2
D7
D3
D6
D4
D5
ADD
ADD
ADD
ADD
ADD
A0
A1
A2
A3
A4
10 BIT 10 TAP SYMMETRICAL FIR FILTER
32 X 10 MEMORY
10 10 BITSHIFT
REGISTER
SUM(10,1)
10
11
10
ScalingAccum.
A10A9A8
S1
SUM(0)
DIN
ShiftReg. 10
Least SignificantBYTE
MostSignificantBYTE
OPTIONALDOUBLEPRECISION
S10S9
A0
10
320 BITS
Look Up Table is only 32 words by 10 bits
SerialAdders
C_I
B(9:0)
SIGN EXT B10
LD
LOAD ONFIRST BIT
PARALLEL INSERIAL OUT
SAMPLEDATA
N K BITSHIFT
REGISTER
SHIFT
D_0
D_1
D_k-1
N N BITSHIFT
REGISTER
SAMPLE DATA WORD SIZE = N BITSNUMBER OF TAPS = K
• One N Bit Shift Register Per Tap
• Use 4000 RAM to build Shift Register
• One 16 Bit Shift Register Per 1/2 CLB
•
# OUTPUTS = # TAPS
PARALLEL IN
SAMPLEDATA
N K BITSHIFT
REGISTER
D_0
N N BITSHIFT
REGISTER
•
RAM16X1RDATA_I
A3A2A1A0
WRCLK
DATA_O
RAM16X1RDATA_I
A3A2A1A0
WRCLK
DATA_O
SHIFT REGISTERIMPLEMENTED IN RAM
SERIAL TIME SKEWBUFFER
D_k-1
D_1
10 BIT 10 TAP = 50 CLBs 10 BIT 10 TAP = 10 CLBs
Serial Adder
D9
D1
D8
D2
D7
D3
D6
D4
D5
ADD
ADD
ADD
ADD
ADD
SerialAdders
D0
AB
D
Clk
FF
A+B+Carry
A + B
Carry In Carry
CLR
1 CLB Per 2 Taps
D
Clk
FF
CNT=10
SUM
DATA
LOOK UPTABLE
A0
A1
A2
A3
A4
32 X 10 MEMORY
320 BITS
DISTRIBUTED ARITHMETIC LOOK-UP TABLE
• HOLDS ALL PARTIAL PRODUCTS
• LUT IS AS WIDE AS COEFF
• CAN USE MEMGEN TO BUILD LUT
1’s COMPLEMENTER
• INVERTS DATA ON LAST CYCLE
• 2 BITS PER CLB
D Q
D Q
INVERT
D0
D1
SCALING ACCUMULATOR
• ADDS DATA TO (1/2) *(SUMOUT)
• 2 BITS PER CLB
• NEED N+1 BITS
• DOUBLE PRECISION WITH SR
• CAN USE XBLOX FOR RPM
FORCE CARRY-IN ON LAST BIT
A
B
REGISTER
SUM OUT
10
11ScalingAccum.
A10A9A8
S1
SUM(0)
DIN
ShiftReg. 10
Least SignificantBYTE
MostSignificantBYTE
OPTIONALDOUBLEPRECISION
S10S9
A0
10
C_I
B(9:0)
SIGN EXT B10
LD
LOAD ONFIRST BIT
DATA
DATA
LOOK UPTABLE
PARALLEL INSERIAL OUT
SAMPLEDATA
XOR
COMPLEMENT ON LAST BIT & ADD 1
A
B
REGISTER
100 BITSHIFT
REGISTER
FILTEREDDATA OUT
SHIFT
D0
D1
D9
D9
D1
D8
D2
D7
D3
D6
D4
D5
ADD
ADD
ADD
ADD
ADD
A0
A1
A2
A3
A4
10 BIT 10 TAP SYMMETRICAL FIR FILTER
32 X 10 MEMORY
10 10 BITSHIFT
REGISTER
SUM(10,1)
10
11
10
ScalingAccum.
A10A9A8
S1
SUM(0)
DIN
ShiftReg. 10
Least SignificantBYTE
MostSignificantBYTE
OPTIONALDOUBLEPRECISION
S10S9
A0
10
320 BITSSerialAdders
C_I
B(9:0)
SIGN EXT B10
LD
LOAD ONFIRST BIT
(RAM)
10
10
5FIVE 2 BITADDERS
2 TO 1 REDUCTION DUE TO SYMMETRY
SERIAL TIME SKEW BUFFER
RAM BASED SHIFT REGISTER
RAM OR ROMLOOK UP TABLE
10ADRS DATA
FIR FILTER COEFFICIENTSAND MULTIPLY LOOK UP
32 X 10
10
10
9
REGISTER
ADDER
A
B
10
FILTER OUT
COMPLEMENT ON LASTCYCLE
XOR
SCALING ACCUMULATOR
1’S COMPLEMENT
5 CLBs10 CLBs 10 CLBs
5 CLBs 7 CLBs
SAMPLE DATA
7 CLBs
TIMING AND CONTROL
50 MHz CLK
CLK
A3
A2
A1
A0
CNTEQ10
CNTEQ9
A3
A2
A1
A0
10 BIT 10 TAP FIR FILTER
• TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN)
• ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS
XC4000PART
NUMBER OFINSTANCES
4002A 4003A 4004A 4005A 4006 4008 4010 4013 4025
1 2 3 5 6 8 10 15 23
NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE
9 Most Significant Bits
FIR10B10T
DATA IN DATA OUT
WORD_CLKCLK_OUT
DIN_ DOUT_
Relatively Placed Macro
BIT_CLK 10X_CLK
PERFORMANCE
• FIR10B10T MACRO CAN BE CLOCKED AT 66 MHZ @XC4000E-3
• 10 BIT WORD REQUIRES 11 CLOCKS
• 10 BIT SAMPLE WORD RATE IS 6 MHZ
• 8 BIT WORD REQUIRES 9 CLOCKS, ETC
• 8 BIT SAMPLE WORD RATE IS 8 MHZ
6 8 10 12 14 16
11.1 7.4 6.1 5.1 4.4 3.9
WORD SIZE BITS
MSPSSAMPLE RATE
FIR Filter Macro
Double-Rate DA FIR Filters
• Process 2 Bits per Clock
• # of Clocks = (N/2) + 1
• Twice as fast
Two Bit Parallel Distributed Arithmetic FIR Filter
SAMPLE DATA
N BITS WIDE
A3A2
A
B
ScalingAccum.
REGISTER
FILTEREDDATA OUT
2 -2
+ -
LOOKUP
TABLE
ADRS
DATA
A1
D1
X0
X2
X1
XN
X0
X2
X1
XN
D0N
A0
...000000
C0
16 WORD X N BITLOOK UP TABLE
2C0
3C0
0000
0001
0010
0011
0100
0101
0110
0111
A[3210]
C1
C2 + 2C1
C1 + 3C0
C2 + C1
1000
1001
1010
1011
2C1
2C1 + 2C0
2C1 + 3C0
2C1 + C0
Double Sample Rate D.A. FIR Filters• Twice the I/O Data Sample Rate
• Two Taps Requires 4 Input LUT without Symmetry
• Four Taps Requires 4 Input LUT with Symmetrical FIR
• Time Skew Buffer uses Twice as many CLBs
• LUTs are the same, if equal bit weights are used to address the LUTs.
• 2-Bit PDA Performance, Clocked at 66 MHz @XC4000E-3
6 8 10 12 14 16
22.2 14.8 12.2 10.2 8.8 7.8
WORD SIZE BITS
MSPSSAMPLE RATE
(Double Precision)
Full Parallel D.A. FIR Filters• One 8-Bit Tap Requires two 4 Input LUTs and an ADDER
with an offset for bit weighting.• Time Skew Buffer must use REGs• Maximum I/O Data Sample Rate• Full PDA Performance, in a XC4000E-3/-2, 50-70 MHz.
– Pipelinning can further increase sample rate
• LUTs are the same, if equal bit weights are used to address the 4-Coefficients in the LUT.
6 8 10 12 14 16
70 70 70 70 66 66
WORD SIZE BITS
MSPSSAMPLE RATE
(Double Precision)
FPGA-Based DSP Coprocessor
Design Implementation• Performance
– Programmable DSP
(DSP56300)• 24 clock cycles
• 360 nsec @ 66 MHz
– FPGA-Based Coprocessor• 9 clock cycles
• 135 nsec @ 66 MHz
• Results:– 37.5% of original processing time
– 2.67X Increase in throughput
– System Requirements:• Before: 4-DSPs, 12-RAMs
• After: 2-DSPs, 6-RAMs, 1-XC4013E
+-
+
-
Old_1
INC
Old_2
-+
+
-
++
++
MUX
MUX
New_1
Diff_2
Diff_1
New_2
MSB
MSB
Prestate Buffer Bit
24-bit 24-bit24-bit
1 0
REG
REG
REG
REG
REG
REG
REG
REG
I/O BusI/O Bus
135 ns
360 ns0
1
2
3
Rel
ativ
e P
erfo
rman
ce
2.67 times better performance with
FPGA-assisted DSP
Two 66 MHz DSPsSix 15 ns RAMs
66 MHz DSP+FPGAThree 15 ns RAMs
Number of TAPS
# CLBs
100
200
300
16 32 48 64 80
• • • • • Serial SequentialDistributed Arithmetic
• •
• •
•
SerialDistributedArithmetic8
MHz
8 Bit Word FIR Filter Structures
1000 to 50 KHz
16 MHz
Two-Bit ParallelDistributedArithmetic
55MHz
ParallelDistributedArithmetic
FIR Filter Implementation Options
Serial* Parallel*Serial* Distributed DistributedSequential Arithmetic Arithmetic
8 Taps
16 Taps
32 Taps
48 Taps
64 Taps
36 CLBs 44 CLBs 250 CLBs1.08 MHz 8.1 MHz 60 MHz 36 CLBs 70 CLBs 400 CLBs0.46 MHz 8.1 MHz 55 MHz
44 CLBs 122 CLBs 0.23 MHz 8.1 MHz
62 CLBs 178 CLBs0.15 MHz 8.1 MHz
70 CLBs 228 CLBs0.11 MHz 8.1 MHz
8 Bit Word Example
* Note: These designs are NOT Pipelined
Lower Sample Rate Applications:
Efficient CLB Counts
Large Number of TAPs
Moderate Sample Rates
Non Symmetrical FIR OK
Serial Sequential Architecture
32 Tap 8 Bit Example
CoefficientTable
REGISTER
ADD
2-1 Scale
32 x 8 LUT
32 - 8 Bit Coefficients
8 CLBsSDB Out
PSR
Parallel to SerialConverter4 CLBs
8
8
9
5 CLBs
24 CLBs Total
Clk50 Mhz
Serial Multiplier
Serial Sequential - FIR Filter
Select
08
SampleData
SAMPLEDATA
BUFFER
ACC
REG
SERIAL MULTIPLYCoefficient
Select
REG
FilteredData Out
5-BITCNTR
5
3 CLBs
64-TAP SerialSequential FIR Filter
ACC
REG
SERIAL MULTIPLYCoefficient
Select
SampleData
SAMPLEDATA
BUFFER
ACC
REG
SERIAL MULTIPLYCoefficient
Select
SAMPLEDATA
BUFFER
ADD
REGISTER
ACC
REG
SERIAL MULTIPLYCoefficient
Select
SampleData
SAMPLEDATA
BUFFER
REG
FilteredData Out
8 Tap
16 Tap
32 Tap
48 Tap
64 Tap
80 Tap
96 Tap
128 Tap
36 43 50 57 64
36 43 50 57 64
44 53 62 71 80
62 77 92 107 122
70 85 100 115 130
97 115 133 151 169
97 115 133 151 169
112 137 162 187 212
8 Bit 10 Bit 12 Bit 14 Bit 16 Bit
Number CLBs vs. Taps / Word Size
• 4002 = 64 CLBs
• 4005 = 196 CLBs
• 4013 = 576 CLBs
• 4025 = 1024 CLBs
Serial Sequential - FIR Filter
781Khz 625Khz 390Khz
390Khz 312Khz 195Khz
195Khz 156Khz 97Khz
130Khz 104Khz 65Khz
97Khz 78Khz 48Khz
78Khz 62Khz 39Khz
65Khz 52Khz 32Khz
48Khz 39Khz 24Khz
8 Tap
16 Tap
32 Tap
48 Tap
64 Tap
80 Tap
96 Tap
128 Tap
TAPS 8 Bit 10 Bit 16 Bit
Maximum Sample Rate / Word Size
• Serial Mult. Limitations
• Can Use Multiple 16 Tap
Serial Sequential - FIR Filter
Building Blocks
• 8X Faster at 128 Taps
ACC
REG
SERIAL MULTIPLYCoefficient
Select
SampleData
SAMPLEDATA
BUFFER
REG
FilteredData Out
58 CLBs for Function plus about 10 CLBs for ControlTotal = 68 CLBs
32 WORD X 12 BIT LOOK UP TABLE A.
M = (A + B + C + D + E)/5
LSB
LUT-A
ADRS
DATA11 Bit
LUT-A
ADRS
DATA11 Bit
D1
D0
C1
C0
E1
E0
8 BITS WIDE
C 8
X0
X2
X1
X8
4xCLK
8 BITS WIDE
D 8
X0
X2
X1
X8
4xCLK
8 BITS WIDE
E 8
X0
X2
X1
X8
4xCLK
SIGN EXTEND
MSB
2xA
B
REG A
B
REG
4x M(A,B,C,D,E)
14SIGN EXTEND
MSB
B 8
X0
X2
X1
X8
A 8
X0
X2
X1
X8
4xCLK
4xCLK
B0
B1
A0
A1
0 0 0 0 0
0 0 0 0 1
0 0 0 1 0
0 0 0 1 1
0 0 1 0 0
0 0 1 0 1
0 0 1 1 0
0 0 1 1 1
0 1 0 0 0
0 1 0 0 1
0 1 0 1 0
0 1 0 1 1
0 1 1 0 0
0 1 1 0 1
0 1 1 1 0
0 1 1 1 1
M(A,B,C,D,E)
001100110011
000110011001
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
1 0 0 1 1
1 0 1 0 0
1 0 1 0 1
1 0 1 1 0
1 0 1 1 1
1 1 0 0 0
1 1 0 0 1
1 1 0 1 0
1 1 0 1 1
1 1 1 0 0
1 1 1 0 1
1 1 1 1 0
1 1 1 1 1
M(A,B,C,D,E)4-CLBs per 8-Bit Shift Reg4x5ea = 20 CLBs
1-CLBs per Bit12-Bit Partial Sums, MSB bit weight = 112x2ea = 24 CLBs
6-CLBs for Add12-Bit Partial Sums1-CLB for [ Carryout + LSB ]6+1 = 7 CLBs
7-CLBs for 14-Bit Add14-Bit Partial Product Sumsno Carryout and LSBs are dropped7 = 7 CLBs
000110011001
000110011001
000110011001
000110011001
001100110011
001100110011
001100110011
001100110011
001100110011
001100110011
001100110011
001100110011
001100110011
010011001100
010011001100
010011001100
010011001100
010011001100
010011001100
010011001100
010011001100
010011001100
010011001100
011001100110
011001100110
011001100110
011001100110
011001100110
100000000000
000000000000
12-Bits
14-Bits
13.5MHz Median Filter, 5-Point, 2-Bit PDA
Design the following Application:
• Equations:• Y(R,G,B) = 0.299*R + 0.587*G + 0.114*B• U(R,G,B) = -0.169*R - 0.331*G + 0.500*B• V(R,G,B) = 0.500*R - 0.419*G - 0.081*B
• R, G, B Data is 8-Bits at 13.5 MHz. The circuit already has a 2x Clk (27 MHz).
• Draw a functional schematic diagram of the circuit.
How do you implement the three multipliers or MACs?
What is the estimated size of the final design?
What is the estimated speed of the final design?
How long would it take to turn over this design?
Video Coding Application with 4x Clock
8 WORD X 10 BITLOOK UP TABLE A.
000
001
010
011
100
101
110
111
f(RGB)
8 BITS WIDE
G8
X0
X2
X1
X8
8 BITS WIDE
R8
X0
X2
X1
X8
8 BITS WIDE
B8
LUT-A
ADRS
DATA10 Bit
LUT-A
ADRS
DATA10 Bit
G1
G0
R1
R0
B1
B0
X0
X2
X1
X8
Y = 0.299*R + 0.587*G + 0.114*BU = -0.169*R - 0.331*G + 0.500*BV = 0.500*R - 0.419*G - 0.081*B
4xCLK
4xCLK
4xCLK
PARALLEL LOAD2-BIT SHIFT REG4 CLBs EA, = 12 CLBs
...000000
CG
CG + CB
CR
CR + CG
CR + CG + CB
CR + CB
CB
SIGN EXTEND
MSB
2xA
B
REG
A
B
REG
4x
Y(R,G,B)U(R,G,B)V(R,G,B)
12
LUTs are the same5 CLBs EA, = 10 CLBs
10 Bit ADDER + REG5.5 CLBs
12 Bit ADDER6 CLBs
12 BITS WIDE
The total design would use about 110 CLBs with control logic.
LSB SIGN EXTEND
MSB
Video Coding Application with 2x Clock
A
B
REG
Y(R,G,B)U(R,G,B)V(R,G,B)
12
12 BITS WIDE8 BITS WIDE
G 8
X0
X2
X1
X8
8 BITS WIDE
R 8
X0
X2
X1
X8
8 BITS WIDE
B 8
X0
X2
X1
X8
2xCLK
2xCLK
2xCLK
LUT-A
ADRS
DATA10 Bit
LUT-A
ADRS
DATA10 Bit
G0R0
B0SIGN EXTEND
MSB
2xA
B
REG
LUT-A
ADRS
DATA10 Bit
LUT-A
ADRS
DATA10 Bit
SIGN EXTEND
MSB
2xA
B
REG 4x
G1R1
B1
G2R2
B2
G3R3
B3
A
B
REG 16x
PARALLEL LOAD4-BIT SHIFT REG4 CLBs EA, = 12 CLBs
LUTs are the same5 CLBs EA, = 20 CLBs
10 Bit ADDER + REG5.5 CLBs EA, = 11 CLBs
12 Bit ADDER + 2 REGs7 CLBs
14 Bit ADDER7 CLBs
The total design would use about 180 CLBs with control logic.
All four LUTs are the same.
LSB
LSB
LSB
SIGN EXTEND
MSB
Xilinx Introduces First Fully Programmable
System Solution
First FPGA Architecture Designed for Intellectual Property
FPGA Technology Roadmap
1995 1997 1998 1999
Year
XC4000ELargest DeviceXC40250.5m
XC4000EXLargest DeviceXC4036EX0.5m
XC4000XLLargest DeviceXC4085XL0.35m
XC4000XVLargest DeviceXC40250XV0.25m
1996
Generation 3 architecture1 Million+ system gatesSystem Solution0.25/0.18
Den
sity
/Per
form
ance
Process Technology and Supply Voltage
Feature Size (m)
Virtex FPGAs Leverage Xilinx Process Technology Leadership
0
0.2
0.4
0.6
0.8
1
1.2
1990 1992 1994 1996 1998 2000 2002
5
3.32.51.8
1.3
Voltage
• Lower cost• Faster speed• Higher density• Lower power
Virtex FPGAs Ship
Voltage and Family Migration
• Virtex FPGAs and XC4000XV share common process (0.25 ) – 2.5 V logic, 3.3 V I/O with 5 V tolerance
• Family migration from XC4000XL possible– Voltage migration guide will assist users
• Design with XC4000XL now and plan ahead for XC4000XV and Virtex FPGAs
Xilinx 0.25 5 Volt-Compatible FPGAs
• Family migration possible if you plan for:– Additional power/ground pins– Dedicated clock and configuration pins
• Voltage migration guide to help users
Any 5 V
device(XC4000E)
Virtex&
XC4000XV2.5 V logic3.3 V I/O
Any 3.3 V
device(XC4000XL)
5 V3.3 V
2.5 V
5 V
3.3 V 3.3 V
3.3 V
I/OSupply
LogicSupply
Meets TTLLevels
Accepts5 V levels
System Level Design Trend
DSP
CustomLogic
BusI/F
RAMI/F
High-DensityHigh-PerformanceCustom Device
PCI
Scratch PadSRAM
PC Board
Introducing Xilinx Virtex FPGAs
Segmented Routing, 4-Input LUT FPGA Architecture
Fast, Flexible I/Os
System Building Blocks
Software
IP
Leading Edge Process Technology
World’s first fully programmable system-level architecture
Advanced Process Technology
0.5u Process 0.25u UMC Process- locos isolation - shallow trench isolation- birds beak - 0.9u metal pitch- no planarization - CMP- only contact plug - plug for all vias
Family Overview• 0.25um, 5 layer metal process• Density: 50 thousand to 1 million system
gates• Performance
– 100+ MHz performance• 3 to 4 LUT levels
– 160 MHz system performance• Clock to output + input setup
• First device in 2Q98– 250,000 system gates– One million system gate device by end of
1998
Virtex FPGA Performance
• 100+ MHz internal speeds – 155 MHz SONET data stream processing– 100+ MHz Pipelined Multipliers– 66 MHz PCI
• 100+ MHz system interface speedswithout PLL with PLL
Tco (output register) 6 ns 3.5 nsTsu (input register) 3 ns 3 nsTh (input register) 0 ns 0 nsMax I/O performance 110 MHz 160 MHz
Functional Block DiagramCLB Segmented routing
SelectI/OPins
DistributedSelectRAMMemory
BlockSelectRAMMemory
PLL
66 MHz PCI SSTL3
Vector BasedInterconnectdelay=f(vector)
Virtex Clocking
Clocking and PLL• 4 low skew clock resources
– 3ns setup, 0ns hold clock pad -> IOB input FF– 6ns clock to out clock pad -> IOB output FF
• 24 Additional low skew globals – clocks, enables, resets, etc– faster than 4KXL secondary global buffer
• PLL for system clock deskew and fast clock to out.
Virtex CLB
Segmented Routing Interconnect
3-STATE BUSSES
SWITCHMATRIX
2 LCs 2 LCs
CA
RR
Y
CA
RR
Y
CLB
CA
RR
Y
CA
RR
Y
• Fast local routing within CLBs
• General purpose routing between CLBs
• Fast Interconnect– 8ns across
250,000 system gates
• Predictable for early design analysis
• Optimized for five layer metal process
2 LCs 2 LCs
CLB
4 InputLUT
RegisterCarryand
Control
I3I2I1I0
O
WI DI
DCE
CLK
Q
CO
CI
4 InputLUT
RegisterCarryand
Control
I3I2I1I0
O
WI DI
DCE
CLK
Q
CO
CI
PR
RS
PR
RS
Polarity of all control signals selectable
Fast arithmetic and multiplier circuitry
Optimized for synthesis
Virtex Configurable Logic Block
Virtex IO
Simplified IOB• Fast I/O drivers• Registered input,
output, 3-state enable control
• Programmable slew rate, pull-up, input delay, etc.
• Selectable I/O Standards– SSTL, GTL,
LVTTL...
DCE
S/R
Q
DFF/LATCH
DCE
S/R
Q
DFF/LATCH
DCE
S/R
Q
DFF/LATCH
PAD
Virtex Memory
SelectRAM+ Memory Features
• Distributed SelectRAM Memory– Pioneered in XC4000 family– 16x1 synchronous SRAM implemented in LUT– Ideal for DSP applications– Access over one hundred billion bytes/sec
• Block SelectRAM Memory– 4096 bit blocks of dual port synchronous SRAM– Configurable widths of 1, 2, 4, 8, and 16– Ideal for data buffers and fifos– Up to 17 gigabytes/sec access
• Fast Access to External RAM
– Direct interface to SSTL3, 3.3V synchronous DRAM standard
– 133 MHz
Block RAM• Configure as: 4096 bits with variable aspect ratio• 8-32 blocks across family devices• True dual-port, fully synchronous operation
– Cycle time <10 ns
• Flexible block RAM configuration– 5 blocks: 2K x 10 video line buffer
– 1 block: 512 x 8 ATM buffer (9 frames)
– 4 blocks: 2K x 8 FIFO
– 9 blocks: 4K x 9 FIFO with parity
WEAENACLKAADDRADINA
DOA
DOB
RAMB4
WEBENBCLKBADDRBDINB
High SpeedSynchronous
DRAM(Mbytes)
VideoData In
Frame DataBlock
SelectRAMMemory(Kbytes)
BlockSelectRAM
Memory(kbytes)
DistributedSelectRAM
Memory(bytes)
Line Data
DistributedSelectRAM
Memory(bytes)
Pixel Data
Video PixelProcessing
Function(logic)
ProcessedVideo Out
Real Time Video Processor
Virtex FPGA
Hierarchy of RAM provides efficient and very high bandwidth data processing
Virtex FPGA Summary• 1 Million+ system gates
• 100+ MHz performance from all devices
• Building blocks for system level design
• ASIC design flow software
• Platform for CORE reuse
First fully programmable system solution