This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11/25/2010
1
CS411 Digital System Design
Dr. Arshad Aziz
Basic FPGA Architecture
Technology Timeline
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
• Carry logic runs vertically, up only– Two independent
carry chains per CLB
Detailed Slice Structure• The next few slides
discuss the slice features– LUTs– MUXF5, MUXF6,
MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)
– Carry Logic– MULT_ANDs– Sequential Elements
11/25/2010
4
SRAM Cell (Pass Transistor)• An SRAM cell can drive the gate (G) terminal of an
NMOS transistor.• If SRAM (M) = 1 then signals passes from S D• An SRAM cell can be attached to the select line of a
MUX to control it.
Combinatorial Logic
AB
CD
Z
Look-Up Tables• Combinatorial logic is stored in Look-Up
Tables (LUTs) – Also called Function Generators (FGs)– Capacity is limited by the number of inputs, not
by the complexity• Delay through the LUT is constant
A B C D Z0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1
. . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1
Look Up Table (LUT)• The LUT is used to realize any Boolean function.• Assume the function to be realized is y = (a&b) | !c • This could be achieved by loading the LUT with the
appropriate output values
LUT (Look-Up Table) Functionality
• Look-Up tables are primary elements for logic implementation
• MUXF5 combines 2 LUTs to create• Any 5-input function (LUT5)• Or selected functions up to 9 inputs• Or 4x1 multiplexer
• MUXF6 combines 2 slices to form• Any 6-input function (LUT6)• Or selected functions up to 19 inputs• 8x1 multiplexer
• Dedicated muxes are faster and more space efficient
Connecting Look-Up Tables
F5F8
F5F6
CLB
Slice S3
Slice S2
Slice S0
Slice S1 F5F7
F5F6
MUXF8 combines the two MUXF7 outputs (from the CLB above or below)
MUXF6 combines slices S2 and S3
MUXF7 combines the two MUXF6 outputs
MUXF6 combines slices S0 and S1
MUXF5 combines LUTs in each slice
Programmable Logic Block• Early devices were based on the concept of programmable
logic block, which comprised • 3-input lookup table (LUT), • register that could act as flip flop or a latch, • multiplexer, along with a few other elements.
3-, 4-, 5-, or 6-input LUTs?
• The key feature of n-input LUT is that it can implement any possible n-input combinational logic function.
• Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells!• The first FPGAs were based on 3-input LUTs.
• FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input LUTS.• The current consensus is that 4-input LUTS offer the optimal
balance of pros and cons.• In the past, some devices were created using a mixture of
different LUT sizes because this offered the promise of optimal device utilization.
• However current logic synthesis tools prefer uniformity and regularity
FPGA Function generators• LUT Example: Implement the function • using:
2-input LUTs3-input LUTs4-input LUTs
AF = ABD + BC BCD +
ABDBCDABC
F
ABDBCDABC
CD
ABF F
Each CLB contains separate logic and routing for the fast generation of sum & carry signals– Increases efficiency and
performance of adders, subtractors, accumulators, comparators, and counters
Carry logic is independent of normal logic and routing resources
Fast Carry Logic
LSB
MSB
Carry
Log
icRo
utin
g
11/25/2010
6
Fast Carry Logic
• Simple, fast, and complete arithmetic Logic– Dedicated XOR
gate for single-level sum completion
– Uses dedicated routing resources
– All synthesis tools can infer carry logic
COUT COUT
SLICE S0
SLICE S1
Second Carry Chain
To S0 of the next CLB
To CIN of S2 of the next CLB
First Carry Chain
SLICE S3
SLICE S2
COUT
COUTCIN
CIN
CIN CIN CLB
Accessing Carry Logic• All major synthesis tools can infer carry
logic for arithmetic functions• Addition (SUM <= A + B)• Subtraction (DIFF <= A - B)• Comparators (if A < B then…)• Counters (count <= count +1)
DCE
PRE
CLR
Q
FDCPE
DCE
S
R
Q
FDRSE
DCE
PRE
CLR
Q
LDCPE
G
_1
Flexible Sequential Elements• Either flip-flops or latches• Two in each slice; eight in each CLB• Inputs come from LUTs or from an
independent CLB input• Separate set and reset controls
– Can be synchronous or asynchronous
• All controls are shared within a slice– Control signals can be inverted
locally within a slice
D QCE
D QCE
D QCE
D QCE
LUTIN
CECLK
DEPTH[3:0]
OUTLUT =
Shift Register• Each LUT can be
configured as shift register– Serial in, serial out
• Dynamically addressable delay up to 16 cycles
• For programmable pipeline
• Cascade for greater cycle delays
• Use CLB flip-flops to add depth
Shift Register
• Register-rich FPGA– Allows for addition of pipeline stages to increase
throughput• Data paths must be balanced to keep desired
functionality
64Operation A
4 Cycles 8 Cycles
Operation B
3 Cycles
Operation C64
12 Cycles
3 Cycles9-Cycle imbalance
Shift Register LUT Example
12 Cycles
64Operation A
4 Cycles 8 Cycles
Operation B
3 Cycles
Operation C
64
12 Cycles
Paths are StaticallyBalanced
9 Cycles
Operation D - NOP
11/25/2010
7
RAM16X1S
O
DWE
WCLKA0A1A2A3
RAM32X1S
O
DWEWCLKA0A1A2A3A4
RAM16X2S
O1
D0
WEWCLKA0A1A2A3
D1
O0
=
=LUT
LUT or
LUT
RAM16X1D
SPO
DWE
WCLKA0A1A2A3DPRA0 DPODPRA1DPRA2DPRA3
or
Distributed RAM• CLB LUT configurable as
Distributed RAM– An LUT equals 16x1 RAM– Cascade LUTs to increase
RAM size• Synchronous write• Asynchronous read
– Can create a synchronous read by using extra flip-flops
– Naturally, distributed RAM read is asynchronous
• Two LUTs can make– 32 x 1 single-port RAM– 16 x 2 single-port RAM– 16 x 1 dual-port RAM
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Embedded Ram Blocks• A lot of applications require the use of memory, so FPGAs now
include relatively large chunks of embedded RAM called e-RAM or Block RAM (BRAM).
• Depending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns
• These blocks can be used for a variety of purposes, such as implementing standard single or dual port RAMs, FIFO, e.t.c.
Block RAM
Spartan-3Dual-Port
Block RAM
Port A
Port B
Block RAM
• Most efficient memory implementation– Dedicated blocks of memory
• Ideal for most memory requirements– 4 to 104 memory blocks
• 18 kbits = 18,432 bits per block (16 k without parity bits)– Use multiple blocks for larger memories
• Builds both single and true dual-port RAMs• Synchronous write and read (different from distributed
RAM)
11/25/2010
8
Spartan-3 Block RAM Amounts Block RAM can have various configurations (port aspect ratios)
0
16,383
1
4,095
40
8,191
20
2047
8+10
1023
16+20
16k x 1
8k x 2 4k x 4
2k x (8+1)
1024 x (16+2)
Block RAM Port Aspect Ratios Single-Port Block RAM
Dual-Port Block RAMRAMB4_S16_S8
Port A Out18-Bit Width
Port B In2k-Bit Depth
Port A In1K-Bit Depth
Port B Out9-Bit Width
DOA[17:0]
DOB[8:0]
WEAENARSTA
ADDRA[9:0]CLKA
DIA[17:0]
WEBENB
RSTB
ADDRB[10:0]CLKB
DIB[8:0]
Dual-Port Bus Flexibility
• Each port can be configured with a different data bus width
• Provides easy data width conversion without any additional logic
11/25/2010
9
0, ADDR[12:0]
1, ADDR[12:0]
RAMB4_S1_S1
Port B Out1-Bit Width
DOA[0]
DOB[0]
WEAENARSTA
ADDRA[12:0]CLKA
DIA[0]
WEBENBRSTB
ADDRB[12:0]CLKB
DIB[0]
Port B In8K-Bit Depth
Port A Out1-Bit Width
Port A In8K-Bit Depth
Two Independent Single-Port RAMs
• To access the lower RAM– Tie the MSB address bit to
Logic Low• To access the upper RAM
– Tie the MSB address bit to Logic High
• Added advantage of True Dual-Port– No wasted RAM Bits
• Can split a Dual-Port 16K RAM into two Single-Port 8K RAM– Simultaneous independent access
to each RAM
Embedded Multipliers• Some functions, like multipliers are inherently slow if they are
implemented by connecting a large number of programmable logic blocks together.
• Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications).
18 x 18 Embedded Multiplier• Fast arithmetic functions
– Optimized to implement multiply / accumulate modules
18 x 18 signed multiplierFully combinationalOptional registers with CE & RST (pipeline)Independent from adjacent block RAM
18 x 18 Multiplier • Embedded 18-bit x 18-bit multiplier
– 2’s complement signed operation• Multipliers are organized in columns
18 x 18Multiplier
Output (36 bits)
Data_A (18 bits)
Data_B (18 bits)
Positions of Multipliers Asynchronous 18-bit Multiplier
11/25/2010
10
18-bit Multiplier with RegisterA simple clock tree
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Digital Clock Manager (DCM)Digital Clock Managers (DCM)
• The clock pin is usually connected to special hard-wired function called a clock-manager that generates “daughter clocks”.
• The daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board.
• There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, …)
DCM: Jitter Removal• In the real world clock edges may arrive a little early or a little late.• A fuzzy clock would result (jitter) due to the delay encountered.• The FPGA clock manager can be used to detect and correct for
this jitter and provide a “clean” daughter clock signal for use inside the device.
DCM: Frequency Synthesis
• The frequency of the clock signal being presented to the FPGA from the outside world might not be exactly what the designer engineer wishes for.
• The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal.
11/25/2010
11
DCM: Phase Shifting
• Certain designs require the use of clocks that are phase shifted (delayed) with respect to each other.
• Some clock managers allow you to select from fixed phase shifts of common values such as 1200 and 2400
(for a three-phase clocking scheme)
Basic I/O Block Structure
DEC
Q
SR
DEC
Q
SR
DEC
Q
SR
Three-StateControl
Output Path
Input Path
Three-State
Output
Clock
Set/Reset
Direct Input
Registered Input
FF Enable
FF Enable
FF Enable
IOB Functionality
• IOB provides interface between the package pins and CLBs
• Each IOB can work as uni- or bi-directional I/O• Outputs can be forced into High Impedance• Inputs and outputs can be registered
– advised for high-performance I/O• Inputs can be delayed
Configurable I/O Impedances• The signals used to connect devices on today’s circuit
board often have fast edge rates.• In order to prevent signals reflecting back it is
necessary to apply appropriate terminating resistors to the FPGA input and output pins.
• In the past, resistors were applied as discrete components (outside the FPGA).
• Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user.
Spartan 3 Family Attributes
FPGA Nomenclature
11/25/2010
12
Spartan-3 FPGA Family Members
2001 – Virtex-II FPGA Family• Virtex-II FPGA introduced followed by Virtex-II Pro in 2003
– 444 18x18 Multipliers & 18kbit block RAMs introduced– Gbit Serial I/O Communications & Power PC Processors Introduced– Complex Floating Point Algorithm Implementation now possible
transceivers• 12 to 216 multipliers• 3,000 to 50,000 logic
cells• 200k to 4M bits RAM• 204 to 852 I/Os
Logic cells
Up to 16 serial transceivers•• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps
PowerPCs
Virtex-II Pro (Selection)
Embedded Processor Cores (Hard and Soft)
• The majority of designs make use of microprocessors.• These appeared as discrete devices on the circuit board.• Lately, high-end FPGAs have become available that
contain one or more embedded microprocessors (referred to as microprocessor cores).
• There are two types of cores:• A hard microprocessor core is implemented as a
dedicated predefined block (two approaches)• A soft microprocessor core is implemented by
configuring a group of programmable logic blocks to act as a microprocessor.
Embedded Core (Inside)• Xilinx and Altera tend to embed one or more microprocessor
cores directly into the main FPGA fabric (PowerPC)• In this case the design tools have to be able to take account of
the presence of these blocks in the fabric (any memory used by the core is formed from the embedded RAM blocks).
The main advantage of this scheme is the inherent speed advantages to be gained from having the processor core in intimate proximity to FPGA fabric.
11/25/2010
13
Soft Core • As opposed to embedding a microprocessor physically into the
fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor.
• Soft cores are simpler (more primitive) and slower than their hard-core counterparts.
1. The main advantage of this scheme is that the user need only implement a core if he/she needs it.
2. Also, the user can instantiate as many cores as they require until they run out of resources!
ADVANTAGE?ADVANTAGE?
Basic Architecture 74
Virtex Architectures
Other Families include• Virtex-II Pro• Virtex-4• Virtex-5Latest Family include• Virtex-6
Built for high-performance applications
Basic Architecture 75
Virtex-II Pro Architecture
High performance True Dual-port RAM - 8 Mb SelectIO™- Ultra
Technology - 1164 I/O
Advanced FPGA Logic –99k logic cells
XtremeDSP Functionality -Embedded multipliers
RocketIO™ and RocketIO X High-speed Serial Transceivers 622 Mbps to 3.125 Gbps
PowerPC™ Processors 400+ MHz Clock Rate - 2
XCITE Digitally Controlled Impedance -Any I/O
DCM™ Digital Clock Management - 12
130 nm, 9 layer copper in 300 mm wafer technology
Contains embedded Processors and Multi-Gigabit Transceivers
Basic Architecture 76
Virtex-4 Family
ResourceResource
14K14K––200K LCs200K LCsLogicMemory
DCMsDSP Slices
SelectIORocketIOPowerPC
Ethernet MAC
LXLX FXFX SXSX
0.90.9––6 Mb6 Mb
44––1212
3232––9696
240240––960960
23K23K––55K LCs55K LCs
2.32.3––5.7 Mb5.7 Mb
44––88
128128––512512
320320––640640
12K12K––140K LCs140K LCs
0.60.6––10 Mb10 Mb
44––2020
3232––192192
240240––896896
00––24 Channels24 Channels
1 or 2 Cores1 or 2 Cores
2 or 4 Cores2 or 4 Cores
N/A
N/A
N/A
N/A
N/A
N/A
Advanced Silicon Modular BLock (ASMBL) ArchitectureOptimized for logic, Embedded, and Signal Processing
Basic Architecture 77
Virtex-4 Architecture
1 Gbps SelectIO™ChipSync™ Source synch, XCITE Active Termination
Smart RAM New block RAM/FIFO
Xesium ClockingTechnology
500 MHz
PowerPC™ 405with APU Interface450 MHz, 680 DMIPS
Tri-ModeEthernet MAC
10/100/1000 Mbps
RocketIO™ Multi-GigabitTransceivers
622 Mbps–10.3 Gbps
XtremeDSP™ Technology Slices
256 18x18 GMACs
Advanced CLBs200K Logic Cells
Basic Architecture 78
Virtex-5 Family
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
SelectIO with ChipSync SelectIO with ChipSync Technology and XCITE DCITechnology and XCITE DCI
RocketIO™ Transceiver OptionsRocketIO™ Transceiver OptionsLowLow--Power GTP: Up to 3.75 GbpsPower GTP: Up to 3.75 GbpsHighHigh--Performance GTX: Up to 6.5 Performance GTX: Up to 6.5 GbpsGbps
• Increased memory capacity and performance– Also important for embedded processing, complex
IP, etc
XtremeDSP DSP48A Slice
DSP48 Comparison
Basic Architecture 87
Function DSP48 DSP48E DSP48A Benefit
Multiplier 18 x 18 25 x 18 18 x 18 Reduces FPGA resource needs for DSP algorithms.
Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better performance. Important in FIR filter construction.
Cascade Inputs One Two One Enables fast data path chaining of DSP48 blocks for larger filters.
Cascade Output Yes Yes Yes Enables fast data path chaining of DSP48 blocks for larger filters.
Dedicated C input
No Yes Yes The C input supports many 3-input mathematical functions, such as 3-input addition and 2-input multiplication with a single addition and the very valuable rounding of multiplication away from zero.
Adder 3 input 48 bit
3 input 48 bit
2 input 48 bit
Supports simple add and accumulate functions.
Dynamic Opmodes
Yes Yes Yes One DSP48 can provide more than one function.. Multiply, Multiply-add, multiply-accumulate etc.
ALU Logic Functions
No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU function on a clock cycle basis Enables multiple functions to be selected. (Add, Subtract, or Compare)
Pattern Detect No Yes No This feature supports convergent rounding, underflow/overflow detection for saturation arithmetic, and auto-resetting counters/accumulators.
SIMD ALU Support
No Yes NoEnables parallel ALU operations on multiple data sets.
Carry Signals Carry In Carry In & Out
Carry In & Out
Supports fast carry functions between DSP blocks. Often a speed limiting path.
Enables IP Portability, Protects Design Investments
High-performance Clocking
Basic Architecture 90
11/25/2010
16
Addressing the Broad Range of Technical Requirements
Mar
ket S
ize
Application Market Segments + 100s More
Spartan-6 LX
Lowest cost logic + DSP
Lowest logic +high-speed serial
Spartan-6 LXT
High logic density +serial connectivity
Virtex-6 LXT
DSP + logic +serial connectivity
Virtex-6 SXT
Ultra high-speed serialconnectivity + logic
Virtex-6 HXT
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appear
Basic Architecture 91
Designers Eccentrics
• Higher System Performance – More design margin to simplify designs– Higher integrated functionality
• Lower System Cost– Reduce BOM– Implement design in a smaller device & lower speed-
grade• Lower Power
– Help meet power budgets– Eliminate heat sinks & fans – Prevent thermal runawayBasic Architecture 92
Logic Cells (Kbit) Up to 55K Up to 150KLUT Design 4-input LUT + FF 6-input LUT + 2FFBlock RAM (Mbit) Up to 2 Mbit Up to 5 MbitTransceiver Count / Speed no Up to 8 / Up to 3.125 GbpsVoltage Scaling No (1.2V only) Yes (1.2V, 1.0V)Static Power (typ mW) 11 mW (smallest density) Up to 60% less!Memory Interface 400 Mbps DDR3 800 MbpsMax Differential IO 640 Mbps 1050 MbpsMultipliers/DSP Up to 126 Multipliers / DSP Up to 184 DSP48 BlocksMemory Controllers no Up to 4 Hard BlocksClock Management DCM Only DCM & PLLPCI Express Endpoint no Yes, Gen 1Security Device DNA Only Device DNA & AES
Basic Architecture 108
11/25/2010
19
Spartan-6 LX / LXT FPGAs
** All memory controller support x16 interface, except in CS225 package where x8 only is supported
Basic Architecture 109
FPGA Design Flow
Design process (1)Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds…..
clock, reset, encr_decr: in std_logic;data_input: in std_logic_vector(31 downto 0);data_output: out std_logic_vector(31 downto 0);out_full: in std_logic;key_input: in std_logic_vector(31 downto 0);key_read: out std_logic;
);end AES_core;
Specification
Verilog description (Your Verilog Source Files)
Functional simulation
Post-synthesis simulationSynthesis
Design process (2)Implementation(Mapping, Placing & Routing)
Configuration
Timing simulation
On chip testing
Design Process control from Active-HDL
architecture MLU_DATAFLOW of MLU is
signal A1:STD_LOGIC;signal B1:STD_LOGIC;signal Y1:STD_LOGIC;signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
beginA1<=A when (NEG_A='0') else
not A;B1<=B when (NEG_B='0') else
not B;Y<=Y1 when (NEG_Y='0') else
not Y1;
MUX_0<=A1 and B1;MUX_1<=A1 or B1;MUX_2<=A1 xor B1;MUX_3<=A1 xnor B1;
with (L1 & L0) selectY1<=MUX_0 when "00",
MUX_1 when "01",MUX_2 when "10",MUX_3 when others;
end MLU_DATAFLOW;
VHDL description Circuit netlist
Logic Synthesis
11/25/2010
20
Synthesis Tools
… and others
XST
Features of synthesis tools
• Interpret RTL code• Synplify Pro: Produces synthesized circuit netlist in a standard
EDIF (.edf) format– Can optionally produce .VHM (VHDL code merged into one) file
for post-synthesis simulation• XST: Produces synthesized circuit netlist in NGC format• Netlist is composed of gates in the particular Xilinx
implementation library– http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has
information on libraries• Give preliminary performance estimates• Some can display circuit schematics corresponding to EDIF
netlist
Timing report after synthesisPerformance Summary *******************
Design Information------------------Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k
4-cm area -c 100 -tx off exam1.ngd exam1.pcf Target Device : xc2s200Target Package : fg256Target Speed : -6Mapper Version : spartan2 -- $Revision: 1.26.6.4 $Mapped Date : Wed Nov 02 11:15:15 2005
Map reportDesign Summary--------------Number of errors: 0Number of warnings: 0Logic Utilization:Number of Slice Flip Flops: 144 out of 4,704 3%Number of 4 input LUTs: 173 out of 4,704 3%
Logic Distribution:Number of occupied Slices: 145 out of 2,352 6%Number of Slices containing only related logic: 145 out of 145 100%Number of Slices containing unrelated logic: 0 out of 145 0%
*See NOTES below for an explanation of the effects of unrelated logicTotal Number 4 input LUTs: 210 out of 4,704 4%
Number used as logic: 173Number used as a route-thru: 5Number used as 16x1 RAMs: 32
Number of bonded IOBs: 74 out of 176 42%Number of GCLKs: 1 out of 4 25%Number of GCLKIOBs: 1 out of 4 25
Place & route reportTiming Score: 0
Asterisk (*) preceding a constraint indicates it was not met.This may be due to a setup or hold violation.
--------------------------------------------------------------------------------Constraint | Requested | Actual | Logic
Minimum input required time before clock: 11.442nsMinimum output required time after clock: 11.491ns
11/25/2010
22
Post-place-and-route simulation
• After place-and-route performed, can do post-place-and-route simulation– Now have real timing information!– Also can do static timing analysis: shows the
worst case critical path in circuit
Configuration
• Once a design is implemented, you must create a file that the FPGA can understand– This file is called a bit stream: a BIT file (.bit
extension)
• The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information
The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Configuration of SRAM based FPGAsSystem Gates vs. Real GatesSystem Gates vs. Real Gates
• One common metric used to measure the size of a device in the ASIC world is that of equivalent gatesequivalent gates (e(e--gate)gate)
• Convention used:• A 2-input NAND function to represent one equivalent gate.• An equivalent gate consists of an arbitrary number of transistors.
• Different vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors (difficult to compare capacity/complexity)
• Solution: Assign each function an equivalent gateequivalent gate value and sum all these values.
•• How can we establish a basis for comparison between FPGAs and How can we establish a basis for comparison between FPGAs and ASICs?ASICs?
•• Can an ASIC of 500,000 equivalent gates that needs to be migrated Can an ASIC of 500,000 equivalent gates that needs to be migrated into an FPGA fit into a particular FPGA?into an FPGA fit into a particular FPGA?
FPGAs: System GatesFPGAs: System Gates
•• System GatesSystem Gates: A 4-input LUT can be used to represent anywhere between one and more than twenty 2-input primitive logic gates.
• Rule of thumb?• Divide the system gates value by three, so a three million FPGA
system gates would equate to one million ASIC equivalent gates!!• However, to make comparisons between two different
implementations on an FPGA (i.e. Floating point adder vs. Fixed point adder) designers should use the resources available in an FPGA:• Number of 4-input LUTs used• Number of embedded multipliers• Number of embedded RAM blocks
State-of-the-Art FPGAs• 65-90 nm process on 300 mm wafers
• Lower cost per function (LUT + register)• Smaller and faster transistors: Higher speed
• System speed up to 500 MHz• Mainly through smart interconnects, clock management,
• Delay increases due to programmable switches in the FPGA routing architecture
• Area• Configuration cells and programmable resources
incur substantial area penalty• Power
• Typically not suited for low power applicationsPerformance Cost
ASIC
FPGA
ASIC
FPGA
Time to market
ASIC
FPGANeed to improve
Conclusion
• FPGAs are the main enabler of Reconfigurable Computing Systems
• FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS.– Advantages: Flexible, programmable, – Disadvantages: Power dissipation, performance w.r.t. ASIC
• Applicability of FPGAs relies on CAD tools provided by different vendors such as Xilinx and Altera
• RCS can be realized with several technologies:– FPGAs: Fine/Medium Grain– Coarse Grain Reconfigurable Architectures: CGRAs