Xcell Journal Issue 52 - Xilinx

ISSUE 52, FIRST QUARTER 2005XCELL JOURNAL

XILINX, INC.

T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

R

Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

Achieving Breakthrough Performance at the Lowest Cost

Achieving Breakthrough Performance at the Lowest Cost

Issue 52First Quarter 2005

Special EditionSpecial Edition

IIn case you haven’t heard, Xilinx recently announced its new Virtex-4™ family of FPGAs. In thisspecial edition of Xcell Journal, you’ll find articles devoted exclusively to Virtex-4 business view-points, system design challenges, engineering solutions, and engineering references.

Our “View from the Top” article is by Erich Goetting, Xilinx Vice President and General Manager ofthe Advanced Products Division. Erich presents an overview of the new Virtex-4 family, and gives youa guided tour of some of the new Virtex-4 technologies, as well as the inspiration and rationale behindthem. Other articles in the Business Viewpoint section discuss how these new Virtex-4 devices, basedon 90 nm technology, have greatly expanded high-performance processing and system integration.

You’ll also find technical articles written by Xilinx marketing, applications, and development staff,as well as our partners and customers, including:

• System Design Challenges articles emphasize the Virtex-4 family advantages and leadershipthemes. These technical articles outline design challenges and demonstrate how the Virtex-4solution addresses these challenges.

• Engineering Solutions articles demonstrate some of the key capabilities of Virtex-4 FPGAs andhow they are used in a design. These articles provide in-depth descriptions of Virtex-4 features,IP, and tools.

• The Engineering Reference section describes some of the Virtex-4 hardware development plat-forms and other design solutions, to help you determine which platform is best for your appli-cation and design task.

It’s Time to Re-SubscribeThis issue marks the 16th anniversary of our Xcell Journal. From itshumble beginnings in the fourth quarter of 1988 as an eight-page, two-color newsletter, the journal has grown into an award-winning publica-tion printed in five languages and distributed in 144 countries with a cir-culation of more than 60,000 readers.

Periodically, we must clean our mailing database. Beginning January 1, 2005,you must re-subscribe to continue receiving the Xcell Journal FREE. If yousubscribed after January 1, 2005, you do not have to re-subscribe. If you sub-scribed before this date, please visit our site at www.xilinx.com/xcell/subscribeand take a minute to renew your FREE subscription and ensure its uninter-rupted delivery.

I want to thank all of you, our readers, for your continued interest and support of the Xcell Journal.Please feel free to drop me a note at [email protected] about your suggestions on how we mayimprove. I’d like to hear from you.

L E T T E R F R O M T H E E D I T O R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780

© 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and otherdesignated brands includedherein are trademarks of Xilinx, Inc. PowerPC is a trade-mark of IBM, Inc. All other trademarks are the propert yof their respective owners.

The articles, information, and other materials included inthis issue are provided solely for the convenience of ourreaders. Xilinx makes no warranties, express, implied,statutory, or otherwise, and accepts no liability with respectto any such articles, information, or other materials ortheir use, and any use thereof is solely at the risk of theuser. Any person or entity using such information in anyway releases and waives any claim it might have againstXilinx for any loss, damage, or expense caused thereby.

Continuing Excellence

Forrest CouchManaging Editor

EDITOR IN CHIEF Carlis [email protected]

MANAGING EDITOR Forrest [email protected]

ASSISTANT MANAGING EDITOR Charmaine Cooper Hussain

XCELL ONLINE EDITOR Tom [email protected]

ADVERTISING SALES Dan Teie1-800-493-5551

ART DIRECTOR Scott Blair

T A B L E O F C O N T E N T S

1010

66

16

B U S I N E S S V I E W P O I N T S SYSTEM DESIGN CHALLENGES

5858

ENGINEERING SOLUTIONS

112

ENGINEERING REFERENCE

This section discusses how the new Virtex-4 devices,based on 90 nm technology, have greatly expandedhigh-performance processing and system integration.

This section emphasizes the Virtex-4 family advantages and leadership themes. These articles outline design challenges anddemonstrate how the Virtex-4 solution addresses these challenges.

This section demonstrates some of the key capabilities of Virtex-4FPGAs and how they are used in a design. These articles providein-depth descriptions of Virtex-4 features, IP, and tools.

This section describes some of the Virtex-4 hardware developmentplatforms and other design solutions, to help you determine whichplatform is best for your application and design task.

The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value.

Introducing the NewVirtex-4 FPGA FamilyIntroducing the NewVirtex-4 FPGA Family

V I E W F R O M T H E T O P

F I R S T Q U A R T E R 2 0 0 5 , I S S U E 5 2 Xcell journalXcell journalView from the Top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

BUSINESS VIEWPOINTSWill the Evolution of Platform FPGAs Mean the End for ASICs and ASSPs? . . . . . . . . . . . . . . .10EasyPath FPGAs Beat ASIC Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

SYSTEM DESIGN CHALLENGESThe Virtex-4 Power Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16Deliver Efficient SPI-4.2 Solutions with Virtex-4 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20Virtex-4 Memory Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24Designing with the Virtex-4 XtremeDSP Slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28Designing for Signal Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32Accelerated System Performance with APU-Enhanced Processing . . . . . . . . . . . . . . . . . . . . . .36Solving the Signal Integrity Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Using FPGAs in Wireless Base Station Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42Implementing a Cable Modem Termination System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46Developing Next-Generation Telecommunication Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .50Virtex-4 FPGAs for Software Defined Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54Virtex-4 FPGAs in Rugged LCD Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56

ENGINEERING SOLUTIONSISE 6.3 Software – Unleash the Power of Virtex-4 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . .58FIFOs Made Easy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61Digital Clock Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64Virtex-4 Clocking Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66Alpha Blending Two Data Streams Using a DSP48 Technique . . . . . . . . . . . . . . . . . . . . . . . .68Dynamic Phase Alignment with ChipSync Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72Lock Your Design with the Virtex-4 Security Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Dynamic Reconfiguration of Functional Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78Designing with Virtex-4 Embedded Tri-Mode Ethernet MAC . . . . . . . . . . . . . . . . . . . . . . . . . .80Emerging Design Methodologies Elicit the Power of Virtex-4 FPGAs . . . . . . . . . . . . . . . . . . . .84Integrate EDK-Created Embedded Processor Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . .88Optimizing Virtex-4 High-Performance Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92Selecting Connectors for Multi-Gigabit Transceiver Designs . . . . . . . . . . . . . . . . . . . . . . . . . . .96Xilinx/Micron Partner to Provide High-Speed Memory Interfaces . . . . . . . . . . . . . . . . . . . . .100Harvesting the Flexibility of Virtex-4 RocketIO Transceivers . . . . . . . . . . . . . . . . . . . . . . . . .102Optimize Memory Subsystem Performance with Network FCRAM . . . . . . . . . . . . . . . . . . . .105

GENERALUsing Spartan-3 FPGAs to Implement High-Performance DSP . . . . . . . . . . . . . . . . . . . . . . . .108

ENGINEERING REFERENCESVirtex-4 ML401 Evaluation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112Virtex-4 FPGA Source-Synchronous Interfaces Design Kit . . . . . . . . . . . . . . . . . . . . . . . . . . .113Virtex-4 ML461 Advanced Memory Development System . . . . . . . . . . . . . . . . . . . . . . . . . .114Memec Virtex-4 Board Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115Avnet Virtex-4 Evaluation Kits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116Nu Horizons Virtex-4 Development Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117TED DDR2 Memory Evaluation Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118

Welcome to the Xilinx® Virtex-4™ editionof the Xcell Journal. We’ve created this spe-cial issue to show you the new Virtex-4FPGA family, and how its innovationsenable the creation of next-generation sys-tems that do more than ever thought possi-ble only a few years ago.

In this article, I’ll take you behind thescenes for a guided tour of some of the newtechnologies, as well as a bit of the inspira-tion and rationale behind them.

With more than 100 innovations, theVirtex-4 family represents a new milestonein the evolution of FPGA technology. Afterconducting extensive interviews with lead-ing design engineers worldwide, we knewthat they wanted the following things in anadvanced next-generation FPGA family:

• Higher performance

• Higher logic density

• Lower power

• Lower cost

• More advanced capabilities

Introducing the New Virtex-4 FPGA Family

6 Xcell Journal First Quarter 2005

The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value.

Viewfrom the top

by Erich GoettingVice President & General Manager, Advanced Products DivisionXilinx, [email protected]

It’s relatively easy to deliver on one or twoof these items – our challenge was to deliverall of them at the same time. We did thisthrough a combination of innovative processand circuit design, process development, theASMBL architectural approach, and the useof advanced embedded functions.

Development work on the Virtex-4 fam-ily (code-named “Whitney” after the high-est mountain in the continental UnitedStates) began more than two years ago. Itrepresents the creativity and dedication ofhundreds of engineers, spanning integratedcircuit design and layout, software and IPdevelopment, process development, testingand characterization, systems and applica-tions engineering, technical documenta-tion, and product marketing.

One of the most remarkable develop-ments embodied in the new Virtex-4 FPGAfamily is the ASMBL architecture, which rep-resents a fundamentally new way of con-structing the FPGA floor plan and itsinterconnect to the package. First of all,ASMBL enables I/O pins, clock pins, andpower and ground pins to be located any-where on the silicon chip, not just along theperiphery as with previous approaches. Thisin turn allows power and ground pins to bebrought directly into the center of the silicondie, thereby significantly reducing on-chip IRdrops that can occur with the largest FPGAsrunning at the highest frequencies.

Clock input pins are also located in thecenter of the die, which reduces clock latency.This is because clock networks need to haveequal delay to all endpoints (that is, mini-mum skew), and thus the clock must emanatefrom the center. In periphery-connected clockinput pins, the signal first traverses from theedge of the die to the center, and is then dis-tributed to all regions. The Virtex-4 ASMBLdesign eliminates this traversal completely,and thus directly reduces the clock networkpropagation delay.

In addition to its electrical advantages,ASMBL provides another significant benefitin that it allows a more flexible – and thusmore precise – allocation of on-chip resources.

material, which is copper rather than alu-minum (the traditional material). More lay-ers provide more routing in less space andshorter connection distances. Copperreduces resistance compared to aluminum,and thus speeds signal interconnect andreduces on-chip power-distribution IRdrop. As clock rates go up and voltages godown, these considerations have becomeincreasingly important, and have driven theindustry-wide shift to copper interconnect.

The Virtex-4 logic fabric was complete-ly re-engineered to fully take advantage ofthe 90 nm triple-oxide CMOS process,resulting in the highest performance fabricever, with system clock rates in excess of500 MHz (at three LUT levels). At thesame time, static power was cut in halfcompared to 130 nm Virtex-II Pro™devices, as was dynamic power.

Thus, while some industry pundits wereproclaiming that the future of deep submi-cron CMOS devices was getting hotter andhotter, with chip temperatures destined toreach that of rocket nozzles and the surfaceof the sun, the Virtex-4 design’s creativeapproach has turned that conventional wis-dom on its head, resulting in overall powerreductions of 50% compared to our previ-ous 130 nm generation. In many applica-tions, such as DSP functions, power levelsare reduced even more – as much as 90%.No wonder design engineers say thatVirtex-4 FPGAs are cool – they literally are.

High-Performance Clocking Clocks were rated as one of the mostimportant and critical FPGA resources inour surveys of design engineers. Quantity,quality, connectivity, frequency, duty cycle,jitter, and skew all made a big difference.

To take clocking to the next level inVirtex-4 devices, all global clock resourceswere made fully differential, thereby reduc-ing skew, jitter, and duty-cycle distortion.This marks the first implementation of dif-ferential clocking in a programmable logicdevice. Not only that, but the number ofglobal clocks was increased to 32, for every

That in turn has enabled us to offer Virtex-4devices in three unique platforms, each witha different mix of on-chip resources:

• The LX platform, optimized for logicapplications

• The SX platform, optimized for high-end DSP applications

• The FX platform, optimized forembedded processing and high-speedserial applications

A Look Inside the Virtex-4 FPGAAt the heart of the Virtex-4 FPGA is ournext-generation 90 nm triple-oxide 10-layer copper CMOS process technology.While that’s quite a lot of adjectives,every one of them is incredibly impor-tant. The first, 90 nm, refers to the“drawn” gate length of the smallest tran-sistors. As transistors get smaller, they getfaster, use less dynamic power, and enablehigher complexity at lower price points.Chip designers think in terms of “transis-tor budgets,” which are now in the billiontransistor range.

Triple-Oxide 90 nm CMOS TechnologyTriple-oxide technology refers to the num-ber of transistor oxide thicknesses availablein the process. More oxide thicknessesallow more tuning of performance andpower in the device circuitry, and enableVirtex-4 devices to deliver industry-leadingperformance while dramatically loweringpower consumption.

One of our key inputs from many engi-neers was that performance and power werevery important constraints in their systemsdesigns, and that they needed both highperformance and low power. With a dual-oxide 90 nm process, we would have had tochoose performance or power. This wasn’tgood enough. By employing a triple-oxide90 nm process, we achieved high perform-ance and low power.

The 10-layer copper refers to the num-ber of metal interconnect layers and their

First Quarter 2005 Xcell Journal 7


At the heart of the Virtex-4 FPGA is our next-generation 90 nm triple-oxide 10-layer copper CMOS process technology.

device, and internal connectivity optionsenhanced to allow any region to use any 8clocks simultaneously.

500 MHz Synchronous Memories and FIFOsOn-chip synchronous block RAM wasenhanced to run at 500 MHz. Built-in sup-port for first-in first-out (FIFO) memorieswas included directly in the block RAMunit, enabling the same 500 MHz opera-tion for FIFOs (approximately a 2Xspeedup over fabric-based FIFOs), whileeliminating the need for any additionallogic cells or complex FIFO designs.

If you’re designing systems requiringECC (error checking and correcting)memory, Virtex-4 devices have built-inECC support, with single-bit correct anddouble-bit detect. ECC is common ininfrastructure equipment in networking,telecom, storage, servers, instrumentation,and aerospace applications, and providesthe highest levels of data integrity. Like theintegrated FIFO support, the integratedECC eliminates the cost and delay of fabric-based solutions.

Speaking of on-chip memory, Virtex-4devices continue to offer SelectRAM™memory, whereby each LUT is trans-formed into a 16 x 1 RAM, ideally suitedfor building high-speed register files andlocal buffers.

At the other end of the spectrum, inter-faces to external memory devices such asDDR, DDR2, QDR-II, and RLDRAM-IIare dramatically enhanced through our newChipSync™ technology, which offers mem-ory interface speeds at rates limited only bythe speed of the external memory devices.

The new Virtex-4 ML461 AdvancedMemory Development System containsfully functional and hardware-proven refer-ence designs for all of today’s most popularmemory technologies. If you plan to useexternal memory, I highly recommend thatyou check this out.

DSP Performance of 256 GigaMAC/sIn the DSP domain, we incorporated someof the world’s fastest multiply accumulate(MAC) technology. The XtremeDSP™slice can perform an 18 x 18 signed multi-ply and 48-bit accumulate every 2 ns.

The Virtex-4 LX, FX, and SX platformsinclude the breakthrough XtremeDSPtechnology. With the new SX platform wedid something completely new – we dra-matically increased the ratio of DSP unitsto logic cells. Given the highly integratednature of XtremeDSP slices, they need onlysmall amounts of logic fabric to implementmost common DSP functions, and thusincreasing the ratio provides a significantincrease in DSP compute power per unitsilicon area. In fact, SX devices provide a10X performance increase per unit costover previous solutions.

Power is dramatically reduced as well,with more than a 10X reduction for multi-ply/add functions from previous FPGAsolutions. The Virtex-4 SX55 contains 512XtremeDSP slices, providing an aggregateDSP compute performance of 256GigaMAC/s, making it one of the mostpowerful DSP devices ever manufactured.

The state-of-the-art XtremeDSP sliceemploys new “silicon algorithms” devel-oped by a company called Arithmatica™.Many different architectures exist forimplementing multiplication, and theArithmetica architecture is truly a break-through. We are excited to see it availablefor the first time to FPGA users. For moreinformation, visit Arithmatica’s website atwww.arithmatica.com.

The Evolution of Advanced I/O TechnologyI/O continues to be a critical success factorfor today’s systems designers. During thelast decade, we have seen four majorchanges in I/O. First was the shift awayfrom 5V, the result of the need to scale volt-ages as we scaled the transistor. This in turnled to the plethora of I/O standards that weare all familiar with today: SSTL, HSTL,LVDS, and LVCMOS 1.5. The Virtex-4SelectIO™ resource continues to lead theindustry, supporting virtually every I/Ostandard in use today on every pin.

XCITE On-Chip TerminationThe second major change was the transi-tion from lumped loads to transmissionline loads – again the direct result ofMoore’s Law. As transistors got faster andclock rates increased, I/O edge rates

increased as well. But because the propaga-tion speed of signals is a constant, dictatedby the speed of light, we entered the realmin which a signal on one end of a wire wasno longer the same as the signal on theother end of the same wire. This is whattransmission lines are all about, and theirappearance during the last few years hasdriven a sea change in all aspects of signalinterconnect and I/O design.

To make sure that these signal “waves”don’t start “splashing” uncontrollably, trans-mission lines need to be driven, built, andreceived using proper signal integrityapproaches, the most critical of which is ter-mination. Traditionally implemented withdiscrete resistors on the PCB, terminationlayouts can become exceedingly difficultaround high-density pinouts like those usedin FPGAs. This often dictates more PCBlayers and thus more system cost.

Virtex-4 FPGAs include our third-generation of XCITE™ integrated digi-tally controlled termination technology.Offering a precisely controlled sourceimpedance at the output drive pin, it isdesigned to enable the driving of trans-mission lines without external compo-nents, with maximum speed and signalintegrity, and with straightforward PCBlayout and layer stack-ups.

Likewise, on inputs, XCITE offers par-allel termination for single-ended inputsand true differential termination for differ-ential inputs. Termination occurs on theend of the transmission line at the die, noton the way there on the PCB, offering max-imum signal integrity. Many customersreport that the XCITE technology hassaved them many PCB layers, increasedPCB packing density, and saved them sub-stantial dollars in their bill of materials.

Source-Synchronous InterfacesThe third major change was the shift fromsystem-synchronous to source-synchronousinterfaces. Traditional system-synchronousinterfaces work by distributing a singleclock to all transmitters and receivers inthe system, and transmitting data betweensource and destination within a singleclock cycle. This makes the data rateinversely proportional to the sum of clock-



to-out, transmission line delay, and inputsetup time.

Typically, system synchronous interfacestop out at speeds in the range of 100 MHz.To go faster, source-synchronous interfacestransmit a clock along with the data, and thereceiver uses this clock to capture the data.Using this technique, along with double-data-rate transmissions, enables parallel I/Odata rates in excess of 1 Gbps.

The challenge of source-synchronousinterfaces is that each interface generates anew clock domain at the receiver. On topof this, to operate at high speeds, the pre-cise alignment of clock and data at thereceiver is paramount. To address this newworld of source-synchronous interfaces,Virtex-4 devices include the breakthroughChipSync technology. ChipSync units liebetween the SelectIO technology and thecore FPGA fabric, are available on everyI/O pin on the device, and serve to trans-mit and receive high-speed source-syn-chronous data and clocks, achieving speedsof 1 Gbps per pin pair.

On the receiver, precise digital delay lineswork internally to align data signals to eachother, and then to align these to the receivedclock. The captured data is synchronizedand transferred to the selected FPGA coreclock domain.

To operate at maximum data rates, thetransmit and receive units include parallel-to-serial and serial-to-parallel conversionunits, respectively. Using ChipSync technol-ogy is virtually automatic for most designs,as it is utilized automatically in the variousXilinx IP cores and reference designs.

Networking interfaces such as SPI-4.2and HyperTransport™, and memory inter-faces such as DDR, DDR2 SDRAM, andQDR II SRAM, all employ the Virtex-4ChipSync technology. And if you’re design-ing your own source-synchronous interface,the ChipSync wizard gives you completecontrol and an easy-to-use GUI that lets youdial in exactly what you want to build.

Multi-Gigabit Serial InterfacesThe fourth major change in I/O has beenthe rapid adoption of high-speed serialinterfaces. For years, serial interfaces werelimited to long-distance communications,

such as those used in fiber-optic links in theSONET/SDH world and the Ethernetlinks like 100BASE-T.

A key breakthrough occurred in the late1990s, in which high-speed serial transceivers(which traditionally had been designed usingcomplex process technology such as GaAs[Gallium-Arsenide]) were for the first timecreated using advanced design techniquesusing standard CMOS. Once implementedin CMOS, these transceivers had lower costand much lower power, and could even beintegrated into complex CMOS chips.

Virtually overnight, gigabit serial tech-nology changed from a rare, expensive, andpower-hungry technology to a common,low-cost, and very power-efficient technol-ogy. This has been the economic and tech-nical impetus behind the industry’s “SerialTsunami,” in which interface after inter-face has shifted from parallel to gigabitserial links. Two common examples are vis-ible in today’s computer architectures, withthe shift from parallel PCI to 2.5 Gbpsserial PCI-Express™, and the shift fromthe parallel ATA drive interface to theSerial ATA interface.

There are more than a dozen multi-gigabit serial interfaces in widespread usetoday, with more being introduced everyyear. The Virtex-4 FX family provides ourthird-generation RocketIO™ multi-gigabitserial transceiver technology. Spanningspeeds from 622 Mbps to more than 10Gbps, each Virtex-4 RocketIO transceiver isprogrammable and can implement a myriadof speeds and serial standards. Link-layer IPis available for such standards as PCIExpress, Serial-ATA, FibreChannel, GigabitEthernet, and Aurora, to name a few.

In addition, Virtex-4 FX devices eachinclude multiple embedded tri-mode (or10/100/1000) Ethernet MACs, makingimplementation of compliant Ethernetdevices simpler and faster than ever.

Application-Specific Embedded ProcessingVirtex-4 embedded processing solutionsinclude full support for both MicroBlaze™32-bit soft CPUs on all devices, andembedded PowerPC™ 32-bit RISC CPUson all Virtex-4 FX devices. The versatileMicroBlaze soft CPU runs at clock rates

over 165 MHz on Virtex-4 devices, anddelivers more than 140 DMIPS.

The number of CPUs in one device islimited only by your imagination, and ofcourse by the available logic cells. Thepowerful PowerPC CPU runs at clockrates up to 450 MHz and delivers up to702 DMIPS each. The first PowerPCprocessor available by any manufactureron 90 nm, the PowerPC processor isincredibly power-efficient, using only 29mw/DMIPS. This makes it among thelowest power microprocessors availablefrom any manufacturer worldwide.

New Auxiliary Processing Unit (APU)technology connects the CPU to the FPGAfabric, enabling implementation of acceler-ation hardware for virtually any applica-tion. Once only the domain of high-budgetASIC and ASSP design teams, the Virtex-4FPGA’s architectural ability to combineapplication-specific hardware accelerationwith high-performance RISC CPUs shat-ters traditional barriers of cost, time-to-market, and risk.

During the next few years, I expect to seemore and more instances of application-specific acceleration, as it truly offers theability to deliver very high performance atlow cost and low power. A recent researchprogram completed within Xilinx ResearchLabs, led by Dr. Kees Vissers, demonstrateda 20-fold speedup for an encryption/decryp-tion (IPSEC) application over the basePowerPC processor. Using only 135 mW, itoutperforms a 3.2 GHz Pentium™-4, whileat the same time reducing power by 99%.That, in my opinion, is what state-of-the-artembedded processing is all about.

ConclusionI hope that you’ve enjoyed reading a bitabout the Virtex-4 Platform FPGA and thefactors that drove its design. From thebreakthrough ASMBL architecture and thetriple-oxide 90 nm CMOS process tech-nology, to the world’s most capable embed-ded processing and multi-gigabit serialsolutions, Virtex-4 devices offer an unpar-alleled set of enabling technologies for yournext-generation systems designs. I look for-ward to seeing the creativity of the world’sdesigners in tomorrow’s products.



by Richard SevcikExecutive Vice President, Programmable Logic Systems andIntellectual Property/Cores and Software Solutions GroupsXilinx, [email protected]

The debate over FPGAs as a viable alterna-tive to ASICs and ASSPs has been ongoingfor nearly a decade. Industry analystsiSupply, Gartner Dataquest™, and othershave documented the trend in decreasingASIC design starts and the increase inFPGA design starts.

Next-generation platform FPGA devicesbased on 90 nm have greatly expandedhigh-performance processing and systemintegration options. They continue to pushASIC design starts lower as additionalapplication solutions are defined.

With the beginning of the new millen-nium, the debate continued with theintroduction of Xilinx® Virtex-II™ andVirtex-II Pro™ devices – the industry’sfirst platform FPGAs. These high-performance devices, with their flexibledevice integration capability, programma-ble I/O, and significantly lower overalldesign cost, helped to usher in and estab-lish SoC design methodology and quicklyassumed innumerable ASIC SoC designs.

Will the Evolution of PlatformFPGAs Mean the End for ASICs and ASSPs?


Today’s multi-platformFPGAs shake upASIC/ASSP suppliers.

B U S I N E S S V I E W P O I N T S

The addition of high-performanceRISC CPUs, block RAM, multi-gigabithigh-speed serial I/Os, dedicated DSPfunctions, and other system enhancementsintroduced technological advances that fur-ther solidified the rise of platform FPGAsover their ASIC SoC counterparts.However, to get high-performance DSP,processing, or connectivity features for aspecific applications domain, designerswere typically forced to purchase thelargest, costliest devices. The larger partshad the biggest helpings of advanced fea-tures, while the smaller parts had reducedportions of the same.

Today, a new breed of domain-optimized, mullti-platform FPGAs fromXilinx – the Virtex-4™ family – promisesmulti-dimensional application scalingbased on required features and cost goals.By combining the economic benefits of aninnovative columnar architectural approachwith advances in process technology (90nm/300 mm), Xilinx is poised to movebeyond the $5.1 billion programmablelogic market to capture additional share inthe $84 billion ASIC and ASSP markets(Source: Gartner Dataquest 2007).

Just the Right MixBased on the revolutionary AdvancedSilicon Modular Block (ASMBL) columnararchitectural approach, Xilinx can nowcost-effectively develop multiple FPGAplatforms, each with different combina-tions of feature sets. Thus, a specific plat-form can be optimized specifically for acertain domain of applications – such aslogic, DSP, connectivity, and embeddedprocessing – to meet application require-ments previously delivered only by ASICs,ASSPs, and similar devices, while remain-ing programmable at heart.

Not only does the designer or designteam have a choice in selecting the idealplatform, they also have a choice in choos-ing the device size with just the right featuremix to best achieve needed capability andperformance at the lowest possible cost.

This unique flexibility and ability to cre-ate optimal application domain subsystemssets even higher standards for FPGAs.Devices that are both hardware- and

technology is used throughout the world.No two people use the same technology,systems, or software, nor do they subscribeto or want the same content.

Higher costs and longer design times forASICs and ASSPs relegate their primaryuses to proven lower-risk, very-high-vol-ume applications. The rapid and significantincrease in ASIC development costs clearly

gives the advantage to platform FPGAs intoday’s leading-edge applications. Theoverall cost benefit of zero NRE pushes thehigh-volume ASIC or ASSP crossoverpoint upwards, locking in FPGAs likenever before.

ConclusionDomain-optimized multi-platform FPGAsare revolutionary in their ability to acceler-ate the deployment of FPGA technologyinto many more application areas. Thecombined leverages of reduced risk, dra-matically shorter design cycles, and zeroNRE will soon move all but the highestvolume applications away from cell-basedASIC implementation toward more flexi-ble, forgiving architectures like today’sdomain-optimized FPGAs. For more infor-mation, visit www.xilinx.com/virtex4/.

software-programmable enable more flexi-ble implementation options than eitherASIC or ASSP devices. Reinvestigating,changing, or enhancing system architectureat any time in the development processprovides the ultimate tool kit to meetapplication requirements.

Designers can use this same capabilityto evolve hardware in the field to meet new

requirements or avoid expensive hardwareupgrades. This flexibility becomes para-mount given today’s many emerging andcompeting standards.

The “Total Cost” AdvantageFPGAs have demonstrated a clear and con-sistent trend in reducing cost and makingFPGA technology more suitable for a widerrange of applications. The combination of90 nm silicon fabrication technology with300 mm wafers results in a cumulativeeffect: increasing the number die-per-waferfive times over previous devices. Increasingthe die-per-wafer together with architectur-al integration enables substantially lowersystem costs.

A key and often overlooked componentin favor of programmable logic’s economicadvantage is clearly demonstrated in how



by Gokul Krishnan Sr. Marketing Manager, Market Specific Products GroupXilinx, [email protected]

Balaji Thirumalai Sr. Marketing Manager, Worldwide MarketingXilinx, [email protected]

The risk of deploying ASIC solutions hasworsened in magnitude with the move tosmaller process geometries. As design com-plexity increases, customers are looking for aviable solution that offers low design, unit,and total costs, high-level system integration,design flexibility, easy-to-use design tools, arich selection of IP, and fastest time to market.

Customers are increasingly turning toother alternatives to avoid the pitfalls ofASICs – high NRE and re-spin expenses,slow turnaround times, complex design envi-ronments, and hidden conversion, verifica-tion, and development costs. In this article,we’ll analyze two such alternatives: Xilinx®

EasyPath™ FPGAs and structured ASICs.

Structured ASIC product offerings tendto be similar to FPGAs in that they havepredefined combinations of gates, memo-ry, and I/Os. However, their architecturestend to trade off flexibility in favor ofreduced area to achieve their cost targets.The reality remains that a vast majority ofdesigns intended for ASICs are originallyprototyped in an FPGA, yet there are stillproblems with FPGA-to-structured-ASICconversions. EasyPath FPGAs offer thebest migration path to high-volume pro-duction at the lowest cost possible.

EasyPath FPGAsEasyPath FPGAs are the industry’s onlycustomer-specific and flexible solution forvolume production priced lower thanstructured ASICs.

EasyPath FPGAs are identical to ourstandard FPGA offerings but use patentedtesting techniques and customer-specifictest patterns to significantly improve FPGAyields. You can reap the benefits of theseimproved yields in the form of lower costs,because Xilinx only tests those parts of an

FPGA that are actually used in your design.With EasyPath FPGAs, you can realize a

30-80% reduction in prices when you moveto high volume, as compared to standardFPGAs. EasyPath FPGAs are availableacross six platforms, four different productfamilies, and 28 different devices over arange of gate and memory counts.

EasyPath FPGAs are identical to theirstandard FPGA counterparts, effectivelyeliminating any conversion work. Onceyou have frozen your design, Xilinx candeliver EasyPath parts in high volume ineight weeks. This compares favorablyagainst structured ASIC companies, whichtypically take 12-14 weeks from prototypesignoff to production.

Structured ASICsStructured ASICs are a variant of the gatearrays of yesteryear, but they use a “sea ofmodules” approach as opposed to a “sea ofgates” approach. The architecture of eachmodule varies depending on the vendor, butin general is some combination of NANDgates, inverters, flip-flops, and muxes.

EasyPath FPGAs Beat ASIC Prices


EasyPath is the most comprehensive volume production solution in the industry.


Structured ASICs promise cost savingsprimarily as a result of customizing fewermask layers per design, unlike standard cellASICs that use all-custom metal layers.Structured ASICs use only the top few (typ-ically two to four) metal layers; the basemodules are all buried in the lower layers,with their ports coming up to the program-mable layers. During the fabrication phase,the connections between various ports aremade to realize the requisite logic.

The Lowest Total Cost SolutionFigure 1 shows the comparative economicsof standard cell ASICs, structured ASICs,FPGAs, and EasyPath FPGAs. FPGAs havetraditionally offered a zero-NRE solution,which has led to their broad adoption.Standard cell ASICs have a high NRE and arelatively low unit cost, but with the over-head discussed earlier. Structured ASICspromise to lower the NRE at a unit cost thatis higher than that of standard cell ASICs,but lower than that of standard FPGAs.

With next-generation EasyPathFPGAs, you can now enjoy unit prices aswell as NREs that are lower than struc-tured ASICs. The combination of theindustry’s lowest NRE charges (starting at$75K); low cost design tools and IP;prices below structured ASICs; fastesttimes to production; and no hidden con-version charges show how EasyPath

MACs, translation to a structured ASICvendor often requires a re-validation of theIP on the vendor’s silicon platform ofchoice. With Xilinx Virtex-4™ EasyPathsolutions, you get the same wide range ofvalidated IP as with standard FPGAs.There is no additional fee required tomigrate the IP to a volume solution.

The bottom line is that whether it is ageneric design or an IP-centric design,EasyPath FPGAs offer very competitiveand cost-effective solutions for high-vol-ume migration when compared to struc-tured ASICs, all from a single trustedsupplier. Migration to structured ASICs,on the other hand, can pose a number ofchallenges.

Conversion-Free Methodology The vast majority of IC design starts beginwith FPGA prototyping, followed by aconversion to a volume solution. This car-ries the inherent risk of redesigning and re-verifying the design in the targetarchitecture, along with the related costs ofre-spins, conversions, and a host of otherdesign issues. The conversion from FPGAto structured ASIC is not seamless; rather,it is fraught with risks.

One issue faced by structured ASICcompanies revolves around the mapping ofmemories from an FPGA to a structuredASIC. FPGAs generally tend to havecolumnar memory architectures and offeran efficient means to form larger memorystructures when required. On the otherhand, the use of distributed memory blocksin some structured ASIC architectures canpose problems when large contiguousblocks are required by the design.

The need to join together blocks thatare physically separated to form a largerblock that is logically monolithic canincrease routing congestion. This can notonly potentially deteriorate the access timesof those memory structures but also leavefewer routing resources available for logic,thus impacting design performance.

With EasyPath FPGAs, there is no con-version. EasyPath FPGAs are exactly thesame as the standard FPGAs on which adesign is prototyped – the only difference isthat the latter are completely programma-

FPGAs are the industry’s lowest total costsolution for volume production.

Unmatched Choice of PlatformsStructured ASIC vendors can roughly begrouped into two camps based on their abil-ity to address IP-centric designs. On the onehand are those that have a wide portfolio ofIP; on the other are companies that typical-ly can only address generic designs. Withthe recent announcement of next-genera-tion EasyPath FPGAs from Xilinx, both ofthese segments can be addressed economi-cally and efficiently.

Xilinx now offers four families and sixplatforms, with 28 devices from which tochoose. This comes with all the benefits of theFPGA ecosystem that Xilinx customers arealready used to – hard IP such as the IBM™PowerPC™, MGTs, and XtremeDSP™blocks, as well as 600+ proven soft IP coresand low-cost design tools.

Some structured ASIC vendors focusexclusively on generic designs or logic-heavydesigns. This class of design tends to be veryprice competitive. Xilinx is now able tooffer a more compelling solution than anystructured ASIC vendor with its Spartan-3™ EasyPath FPGAs, which are pricedbelow these structured ASICs.

For designs that require a lot of IP andsystem integration such as PowerPC proces-sors, DSP, high-speed I/O, or Ethernet


Figure 1 – EasyPath FPGAs offer the lowest total cost solution.


ble, while the former are not. As a result,memory mapping and performanceachieved in an EasyPath FPGA is identicalto that achieved in a standard FPGA.

Another problem that some structuredASIC companies face has to do with padlimitations. It is fairly well known that asprocess nodes shrink, more and moredesigns become pad-limited in ASICs. Toget an adequate number of pads, struc-tured ASIC vendors sometimes have togrow their die size and increase the effec-tive cost to end customers. This problemis compounded by the fact that structuredASIC I/Os tend not to be as flexible asFPGA I/Os.

To keep I/O structures small and lessarea-intensive, structured ASIC vendorshave to make some difficultchoices about what standardsthey want to address and how.In cases where designs requirelarge buses of input and out-put I/Os (for example, SSTL2buses for SDRAM, or HSTLbuses for certain telecom pro-tocols), the limitations in thedesign of I/O structures canmake it difficult to achievepin compatibility in theFPGA-to-ASIC conversion.The end result is that cus-tomers have to either re-spintheir board or migrate to alarger device – both unpalat-able options. None of theseare issues with EasyPathFPGAs because of the one-to-one mapping between themand standard FPGAs.

Apart from memory and I/Os, there isa whole other host of issues, includingdifficulties with IP translation and test-ing, when moving from FPGAs to struc-tured ASICs. FPGA cost reduction plansthat involve converting to structuredASICs in order to get a smaller die arelikely to trigger design changes andschedule risks.

The EasyPath solution, on the otherhand, is neither an ASIC conversion nor amask-programmed FPGA. No conversionor silicon differences are involved, so there

are no long lead times, no timing or pinoutchanges, no need for product qualification,no lost feature support, and no risk of adesign failure. In addition to eliminatingany hidden design or qualification expens-es and the risks of ASIC conversions,EasyPath FPGAs are delivered in eightweeks in production volume, allowing youthe benefits of faster time to market ormore time to perfect your designs

Unprecedented FlexibilityOne of the major advantages of FPGAsover ASICs is the flexibility to makedesign changes in case of a specificationchange or a design error. Traditionally,customers have had to forgo this advan-tage as they move from FPGAs to an

inflexible custom solution like standardcell or structured ASICs. Now, withEasyPath FPGAs, Xilinx offers two flexi-bility features that allow you to enjoysome of the FPGA advantages when yougo to volume production at prices belowstructured ASICs.

Spartan-3 and Virtex-4 EasyPathFPGAs enable you to buy a custom devicethat supports two applications – one fordiagnostic testing and one for the actualapplication. EasyPath FPGAs can now betested for two designs, or two variations ofthe same design. This means that you can

now enjoy greater flexibility while alsosaving on BOM and inventory costs. Forexample, you can use one bitstream toperform system diagnostics on the entiresystem and then load the second applica-tion-specific bitstream. This reduces asso-ciated manufacturing system costs.

Xilinx offers EasyPath FPGA deviceswith LUTs and I/Os tested for drivestrengths and slew rates, allowing revisionslike engineering change orders at the LUTor I/O level. In many instances, even afterthe customer design is fully functional andcertified, flexibility with I/O drivestrengths and slew rates is critical.

For instance, a line card in a routermight need to have the drive strength (andslew rate) adjusted a notch or two depend-

ing on what load it sees.EasyPath customers can chooseto have a range of drivestrengths available to them forcertain I/Os. The unique flexi-bility is implemented on an as-needed basis. This eliminatesany re-spin and conversion-related engineering effort,delay, and expenses associatedwith ASICs and structuredASICs.

ConclusionEasyPath FPGAs from Xilinxoffer a seamless one-for-one,no-conversion volume reduc-tion solution across an industry-leading portfolio of productfamilies. The comparisonbetween EasyPath FPGAs and

structured ASICs shown in Table 1 illus-trates why EasyPath is a much superiorsolution. Unlike structured ASICs,EasyPath customers can get to productionvolumes much faster and now can do so atlower prices as well.

For more information about the next-generation EasyPath FPGAs, please visitwww.xilinx.com/easypath/, where you canget information on the platform support,flexibility features, and use an online costcalculator to find out why EasyPathFPGAs are the lowest total cost solution inthe industry.


Selection Criteria StructuredASICs*

EasyPathFPGAs

Time to Prototype Samples

Total Time to Volume Production

Vendor NRE/Mask Costs

Design Costs for Conversion

Additional Cost of Tools for Conversion

Unit Costs

Risk

Flexibility to Make Changes In-System

Design Conversion from Prototype to Production

4-8 weeks 0 weeks

12-15 weeks 8 weeks

$100K-$200K $75K

$250K-$300K $0

$100-$200K $0

Low Low

High Low

Inflexible Flexible

Additional Engineering Conversion Free

*Xilinx market analysis

Table 1 – EasyPath FPGAs versus structured ASICs


Coming Soon to a location near you!

Learn how the latest Xilinx technology can help you design cost effective solutions faster.

Gain hands-on experience to speed up your next development cycle.

What Are You Waiting For?Register now for the event nearest you.

Visit www.memec.com/xfest-2005

ASIA CANADA EUROPE JAPAN UNITED STATES

April through June

(MG0

084-

04) 1

2.20

.04 Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission. All company and product

names may be trademarks of their respective companies.

by Matt KleinSr. Staff Engineer, Applications Engineering, Advanced Products DivisionXilinx, [email protected]

Device power consumption is a primaryissue in the semiconductor industry – asprocess technologies get smaller and faster,they normally consume more power, put-ting power concerns and performance atodds. The new Virtex-4™ FPGA familyfrom Xilinx® employs innovative architec-tural features and clever IC design tech-niques that dramatically reduce powerconsumption, without compromising per-formance. This bucks expected trends nor-

mally associated with the reduced featuresizes of 90 nm process technology.

In this article, we’ll explore how Xilinx ICdesigners achieved remarkable power efficiencyin the high-performance Virtex-4 FPGA.

Components of Power ConsumptionThere are two main components to powerconsumption: static and dynamic. Static orquiescent power is mainly dominated bytransistor leakage current. When this currentis listed in data sheets, it is listed as ICCINTQ

and is the current drawn through theVCCINT supply powering the FPGA core.

Dynamic or active power has componentsfrom both the switching power of the core ofthe FPGA and the I/O being switched. The

dynamic power consumption is determinedby the node capacitance, supply voltage, andswitching frequency and governed by thebasic formula P=CV 2ƒ.

Both static and dynamic power havebeen significantly reduced in Virtex-4devices, even when compared to Virtex-IIPro™ devices.

Dramatic Power ReductionThe Virtex-4 product family has reducedpower consumption in several key areas.The power-per-CLB has been cut in half,with static power reduced by 40% anddynamic power reduced by 50% whencompared to the 130 nm Virtex-II ProFPGA and other 90 nm FPGAs.Furthermore, certain hard-logic siliconfunctions in the Virtex-4 FPGA reducepower consumption by 80-95%, a whop-ping factor when compared to the samefunctions implemented in configurablelogic blocks and programmable intercon-nect routing.

Additionally, comprehensive powerplanning tools are available to help youget an idea, up front, of power consump-tion for your Xilinx FPGA under its oper-ating conditions.

Reduced Power Consumption BenefitsReduced power consumption benefits cutacross a few areas of product design inreduced thermal concerns as well as easedpower supply design (see Figure 1).

• Reduced thermal concerns – Whenyou reduce power consumption in adevice or system, you use smaller heatsinks, or no heat sinks at all in somecases. You also have simpler thermalsystem design from the point of viewof reducing airflows and fan size needs.

• Easier power supply design – You canalso use smaller supply circuitry andreduce the number of components inthe power supply. Using less PCBspace allows you to reduce the cost ofthe power system. Plus, by not havingyour device consume as much power,you can achieve higher reliability bylowering the temperature of theFPGA die.

The Virtex-4 Power Play


The latest Xilinx FPGA offers revolutionary power innovations.

SYSTEM DESIGN CHALLENGES

Static Power Trends in 90 nm TechnologyThe reduction in transistor size in 90 nmtechnology has several effects on power con-sumption. The biggest potential problem isin the area of static power.

Scaling Trends for Static PowerAs we mentioned earlier, static power is dom-inated by transistor leakage current.Unfortunately, channel leakage increases astransistor size decreases. This is especially truefor low VT transistors where VT refers to volt-age threshold between the gate and drain.

Low VT transistors are the fastest transis-tors – the ones with the shortest turn-on andpropagation delay – that IC designers useinside the FPGA when the highest speed per-formance is needed. Regular VT transistors arealso used when less performance is acceptable,but this only helps so much with leakage.

Figure 2 shows that leakage goes up dra-matically when moving from 130 nm to 90nm technology. The Virtex-II Pro deviceuses 130 nm process technology, whereasthe new Virtex-4 device uses 90 nm processtechnology.

Triple-Oxide – The Savior of Static PowerTriple-oxide simply means that we use athird thickness of oxide in making some ofthe transistors in the FPGA (two oxidethicknesses are used in devices like theVirtex-II Pro FPGA). Most transistors in thepast had a thin oxide layer. Within thosetransistors could be low VT, regular VT,NMOS, or PMOS transistors. Thick-oxidetransistors are mostly used for I/O driversand a few other functions.

Oxide deposition thickness is a very sta-ble and controllable process in the semicon-ductor industry because it depends ontemperature, concentration, and exposure

FPGAs can use different transistortypes for different functions, and Xilinxdesigners have accomplished this balance.

Optimizing Performance and LeakageOur IC designers have many things thatthey can do to adjust the mix to optimize forcertain factors. The Virtex-4 FPGA is thefirst Platform FPGA designed for high speedand low power.

Low VT transistors are used only wherenecessary for maximum speed, while the mid-dle thickness of oxide from the triple-oxideprocess may be used for less aggressive per-formance with very low leakage. You may usedifferent sizes and types of transistors for per-formance and function. Combinations arealso possible, such as small and medium-sizedlow VT fast transistors and small and medi-um-sized middle oxide thickness transistors. Itis not a one-size-fits-all procedure.

Xilinx IC designers were given a directiveto reduce power, among other things, in theVirtex-4 platform while maintaining thehighest system performance. These transistorsare used across the various FPGA functions ofLUTs, I/O, interconnect, and configurationmemory cells. Even within a given FPGAfunction, all transistors don’t need to be thesame, and that is up to the Xilinx IC design-ers (see Figure 4).

The surprising result of this balancing isthat the overall static current in Virtex-4devices with 90 nm process is reduced by 40%when compared to Virtex-II Pro devices with130 nm process. Table 1 shows a chart of theweighted average changes to the transistors inthe Virtex-4 die compared to Virtex-II Prodie, which allows you to arrive at the reducedtransistor leakage in the Virtex-4 FPGA.

time. Figures 3a and 3b show the Virtex-4transistor with the middle oxide thicknessused in the triple-oxide process. You maynotice that the oxide thickness is still very,very thin, but this thicker oxide transistorhas much lower leakage than the standardthin-oxide low VT and regular VT transis-tors used in Virtex-II Pro FPGAs and invarious parts of Virtex-4 FPGAs.

Why Doesn’t Everyone Use Triple-Oxide?If triple-oxide is such a great process, whydon’t other companies like Intel™ orIBM™ use it in their own ASICs?

They probably would ifit benefited them. The rea-son they don’t is that all oftheir transistors need to runat speed; hence, they mustuse the low VT leakier tran-sistors for everything.FPGAs can have many dif-ferent transistor types,which can be selected forfunction, power, or per-formance.


1000

100

10

1

0.1220 180 150 130 90 75 65

Transistor IOFF Trend

Technology Node

I OFF (

nA/u

m) Low VT

Regular VT

Figure 1 – Virtex-4 devices reduce thermal concerns and simplify power supply design.

Figure 2 – Transistor leakage trends due to process scaling

Figure 3a, 3b – Middle oxide thickness Virtex-4 transistor used in triple-oxide process and with

highlighted portions of the transistors


Dynamic Power ReductionStatic power reduction, while dramatic, isnot the only power winner that you cantake advantage of. Dynamic power is alsoreduced by 50% when compared toVirtex-II Pro FPGAs.

The dynamic power in the FPGA isgoverned by the following equation:

PDynamic=FPGACore (CV 2ƒ )+FPGAI/O(CV 2ƒ )

The Virtex-4 family of FPGAs has thefollowing:

• Reduced FPGA core dynamic power

– Internal operating voltage is the dominant factor

– Secondary scaling by frequency (f ) and node capacitance (C)

• Constant FPGA I/O dynamic power

– Unchanged voltage swing (VI/O), toggle rate (f ), and pin/pad capaci-tance (C) for a given I/O standard

So you can see that we may be able tohave an effect on dynamic power inside thedevice, but that dynamic power consumedby I/O switching remains unchanged.

When we go from the 130 nm processof the Virtex-II Pro FPGA to the 90 nmprocess of the Virtex-4 FPGA, the inter-nal supply voltage changes from 1.5V to1.2V. This reduces the dynamic powerconsumption for every internal transistorby of that in the Virtex-IIPro FPGA.

Additionally, the FPGA internal com-posite capacitance is reduced in the Virtex-4FPGA. This internal capacitance comprisestransistor parasitic capacitances and trace-to-metal and trace-to-trace capacitances forthe interconnecting metal traces. Figure 5shows the capacitance involved relative totheir structures.

Does low-K reduce power? Low-K refersto the dielectric insulating materialbetween the metal traces in the FPGA.Lower K dielectric insulating layers doreduce internal capacitances per unit tracelength, but “low-K” is a relative term.Xilinx has reduced-K-insulating materials,and in the past has used low-K itself; wemay do so again in the future.

36% (1-[ ]2 )1.5

1.2

As mentioned earlier, dynamic poweris related to the bulk capacitance andinternal voltage levels being switched,P=CV 2ƒ. All things being equal, havinga lower internal capacitance for the inter-connects would be a benefit for dynamicpower and reduced resistor-capacitordelay, but other factors contribute tointerconnect capacitance, such as dis-tance above the metal plane, intercon-nect width, and interconnect length.

Additionally, other parasitic capaci-tances such as gate-to-drain and gate-to-source are also part of the equation. Totalcapacitance for a path is based on a com-plex combination of parasitic capacitance

in the transistors; the architecture of theinterconnect paths and actual pathlengths; and the number of hops throughinterconnect switches. Xilinx has reducedthe overall capacitance for those compo-nents in the Virtex-4 FPGA.

The overall effect is mostly due toreduced gate capacitance and lowers capac-itance by 20% for Virtex-4 FPGAs whencompared to Virtex-II Pro FPGAs. Table 2shows a dynamic power reduction of 50%for the Virtex-4 FPGA when compared tothe Virtex-II Pro FPGA. We have a 23%reduction in dynamic power when run-ning at a 50% higher frequency.

Because the Virtex-4 FPGA is a muchhigher performance device than the Virtex-II Pro FPGA, you may need to operate it athigher clock speeds to meet newerdemanding performance targets that couldnever be achieved in previous systems.


ParametersChannel Width RatioChannel Length Ratio

Leakage Current per Unit Width RatioLeakage Current per Transistor

VCCINT RatioStatic Power per Transistor Ration

(ILEAKAGE* VCCINT)

Virtex-II Pro Virtex-4 Change0.640.711.140.740.80

0.59

-36%-29%+14%-26%-20%

1.00

-41%Table 1 – Overall weighted average transistor leakage and parameter comparisons

for 90 nm Virtex-4 transistors relative to 130 nm Virtex-II Pro transistors

Figure 4 – Optimal transistor mix for minimizingleakage and maximizing performance

Figure 5 – Internal FPGA capacitance comprises parasitic transistor and interconnect capacitances


Embedded BlocksAnother major area of improvement inpower consumption is in the area ofembedded functions. This has alwaysbeen a strength in Xilinx FPGAs, but it ismore so in the Virtex-4 FPGA, evenwhen compared to the feature-richVirtex-II Pro FPGA.

In Virtex-4 FPGAs you can take furtheradvantage of both static and dynamic powerreduction by using the embedded functions,which are built as hard-logic functions.

When embedded functions are imple-mented as hard-logic functions instead ofin configurable logic blocks and program-mable interconnects, there is a lot lessstatic and dynamic power consumed. Thisis because far fewer transistors are used forhard, fixed logic than for programmablelogic. Additionally, there are no transistorsneeded to make connections for intercon-nects in the embedded functions, becausethere are no programmable interconnects.

Xilinx has carefully studied some of thefunctions that engineers like you havestruggled with and that we have alsofound tedious to implement within the

FPGA programmable logic. The newembedded functions lower power by 80-95% compared to their configurable logicblocks and routed counterparts in pro-grammable silicon.

Comprehensive Power Planning ToolsAnother useful thing in planning power isthat Xilinx data sheets show you both typ-ical and maximum power consumptionnumbers. Maximum numbers are forworst-case process, temperature, and volt-age, but many designers are very happy towork with typical numbers, depending ontheir application and the number of partsbeing used in one system.

One additional very useful thing thatyou can take advantage of in planning forpower consumption in Xilinx FPGAs arepower planning tools. Xilinx web powertools are available for estimating powerearly in the design cycle. Also, as part of theXilinx design flow, XPower looks in moredetail at a mapped or routed design. Bothcan be found, along with power applicationnotes, by searching the Xilinx website forthe phrase “Xilinx Power Tools.”

ConclusionXilinx has made profound improvements inboth static and dynamic power in the Virtex-4 90 nm family of FPGAs when comparedto Virtex-II Pro FPGAs – and (we believe) incomparison to our competitors. We havedone this through a multi-pronged, purpose-ful approach in the areas of reduced leakagecurrent, reduced dynamic power consump-tion, and embedded functions, withoutcompromising performance. These, alongwith comprehensive power planning tools,make the Virtex-4 device an excellent choicefor a high-performance FPGA system.

For more information about power con-sumption in Virtex-4 and other XilinxFPGAs, visit www.xilinx.com/products/design_resources/design_tool/grouping/power_tools.htm.


ParametersVCCINT

CTOTAL (rel.)fMAX (rel.)

Power at Same Frequency

Power at fMAX

Virtex-II Pro Virtex-4 Change1.20.81.5

1.15

1.73

1.51.01.0

2.25

2.25

-20%-20%+50%

-23%-49%

Parameters

QDR II SRAM Interface

SPI-4.2 Core

Virtex-II Pro Virtex-4Logic SliceReduction

Logic SlicePower Reduction

550 slices 125 slices 77% 89%5000 slices 3900 slices 22% 61%

Logic slice power reduction = 100* 1 – 0.5Virtex-4 slice count

Virtex-II Pro slice count( )%

Note: The factor of 0.5 above comes from the fact that Virtex-4 power per slice is 1/2 of the Virtex-II Pro power per slice because of the 50% dynamic power reduction in Virtex-4 devices compared to Virtex-II Pro devices.

Table 2 – Chart showing changes in internal FPGA in Virtex-4 devices compared to Virtex-II Pro devices and the effect on dynamic power

Table 3 – QDR II SDRAM and SPI-4.2 core benefit in reduced power consumption from significant logic cell reduction due to new Virtex-4 ChipSync block

Virtex-4 Embedded Functions andReduction of Dynamic Power

• PowerPC – 50% power reductioncompared to Virtex-II Pro PowerPC

– 10:1 power reduction over FPGAfabric-built version

• DSP – XtremeDSP™ slice greatlyreduces logic cells, which previously needed many filtering functions

– 20:1 power reduction over Virtex-IIPro separated multiply/accumulatefunctions

• SSIO – New ChipSync™ blockreduces logic cell count for SSIO(source synchronous I/O) designs

– Significant logic cell savings for vari-ous memory and networking inter-face designs leads to reduction inoverall power up to 9:1 for selecteddesigns (see Table 3)

• Embedded Ethernet MAC(s) – Noneed to use logic and interconnectfor MAC function, which saves>3,000 logic cells for the Xilinximplementation

• FIFO – SmartRAM™ memoryincludes built-in FIFO controllers,which can save hundreds of logiccells per FIFO and greatly simplifydesign as well


by Chris Ebeling Principal Engineer Xilinx, [email protected]

Krista MarksSr. Manager, IP Solutions DivisionXilinx, [email protected]

SPI-4.2 (System Packet Interface Level 4Phase 2) is the Optical InternetworkingForum’s recommended interface for theinterconnection of devices for aggregatebandwidths of OC-192 (ATM and POS)and 10 Gbps (Ethernet), as illustrated inFigure 1.

In the last few years, this interface hasbecome the de-facto standard on all leading10 Gbps framer ASSPs and has been imple-mented directly on many next-generationnetwork processors. SPI-4.2 has beenbroadly adopted because of its efficientinterface, which offers high bandwidthwith a low pin count and seamless handlingof typical system requirements such as flowcontrol, error insertion/detection, synchro-nization, and bus re-alignment.

The Xilinx® Virtex-4™ architectureprovides an ideal platform for implement-ing SPI-4.2. The Xilinx SPI-4.2LogiCORE™ IP targeting Virtex-4devices provides a solution with one-thirdless resources, dramatic power savings, 1+Gbps LVDS double-data-rate (DDR) I/O,and complete pin assignment flexibility.

SPI-4.2 LogiCORE IPXilinx has improved on its Virtex-II™ andVirtex-II Pro™ SPI-4.2 solution, alreadyone of the smallest in the industry, andmade it 30% smaller by leveraging newChipSync™ technology in the Virtex-4FPGA. ChipSync technology is supportedon every pin of the Virtex-4 device family;thus the new SPI-4.2 LogiCORE IP canbe targeted to any device pin-out. Thisallows you to select I/O pins that best fityour system and PCB requirements.

In addition, for those applicationsrequiring multiple SPI-4.2 interfaces, theVirtex-4 FPGA’s logic density, high pincount, and extensive clocking resourceswill support four or more full-duplex coresin a single device. Regardless of the per-formance your application requires,

Virtex-4 devices fully support the entireSPI-4.2 operating range, with high-speedLVDS support of data rates greater than 1Gbps per pin.

ChipSync TechnologyXilinx introduced ChipSync technology inVirtex-4 FPGAs to enhance I/O capabilitywhen used for source-synchronous applica-tions like SPI-4.2. ChipSync features are sup-ported in every Virtex-4 I/O pin and include:

• New serial and de-serial (OSERDESand ISERDES) features. This enableslogic built in the fabric to interface tothe I/O at a fraction of the source-synchronous clock rate. The ISERDESalso includes a Bitslip function. Bitslipallows you to shift the starting bit ofdeserialized data to achieve proper wordalignment when linking multiple pinstogether (bus deskew).

• A new input delay (IDELAY) feature.This allows you to precisely adjust theinput delay of each bit of a bus independ-ently, in 78 ps increments. This providesa mechanism for tuning the interfacetiming to the system environment.

Deliver Efficient SPI-4.2 Solutionswith Virtex-4 FPGAsDeliver Efficient SPI-4.2 Solutionswith Virtex-4 FPGAs


Virtex-4 devices offer an idealplatform for source-synchronous designs like the widely adoptedSPI-4.2 interface.

Virtex-4 devices offer an idealplatform for source-synchronous designs like the widely adoptedSPI-4.2 interface.


Additional DDR registers are now fullyintegrated into the input (ILOGIC) andoutput (OLOGIC) pins, simplifying theinterface between the FPGA fabric and I/Oblocks and supporting data transfer to andfrom the I/O logic on a single clock edge.

SPI-4.2 and ChipSync TechnologyThe SPI-4.2 interface has a DDR source-synchronous data bus that comprises 18LVDS pairs (16 data bits, 1 control bit, and1 clock). The SPI-4.2 source-synchronousclock varies from 311 MHz to 500 MHz.

As the frequency of the source-synchro-nous clock increases, data recovery at thereceiving (sink) device becomes more chal-lenging. The SPI-4.2 protocol provides acalibration data, or training pattern, thatpermits a receiving device to adjust its datasampling to the system interface timing.The process of tuning the interface to itsparticular timing is referred to as dynamicphase alignment (DPA).

Before Virtex-4 devices, Xilinx DPAsolutions worked by over-sampling theinput data and choosing the best samplefrom the group. This required valuableFPGA resources and careful control of theinput data path in the FPGA fabric, restrict-ing the SPI-4.2 interface pin placement. InVirtex-4 FPGAs, the IDELAY feature pres-ent in every I/O is ideally suited to performthis function, as shown in Figure 2. (See“Dynamic Phase Alignment with ChipSyncTechnology in Virtex-4 FPGAs,” also inthis issue of the Xcell Journal).

The IDELAY features have two pri-mary benefits for the SPI-4.2 core inVirtex-4 FPGAs:

• Integrating the IDELAY feature intothe input pin (ILOGIC) reduces theFPGA resources required for DPA toless than 350 slices.

• The IDELAY function’s ability toadjust the data sampling point enablesDPA to be implemented in the I/O –except for a small control statemachine, which is implemented in thefabric. The state machine portion isfully synchronous and does not requirea complex macro. Thus, there are norestrictions on SPI-4.2 pin assignments.

Clocking ResourcesVirtex-4 FPGAs provide an unprecedentednumber of clock resources for implement-ing multiple SPI-4.2 interfaces in a singledevice. With the Virtex-II and Virtex-IIPro architectures, implementing more thantwo SPI-4.2 interfaces posed a clock man-agement challenge. The abundance andflexibility of clock distribution in theVirtex-4 family solves this challenge, sup-porting as many SPI-4.2 interfaces as thedevice logic and I/O will allow.

For example, a typical OC-192 framer will require an aggre-gate bandwidth of 10 Gbps,which for a 16-bit dual data ratebus would require a data clock ofat least 311 MHz, with 350 MHza typical clock rate. The XilinxSPI-4.2 LogiCORE IP easilymeets your application require-ments, regardless of performance,and with Virtex-4 ChipSync tech-nology delivers a solution that issmaller and more flexible thenprior FPGA implementations.

The SPI-4.2 core usesChipSync technology to serializeegress data and de-serialize ingressdata to a four-word (bus cycle)SPI-4.2 data stream at a lowerclock rate. Operation of the corelogic at a lower internal clock rate

allows you to implement high-frequencySPI-4.2 interfaces in the slowest speedgrade Virtex-4 device.

The ISERDES and OSERDES functionsallow the core logic to time multiplex andde-multiplex these four words to and fromthe I/O logic without using any CLB logicresources. The core logic need only operate athalf the source-synchronous DDR clockrate. For example, a SPI-4.2 interface with a500 MHz DDR reference clock would onlyrequire an FPGA fabric clock of 250 MHz –easily achievable in the Virtex-4 architecture.


SPI-4.2

PHY Layer

Device

or

MPU

Rx Data Path

Rx Status Path

Tx Data Path

Tx Status Path

User's Logic

SPI-4.2

Sink

Interface

User

Sink

Interface

SPI-4.2 Sink Core

SPI-4.2 Interface

UserInterface

Virtex-4 Device

SPI-4.2

Source

Interface

User

Source

Interface

SPI-4.2 Source Core

ReceiveLVDS

DDR I/O

Time Sliced(Delay Chain)Oversampling(8 times/bit)

Per BitSample

SelectionState

Machine

Virtex-II or Virtex-II Pro FPGASPI-4.2 Dynamic Phase Alignment (DPA)

De-SerializeData(4:1)

Bus De-SkewState

Machine

ReceiveLVDS

DDR I/O

IDELAYMulti-Tap

Delay LineMultiplex or(One of 64Choices)

Virtex-4 FPGASPI-4.2 Dynamic Phase Alignment (DPA)

Implemented in the FPGA Fabric

Implemented in the I/O Block

De-SerializeData(4:1)

Bus De-SkewState

Machine

Per BitSample

SelectionState

Machine

Figure 1 – Typical SPI-4.2 application

Figure 2 – DPA implementation in I/O logic for Virtex-II devices versus Virtex-4 devices


In Virtex-4 devices, all devices have 32global clock resources. No restrictions existon global clock distribution other than amaximum of eight global clocks per clockregion. All clock regions have access to any8 of the 32 total global buffers, regardlessof the requirements of other clock regions.

In addition to the eight global clocks,each region in the device has two regionalclock buffers. The regional clock resourcesare ideal for interface clocking, like thesource-synchronous clock scheme used bySPI-4.2. Note that even the smallestVirtex-4 device has a total of 48 availableclock resources, each designed for low-skewclock distribution and clock power man-agement. The SPI-4.2 LogiCORE IP canbe configured to use either global orregional clock resources.

In Virtex-4 FPGAs, the global clocktrees and associated buffers are implement-ed differentially, for best duty-cycle fidelityand greater common-mode noise rejection.With Virtex-II and Virtex-II Pro devices, ifSPI-4.2 interface operates above 350 MHz,you must route the high-speed referenceclock using two clock buffers to minimizeduty-cycle distortion at the DDR registers.

Because each global clock tree in Virtex-4FPGAs is implemented differentially, onlyone clock buffer is required.

Not only does the Virtex-4 architecturehave considerably more clock resources,but because they are distributed differen-tially, the SPI-4.2 LogiCORE IP requiresfewer of them. These high-performanceclock resources support as many as fourSPI-4.2 interfaces in a mid-range device(LX40/LX60) and more than four SPI-4.2

interfaces in the larger devices (Figure 3).The Virtex-4 clocking capability opens up awhole new class of SPI-4.2 applications, andprovides an ideal platform for applicationssuch as multiplexing and de-multiplexing,bridges, and switches.

Higher Performance at Lower PowerVirtex-4 silicon is manufactured with atriple-oxide process that reduces staticpower consumption by 40%. This willhave a positive impact for all designs,including the SPI-4.2 interface, where thepower savings are dramatic, as readily illus-trated and summarized in Table 1.

With Virtex-4 devices, SPI-4.2 uses sig-nificantly less power than its Virtex-II andVirtex-II Pro predecessors, both because of

the enhanced 90 nm semiconductorprocess and because the LogiCORE IPuses 30% less fabric resources. At thesame time, Virtex-4 FPGAs support 30%higher internal performance for SPI-4.2,with a maximum frequency of 250 MHzin the lowest speed grade (compared to175 MHz in the lowest speed grade ofVirtex-II and Virtex-II Pro devices). Inaddition, Virtex-4 FPGAs support 1+Gbps LVDS for every I/O on the device.

This means that not only can youplace multiple SPI-4.2 interfaces any-where on the device, but for each imple-mented interface you get an aggregatebandwidth as high as 16+ Gbps. Designsthat do not require this level of perform-ance (such as more typical framer interfaces running at 10-12 Gbps) auto-matically get additional performanceoverhead that ensures ease of designintegration and timing closure.

ConclusionThe Xilinx SPI-4.2 LogiCORE IP, cou-pled with Virtex-4 features, provides ahighly efficient SPI-4.2 solution. Wedeveloped ChipSync technology that sup-ports every I/O pin specifically for source-synchronous interfaces like SPI-4.2.

This technology enables you to designthe most efficient SPI-4.2 solution, whichuses significantly less resources (35% less),allows fully flexible device pin assignments(you choose the pinout), and supportsextremely high interface speeds (1+ GbpsLVDS DDR I/O).

The higher performance is even morecompelling because Virtex-4 FPGAs deliverit with lower power and significantly high-er internal operating rates. The wealth ofVirtex-4 clocking resources, combined withfull pin assignment flexibility, opens up thepossibility for new applications with multi-ple SPI-4.2 interfaces.

For more information about SPI-4.2 LogiCORE IP targeting Virtex-4devices, please refer to this site at the XilinxIP Center: www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=DO-DI-POSL4MC. A hardware demon-stration is also available; for more informa-tion, contact your Xilinx representative.


VIRTEX-II VIRTEX-II PRO VIRTEX-4

Power: Static Alignment 1.9W 1.75W 1.55W@ 700 Mbps per LVDS Pair

Power: Dynamic Alignment 2.6W 2.8W 2.0WPerformance per LVDS Pair @800 Mbps @944 Mbps @1 Gbps

Speed Grades Supporting -6 -6, -7 -10, -11, -12800 Mbps per LVDS Pair

Figure 3 – Illustration of four SPI-4.2LogiCORE IP implemented on a Virtex-4

XC4VLX60 device

Table 1 – SPI-4.2 power estimates for Virtex-II, Virtex-II Pro, and Virtex-4 FPGAs


by Maria GeorgeSenior Product Applications EngineerXilinx, [email protected]

Xilinx® Virtex-4™ devices have a 64-tapabsolute delay element built in each I/O,making high-speed memory interface readdata capture very easy. This feature alsoprovides the flexibility to adopt differentread data capture schemes whereclock/strobe or data can be delayed.

During a write to the external memorydevice, the clock/strobe must be transmit-ted center-aligned with respect to data. Amemory write is easy to implement withVirtex-4 devices by means of the quadra-ture phase outputs of the DCM (CLK0,CLK90, CLK180, CLK270), ensuring thatthe clock/strobe is center-aligned withdata. Figure 1 illustrates the clock/strobeand data phase relationship during readand write transactions.

For most memory interfaces, such asDDR 2 SDRAM, RLDRAM II, FCRAMII, and QDR II SRAM, the data rate istwice the clock rate because data isreceived and transmitted on both the ris-

ing and falling edges of the forwardedclock/strobe. Virtex-4 devices have bothinput and output DDR flip-flops, mak-ing DDR operation extremely simple.

Write Data and Clock/Strobe TransmissionDuring a write operation, the clock/strobe isgenerated using the output DDR registersclocked by a DCM clock output (CLK0) onthe global clock network. The write data istransmitted using the output DDR registers

clocked by a DCM clock output that is 90degrees phase ahead (CLK270) of the clockused to generate clock/strobe. This meets thememory vendor specification of centering theclock/strobe in the data window.

Another innovative feature of the outputDDR registers is the SAME_EDGE mode ofoperation. In this mode, a third registerclocked by a rising edge is placed on the inputof the falling edge register (Figure 2). Usingthis mode, both rising edge and falling edgedata can be presented to the output DDR reg-isters on the same clock edge (CLK270),thereby allowing higher DDR performancewith minimal register-to-register delay.

Read Data CaptureMost memory interfaces are source-syn-chronous interfaces, where the clock/strobeis received edge-aligned with data during aread from the external memory device. Thismakes read data capture challenging becausethe read clock/strobe must be delayed tocapture read data.

Read data capture is challenging becausethe read data and the incoming memoryread clock/strobe are received edge-alignedfrom the external device.

Virtex-4 Memory Interfaces Virtex-4 Memory Interfaces


Virtex-4 devices make challenging memory interface requirements simple.Virtex-4 devices make challenging memory interface requirements simple.

word 0 word 1 word 2 word 3

ReceivedRead Dataat FPGA

Read Clock/Strobe

word 0 word 1 word 2 word 3

TransmittedWrite Datafrom FPGA

Transmitted Clock/Strobe

Figure 1 – Clock/strobe and data during read and write


The traditional technique to captureread data is to register it in the delayedmemory clock/strobe domain. This entails:

• Ensuring that the memory clock/strobeand the associated data have matchedPCB trace delays between the memorydevice and the FPGA

• Delaying the clock/strobe signals suchthat the edges of the clock/strobe cen-ter in the valid data window, as shownin Figure 3

• Registering the read data with thedelayed memory clock/strobe

• Synchronizing registered read data to thesystem (FPGA) clock domain

An alternate and simpler technique,currently used in Xilinx reference designs,is to capture read data directly in the sys-tem (FPGA) clock domain. This entails:

• Ensuring that the memory clock/strobeand the associated data have matchedPCB trace delays between the memorydevice and the FPGA

• Determining phase difference betweenthe memory clock/strobe to the system(FPGA) clock by detecting two memoryclock/strobe transitions in the systemclock domain

• Detecting transitions of memoryclock/strobe after the memory initial-ization sequence by delaying memoryclock/strobe with respect to the system(FPGA) clock in unit increments

• Delaying read data based on memoryclock/strobe to system (FPGA) phaseinformation such that the system(FPGA) clock is centered in the validdata window

Both techniques require delay elementsto delay the clock/strobe or data.

The 64-tap, 80 ps absolute delay ele-ment available in each Virtex-4 I/Oallows center alignment of memoryclock/strobe in the data window or datacentering with the system (FPGA) clock.Each Virtex-4 I/O also has input DDRflip-flops that are required for read datacapture, either in the delayed memory

domain and must be re-captured in the sys-tem (FPGA) clock domain. The transfer ofcaptured read data from the delayed mem-ory clock/strobe domain to the internalsystem (FPGA clock) domain is defined asread data re-capture. Read data is re-cap-tured within the I/O block.

Using the second technique, imple-mented in the Xilinx reference designs,you can directly capture read data in thesystem (FPGA) clock domain by delayingread data to meet the setup/hold time ofthe flip-flops in the system (FPGA) clockdomain. A simple state machine is suffi-cient to implement the center alignmentof the delayed read data with respect to

strobe domain or the system (FPGA)clock domain.

You can use the input DDR flip-flops inthe SAME_EDGE or SAME_EDGE_PIPELINED modes. In the SAME_EDGEmode, the falling edge data is output on thefollowing rising edge of the clock (Figure 4).In the SAME_EDGE_PIPELINED mode,both the rising edge and falling edge dataare output together on the same rising edgeof the clock (Figure 5). With these modesyou can achieve higher design performanceby avoiding half-clock cycle data paths inthe FPGA fabric.

In the first technique, read data is cap-tured in the delayed memory clock/strobe


word 0 word 1 word 2 word 3 word 4 word 5 word 6 word 7

Clock/Strobefrom Memory

Delayed Clock/Strobe in FPGA

Read Data

Figure 3 – Clock/strobe delayed in FPGA to center in read data window


CLK

R

CE

D Q

CLK

R

CE

D Q

CLKSS

S

R

CE

D Q

D1

D2

R

CE

C

S

OQ

DDR MUX

C

CE

OQ

D1

D2

D1A D2A D1B

D1A D1B D1C D1D

D2A D2B D2C D2D

D2B D1C D2C D1D

Figure 2 – Output DDR in SAME_EDGE mode

the system (FPGA) clock after the initializa-tion period.

This “run time” adjustment after thememory initialization sequence has signifi-cant advantages over other methods that setthe required delay or phase shift during“compile time.” The 64-tap absolute delayelement compensates for variations inprocess, temperature, or voltage, and henceincreases the timing margins – resulting in amore reliable system.

The read data is re-captured and storeddirectly into the block RAM FIFO, a Virtex-4feature that saves additional logic resources.

ConclusionVirtex-4 architectural features enable you toeasily and reliably implement high-speedmemory interfaces. You can use the 64-tap, 80ps absolute delay elements to capture read databy either delaying the memory clock/strobe orthe data. Built in each I/O, the 64-tap absolutedelay elements provide you the flexibility toselect any I/O for memory interfaces. The“run time” adjustment after memory initializa-tion improves design margins.

The input and output DDR registersenable you to receive and transmitclock/strobe and data at high frequencies; thedifferential clocking resource provides higherperformance with better duty cycle and lowerglobal clock buffer utilization; and the blockRAM FIFO feature enables you to storetransmitted or received data without addi-tional logic resources.

For more information about the imple-mentation and design details of differentmemory interfaces in Virtex-4 devices, visitthe following websites:

• DDR2 SDRAM (XAPP 701 andXAPP702) and DDR SDRAM(XAPP709): www.xilinx.com/products/design_resources/mem_corner/resource/xaw_dram_ddr.htm

• RLDRAM (XAPP710):www.xilinx.com/products/design_resources/mem_corner/resource/rldram.htm

• QDR II SRAM (XAPP703): www.xilinx.com/products/design_resources/mem_corner/resource/xaw_sram_qdr.htm

CLK

R

CE

S S

S S

D Q Q1

CLK

R

CE

D Q Q2

CLK

R

CE

D Q

D

R

CE

CLK

S

CLK

R

CE

D Q

C

CE

D

Q1

Q2

D0A D1A D2A

D0A D2A D4A D6A D8A D10A


D3A D4A D5A D6A D7A D8A D9A D10A D11A D12A D13A

Figure 5 – Input DDR in SAME_EDGE_PIPELINED mode


CLK

R

CE

S

S S

D Q Q1

CLK

R

CE

D Q Q2

CLK

R

CE

D Q

D

R

CE

C

S

C

CE

D

Q1

Q2

D1A



D3A D5A D7A D9A D11AD0A D2A D4A D6A D8A D10A

Don't care

Figure 4 – Input DDR in the SAME_EDGE mode


Now you can see inside your FPGA designs in a way that

will save days of development time.

The FPGA dynamic probe, when combined with an Agilent

16900 Series logic analysis system, allows you to access

different groups of signals to debug inside your FPGA—

without requiring design changes. You’ll increase visibility

into internal FPGA activity by gaining access up to 64

internal signals with each debug pin.

You’ll also be able to speed up system analysis with the

16900’s hosted power mode—which enables you and your

team to remotely access and operate the 16900 over the

network from your fastest PCs.

The intuitive user interface makes the 16900 easy to get up

and running. The touch-screen or mouse makes it simple to

use, with prices to fit your budget. Optional soft touch

connectorless probing solutions provide unprecedented

reliability, convenience and the smallest probing footprint

available. Contact Agilent Direct today to learn more.

U.S. 1-800-829-4444, Ad# 7909Canada 1-877-894-4414, Ad# 7910www.agilent.com/find/new16900www.agilent.com/find/new16903quickquote

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation

• Increased visibility with FPGA dynamic probe• Intuitive Windows

®XP Pro user interface

• Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000

Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.

X-ray vision for your designsAgilent 16900 Series logic analysis system with FPGA dynamic probe

X-ray vision for your designs

by Niall Battson DSP Applications Engineer Xilinx, [email protected]

With the introduction of Xilinx® Virtex-4™FPGAs in September 2004, the world of DSPdesign witnessed a dramatic leap in program-mable logic DSP: higher performance, lowercost, lower power, and maximum flexibility.

At the same time this phenomenon asksDSP hardware engineers to change their tradi-tional way of designing and embrace a differentapproach. These great improvements have beenmade possible by the XtremeDSP™ slice.

The XtremeDSP SliceThe XtremeDSP slice (also referred to as theDSP48) is a high-performance multiplier andarithmetic unit with great flexibility that canform the building block of many DSP algo-rithms implemented in FPGAs. A detaileddiagram of the DSP48 structure is shown inFigure 1.

The XtremeDSP slice comprises four mainsections:

• I/O registers

• 18 x 18 signed multiplier

• Three-input adder/subtractor

• Op-mode multiplexers

The I/O registers ensure a maximum clockperformance of 500 MHz in the fastest speedgrade device (400 MHz in the slowest speedgrade), also ensuring support for higher samplerates. The dynamic op-mode multiplexers arekey to the functionality of the structure; they areresponsible for the DSP48’s great flexibility. Forexample, in a simple MACC engine, you set theX and Y MUX to multiply and select the feed-back path from the registered output P as the ZMUX input to the arithmetic unit.

In the Virtex-4 architecture, XtremeDSPslices are arranged in columns. The most impor-tant aspect about the column is the cascade logicand routing between each block, which exists onboth the input and output stages of each slice.This dedicated routing enables a number of filters and other functions to be built entirelywithin the XtremeDSP slice, thus removing theneed for signals to be routed through the FPGAinterconnect or logic fabric.

Designing with the Virtex-4 XtremeDSP Slice


Harness the full capabilities of the XtremeDSP slice in filter design.


However, you must take this adder-chainconfiguration into account when designingfunctions that exploit the XtremeDSP slice.Herein lies the fundamental change in theapproach to filter design. The simple, tradi-tional adder-tree approach limited the per-formance and extensibility of a given filterimplementation. By using adder-chain-styleimplementations, these limitations are liftedand the huge benefits Virtex-4 FPGAs offerare possible.

The embedded nature of the XtremeDSPslice has also had a radical impact on reduc-ing the power consumed by high-speed mul-tiply and add functions. Figure 2 illustratesthis dramatic reduction, showing that thedynamic power consumption is 1/17 ofVirtex-II Pro™ devices with a specificationof 2.9 mW/100 MHz. As a designer, youshould migrate as much functionality intothese embedded functions as possible.

Filter TechniquesDuring the last ten years, hardware andFPGA designers have created a wide varietyof filter architectures to efficiently exploitthe building blocks that the current gener-ation of technology offers. With the intro-duction of Virtex-4 FPGAs and theXtremeDSP slice, filter implementationsmust change to most efficiently exploit thislatest FPGA offering. Filters are prolific inDSP designs and nearly always form thestarting point for analyzing an architecture.

The Semi-Parallel FIR FilterEven within the filter world, you canimplement a wide variety of filters. The keyparameters that tell us which FIR filterimplementation we will construct are:

• Number of coefficients (N)

• Sample rate (Fs)

Let’s examine a particular filter structureto demonstrate the key design techniquesthat can help you maximize the benefits ofVirtex-4 devices. Our filter has 20 coeffi-cients and a sample rate of 74.25 MHz.

As noted earlier, the maximum capableclock speed of the XtremeDSP slice is 400MHz in the slowest speed grade (-10).Therefore, we have a total of five clockcycles to perform the required 20 multiplyand adds to form the result.

This equation determines how manymultipliers to use for a particular semi-parallel architecture:Number of Multipliers = (Maximum Input Sample Rate x Number of Coefficients) / Clock Speed

For our example, the required numberof multipliers will be four. Once we havedetermined the required number of multi-pliers, there is an extendable architectureusing the XtremeDSP slices that can serveas the basis for the filter.

The general FIR filter equation is asummation of products (also known as aninner product) defined in the equation:

In this equation, a set of N coefficients ismultiplied by N respective data samples,and the results are summed to form anindividual result. The values of the coeffi-cients determine the characteristics of thefilter: low-pass, band-pass, or high-pass.

yn = ∑ xn-i hi

N-1

i=0


CE

D Q2-Deep

BCOUT PCOUT

B REG

CE

D Q

M REG

CE

D Q

P REG

CE

D Q2-Deep

A REG

0

1

B

A

C

BCIN

18 18

72

748

PCIN

48

48P

36 48A:B

36

0

36

0

018

48

X

Y

Z

Subtract

17-bit shift

17-bit shift

Carry In

OpMode

0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

0 100 200 300 400 500 600

Frequency (MHz)

Averag

e P

ow

er (

mW

)

Conditions: TT, 25C, nominal voltage, fully pipelined multiply-add mode, random vectors

* Based on power estimator spreadsheet, uses slice logic

Virtex-4

~2.3 mW/100 MHz

Virtex-II Pro*

~39 mW/100 MHz

Virtex-II*~47 mW/100 MHz

Figure 1 – Simplified diagram of the XtremeDSP slice

Figure 2 – Dynamic power consumption of the XtremeDSP slice


XtremeDSP arithmetic units aredesigned to be chained together easily andefficiently thanks to dedicated routingbetween slices. Figure 3 illustrates how thefour XtremeDSP multiply and add ele-ments are cascaded together to form themain part of the filter.

It is critical to highlight the usage of theadder chain here rather than the more tradi-tional adder tree. The adder chain has a pro-found impact on the control logic requiredfor the filter, as well as its efficiency, becauseof the mapping to the XtremeDSP slice.

Continuing to analyze the filter structure,an extra XtremeDSP slice is required to per-form the accumulation of the partial results,thus creating the final result. A new result iscreated every five clock cycles. This meansthat for every five cycles the accumulationmust be reset to the first inner product of thenext result. This reset (or load) is achieved bychanging the op-mode value of theXtremeDSP slice for a single cycle, from0010010 to 0010000 (this is just a single bitchange). At the same time, the capture regis-ter is enabled and the final result stored onthe output.

The Control LogicThe control is the most important and com-plicated aspect of semi-parallel FIR filters;getting it right is crucial to filter operation.Because the XtremeDSP slice is most effi-ciently used in adder chains, memoryaddressing is necessary to provide the delayfor each multiply-add element that the adderchain causes. Figure 4 illustrates the controllogic required to create memory addressing.

The counter creates the fundamentalzero through four count. This is thendelayed by one cycleby the use of a registerin the control path.Each successive delayis used to address boththe coefficient memo-ry and the data buffer– and their respectivemultiply-add ele-ments. Hence, a singledelay is required forthe second multiply-add element, twodelays for the thirdmultiply-add element,

and so on. Note that this is extensible con-trol logic for M number of multipliers.

Figure 4 also shows write enablesequencing. A relational operator isrequired to determine when the countlimited counter resets its count. This sig-nal is high for one clock cycle every fivecycles, reflecting the input and outputdata rates. The clock enable signal isdelayed by a single register just like thecoefficient address; each delayed versionof the signal is tied to the respective sec-tion of the filter.

The filter and control logic areextremely cascadable. The address for eachSRL16E data buffer and coefficient mem-ory pair are a delayed version of the previ-ous elements’ address, and are identical.

The performance and resource utiliza-tion for our filter is specified in Table 1. Inthe table, you can see how logic slice uti-lization dramatically drops when using theXtremeDSP slice. Clock frequency per-formance approximately doubles overVirtex-II Pro FPGAs.


DSP48 Slice

opmode = 0000101

0

x(n)

y(n)

18

40

h0h1

h2

h3

h4

h5h6

h7

h8

h9

h10h111

h2

h13

h14

h15h16

h17

h18

h19

DSP48 Slice

opmode = 0010101

DSP48 Slice

opmode = 0010010

Q

CE

D

Counter0 -> (NM-1)

Coefficient and Data Buffer 0

Address

WE

1

3

WE1 WE2 WE3 LOAD

Z-5


Address


Address


Address

Compare= (N/M-2)

Q

CE

D Q

CE

D Q

CE

D

Q

CE

D Q

CE

D Q

CE

D

Four-Multiplier 20-Tap Semi-Parallel FIR Filter Virtex-4 (-11) Virtex-II Pro (-7)18-Bit Data, 18-Bit Coefficients

Logic Slices 108 309

XtremeDSP Slice 5

Embedded Multipliers 7

Performance (Sample Rate) 90 MHz 77 MHz

Performance (Clock Frequency) 450 MHz 231 MHz

Figure 3 – The four-multiplier semi-parallel systolic FIR filter

Figure 4 – Control logic for the four-multiplier semi-parallel FIR filter

Table 1 – Resource utilization and performance of four-multiplier 20-tap semi-parallel FIR filter


Three Important Design PointsThis new filter architecture, along withVirtex-4 devices and the XtremeDSP slice,addresses the demanding needs of current andfuture DSP designs. However, it is only onefilter in an extremely large array of possibleimplementations, not to mention other DSPfunctions such as IIRs, FFTs, and DCTs.

Knowing this, you can take away threevery important design questions that willenable you to exploit the XtremeDSP sliceand Virtex-4 device as designed.

1. Is the design running as fast as possible?

The fastest speed grade (-12) shouldrun at 500 MHz. If your design isrunning at 50 MHz, you’ve got theroom to reduce your resource utiliza-tion by increasing performance (andreducing cost) by making more effi-cient use of the FPGA resources. Thefaster a particular function operates,the smaller it becomes. Our semi-parallel FIR filter, for example, usedfive XtremeDSP slices running at 375MHz instead of 20 XtremeDSP slicesrunning at 74.25 MHz.

2. Are there any XtremeDSP slices left?

If you are not using them all up, youcan probably add some functionality.This can lead to logic slice reductionand lower power consumption.

3. Are you using adder chains instead of adder trees?

DSP algorithms must aim to exploitadder chain-based implementationswherever possible, as this will lead tothe best utilization of the XtremeDSPslice. Such implementations will resultin performance gains, power reduction,and logic slice reduction.

ConclusionFor more information, see the XtremeDSPSlice Design Considerations User Guide,which provides in-depth details on other filterimplementations and DSP functions, atwww.xilinx.com/bvdocs/userguides/ug073.pdf.There are also other HDL and SystemGenerator for DSP reference designs to getyou started.



by Suresh Sivasubramaniam Senior Design Engineer Xilinx, Inc. [email protected]

Lisa MurphyApplication [email protected]

The Xilinx® Virtex-4™ FX family ofdevices contains up to 24 RocketIO™multi-gigabit transceivers, each capable ofoperating anywhere from 622 Mbps to 11Gbps. This seamless scalability, coupledwith support for various emerging stan-dards (Figure 1), allows you tremendousflexibility to upgrade today’s designs tomeet increasing bandwidth requirements.

To realize the full potential of thisupgradeability to high-bandwidth pro-cessing applications, you must carefullydesign the serial interconnect channels onthe PCB, be it line card or backplanes.

Once the transfer characteristics of thephysical channel are well understood, youcan effectively employ features such astransmit pre-emphasis/voltage swing andreceive equalization (Figure 2) to over-come losses and attenuation in the chan-nel, thus ensuring high signal integrity atthe receiver.

MK322 Evaluation Board Case StudyThe MK322 platform is the primaryboard used for the electrical evaluationand characterization of the RocketIO Xhigh-speed serial multi-gigabit trans-ceivers in Virtex-II Pro™ X FPGAs. Thisboard was specifically designed to evalu-ate and test the RocketIO X transceiverand is available for sale.

The SMA connectors on the board allowyou to interface the board to a scope, toother boards, or for loopback tests. Thephysical channel for each transceiver is care-fully optimized to ensure the highest signal

quality at the SMAs (on the transmit path)or at the FPGA (on the receive path).

The data can significantly degrade afterit has passed through the transmissionpath. Degradation includes loss of signalamplitude, reduction of signal rise time,and a spreading at the zero crossings. It iscritical to model the transmission pathwhen designing a high-performance, high-speed serial interconnect system. The trans-mission path may include longtransmission lines, connectors, vias, andcrosstalk from adjacent interconnect.

MK322 Board StackupThe MK322 is a 12-layer board. The stackand trace geometries are designed for 100Ohm differential and 50 Ohm single-ended signaling. The board material isstandard FR4 (Er = 4.2 and tanδ = 0.02).All trace and plane layers are 0.5 oz. copper(0.65 mil thick). The electrical channel ofinterest for our case study is routed as fol-

Designing For Signal IntegrityDesigning For Signal Integrity


You can use the Xilinx/Ansoft 10 Gbps Backplane Design Kit to predict interconnect performance.

You can use the Xilinx/Ansoft 10 Gbps Backplane Design Kit to predict interconnect performance.

Virtex-II Pro

Virtex-II Pro X

Rocket PHY

Virtex-4

Storage

Networking

Telecom

Computing

Video

1GFC 2GFC 4GFC 8GFC 10GFC

SATA SATA 2

SATA 3

GbE XAUI CEI (OIF) 10Gb ECEI (OIF)

OC-12 OC-48 OC-192

GbE SATA PCIE SATA 2

HD-SDI

Rate (Gbps) 0.622 1.0 2.0 3.0 5.0 6.0 10.0 11.0

0.622 2.488 9.952

1.25 1.5 2.5 3.0

1.45

1.25 3.125 6.25 10.313 11G

1.06 2.12 4.25 8.5

6.01.5 3.0

10.519

Programmable Termination

Programmable Voltage Swing

Transmit Pre-Emphasis

Integrated AC Coupling

Receive Equalization

Automatic EQ Settings Algorithm

Yes

Yes

Yes

Yes

Yes

Linear and DFE

Reduces reflections

Reduces power

Equalizes simple channels

Direct interface to other devices,reduces component count

Equalizes stringent channel; allows legacy backplanes to be upgraded

Automatically finds optimum EQsetting for a given channel;eases design and ensures

signal integrity

Feature Benefit

Figure 1 – Seamless scaling from 622 Mbps to 10 Gbps Figure 2 – Programmable pre-emphasis and equalization features in the Virtex-4 FX family


lows: microstrip on the top layer and tran-sitions to layer 10 stripline through aGSSG differential via.

Differential Signal TopologyThe differential signals are routed into andout of the board using Rosenberger™high-performance coax-to-board SMAconnectors. The signals are routed from thetop-mounted connector to the FPGA usingstripline transmission lines (layer 10),which transition to microstrip before inter-facing with the FPGA BGA package. Theactual trace layout for one Tx and Rx pairis shown in Figure 3.

Modeling and SimulationThe electrical channel comprises five mainsections (Figure 4):

• The BGA package

• Microstrip transmission line

• Differential via (GSSG configuration,G- ground, S- signal)

• Stripline transmission line

• Connector

Let’s look at each piece in turn.

BGA PackageThe package model and the specific Tx pairof interest were extracted from theCadence™ APD database and simulatedusing Ansoft HFSS. Figure 5 is a plot of thedifferential insertion loss (red) and returnloss (blue) as computed by Ansoft HFSS.

For this particular differential pair,return loss is better than 15 dB, up to 22GHz. Ansoft HFSS can output the differ-ential S-parameters as Touchstone files.Typically, companies are reluctant to giveout their package databases except underan NDA, because they contain sensitivedesign information. However, you can useS-parameters derived from the model forchannel simulations.

Microstrip and Stripline InterconnectWe performed simulations for the striplineand microstrip structures using the two-dimensional quasistatic finite element sim-ulator within Ansoft SI 2D Extractor. The

provides a comparisonof the simulationresults using the threedifferent methods. Asyou can see in the fig-ure, all methods predictsimilar performance.For an extended discus-sion of the trade-offs ofthe different approach-es, please refer to thewhite paper accompa-nying the kit, availableon the Xilinx SICentral website.

In addition, we parameterized each of theinterconnect models. For example, in themicrostrip interconnect model, the width,spacing, metal thickness, and physicallength are parameters that can vary. For theinitial simulations, these values were set togeometries specific to the MK322 board.

Differential ViaIn keeping with good design practices thatminimize unterminated stubs, layer 10was used to transition from the microstrip

stripline geometries were designed to pro-vide nominally 100 Ohms differentialimpedance. Simulations confirmed that theimpedance was within 7% of the nominalvalue (see Figure 6).

You can model PCB interconnects usingvarious methods within Ansoft Designer™.The simplest is to use a coupled-line circuitmodel (like those found in popular high-frequency circuit simulators such as AnsoftDesigner). In this instance, the interconnectis modeled with a uniform differential cou-

pled transmission line without any discon-tinuities. On the other end of the modelingspectrum is the utilization of a full-waveplanar EM field simulator based on themethod of moments (MoM). Althoughaccurate, MoM simulations are also themost computationally expensive method topredict interconnect performance.

A compromise that offers the accuracyof planar EM simulations with some of thespeed of circuit simulation is offered byusing a combination of the two. Figure 7


BGA Package

S-Parameters

MicrostripTransmission Line

Circuit Model

DifferentialVia

HFSS Model

StriplineTransmission Line

Circuit ModelHFSS Model

SMAConnector

S-Parameters

1

2

Microstrip Stripline

Figure 3 – Physical structure of a Tx and Rx differential pair on the MK322 board

Figure 4 – The individual pieces comprising the full channel

Figure 5 – Package model insertion loss (red) andreturn loss (blue) as computed by Ansoft HFSS


to stripline using the throughhole differ-ential via. The actual geometries for theground-signal-signal-ground configura-tion were taken from Appendix D of theXFP specification (see pages 160-163 ofthe specification).

Several key variables for the via areparameterized, including spacing betweensignal vias, via radius, and antipad radius.Simulation results for the differential viastructure are shown in Figure 8. The viastructure shows excellent broadbandinsertion and return loss (> -10 dB) wellbeyond 20 GHz.

SMA ConnectorThe SMA connector used on the MK322board is manufactured by Rosenberger(Part # 32K153-400). Rosenberger wasgracious enough to provide uswith the HFSS model for theconnector, along with the optimized PCB footprint. Thecritical parameters for opti-mization involve the pad andantipad radii, as well as place-ment and spacing of severalground return vias around thecenter conductor. The groundvias around the center conduc-tor allow the signal to transi-tion from a radial coaxial field to atransverse electromagnetic mode (TEM)transmission line field in such a way thatit minimizes any impedance mismatches.Figure 9 shows the insertion and returnloss (> -10 dB up to 12 GHz) for the opti-mized SMA launch.

Full Channel SimulationIt is possible to cascade results generatedfrom EM and circuit simulations on each

of the individual components to get a fullsystem simulation. Figure 10 is a snapshotof the schematic of the full channel, fromthe SMA connector, through the board tothe Xilinx Virtex-II Pro X BGA package,set up for frequency domain analysis.

Figure 11 is a plot of the system simu-lation results displaying the insertion andreturn loss up to 40 GHz. As expected, thechannel has a response similar to a low-pass filter. The majority of the energy for abaseband digital binary signal is containedwithin the first null of its power spectrum.For the rise time and signaling rate of thischannel (30 ps, 10 Gbps), we are mostconcerned with the response up to 17GHz. As seen in the plot, the insertion lossis roughly -10 dB and the return loss isbelow -10 dB up to 17 GHz.

You can also perform time domain sim-ulations (see Figure 12) using the systemsimulator in Ansoft Designer. This simula-tor uses a convolution algorithm to processthe frequency domain channel data withuser-defined input bitstreams. Insertionand return loss is included in the simula-tion.

An ideal 10 Gbps pseudo-random bitsource with a 0.5V p-p amplitude and 30ps rise time was applied to the channel.


εr=4.2, δ =

0.02

S BW

S10 20.650 6.750 7.500 54.73 93.64 31.32

Layer B W S Zse Zd Zoom

All dimensions are in mills

Figure 6 – Impedance for the stripline traces as extracted using Ansoft SI 2D Extractor

Figure 7 – A comparison of the three methods to simulate interconnects

Figure 8 – Differential S-parameters for the via as computed by Ansoft HFSS

Figure 9 – Differential S-parameters for the SMA connector

Figure 10 – Schematic of the full channel setup for frequencydomain analysis within Ansoft Designer

Figure 11 – Insertion and return loss for the full channel


The channel was terminated in single-ended 50 Ohm impedances. The resultingeye diagram is shown in Figure 13, alongwith a measured eye diagram. There isexcellent correlation between the measure-ment and simulation results. A very clearand open eye is achieved, as is expectedfrom the frequency domain results.

For comparison to the measured eye,the driver capacitance was added to thechannels. These capacitors are not part ofthe package model, because the passivechannel will eventually be used with actualdriver/receiver models that already includethe capacitance. No pre-emphasis was usedin the simulation. It should be anticipatedthat some pre-emphasis would sharpen upthe time-domain response.

Extension of the MethodologyIn creating the models, we emphasized thatthe critical variables that make up the phys-ical structure are parameterized. Why para-meterize? Although there are many reasonsfor doing so, let’s show through some exam-ples the power and utility of models thatallow manipulation of critical variables.

A Longer Stripline SegmentIn the original model, the nominal lengthfor the stripline segment of the channel is2.5 in. For whatever reason (board routingcongestion is an obvious one), suppose thatthe stripline segment now needed to be 5in. You can easily investigate the channelperformance for this new scenario bychanging the physical length variable(SL_L) in the model. Examples of such ananalysis, for various trace lengths, areshown in Figure 14.

Increasing the length of the striplinesegments results in significant eye degra-dation. Because every component of the

channel is parameter-ized, you can explorethe performance impactof different variables ineach section of thechannel when investi-gating design trade-offs. In fact, withexactly this intent inmind, we have made

these models available as a Xilinx/Ansoft10 Gbps Backplane Design Kit at www.gigabitbackplanedesign.com. Completedetails on each of the models and the para-meterized variables are available at this site.

ConclusionModern platform FPGA devices providewide bandwidth processing and high-speedI/O. Serial I/O with speeds in the gigabitrealm creates new challenges for PCBdesigners.

Models associated with this effort havebeen assembled into a 10 Gbps backplanedesign kit that you can use to predict per-formance of circuit board designs.

The design kit is available on the Xilinx“SI Central” website, enabling you to rap-idly evaluate your own board designs. Visitwww.gigabitbackplanedesign.com for moreinformation.


Figure 12 – Schematic showing setup for time-domain simulations

Figure 13 – Simulated (left) and measured (right) eye diagram for the full channel; the simulated eye is in excellent agreement with measurements

Figure 14 – Channel performance degrades due to losses in the transmission line as the trace length increases


by Ahmad Ansari Senior Staff Systems ArchitectXilinx, [email protected]

Peter RyserManager, Systems EngineeringXilinx, [email protected]

Dan IsaacsDirector, APD Embedded MarketingXilinx, [email protected]

The APU controller provides a flexiblehigh-bandwidth interface between the re-configurable logic in the FPGA fabric andthe pipeline of the integrated IBM™PowerPC™ 405 CPU. Fabric co-processormodules (FCM) implemented in the FPGAfabric are connected to the embeddedPowerPC processor through the APU con-troller interface to enable user-defined con-figurable hardware accelerators. Thesehardware accelerator functions operate asextensions to the PowerPC 405, therebyoffloading the CPU from demanding com-putational tasks.

APU InstructionsThe APU controller allows you to extend thenative PowerPC 405 instruction set with cus-tom instructions that are executed by the soft

FCM; the primary capabilities are shown inFigure 1. This provides a more efficient inte-gration between an application-specificfunction and the processor pipeline than ispossible using a memory-mapped coproces-sor and shared bus implementation.

The instructions supported by the APUare classified into three main categories:

• User-defined instructions (UDI)

• PowerPC floating-point instructions

• APU load/store instructions

The UDIs are programmed into thecontroller either dynamically through thePowerPC 405 device control register(DCR) or statically when the FPGA is con-figured through its bitstream. The APUcontroller allows you to optimize your sys-tem architecture by decoding instructionseither internally or in the FCM.

The floating-point unit (FPU) is anexample of an FCM. The PowerPC float-ing-point instruction set is decoded in theAPU controller, whereas the computation-al functionality is implemented in theFPGA fabric. To support FPUs with dif-ferent complexities, the APU controllerallows you to select subgroups of thePowerPC floating-point instructions.These instructions are executed in theFCM while other subgroups of instructionsare either computed through software FPU

emulation or ignored completely. This fine-tuning optimizes FPGA resources whileaccelerating the most critical calculationswith dedicated logic.

The APU controller also decodes high-performance load and store instructionsbetween the processor data cache or systemmemory and the FPGA fabric. A singleinstruction transfers up to 16 bytes of data –four times greater than a load or storeinstruction for one of the general purposeregisters (GPR) in the processor itself. Thus,this capability creates a low-latency and high-bandwidth data path to and from the FCM.

APU Controller OperationFigure 2 identifies the key modules of theAPU controller and the 405 CPU in rela-tion to the FCM soft coprocessor moduleimplemented in FPGA logic. To explainthe operation of the APU controller andthe processor interactions related to theexecution units in soft logic, we can tracethe step-by-step sequence of events thatoccur when an instruction is fetched fromcache or memory.

Once the instruction reaches the decodestage, it is simultaneously presented to boththe CPU and APU decode blocks. If theinstruction is detected as a CPU instruc-tion, the CPU will continue to execute theinstruction as it would normally.Otherwise, within the same cycle, the CPU

Accelerated System Performancewith APU-Enhanced ProcessingAccelerated System Performancewith APU-Enhanced Processing


The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.

The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.


will look for a response from the APU con-troller. If the APU controller recognizes theinstruction, it will provide the necessaryinformation back to the CPU.

If the APU controller does not respondwithin that same cycle, an invalid instruc-tion exception will be generated by theCPU. If the instruction is a valid and rec-ognized instruction, the necessary operandsare fetched from the processor and passedto the FCM for processing.

Because the PowerPC processor and theFCM reside in two separate clock domains,synchronization modules of the APU con-troller manage the clock frequency differ-ence. This allows the FCM to operate at aslower frequency than the processor. In thisinstance, the APU controller would receivethe resultant data from the coprocessor and

implement synchronization semantics topace the software execution with the hard-ware FCM latency.

Non-autonomous instruction types arefurther divided into blocking and non-blocking. If blocking, asynchronous excep-tions or interrupts are blocked until theFCM instruction completes. Otherwise, ifnon-blocking, the exception or interrupt istaken and the FCM is flushed.

Software DescriptionSoftware engineers can access the FCMfrom within assembler or C code. On oneside, Xilinx has enabled the GCC compiler(which is contained in the EmbeddedDevelopment Kit) to generate code thatuses an FCM floating-point unit to calcu-late floating-point operations. Furthermore,assembler mnemonics are available forUDIs and the pre-defined load/storeinstructions, enabling you to place hard-ware-accelerated functions into the regularprogram flow. For the ultimate level of flex-ibility, you can define your own instructionsdesigned specifically for the hardware func-tionality of the FCM.

You can easily use the pre-definedload/store instructions through high-levelC macros. For example, in an applicationwhere the FCM is used to convert pixeldata into the frequency domain, 8 pixels of16 bits are transferred from main memoryto an FCM register with a simple program:

unsigned short pixel_row[8]; // 8 pixels,each pixel has a size of 16 bits

lqfcm(0, pixel_row); // transfer a row ofpixels to FCM register zero

The quadword load operation main-tains cache coherency as the data is movedthrough the cache, if caching is enabled forthe corresponding address space.

The FCM operation on the pixel datacan start on an explicit command; forexample, a UDI. However, for many appli-cations the operation starts immediatelyafter the FCM hardware detects the com-pletion of the load instruction.

The latter approach has many advantages:

• Simple software – A load operationmoves the data from the memory to

at the proper execution time send the databack to the processor. The APU controllerknows in advance, based on instructiontype, if or when it will get the result.

Autonomous and Non-Autonomous InstructionsTwo major categories of instructions exist:autonomous and non-autonomous. Forautonomous instructions, the CPU contin-ues issuing instructions and does not stallwhile the FCM is operating on an instruc-tion. This overlap of execution allows youto achieve high performance through tech-niques such as software pipelining.

On the other hand, during the syn-chronized execution, the CPU pipelinestalls while the FCM is operating on aninstruction. This feature allows you to


PowerPCAPU

Controller

SoftAuxiliary

Processor

PLB

OCM FPGA Fabric

APUI/F

FPGAI/F

Processor BlockProcessor Block

• Extends PPC 405 Instruction Set – Floating Point Support (with soft auxiliary processor) – User-Defined Instructions

• Offloads CPU-Intensive Operations – Matrix Calculations • Video Processing – Floating-Point Mathematics • 3D Data Processing

• Direct Interface to HW Accelerators – High Bandwidth – Low Latency

Fetch Stage

Decode Stage

Decode

EXE Stage

Exec. Units

WB Stage

Load WB Stage

DecodeControl

DecodeRegisters

APU Decode

Pipeline Control

Buffers andSynchronization

Optional Decode

ExecutionUnits

Register File

Intructions fromCache or Memory

Processor Block

Soft CoprocessorModule

405 Core

APU Controller

Instruction

Instruction

Control

Operands

OperandsResult Result

Load Data

Figure 1 – APU expanded processing capabilities

Figure 2 – APU controller processing operative block diagram


the FCM and starts the operation. Asubsequent store instruction retrievesthe result of the operation and stores itback to main memory.

• High data transfer rates – Quadwordload and store operations take just a fewcycles to complete. A single operationmoves 16 bytes within that timeframe.

• Low latency – FCM load operationsare simple to use. The processor com-pletes the operation in a single cycle.

The principle of the RISC architectureuses a number of simple instructions ondata stored in general-purpose registers(GPR) to compute complex operations.User-defined instructions fall into this cat-egory but take the concept a step further inthat the system architect defines the com-plexity of the operation on data stored inGPRs and FCM registers (FCR). Again,from a software point of view, the engineercodes user-defined instructions through Cmacros. GCC recognizes mnemonics suchas udi0fcm as a user-defined operation ofthe general form:

udi0fcm<FCRT5/RT5>,<FCRA5/RA5/imm>,<FCRB5/RB5/imm>

The target of the operation is either aGPR or an FCR. The operands are eitherGPRs, FCRs, immediate values, or a com-bination. As you can see, the semantics arenot defined by the instruction and dependon your intentions and the implementationin the FCM.

This code sequence demonstrates theuse of a user-defined instruction as anexample of a complex add operation:

struct complex int r, i; // 32 bit integer for realand imaginary parts

;complex a, b, r;ldfcm(0, &a); // load complex number ainto FCM register 0ldfcm(1, &b); // load complex number binto FCM register 1udi0fcm(2, 1, 0); // udi0fcm computes r = a+ b, where r is stored in FCM register 2stdfcm(&r, 2); // store complex resultfrom FCM register 2 to variable r

To increase the readability of the code,you can redefine the user-defined instruc-tion with regular C preprocessor constructs.Instead of using the udi0fcm() macro, youcan redefine it to a more comprehensiblecomplex_add() macro with #define com-plex_add(r, a, b) udi0fcm(r, a, b) and changethe listing to call complex_add(2, 1, 0)instead of udi0fcm(2, 1, 0).

Therefore, system architects can partitiontheir tasks into hardware- and software-executed pieces that are efficiently and pre-cisely interfaced to one another through theuse of the APU controller. This partitioningcan be done statically during the initial sys-tem configuration or dynamically duringthe program execution. Using the directprocessor/FPGA coupling presented by theAPU controller and its high throughputinterfaces, hardware/software synchroniza-tion is greatly simplified and performancesignificantly improved.

Accelerating System PerformanceThe following examples showcase keyadvantages the APU provides based on twodifferent scenarios. The first scenario isessentially a benchmarking comparison of afinite impulse response (FIR) filter using asoft FPU core, implemented as an FCMattached directly to the APU controller (ascompared to software emulation used tocalculate the filter function). The secondscenario implements a two-dimensional

inverse discrete cosine transform (2D-IDCT) typically used as one of the pro-cessing blocks in MPEG-2 videodecompression, again compared to emu-lating the 2D-IDCT function in software.

The two use cases are different in thatthe FPU implements a set of registers in theFPGA fabric upon which the FPU instruc-tions operate. The 2D-IDCT only requiresload and store operations, while the func-tionality of the operation on the datastream is fixed. In either case the operationsare complex enough to justify offloadinginto the FPGA fabric.

Thus, the combination of using theAPU and FPGA hardware accelerationclearly provides a significant performanceadvantage over software emulation – or theconventional method involving the proces-sor and processor local bus architecturewith a soft co-processing function.

FIR FilterThe implementation of floating-pointcalculations in hardware yields animprovement by a factor of 20 over soft-ware emulation. Connecting the FPU asan FCM to the APU controller providesperformance improvement because thelatency to access the floating-point regis-ters is reduced and dedicated load andstore instructions move the operands andresults between the FPU registers and thesystem memory.


PowerPCAPU

ControllerProcessor(soft logic)

XtremeDSPXtremeDSP

XtremeDSP

OCM FPGA Fabric

APUI/F

FPGAInterface


43.8 .40 0

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

-4.1 0 -1.1 0 0

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

Pixel Amplitude Values

Pixel DCT ValuesRGB

YUV

Blocks

APU Function: • Decompresses encoded pixel data for output display• Utilize FPGA Resources – Less overhead logic – Fast data transfer

Spatial Redundancy:Pixel Decoding Using the IDCT

MPEG Decode Flow

Figure 3 – Utilizing APU to decode pixel data for display output


2D-IDCTThe 2D-IDCT transforms a block of 8 x 8data points from the frequency domain intopixel information. A high-level diagramdepicting the pixel decode by the APU con-troller, along with advantages, is shown inFigure 3. In this example, each data pointhas a resolution of 12 bits and is representedas a 16-bit integer value. The data structureis defined where each row of 8 pixels con-sumes 16 bytes. This is an ideal size thatallows optimal use of the FCM load andstore instructions described earlier. In otherwords, eight FCM quadword load instruc-tions are needed to load a data block into the2D-IDCT hardware. Eight FCM quadwordstore instructions are sufficient to copy thepixel data back into the system memory.

The calculation of the 2D-IDCT in theFCM starts immediately after the first load,and the pixel data is available shortly afterthe last load operation. As shown in Figure4, the 2D-IDCT makes uses of the newXtremeDSP™ slices in the Virtex-4 archi-tecture that offer multiply-and-accumulatefunctionality.

A software-only implementation of a2D-IDCT takes 11 multiplies and 29 addi-tions together with a number of 32-bit loadand store operations, while the hardware-accelerated version takes 8 load and 8 storeoperations. The reduced number of opera-tions results in a speed-up of 20X in favora 2D-IDCT FCM attached through theAPU controller.

By comparison, if you connect the 2D-IDCT hardware block to the processor localbus, as it is done conventionally, the systemperformance will be reduced. This increasedlatency is mainly caused by the bus arbitra-tion overhead and the large number of 32-bit load and store instructions. This isillustrated schematically in Figure 5.

ConclusionThe low-latency and high-bandwidth fab-ric coprocessor module interface of theAPU controller enables you to acceleratealgorithms through the use of dedicatedhardware. Where operations are complexenough to justify the offloading into theFPGA fabric, or when acceleration of a

specific algorithm is desired to achieveoptimal performance, the combination ofthe APU controller and FPGA hardwareacceleration provides a definitive per-formance advantage over software emula-tion or the conventional method ofattaching coprocessors to the processormemory bus.

Generating the accelerated functionscalled by user-defined instructions is easilyperformed through GUI-based wizards.This functionality will be included in sub-sequent releases of the powerful EmbeddedDevelopment Kit or Platform Studio.

If you are more comfortable workingat the source code or assembly level, theAPU controller allows you to define yourown instructions written specifically forthe hardware functionality of the FCM,or you can easily use the pre-definedload/store instructions through high-levelC macros.

The APU controller provides a closecoupling between the PowerPC processorand the FPGA fabric. This opens up anentire range of applications that can imme-diately benefit customers by achievingincreases in system performance that werepreviously unattainable.

For additional details on the APU con-troller in Virtex-4-FX devices, includingdetailed descriptions and timing waveforms,refer to the Virtex-4 PowerPC 405 ProcessorBlock Reference Guide at www.xilinx.com/bvdocs/userguides/ug018.pdf.


PowerPCAPU

Controller

AuxiliaryProcessor(soft logic)

XtremeDSPXtremeDSP

XtremeDSP

PLB

OCM FPGA Fabric

APUI/F

FPGAInterface


• Leverages Integrated Features – PowerPC, APU, XtremeDSP Blocks

Example: Video Application – MPEG De-Compression Algorithm

• HW Acceleration Over Software – Lower Latency and High Bandwidth

• Effecient HW/SW Design Partitioning– Optimized Implementation

• Significant Performance Increase

Over 20X Performance Improvement Compared to Software Emulation

PowerPC

Memory

PowerPC

Memory Memory

ProcessorLocal Bus

ProcessorLocal Bus

FPGAInterface

SoftIDCT

SoftIDCT

PowerPC

FPGA Fabric FPGA Fabric FPGA Fabric

Software Emulation APU AcceleratedProcessor w/Soft IDCT

Software Only>200 Lines CodeSeveral 100 Instructions

Accelerated w/Soft CoreMultiple Load/Store Operationsper IDCT

APU w/XtremeDSP SlicesSingle Instruction ExecutionLeverages APU and Soft Logic

Inverse Two-Dimensional IDCT Algorithm

APUController

Processor Block

Figure 4 – Accelerated system performance with APU

Figure 5 – Comparison of implementation models for 2D-IDCT


by Ryan CarlsonDirector of Marketing, High Speed Serial I/OXilinx, [email protected]

The industry is moving away from parallelbuses and relatively slow differential signalstoward higher speed differential signalingschemes. These high-speed signals solvemany design challenges: they offer new levelsof bandwidth, they lower overall system cost,and they make designs easier by addressingthe skew issues of large parallel buses.

However, with these improvementscomes a new challenge: maintaining signalintegrity. As signals push the limits of themedia across which they are transmitted, thechallenge of dealing with signal impairmentsbecomes non-trivial, to say the least. Thenew Xilinx® Virtex-4™ RocketIO™ trans-ceivers have incorporated multiple new fea-tures designed to solve this challenge.

Frequency-Dependent LossSeveral factors contribute to the frequency-dependent loss of a typical channel. Figure1 shows the frequency response of 1 m ofFR-4 trace. Dielectric loss and skin effectcombine to create a significant loss above 1GHz. With today’s serial I/O standards

approaching 10 Gbps, this loss becomes acritical design issue.

As a signal travels across a channel (likethe one with a transfer function shown inFigure 1), a bit is degraded to the pointwhere it interferes with neighboring bits;this is known as inter-symbol interference(ISI). Figure 2 shows the effect of ISI on asignal transmitted across a typical back-plane channel. The high-frequency com-ponents are subject to losses that aregreater than the low-frequency compo-nents. The edges that contain the high-frequency components are degraded,resulting in added jitter and eye closure.Additional techniques are needed to com-pensate for these losses.

Signal Integrity FeaturesThe Virtex-4 RocketIO transceivers con-tain several features aimed at solving thisproblem. The first is transmit pre-emphasis. By modifying the signal beforeit is transmitted through a channel,transmit pre-emphasis can proactivelycompensate for some of the frequency-dependent loss of the channel.

Although most existing solutions usetwo-tap transmit pre-emphasis (addressingonly the post-cursor ISI shown in Figure 2),

the Virtex-4 RocketIO transceivers employthree-tap transmit pre-emphasis to addressboth pre- and post-cursor ISI. For signalrates above 3 Gbps, pre-cursor ISI becomesa non-negligible effect, and three taps oftransmit pre-emphasis are needed to solvethe problem.

In addition to transmit pre-emphasis,Virtex-4 RocketIO transceivers providetwo different types of receive equalization.These options can be used in conjunctionwith transmit pre-emphasis to furtherimprove signals degraded by lossy channels.

The first type of receive equalizationworks by amplifying the high-frequencycomponents of the signal that have beenattenuated by the channel (Figure 1). Thetransfer functions of this equalizer are pro-grammable, and are shown in Figure 3.

The second type of receive equalization iscalled decision feedback equalization (DFE).This technique removes ISI effects by look-ing at consecutive bits and choosing theamount of equalization needed.

Both forms of receive equalizationdescribed above seek to amplify the high-frequency components of the desired sig-nal. An advantage of DFE is that it doesnot amplify any crosstalk that may be asso-ciated with the signal. This technique can

Solving the Signal Integrity ChallengeSolving the Signal Integrity Challenge


Virtex-4 RocketIO transceivers bring blazing speed, and the ability to use it.Virtex-4 RocketIO transceivers bring blazing speed, and the ability to use it.


therefore be useful for increasing the speedof legacy backplanes, where extensivecrosstalk may exist.

All of these signal integrity features arefully programmable; they can be used inde-pendently or together, and each has multi-ple settings to equalize any channel. Tofully take advantage of these hardware-based features, Xilinx also provides soft-ware-based reference designs that use biterror rate tests (BERT) to find the optimalsettings for each unique application.

Integrated Receive Side AC-Coupling CapacitorsMany applications require AC-couplingcapacitors to ensure compatibility betweendifferent Tx and Rx blocks. These capaci-tors require their own vias; at high speedsvias present yet another discontinuity toimpair signal quality.

The Virtex-4 RocketIO transceiversintegrate the AC-coupling capacitors onchip. This not only reduces external com-ponent count and design effort, but moreimportantly improves signal integrity byremoving the need for extra vias in theboard. These integrated AC-couplingcapacitors can be optionally bypassed.

ConclusionSignal integrity is an engineering challengethat accompanies the move to high-speedserial signaling. Once the system design hasbeen optimized to minimize the physicaleffects of connectors, board materials, traces,vias, coupling capacitors, and cables, theremaining losses and channel effects need tobe addressed by advanced silicon features.

Virtex-4 RocketIO transceivers are theindustry’s fastest integrated transceivers.Along with these leading-edge speeds, theRocketIO transceivers deliver multiple fea-tures designed to simultaneously addressthe signal integrity challenge that comeswith them.

Xilinx has detailed information abouthigh-speed design challenges, and thesolutions available to solve them, atw w w. x i l i n x . c o m / s i g n a l i n t e g r i t y .Instructional DVDs that describe variousaspects of the signal integrity challengecan be purchased from the Xilinx onlinestore by visiting www.xilinx.com/store/.


1.0

0.8

0.6

0.4

0.2

0

1 MHz 10 MHz 100 MHz 1 GHz 10 GHz

-3 db

-6 db

-12 db

-20 db

Dielectric Loss

Total Loss

Conductor Loss (Skin Effect)

Ampli

tude

TransmittedPulse

Example Backplane

Time Time

Received Pulse,Attenuated and

Dispersed

Cursor

Pre-CursorCauses ISI

(Secondary Effect)

Post-CursorCauses ISI

(Primary Effect)

16

14

12

10

8.0

6.0

4.0

2.0

0.010 M 100 M 1 G 10 G

Freq (Hz)

Ampli

ficati

on (d

B)Figure 1 – Frequency-dependent loss

Figure 2 – A transmitted bit (left) and the result of inter-symbol interference (right)

Figure 3 – Virtex-4 RocketIO receive equalization transfer functions


by David GambaSenior Manager, Strategic Solutions MarketingXilinx, [email protected]

Wireless infrastructure revenue continuesto experience phenomenal growth, increas-ing from approximately $27 billion in2003 to an estimated $35 billion in 2004.Industry analysts are predicting that 2004will be the peak revenue year, as forecastsshow the revenue figure dropping back to$27 billion in 2005, eventually settling into the $10-$15 billion range by the end ofthe decade. This revenue decline is drivenboth by lower prices as well as a drop inbase station deployments, from nearly500,000 stations in 2004 to less than200,000 in 2010.

As the industry transitions from a high-growth phase to a more mature state, costpressures will increasingly mount in allfacets of the infrastructure, including thewireless base station. Next-generation basestation deployments must conquer thechallenge of continually reducing cost (asmeasured by cost per channel) whileadding functionality to support new servic-es, protocols, and changing subscriberusage patterns.

Using FPGAs in Wireless Base Station DesignsUsing FPGAs in Wireless Base Station Designs


Wireless base station design trends benefit from Virtex-4 device features.Wireless base station design trends benefit from Virtex-4 device features.


To begin addressing this challenge,wireless base station designs are shiftingfrom ASIC technology to more readilyavailable off-the-shelf components such asFPGAs. This shift is driven both by declin-ing annual base station unit volumes aswell as FPGA technology improvementsthat increase processing power and enable amuch lower cost per channel.

The migration to FPGAs is not just anattempt to reduce costs and create a com-mon platform to achieve commoditization– it is also being driven by time-to-marketpressures, along with the need to make in-

field upgrades of base station deployments.This shift away from ASICs has enabledsignificant new design opportunities forXilinx® Virtex-4™ devices to fill the void.

Wireless Base Station Module Building BlocksInside a wireless base station are fairly dis-tinct module blocks performing differentfunctions, such as radio, baseband process-ing, transport network interfacing, andcontrol (Figure 1). Traditional base stationdesigns used ASICs – along with DSPs andother discrete components – to implementthese various architectural features andfunctions.

This design approach is rapidly givingway to more cost-effective and flexibledesigns that use FPGAs. With lower costsand increased flexibility, product delivery isaccelerated and inventory control is much

Extending Current Design LifecyclesStandardization is the first step towardsthe commoditization of base stationdesign and will eventually lead to a phas-ing out of ASICs from wireless base sta-tions. In the interim, companies areinserting discrete devices next to their cur-rent ASICs to support new functionalitythat cannot be added in a timely or cost-effective manner to the current design.

For instance, the Third GenerationPartnership Project (3GPP), which is acollaboration agreement between severaltelecommunications bodies, is activelycreating additional standards for thewireless industry. 3GPP has added ahigh-speed downlink packet access(HSDPA) feature as a new UniversalMobile Telecommunications System(UMTS) requirement in its latest base-band processing specification, Release 5,for Wideband Code Division MultipleAccess (W-CDMA).

ASICs in current base stations do notsupport this new variant for UMTS.This creates a hole in the service offer-ings for UMTS, which forecasters arepredicting will represent approximately80% of the wireless traffic in the nextfew years. This deficiency must beaddressed before future field deploy-ments, and it can be – without exceedingthe system power budget – by using aVirtex-4 LX device next to the ASIC,implementing HSDPA using the avail-able Xilinx HSDPA IP offering.

Next-Generation Base Station DesignsBut adding external devices to patchdesign holes created by existing ASICdesigns limitations is purely a stopgapsolution. Future base station designs mustbe able to quickly adapt to changes in sub-scriber traffic patterns, as well as supportthe upcoming convergence of new servic-es and emerging cellular technologies suchas W-CDMA, TD-SCDMA, EDGE,1xEV-DO, and WiMAX.

As shown in Figure 2, the amount ofcellular technologies is expected to contin-ue to proliferate, leading base stationsdown the path of having to support manymore technologies. Current issues such as

more manageable, avoiding some of themulti-million dollar inventory obsoles-cence issues that base station manufacturershave faced with ASIC solutions fabricatedto support the 3G launch.

Standardizing the Wireless Base StationAnother significant step taken by the wire-less industry is the launch of industryorganizations focused on standardizing thenon-differentiated features inside a basestation. The most notable development forXilinx is the migration to a standardizedhigh-speed serial interconnect solution

between the different base station moduleblocks, such as the Open Base StationArchitecture Initiative (OBSAI) ReferencePoint 3 (RP3) and Common Public RadioInterface (CPRI) interconnects for base-band and radio module connectivity.

Many leading base station manufactur-ers are members of these organizationsand are rapidly preparing to adopt one ofthese two standard interconnect solutionsin their upcoming design implementa-tions. Xilinx is fully prepared to supportthese standards, and has both OBSAI andCPRI IP solutions and reference designsavailable for implementing in Virtex-IIPro™, Virtex-II Pro X, and Virtex-4 FXFPGA devices, using the integratedRocketIO™ multi-gigabit tranceivers(MGTs) in association with the logicbuilding blocks.


Antenna

MultichannelPower Amp

Low NoiseAmp

ADC

AnalogRF RX

AnalogRF TX

ADC DAC

Digital DownConversion

Digital UpConversion

Digital Filteringand Antenna

Diversity

Pre-Distortionand Digital

Filtering

BasebandInterface Bus

SymbolEncoding

SymbolDecoding

Modulationand Spreading

SymbolDetection and

Combining

Chip-RateDemodulation

and Despreading

ChannelEstimation

Bac

kpla

ne

Circuit SwitchedNetwork Control

Packet SwitchedNetwork Control B

TS

to R

NC

IIn

terf

ace

Central Processor

ControlInterface

Timing and ClockGeneration Power Supply AC/DC

Power

E1, T1Frame Relay

orIP Network

(GigibitEthernet

etc.)

Amplifiers Baseband Processing Network Interface

Main Processor

TX/RX

Figure 1 – Wireless base station module block diagram


multi-user detection and antenna selectionwill be augmented by new technical chal-lenges, such as channel provisioning andbase station tuning, that will need to beresolved appropriately to reduce a serviceprovider’s customer turnover. The funda-mental expectation to receive the samehigh-quality wireless service wherever a cus-tomer roams must be completely addressed.

These customer expectations wouldbenefit from substantial flexibility in thebase station. Fortunately, many of the base-band processing functions and radio mod-ule functions are well suited forimplementation in Virtex-4 devices, taking

advantage of the integrated XtremeDSP™slices in the product architecture.

For instance, quite a few basebandprocessing tasks – such as call initiationand set-up and multi-path signal detec-tion and monitoring – are heavily basedon mathematical algorithms. You canvery efficiently implement these algo-rithms by using the integrated multipliercapabilities available in Virtex-4 devices,along with the readily available intellectu-al property components such as theRandom Access Channel (RACH),Searcher, and 3G Turbo ConvolutionalCodecs (3GTCC) that Xilinx has imple-

mented as reference designs to demon-strate these capabilities.

The integrated DSP capability in theVirtex-4 SX device enables a very lowpower implementation of these func-tions. Radio functions can be expandedby using a Virtex-4 SX device to enablemore channel support.

Several enabling pieces of intellectualproperty targeted at radio functions, suchas digital pre-distortion (DPD), crest fac-tor reduction (CFR), and digital up/downconversion (DUC/DDC), are supportedby the Virtex-4 SX device. Not only doesthis help increase in the number of chan-nels supported in a base station, but it alsohelps reduce the cost per channel. Table 1gives an overview of the different capabili-ties offered by Xilinx baseband and radiomodule IP offerings.

System Generator for DSP Development ToolXilinx complements its Virtex-4 productofferings with the System Generator forDSP tool. This is a complete integratedDSP design environment that simplifiesthe development, debug, and verificationof high-performance DSP designs target-ing wireless base stations. This tool alsohelps designers interface with complemen-tary general-purpose and DSP processorsused in wireless base station designs.

System Generator for DSP provideshigh-level abstractions that are automati-cally compiled into Virtex-4 devices at thepush of a button, with no loss in perform-ance over designs implemented in lower-level languages such as VHDL. SystemGenerator is part of the XtremeDSP solu-tion, which combines state-of-the-artFPGAs, design tools, intellectual propertycores, and design and education services.

ConclusionTo learn more about the key markets and end applications of Xilinx wirelesssolutions, visit www.xilinx.com/esp/, or e-mail [email protected]. For more details about Virtex-4 FPGAs, visitwww.xilinx.com/virtex4/. And for moredetails on System Generator for DSP orother pieces of the Xilinx DSP solution,visit www.xilinx.com/dsp/.


GSM

TDMA

IS95a/b 1xRTT

1xEV-D0

1xEV-DV

3xRTT

W-CDMA

TD-SCDMA

GPRS EDGE HSDPA

Wireless LANs

4G

2G 2.5G 3G 3.5G 4G

Current Being Deployed Development Future

IEEE 802.11IEEE 802.16

Xilinx Baseband Intellectual Property Offerings

IP Offering Application

HSDPA Increases downlink data transmission rate to a peak of 14.4 Mbps

RACH Receiver path preamble detection (specified by W-CDMA)

Searcher Multi-path delay estimate for each subscriber

3G TCC Forward error correction

Xilinx Radio Intellectual Property Offerings

IP Offering Application

DPD Signal conditioning to enable use of lower cost RF power amplifiers

CFR Signal amplitude conditioning to enable increased RF power amplifier efficiency

DUC Baseband signal modulation for digital-to-analog converter input

DDC Receiver signal modulation for analog-to-digital converter input

Table 1 – Xilinx baseband and radio IP offerings

Figure 2 – Mobile technology roadmap


A new benchmark in delivery!

The supremely popular, low-cost Spartan™ FPGA product line from Xilinx recently shipped its 100 millionth

device. And we are in high-volume production of our 90nm Spartan-3 series, already delivered to customers

worldwide. Addressing the demands of consumer-oriented, cost-sensitive applications, Spartan-3 FPGAs

offer full-feature capability with the lowest price points ever.

Get started today with the world’s lowest-cost FPGA

The Spartan-3 Starter Kit gives you instant access to the FPGA’s complete platform capabilities, bringing

high-volume designs to reality faster. The kit includes a total starter board,

JTAG cable, handbook and resource CD, plus free ISE software, all for just

US $99. Contact your local distributor, or order your Spartan-3 Starter

Kit today at www.xilinx.com/spartan3.

Now there’s a hundred million reasons to get started today!

MAKE IT YOUR ASIC

The Programmable Logic CompanySM

www.xilinx.com/spartan3

Pb-free devicesavailable now

©2004 Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. Europe +44-870-7350-600; Japan +81-3-5321-7711; Asia Pacific +852-2-424-5200; Xilinx is a registered trademark, Spartan is a trademark, and The Programmable Logic Company is a service mark of Xilinx, Inc.

Over

100 MillionSERVED.

by Delfin RodillasStrategic Solutions ManagerXilinx, [email protected]

With the continued proliferation of cableand satellite television and the rapidgrowth of the Internet, video transmissionbandwidth has experienced phenomenalgrowth. With video streaming now beingintroduced into mobile handsets, thisgrowth rate is not showing any signs ofslowing down.

The technology advances of Xilinx®

FPGAs have kept pace with the increasingtransmission requirements and have solvedmany of the critical design issues in thesesystems. The Virtex-4™ product familyincorporates additional enhancements –high-speed DSP, ultra low power, flexibleintegrated memory, and high-speed serialI/O – that enable these devices to meet thehigh bandwidth requirements of videoapplications.

With these features, you can use Virtex-4 devices in a variety of products, such ascable modem termination systems, digitalvideo broadcast systems, flat-panel dis-plays, master control switches, MPEGencoders, non-linear video editors, broad-cast routers, image statistical multiplexers,and video servers.

Implementing a Cable ModemTermination System with Virtex-4 FPGAs

Implementing a Cable ModemTermination System with Virtex-4 FPGAs


Integrated features make the Virtex-4 device an ideal choice.

Integrated features make the Virtex-4 device an ideal choice.


Cable Modem Termination SystemOne common application where you canuse Virtex-4 devices is in a cable modemtermination system (CMTS), shown inFigure 1. The CMTS is used in cableheadends, a switching system that worksin conjunction with Internet serviceproviders to route data between cablemodems and the Internet.

In a CMTS, the transmitted data is mul-tiplexed onto a cable channel along withbroadcast video transmissions. Bandwidthis shared by all active subscribers (typically500 to 2,000) in the cable network seg-ment. Downstream transmission rates runat 40 Mbps using quadrature amplitudemodulation (QAM), while upstream ratescan be as high as 10 Mbps using QAM orquadrature phase shift keying (QPSK). Thespeed of the upstream link depends on theservice level agreement (SLA) that the sub-scriber has signed with their cable company.

CMTS Design ChallengesCable operators can offer a variety of dif-ferent services by using quality of service(QoS) provisioning to support differentsubscriber packages, helping to maximizetheir revenue stream. For QoS in theCMTS, the design needs to support packetclassification, packet prioritization, flowcontrol, congestion control, queuing,scheduling, and QoS statistical measure-

port bandwidth. The design goal is toreduce the amount of congestion in order tooffer the maximum amount of bandwidthand packet throughput by optimizing end-to-end delay and minimizing packet loss.

In addition, the implementation needsto support fair bandwidth distribution foreach service class; furnish protectionbetween the different class levels; providefast, flexible access to bandwidth withoutimpacting forwarding performance; andallow other service classes to use underuti-lized bandwidth.

To surmount these challenges, efficientqueuing and scheduling techniques arerequired to optimize queue memory man-agement, which controls the number ofpackets in a queue. This function controlsservice-class access to the packet memorybuffer and determines which packets todrop because of congestion.

Multiple queue memory managementtechniques are in use today, including ran-dom early detection (RED), weighted ran-

dom early detection (WRED)and leaky bucket. Per-flowqueuing is commonly per-formed using one or a combi-nation of the schedulingalgorithms shown in Table 1.

Table 1 shows that thereare many different queuingand scheduling algorithms.Given the dearth of standardsactivity in this area, many dif-ferent algorithms will contin-ue to exist for the foreseeablefuture. In addition, thesealgorithms need to handlevariable sized packets, which

are more complicated than fixed cells.Virtex-4 devices offer a high-performance

solution for these queuing and schedulingrequirements, for the devices offer anextremely fast and flexible fabric for imple-menting designs without impacting for-warding performance. Scheduling decisionsare typically performed every clock cycle andrequire heavily pipelined designs.

Virtex-4 devices also offer a register-richarchitecture with ample routing, enablingefficient implementation of these decisions.The high-speed designs also require very

ments. All of these functions need to besupported without a reduction in userbandwidth. Given this, QoS processing isgenerally done in hardware, for softwareimplementations lack theprocessing power to makereal-time routing decisionsand can result in delaysand excessive queuing.

Maintaining efficientbandwidth utilizationwhile supporting SLAs andmultiple traffic typesmakes traffic managementvery challenging. Throw invarying protocols, memorymanagement, differentsized payloads, and a vari-ety of different systeminterfaces, and it is easy tosee how these designs require high-performance, cost-effective flexibility thatASSPs and ASICs cannot offer. These chal-lenges open up opportunities for Virtex-4devices that can provide flexible trafficmanagement capability at the required per-formance levels.

CMTS Queuing and Scheduling RequirementsQoS provisioning is basically a queuing andscheduling problem. Proper queuing andscheduling entails recognizing service classesalong with managing buffer memories and


System Backplane

HFC Network

MAN / WAN

Cable

Transceiver

Network

Transceiver

MixedSignal

Packet

Processing,

Queuing/

Scheduling

Traffic

Flow

Management

QoS

Measuring

QAM

Modulator

Disk

Controller

Memory

Interface

Switch

Fabric

Memory

Xilinx

Hard Disk Drive

SRAM

DRAM

Flash

Memory

Host

CPU

CPU

QAM

Demodulator

QAM

Demodulator

QAM

Demodulator

Non-Xilinx

Queuing and Scheduling Algorithms

First-In, First-Out

Round Robin

Weighted Round Robin

Fair Queuing

Weighted Fair Queuing

Priority Queuing

Shortest Remaining Time

Figure 1 – Cable modem termination system block diagram

Table 1 – Common queuingand scheduling algorithms


wide internal buses, which are easily imple-mented in the Virtex-4 architecture byusing the integrated DLLs and DCMs tohelp manage multiple clock domains.

Many of the queuing and schedulingbuffer management schemes are math-intensive; these schemes must quickly cal-culate multi-variable equations such aspacket transmit scheduling and customerservice normalization schemes. Forinstance, the bandwidth calculation shownin Figure 2 is a multi-variable equationused to calculate the bandwidth (B1, B2)for each user for a given level of total band-

width. These types of functions can takeadvantage of the integrated 500 MHz per-formance, low power, 18 x 18 multipliers,and 48-bit adder/subtractor integrated inthe XtremeDSP™ slice.

CMTS Memory RequirementsMost networking applications are builtaround a load-store type of architecture,with packets being stored in linked lists inexternal memories. Because of the increasingqueuing and scheduling performancerequirements of the CMTS, high-speedDDR or QDR SRAM memories preventmemory access from becoming a bottleneck.

To properly interface to these memorydevices, all Virtex-4 devices have theChipSync™ feature in every device I/O.ChipSync lets designers easily align theDQS control signal with memory data invery small increments; this alignment canbe easily monitored and altered as temper-ature and voltage changes alter the verydelicate timing.

Converting the high-speed 300 MHz+memory data to wider, slower, more man-ageable data is easily accomplished with thebuilt in ISERDES and OSERDES availablein every I/O. Additionally, the Virtex-4

memory-rich architecture, capable of run-ning at 500 MHz, provides much neededon-chip cache capability.

Virtex-4 devices support high-speedmemory interfaces and, along with anembedded hierarchy of memory structurescomprising distributed and block RAM,can easily facilitate implementation ofhigh-performance queuing and schedulingalgorithms. The Virtex-4 devices’ highmemory-to-logic ratio helps reduce memo-ry access latency by caching data on-chip,buffering data between two disparate clockdomains, and using scratch-pad memoryfor storing coefficients.

The integrated distributed RAM isgood for implementing small FIFOs, DSPcoefficients, shallow/wide memories, andCAMs. The block RAM is good for largerFIFOs, packet buffers, video line buffers,cache tag memory, deep/wide memories,and CAMs. Xilinx also has many provenembedded-memory CAM and FIFO ref-erence designs available to help implementthese high-speed memory designs.

CMTS Video Transmission StandardsThe ITU-T (International Telecom-munications Union – TelecommunicationStandardization Sector) has created a stan-dard for the transmission of audio, video,and data services over cable networks. Thespecification for this standard is ITU-TJ.83 Digital Multi-Program Systems forTelevision, Sound, and Data Services forCable Distribution.

This standard is supported in Virtex-4devices using the Xilinx J.83 CableModulator LogiCORE™ IP to provideeither single- or quad-channel support.(See the related article from the Winter2004 issue of the Xcell Journal, “UsingSystem Generator for DSP to Create theJ.83 Cable Modulator.”)

ConclusionGiven the high bandwidth requirementsof a CMTS along with the associatedqueuing and scheduling complexities toprovide the appropriate QoS require-ments, Virtex-4 devices offer an optimalsolution for these designs. The embeddedhierarchy of memory structures, alongwith integrated high-speed serial inter-faces and programmable flexibility, makeVirtex-4 devices a better choice overimplementations using ASICs or ASSPs.

To learn more about Xilinx key marketsand end applications, visit www.xilinx.com/esp/. For more details on Virtex-4 FPGAs, visit www.xilinx.com/virtex4/.


Problem: Known the whole bandwidth (B), the Drop Probability (P1, P2, and P3),

and the number of Flows for each PHG class calculate B1, B2, and B3

Total Bandwidth for the aggregate PHB AF1 group is: B

PHB AF13: [Drop Probability = P3, Bandwidth B3; Flows N3]



N3 P2 B2 = N2 P3 B3 Bi = Pi B i = 1,2,3; i = j,k

Pi =

Pi Pj

~

~

^ ^

Pj Pk

^ ^

Px =Px

Nx

^

+ Pj Pk

^ ^+ Pi Pk

^ ^

N3 P1 B1 = N1 P3 B3

B1 + B2 + B3 = B

Figure 2 – Bandwidth calculation formula example

Additionally, the Virtex-4 memory-rich architecture, capable of running at 500 MHz, provides much

needed on-chip cache capability.


We HaveWhat You’ve Been

Looking For

©M

emec

(MG

093-

04) 1

2.22

.04

The Memec Virtex-4 Development Kitsare the ideal solution for designers needing a high-performance Virtex-4 platform with the flexibility to meet your system design challenges.

Your Search is Over.

Visit www.memec.com/xilinx-v4

Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission.All company and product names may be trademarks of their respective companies.

F lexible Design

P owerful Performance

G reater Programmability

A dvanced Technology

by Amit DhirSenior Manager, Strategic Solutions, Wired Networks and Telecom MarketsXilinx, [email protected]

Although the dot-com bubble may haveburst, the Internet has continued its multi-fold growth, thus placing a strain ontelecommunication networks. Both indi-viduals and businesses are demanding morebandwidth to run new communicationsoptions, such as desktop video conferenc-ing, IP telephony, remote storage, andmobile communications.

This is the driving force behind the needto transform the multiple, costly, and com-plex networks in use today into a smarter,multipurpose, global, cost-effective broad-band network. This transformation willgenerate new sources of revenue for serviceproviders, provide greater opportunities andproductiveness for enterprises, and meet theneeds of consumers who value multimedia,the freedom of mobility, and personalizedand secure private network services. Theboundaries between public and private,wired and wireless, and voice and data net-works are vanishing.

Developing Next-GenerationTelecommunication Networks


Virtex-4 FPGAs provide the density, features, and performance at low price points to enable the communication revolution.


The key elements of a more intelligent,high-speed, multi-purpose global networkinclude broadband and optical technolo-gies, voice over packet, wireless data, multi-media services and applications, andsecurity, all underpinned by a packet net-work core. Typical telecom- and datacom-wired equipment can be segmented intoline cards, switch cards, control cards, anda backplane. Network convergence requiresequipment vendors to support multipletechnologies, including SONET/SDH,PDH, Data over SONET (GFP, VCAT,and LCAS), Fibre Channel,Ethernet, DVI, DSL, PON, andMPLS, depending on the system’slocation in the access networks,metropolitan area networks,enterprises, and wireless networks.

Because data is transmitted in IPpackets, packet processing hasbecome a sophisticated architectur-al decision depending on the endsystem. This also influences theswitch architecture and backplanetopology. Also, with time to marketand cost pressures, equipmentproviders continue to focus ontechnology and innovation as thecornerstones for creating new rev-enue opportunities.

Enabling the Communications Revolution Xilinx® FPGAs offer a high-performancefabric, integrated features, and powerfulclock management, thus providing an idealplatform for communications equipmentvendors to develop their solutions. Xilinxalso provides case studies, IP, and referencedesigns to help customers with their designsin several key applications.

Telecom and Datacom Line Card Port Interfaces Digital telecom infrastructure has mostlybeen based on PDH and SONET/SDHtechnologies in the metropolitan area andtransport networks. The transport of datatraffic (Ethernet, Fibre Channel, ESCON,and DVI) onto SONET/SDH networks isgiving rise to technologies such as genericframing procedure and virtual concatena-tion. This flux is requiring a need for pro-

already enabled several customers toupgrade their backplane to faster rates.

With the Virtex-4 family’s third-genera-tion multi-gigabit transceivers and enhancedfeatures such as AC coupling, programma-ble preemphasis, and receive (linear anddecision feedback) equalization, you canensure signal integrity in a wide variety ofapplications and give new life to old systemsby upgrading legacy backplanes.

Industry standards such as SerialRapidIO™, Gigabit Ethernet, and PCIExpress (including out-of-band signaling

and spread spectrum clocking) are all sup-ported. Virtex-4 FX FPGAs enable bridg-ing between just about any serial or parallelsystem interface.

To enable the creation of mesh designs,Xilinx offers the mesh fabric referencedesign for complete flexible connectivityacross a serial backplane based on the stan-dard of your choice. Xilinx also providessignal integrity tools and resources such asthe ATCA development board to ease theprocess of designing SerDes solutions intoyour next-generation backplane.

Packet Processing Although several network processor vendorshave attempted to solve packet processing(classification, policing, queuing, andscheduling) glitches, achieving performanceand power goals continues to be challeng-ing. Virtex-4 FPGAs solve network process-

grammable solutions that can allow ven-dors to have a single SFP or XFP module tosupport multiple technologies at givenrates. With the Virtex-4™ FX family sup-porting Gigabit Ethernet (1 and 10 Gbps),Fibre Channel (1, 2, 4, 8 and 10 Gbps),and SONET (OC-12 and OC-48) onevery RocketIO™ serial transceiver, youhave extreme flexibility in the I/Os.

The FPGA, coupled with robust IPofferings from Xilinx and our partners forMACs and framers/mappers, presents aflexible solution that can be morphed

depending on the service provider’s needson a per-port basis. This also helps in thelifecycle cost management of the system, asfewer cards need to be maintained and canbe programmed with the relevant portinterfaces required upon shipping.

Serial Backplanes and Switching With exploding data rates and source syn-chronous I/Os unable to keep up with thepace at which packet communicationoccurs between the line cards, vendors areuniversally looking at serial technologies tosolve the bandwidth problem. RocketIOtransceivers, which support a wide per-formance range of 622 Mbps to 11.1 Gbps,can also be used to drive several tens ofinches on FR-4 and other exotic materials– at different rates. With Virtex-II Pro™and Virtex-II Pro-X families and the inte-grated RocketIO transcievers, Xilinx has


IntegratedOptics

PMD

Memory

Traffic, Queue,

Policy Mng.

NetworkProcessor,Look-up,

Classification

Framer/Mapper/

MAC

PHYLayer

PCI, PCI-XPCI Express

ASSerial RapidIO

InfinibandProprietary

SPI-3, PL3SPI-4.1/4.2

SPI-5UTOPIA

SystemInterfaces

SFI-4XSBITFI-5

CSIXPCI, PCI-X

PCI ExpressHyperTransport

RapidIOProprietary

QDR, QDR IICAM I/F

RLDRAM, FCRAMDDRRAMNoBL/ZBT

SerDes+

SwitchFabric

Figure 1 - Typical line card


ing challenges with features such as systemand memory interfaces, clock managers,block RAM, DSP slices, PowerPC™, andhigh-speed programmable logic. Xilinx alsooffers solutions such as the queue managerand mesh fabric reference designs to helpwith traffic management needs.

Simplifying System Design Challenges The fundamentals of unparalleled flexibili-ty and high performance are furtherextended in the Virtex-4 family. To helpsimplify your system design challenges,Xilinx also offers:

• Integration. The integration of proces-sors, tri-mode Ethernet MACs, DSPslices, SerDes, memory, and other fea-tures in the FPGA helps reduce yourbill of materials and saves FPGAresources. This reduction in compo-nent count helps streamline logisticswith a smaller bill of materials andsimplifies the design and manufactureof system hardware because of simplerPCB design and manufacturing andimproved reliability through the reduc-tion of solder joints.

• SelectIO™ technology and connectivityIP. Virtex-4 FPGAs make it easy tobuild robust high-speed memory andnetworking interfaces. All Virtex-4 plat-forms include configurable, high-per-formance SelectIO technology tosupport a wide variety of I/O standards.

Virtex-4 FPGAs provide as many as960 user I/Os, supporting more than20 single-ended and differential elec-trical I/O standards to enable severalparallel system interface standards onone device. New ChipSync™ tech-nology built into every I/O blockmakes source-synchronous interfacingto the latest high-speed componentseasy. Plus, powered with XCITE tech-nology, each I/O block delivers on-chip active I/O termination,eliminating external termination resis-tors to increase signal integrity, saveboard space, and reduce system cost.Xilinx also provides a robust offeringof IP (PCI, SPI-3, SPI-4.2, RapidIO)

and reference designs (DDR2, DDR,QDR II, RLDRAM II, FCRAM II)for system and memory connectivity.

• Embedded processing. With theembedded PowerPC and the softMicroBlaze™ and PicoBlaze™processors, Xilinx offers a range ofprocessing solutions to match therequirements of different tasks, rang-ing from simple control functions toadvanced algorithms and high-speedcalculations. Also, in telecom cardsthe processors assist with simplefunctions such as alarm handlingand performance monitoring.

• Low-cost designs. Xilinx manufacturesVirtex-4 FPGAs using 90 nm advancedprocess technology on 300 mm wafers.This allows us to produce approxi-

mately five times as many die perwafer, compared to building an equiv-alent chip in 130 nm process on 200mm wafers. This lowers the cost perdie significantly.

Additionally, the EasyPath™ programfurther lowers system cost for customers whoare ready to take their finished design to vol-ume production. Xilinx creates customizedtest programs for EasyPath customers thatexercise only the device resources used in thespecific design. This approach shortens testtime and increases yield to reduce FPGAunit prices as much as 80%.

ConclusionTo learn more about the key markets andend applications of Xilinx solutions, visitwww.xilinx.com/esp/ or e-mail [email protected]. For more details on Virtex-4FPGAs, visit www.xilinx.com/virtex4/.


FREE Online SeminarVerification of

Your EmbeddedFPGA Design

Seamless FPGA for Xilinx Virtex-II™ Pro

During this session, you will learn how to:•Leverage Platform FPGAs for embedded systems •Utilize the tightly integrated solution of Seamless with Xilinx Platform Studio (XPS) •Easily debug complex hardware-software interactions •Measure software and hardware performance of the FPGA system

Learn more today: http://www.mentor.com/fv/events/seminars/xilinxonline/

For additional details about Seamless FPGA, visit us at www.seamlessfpga.com


Prove your design with high speed FPGA hardwareemulation plugged directly into your PCIe system. Here are

4.5 million gates to emulate your ASIC and kill the RTL bugs beforeyou cut masks. This board will let you test your software andincrease your chances that the first spin will be the last. TheDN6000K10PCIe is packed with the features you need:

•1,4 and 8-lane versions

•Six VirtexII-Pro FPGAs (-2vp100s, the big ones)

•10 DDR (64Mx16) and 4 SSRAMs (2Mx36) external to the FPGAs

•Expansion capability to customize your application

•Synplicity Certify® models for quick and easy partitioning

Like all our products, this new PCI Express bus board will help youget your ASIC to market on time and in budget. Call The Dini Grouptoday-- PCIe is already here.

1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419 • Email: [email protected]

Prototype Your PCIe ASIC HERE

by Ken SienskiPresidentRed [email protected]

Established in 1996, Red River specializesin high-performance signal processing anddata communication solutions for theembedded systems market, especially soft-ware defined radio applications.

Our main challenge in serving the soft-ware defined radio market is to have a hard-ware platform that meets the demands ofmultiple configurations. Some customersare looking for a complete, pre-built radiosolution; others are looking to add customfeatures to a radio platform. These disparaterequirements place great demands on us tofind a common programmable silicon solu-tion that meets both needs.

The Xilinx® Virtex-4™ FPGA family

allows us to do exactly that – provide differ-ent customer solutions at the lowest cost.Advanced features such as FIFO logic,embedded PowerPC™, RocketIO™ trans-ceivers, and Ethernet MAC, as well asadvanced power and packaging technology,makes Virtex-4 devices a perfect choice for us.

Model 351 (Pocket Change)Our next-generation product, the Model351, or “Pocket Change,” transforms anyportable computer into a high-performancemulti-channel software defined radiotransceiver. The Pocket Change CardBusPC Card accepts two analog input signalsthrough MMCX coaxial connectors onthe outside edge of the card. The receiverinput is AC-coupled to a 14-bit (80MSPS) A/D converter. The transmitteroutput is supplied through a 14-bit (100 MSPS) D/A converter. Most of the

digital logic is supplied using a Virtex-4FPGA device.

When we began developing the Model351, we investigated various offerings onthe market and finally decided to useVirtex-4 FPGAs. The Virtex-4 FPGA fam-ily provides the flexibility and features thatsupport both our needs and the require-ments of our customers.

The Model 351 design comprises aVirtex-4 FPGA connected to an A/D con-verter, a D/A converter, and a dedicated PCIbus controller (for the CardBus interface tothe host computer) (Figure 1). Although it istargeted at our traditional software definedradio customers, the Model 351 is also suit-able for signal acquisition or generation, sig-nal intelligence collection, transceivermodem algorithm prototyping, frequencyhop signal generation, or portable signalrecorder/playback applications.

Virtex-4 FPGAs for Software Defined RadioVirtex-4 FPGAs for Software Defined Radio


Red River’s new PCMCIA Type II module can transform any notebook computer into a software defined radio using a Virtex-4 FPGA for performance-critical DSP functions.

Red River’s new PCMCIA Type II module can transform any notebook computer into a software defined radio using a Virtex-4 FPGA for performance-critical DSP functions.


Customization and FlexibilityInitially we considered using dedicated dig-ital upconverter/downconverter chips toimplement the Model 351 transceiverfunction. However, many of our customersprefer the flexibility of inserting customfunctions into their designs. The cus-tomization requirement pushed us to useprogrammable technology.

By selecting a leading programmablelogic architecture, we can address the cus-tomization needs of a broad set of cus-tomers. Xilinx ISE™ developmentsoftware provides our customers a familiardesign environment to embed custom DSPfunctions in the uncommitted logic of theVirtex-4 FPGA.

Another benefit from using Virtex-4FPGAs is that we can offer multiple prod-ucts using one common hardware plat-form. This has helped reduce hardwaredevelopment time and simplify inventorymanagement.

Power and Space EfficiencyOne of the challenges in CardBus PC Carddevelopment is to select a device that meetsthe PCMCIA functional specification andthe tight power restriction of 3.3W. Wewere impressed with the power efficiencyof the Virtex-4 family, as it consumes halfthe power of comparable logic solutions.

Virtex-4 FPGAs give us significant fea-tures and performance while still meetingthe tight power budget of our design. Inaddition, PCMCIA imposes severe heightrestrictions in order to fit into the Type IImodule form factor. The Virtex-4 FF668package offering is one of the few FPGApackages that meet the height requirements.

the highest-performance internal blockRAM and unique integrated FIFO logic,Virtex-4 FPGAs give us the FIFO quanti-ty and performance that we need to keepup with the bandwidth of the analogcomponents and host interface.

Three Platforms Satisfy Multiple RequirementsThe three Virtex-4 platforms (LX, SX, andFX) give us unique capabilities for severalupcoming products. For customers want-ing to add custom logic functionality, weuse the LX platform. LX offers the choiceof many different gate densities within thesame package footprint, allowing us to usethe same base design to support many dif-ferent customer needs.

We have some designs that necessitatetremendous additional DSP capabilityfor math-intensive processing, includingsignal modulation and demodulation.For these applications, we see the SXplatform as a natural fit. SX devices give us by far the largest amount of DSPperformance.

For some of our other designs, we areimplementing the advanced system-levelblock functionality of the FX platform –PowerPC running VxWorks, RocketIOtransceivers for optical and PCI Expressinterfacing, and gigabit Ethernet MACcores. Because Virtex-4 devices give usthree platforms to choose from, we canoffer different capabilities across ourproduct line.

ConclusionSoftware defined radio products mustaddress a broad application space, whichpresents a challenge when selecting com-ponent features. The three Virtex-4 plat-forms give us the feature choice andperformance that we require to field afamily of solutions for both fixed andmobile installations.

The upcoming Model 351 demon-strates cutting-edge capabilities in anextremely small, power-efficient modulethat operates in a standard notebook com-puter. Visit www.red-river.com for moreinformation about the Model 351 andother Red River products.

Advanced Features and PerformanceOne key requirement for a software definedradio application is high-performance DSPcapability. The performance requirement isdriven by the need to support multiple sig-nal channels in real time.

Virtex-4 FPGAs are capable of perform-ing multi-channel digital upconversion anddownconversion across the entire Model351 analog bandwidth. The Virtex-4device can also perform Fast FourierTransforms (FFTs) for spectral analysis ofincoming signal data.

The Virtex-4 FPGA provides the “heavylifting” to process digital informationbetween the host computer and the A/D orD/A converter. The signal processingpower comes directly from the SX plat-form. Virtex-4 devices can achieve high-DSP performance by taking advantage ofmassive parallelism within each FPGA. Formath-intensive algorithms (likeDUC/DDC applications in a softwaredefined radio), the high number of DSPslices – multiply/add/accumulate engines –that can run up to 500 MHz provides thekind of performance only previously avail-able in fixed ASIC technology.

Our designs also make extensive use ofthe internal block memories in the FPGAto provide multi-queue FIFO capabilities.The FIFOs are used to buffer databetween the A/D or D/A converters andthe local bus for DMA operations, provid-ing performance-intensive processingwithout involving the host CPU in mem-ory transfers. This gives our products theability to flexibly handle digital radio datawithout completely consuming the CPUperformance of the host computer. With


Xilinx Virtex-4 FPGAs

A/D

CardBusInterface

AnalogInput

ACCoupler

HostProcessor

A/D

and

D/A

Inte

rfac

e

D/AAC

Coupler

AnalogOutput

DigitalData/Control

Clock SelectExternalClock

Use

r-C

onfig

urab

le L

ogic

Loca

l Bus

Inte

rfac

e

ConfigurationFlashOscillator

Figure 1 – Model 351 block diagram


by Fabrice MommensProject Manager, Defense & Security LabBarcoView Command & [email protected]

Using in-depth market knowledge, Barcodesigns and develops solutions for large-screen visualization, display solutions for life-critical applications, and systems for visualinspection. Barco is currently active in thetraffic, surveillance, broadcasting, presenta-tion, simulation and virtual reality, edutain-ment, events, media, digital cinema, airtraffic control, defense and security, medicalimaging, avionics, and textile industries.

My particular division at Barco,BarcoView Command & Control inBelgium, has been a Xilinx® customer for justover two years. Our division’s choice to stan-dardize on Virtex™ products was based onthe availability of the embedded PowerPC™processor, first introduced by Xilinx in theirVirtex-II Pro™ product family.

We like to design with FPGAs in oursystems because they can be reprogrammedthroughout the life of the product. Thiscritical feature allows us to add featuresfrom one generation to the next withouthaving to redesign the whole system.

BarcoView Command & Control isworking on a rugged family of LCD moni-tors. These products are designed for roughenvironments where commercial displayproducts would not survive. In thesedesigns, FPGAs are mainly tasked to per-form video and image processing.

The system is currently designedaround a Virtex-II Pro device, in which thePowerPC processor, running a real-timeembedded operating system, controls thecomplete display system. Looking at thenew features of the Virtex-4™ FX family,we are planning to migrate these Virtex-IIPro designs that use the PowerPC processorto Virtex-4 FX devices in a future versionof the project.

Besides the central control of the displaysystem, we also use FPGAs in the data pathfor specific processing. The part of thedesign where we chose to implement theVirtex-4 FPGA is an optional feature of thedisplays, where it performs real-time imagescaling on the video stream.

Virtex-4 FPGAs in Rugged LCD Monitors


Integrated features like ChipSync technology not only reducecost but improve ease of use and design cycle time.


This scaler module can receive a videostream on its input at a very high rate (160MHz x 24 bits = 3.84 Gbps), perform scal-ing on the stream, and send out the scaledstream at the same rate. With the amount ofdata being processed and because of the waythe scaler algorithm works, we must storethe incoming video stream into memorybefore processing it. Thus, we had to look atvery fast external memories (DDR2).

Memory Interfaces Made EasyWhen searching for the right product forour application, we looked at many alter-natives. However, it rapidly became clearthat Virtex-4 devices could best performthe required tasks.

The main reason for choosing Virtex-4FPGAs was the availability of theChipSync™ feature, with support forDDR-2 400 memories. Having support forDDR-2 400 gives us enough bandwidth toreduce the number of physical RAM chipsneeded, reduce the board real estate, and inthe end reduce system cost.

Looking at the data flow, these videostreams are digitized into pixels up to 24-bitRGB (it could be a narrower stream depend-ing on the input source). The incomingstream is stored into an input memory bufferat a frequency reaching up to 160 MHz. Thedata from this input memory buffer is thenfed to the scaler core, also on 24 bits, at amaximum frequency of 100 MHz.

After the core has processed the data, the

ChipSync technology allows us to easilyreach 400 Mbps and intuitively design thisinterface. Without this feature, we wouldhave needed a 32-bit interface to the externalmemory. Though running at half the clockrate, more physical SDRAM on the boardwould be required, as there is no such thing asa small SDRAM device. In addition to thehigher unused memory locations, we wouldhave required a larger package for the scalerdevice because of the increased number ofpins, using more board real estate.

ChipSync technology also allows us to eas-ily use DDR-2 interfaces, enabling us tochoose the very latest in SDRAM technology.This helps to avoid obsolescence issues, acommon problem in the memory industry.

Block RAM: Not Just MemoryAnother critical point when choosing theright FPGA is the amount of block RAMavailable in the device. Having flexible, fastinternal RAM is a critical factor for us becausewe use block RAM for two things: as videoline memory and as FIFOs for the DDR-2memory controller. Smaller, slower, or lessflexible RAM blocks would have produced amore complex DDR-2 memory controllerdesign, resulting in larger logic requirementsand therefore a larger device.

In addition to speed, flexibility, and size,the integrated FIFO logic available on eachblock RAM allows us to save a substantialamount of logic and guarantees fast FIFOoperation, simplifying the design of ourwhole system.

ConclusionThe logic savings obtained through the useof the integrated FIFO, ChipSync technol-ogy, and the use of smaller external memo-ries results in a significant cost reduction.Additionally, the ease of use, implementa-tion, and modification brought by the hardIP blocks makes the Virtex-4 LX15 deviceperfect for this application.

After designing with the Virtex-4 LXFPGA, we are looking forward to evaluatingthe Virtex-4 FX platform to see how we canbenefit from all the new features availablewith the integrated PowerPC processor.

For more information about Barco andour products, visit www.barco.com.

video stream is written back into an outputmemory buffer at 100 MHz on 24 bits. Theoutput memory buffer can then be read at afrequency reaching 160 MHz on 24 bits tofurther process the data. After all that pro-cessing and some more, the images are dis-played on the LCD monitor.

As shown in Figure 1, which representsthe Virtex-4 LX15 ecosystem of our design,the memory bandwidth requirements for the

input and output buffers are identical.Focusing on the input memory stream, wecan see that the bandwidth required is (160MHz + 100 MHz) * 24 bits = 6,240 Mbps.

This is where the advantages of 400 MbpsDDR-2 are realized. Because of this memoryspeed, we can select a 16-bit-wide DDR-2SDRAM running at 200 MHz and still haveenough bandwidth to process the inputmemory buffer streams (the stream comingfrom the input source and the stream goingto the scaler core).

A simple calculation shows that 200 MHzx 2 (double data rate) x 16 bit = 6,400 Mbps.This is higher than the 6,240 Mbps previous-ly calculated for the input buffer. Of course,we need to take into account a small overheadfor the memory controller (during tran-sients), but the margin should be more thanenough to guarantee reliable system opera-tion. If for any reason the controller’s over-head becomes such that we cannot guaranteethat the system would work properly, we canalways lower the 100 MHz core frequency.


160 MHz @ 24 bits (max) 160 MHz @ 24 bits (max)

From Input Stage To Mixing Stage

Virtex-4 LX15 363 pins

Proprietary Scaling Core

100 MHz Core Clock

DDR2 Memory Controller DDR2 Memory Controller

DDR2 400 - 256 Mb @ 16 bitsInput Buffer

DDR2 400 - 256 Mb @ 16 bitsOutput Buffer

Figure 1 – Video scaler block diagram based on a Virtex-4 FPGA


by Lee HansenSr. Product Marketing ManagerXilinx, [email protected]

The advanced silicon features introducedwith Xilinx® Virtex-4™ FPGAs are readi-ly available through ISE™ (IntegratedSoftware Environment) 6.3i technology.This latest release of Xilinx design softwarecomes ready to deliver maximum designperformance, with new features andoptional tools that will speed your Virtex-4project to completion.

Advanced Timing Closure and PerformanceISE software lets you get the most out ofVirtex-4 devices and your target project.Benchmark testing on a suite of real-world,customer-based designs demonstrates thatVirtex-4 FPGAs, with ISE 6.3i design soft-ware, are as much as 43% faster than thenearest competitive FPGA. On average,that’s an extra speed grade advantage.

The performance-driven ISE technology– like our exclusive timing-driven mapoption – helps you achieve better designpacking and better performance, particu-larly if your target device is already morethan 90% utilized. Timing-driven map canyield 30% better overall design perform-ance depending on design utilization.

This additional performance advantagegives you the potential to stay in a lower den-sity target Virtex-4 device, even if utilizationis pushing 90% or higher, when competingtools would have already forced the designinto a larger, more expensive device.

ISE 6.3i Software –Unleash the Power ofVirtex-4 FPGAs

ISE 6.3i Software –Unleash the Power ofVirtex-4 FPGAs


New ISE technology delivers breakthrough performance with greater ease of use.New ISE technology delivers breakthrough performance with greater ease of use.

ENGINEER ING SOLUT IONS

High-Density DesignISE design software also includes a full spec-trum of tools for larger density designs,including area and logic group floorplan-ning, incremental design for faster designrecompile cycles, and modular design forteam-based project approaches. High-densi-ty designers can also separately purchase thenew PlanAhead™ hierarchical flooplanner,which wraps all of these methodologies intoone separate advanced tool. Together, thesetools augment the design flow of high-density projects with methodologies thatspeed through to project completion, as wellas performance-locking strategies to helpbring large designs under control.

Area GroupsUsing either PACE (Pinout and AreaConstraints Editor) or ISE Floorplanner,both included with all configurations of ISEdesign software, you can quickly floorplanareas of logic from your design onto your tar-get Virtex-4 device. You can create areagroups around hierarchical HDL boundaries,or let PACE create default area estimates fortarget logic, or draw logic areas by hand.

Visualizing the different areas of logichelps you partition out areas for designreuse or IP placement, or section off wherethe “tough” areas of the design will be con-centrated. Most importantly, area planningcan help accelerate timing closure by group-ing critical logic and paths together, andminimize the number of interface pointsbetween modules.

Modular DesignISE design software also includes modulardesign, a capability that implements a“divide and conquer” strategy for largedesigns – and for the corporate environ-ments that deploy teams of engineers totackle them. A design team manager firstplans the design project, using floorplan-ning to partition the overall larger projectinto smaller design “modules.” These mod-ules can then be assigned to individual teammembers for completion independent ofthe other modules.

Completion is focused on only that par-ticular module of the overall design, with allteams completing their work in parallel.

PlanAheadIn June 2004, Xilinx announced theacquisition of the leading-edge PlanAheadhierarchical floorplanner, developed origi-nally by Hier Design. The PlanAheadfloorplanner is a separately purchased tooloption to the ISE design flow that is idealfor Virtex-4 high-density designs.

The PlanAhead tool utilizes an ASIC-style floorplanning methodology using ablock-based approach. It enables you toanalyze, detect, and correct potential imple-mentation problems earlier in the designcycle, leading to the following benefits:

• Quicker incremental design changes

• Faster place and route

• Greater consistency and predictabilityin place and route

• Fewer design iterations

• Improved design performance

• Tighter utilization control

• Reuse of intellectual property andteamwork

The majority of low-density FPGAdesigns are implemented flat, with no hier-archy. Standard PLD place and route algo-rithms use more compile time to completea flat design. By breaking the designs into

Once a module is finished, its place androute results are locked while the projectmanager waits for the remaining modulesto be completed.

Modular design delivers full planningcontrol over the larger design, implement-ing a true bottoms-up design approach thatcompletes the larger project much faster.

Incremental DesignIncremental design, also included with ISEdesign software, combines the quick-and-easy facet of area groups with the perform-ance-locking aspects of modular design todeliver faster runtimes during heavy designiteration cycles.

Using PACE, you can assign area groupsalong hierarchical HDL boundaries; theoverall design is then completed as usual.Should an incremental change become nec-essary, incremental design guarantees thatyou only have to re-implement the logicarea that needs to change. The remainderof the design stays locked and intact, dras-tically speeding up overall compile times.

Incremental design also lets you makefull use of the verification phase by deliver-ing much faster overall project compiletimes. You can tweak critical design areas orimplement ECO design changes late in thecycle with minimal impact on the largerFPGA project.


Figure 1 – PlanAhead floorplanning with a Virtex-4 LX100 FPGA


smaller pieces, or blocks, place and routedoesn’t need to converge on the entiredesign timing each time an incrementaldesign change occurs. Hierarchy allowsyou to take maximum advantage toreduce place and route time.

You can also lock placement resultsfor individual blocks that already meettiming so that subsequent place androute iterations do not change their per-formance, further stabilizing the overalldesign and making the overall resultsmore consistently predictable. ThePlanAhead tool wraps area groups,incremental design, and modular designinto a single ASIC-strength floorplan-ner. Figure 1 shows Virtex-4 floorplan-ning using the PlanAhead hierarchicalfloorplanner.

Speed the Design Flow – ISE Architecture WizardsThe architecture wizards are a series ofmenus and dialog boxes built into all ISEconfigurations. These graphical menuslet you quickly set advanced configurationparameters for FPGA silicon features. Thewizards then write out editable VHDL orVerilog™ source code that is instantiateddirectly into your target project.

For example, the clocking wizard letsyou easily set clock frequency, phase, mul-tiplier factors, and delay for Virtex-4devices and other Xilinx FPGAs usingDCMs (digital clock managers). With thearchitecture wizards, you can rapidly set upand program advanced FPGA features, soeven novice users can learn the mostadvanced Virtex-4 capabilities quickly.

Also new in ISE 6.3i software are twoVirtex-4-exclusive architecture wizards, theChipSync™ and XtremeDSP™ slice wiz-ards. The ChipSync wizard configuresgroups of I/O blocks into an interface for usein memory, networking, or other types ofbus interface design. You can quickly definekey parameters such as the width and I/Ostandard of the data, address, clocks/strobes,clock buffers, and data bus specifications. Allinformation is then presented in a clear andconcise table for review.

The XtremeDSP slice wizard, shown inFigure 2, provides easy control of the revolu-

tionary Virtex-4 XtremeDSP slice technolo-gy. This new silicon capability lets you buildhigh-performance DSP filters and custompre or post-co-processing DSP algorithms.The XtremeDSP slice wizard lets you specifyaccumulator, adder/subtractor, multiplier, ormultiplier and adder/accumulator DSPmodes. You can graphically set input andoutput bus data widths, pipelining options,clock enable, and reset pin setups, and thenreview parameters and output the results asHDL-ready code.

50% Faster Verification CyclesVerification is one of the most time-consuming and time-critical phases of thedesign flow. As with most logic designsuites, HDL verification and timing analy-sis are available. The ISE tools also link toadditional verification technologies uniquein FPGA design, including formal equiva-lency verification through Formality fromSynopsys™ and Prover eCheck fromProver Technology AB, making quick workof verifying Virtex-4 high-density designs.

The ISE design tools also link directlyto our optional, separately purchasedChipScope Pro™ real-time debug environ-

ment. ChipScope Pro tools insert low-profile logic analyzer, bus analyzer, andvirtual I/O software cores duringdesign capture. These cores are thensynthesized and implemented into yoursilicon, allowing you to view:

• Any internal signal within theFPGA

• Embedded processor signals,including the IBM™CoreConnect processor local busor on-chip peripheral bus support-ing the PowerPC™ 405 insideVirtex-4 FX family devices

• Embedded processor signals for theMicroBlaze™ soft-processor core

Signals are captured at or near oper-ating system speed and brought outthrough the programming interface,freeing up pins for your design, notdebug. You can then analyze capturedsignals through the ChipScope Prosoftware logic analyzer.

The ChipScope Pro environmentalso links internal FPGA debug to AgilentTechnologies™ bench-top logic analzyersusing the included ChipScope Pro ATC2core. This core synchronizes theChipScope Pro tool with Agilent’s FPGADynamic Probe software.

This unique partnership betweenXilinx and Agilent delivers deeper tracememory, faster clock speeds, and moretrigger options, all using fewer pins on theFPGA, making Virtex-4 design debug asmuch as 50% faster than other logic veri-fication methodologies.

ConclusionYou can unlock the power of Virtex-4FPGAs with the ISE 6.3i FPGA environ-ment, the most complete available for pro-grammable systems design. Whether yourdesign includes DSP, embedded, and high-speed serial I/O design, Xilinx ISE softwareand our optional System Generator for DSP,ChipScope Pro, and EDK and PlatformStudio products will get your Virtex-4 LX,SX, and FX designs running with the maxi-mum performance, while shortening designcycles and getting you to market faster.


Figure 2 – XtremeDSP slice architecture wizard


by Peter AlfkeDirector of Applications EngineeringXilinx, [email protected]

A FIFO is a memory subsystem where a datasequence can be written and retrieved inexactly the same order. No explicit address-ing is required, and the write and read oper-ations can be completely independent, usingunrelated clocks.

“First-In First-Out” has been used inaccounting for hundreds of years, as well as indata queues since the early days of computers.In 1970, Fairchild Semiconductor introducedthe first integrated FIFO, the 3341.

Today, dedicated and much larger FIFOICs are available, and mid-sized FIFOs areoften implemented in Xilinx® FPGAs usingthe dual-ported block RAMs supported bysoft cores for addressing and control.

A FIFO is an ideal subsystem: simpleand user-friendly on the outside but com-plex and demanding in its implementationdetails. The design seems to be trivial;using a RAM with two independentlyclocked ports (one for writing, one forreading) plus two independent addresscounters to steer write and read data.

It may look easy, but the difficulty isfound when you look deeper into thechallenge – specifically, the decoding andsynchronization of the obligatory statusoutputs indicating the extreme conditionsof EMPTY and FULL. Even experienced

designers have had problems decodingthese two conditions in a fail-safe way,especially when the FIFO operates withtwo independent clocks of several hun-dred megahertz.

Because fast asynchronous design isnotoriously difficult, Virtex-4™ FPGAsnow have a dedicated FIFO addressingand control circuit right inside each blockRAM. Using the Virtex-4 block RAMFIFO option, you can be assured of reli-able operation at a clock rate up to 500MHz, without using any logic slices inthe Virtex-4 fabric.

Virtex-4 FIFO The FIFO shown in Figure 1 behaves likea “black box.” You supply the data (4, 9,18, or 36 bits wide), a continuously run-ning write clock and its enable signal, anda continuously running read clock andread clock enable. Output data has thesame width as the input data, unlike thebasic block RAM where the two widthscan be different.

FIFOs Made EasyFIFOs Made Easy


Virtex-4 FPGAs have a complete FIFO controller in each block RAM.Virtex-4 FPGAs have a complete FIFO controller in each block RAM.


As the last data entry is being read,EMPTY goes high as a result of the readclock that reads the final data. You are sup-posed to disable the read operation until theEMPTY output has gone inactive again.

Note that both the rising and fallingedge of the EMPTY status signal aremade synchronous with the read clock,giving you a totally synchronous inter-face. If read clock enable stays active afterthe FIFO is empty, the read error flag isactivated, but FIFO content and address-ing are not disturbed.

ALMOST EMPTY and ALMOSTFULL are programmable status outputs,available as a warning to slow down theread or write process, or as an indication ofthe data level in the FIFO (“dipstick”).

Implementation DetailsUnderstanding FIFO design details is notnecessary. It is all “under the hood,” andworks without user intervention. But forthe curious reader, let’s briefly explain.

Detecting FULL and EMPTY requiresdetecting identity of the write and read


wrcount rdcount

DO

rdclkrden

DIN

wrclk

wrenreset

waddr

oe

mem

_ren

mem

_wen

raddr

fullem

ptyafullaem

ptyrderrw

rerr

WritePointer

Block RAMCore

Status FlagLogic

ReadPointer

Counter

Clock A

Clock B

FIFOA FIFOB

SubtractRegister

CompareOutD1

WC

D0 D1

WREN

D0

EMPTY

RC WC RC

FIFO Test Circuit

Figure 1 – FIFO block diagram

Figure 2 – FIFO test circuit

Verifying the EMPTY Flag Synchronization

The only tricky detail in a FIFO withunrelated read and write clocks is theproper synchronization of theEMPTY and FULL flags that crossclock boundaries. Any design thatmight thus be exposed to metastabiltyproblems deserves special attentionand scrutiny.

At Xilinx, we tested the EMPTYlogic exhaustively by writing data intothe FIFO at 200 MHz and reading itout at 500 MHz, which makes it goEMPTY soon after each write cycle(Figure 2). The detection logic wasthus exercised, and the trailing edgeof the EMPTY flag was re-synchro-nized to the write clock 200 milliontimes a second.

More specifically, we wrote anascending data sequence at 200 MHzand read it out at 500 MHz. Wewrote the output data directly into asecond FIFO at the same 500 MHz.We then read the second FIFO out atthe original 200 MHz rate.

The combined dual FIFO forms asynchronous system, but with asyn-chronous data transfer between thetwo halves. When we synchronouslysubtracted the input data from theoutput data, the difference was con-stant, indicating flawless transfer atthe 500 MHz read/write rate and noflag synchronization problem – evenat this high rate.

When the two clock frequenciesare uncorrelated, each read clockcycle has a different phase relation-ship with respect to the write clock.During any second, the active readclock edge steps across the ~5 nswrite clock period in ~200 milliondifferent phase orientations, thus cre-ating a timing granularity of 0.025femtoseconds (one quadrillionth of asecond). This resolution is millionsof times better than any convention-al deterministic test methodology canpossibly achieve.

We ran this design for a wholeweek, with more than 1014 opera-tions, without any error.


address pointers, which generally do not sharea common clock. Binary counters would gen-erate unacceptable glitches on the comparatoroutput; using Gray-coded counters is thewell-known solution to this problem.

The simplest way to build Gray countersis to start with a binary counter and syn-chronously convert its content into Graycode. The binary address counter values canthen be used to calculate the programmableoffset for detecting ALMOST FULL andALMOST EMPTY.

Synchronization IssuesBecause EMPTY can only be caused by aread operation, the leading edge is naturallysynchronous with the read clock. But thetrailing edge is caused by a write operationand is thus synchronous with the “wrong”clock. Moving the trailing edge of EMPTYover onto the read clock domain needssome flip-flops and invites the specter ofmetastability.

Virtex-4 FPGAs use a conservative syn-chronizer design that has been demonstratedto work reliably at a 500 MHz read clockrate. We ran a week-long test with ~200 and~500 MHz asynchronous clock rates, gener-ating EMPTY more than 1014 times withouta single failure. The synchronizer delays thetrailing edge of EMPTY by a few read clockperiods. This latency is acceptable, since itdoes not affect top performance.

In a similar way, the trailing edge ofFULL is synchronized to the write clock.The software default is for FULL to haveone write clock latency. We therefore rec-ommend using ALMOST FULL instead.

A well-designed FIFO buffer shouldnever go FULL, and should go EMPTYonly when you want to drain the last wordfrom the buffer.

ConclusionThe hard-coded FIFO controller is availablein every Virtex-4 block RAM, and uses noadditional resources in the fabric. It alsosaves you from making any complex, time-consuming, and risky design decisions.

For a detailed description of the Virtex-4FIFO controller, visit the Virtex-4 UserGuide on the Xilinx website at www.xilinx.com/bvdocs/userguides/ug070.pdf.


Would you like to write for Xcell Publications?

It’s easier than you think.

Would you like to write for Xcell Publications?

It’s easier than you think.We recently launched the Xcell Publishing Alliance

to help you publish your technical ideas. We can help you – from concept research and development, through planning and

implementation, all the way to publication and marketing.

Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work

with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,

the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,

writing, editing, and marketing.

We recently launched the Xcell Publishing Alliance to help you publish your technical ideas. We can help you –

from concept research and development, through planning and implementation, all the way to publication and marketing.

Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work

with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,

the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,

writing, editing, and marketing.

For more information on this exciting and highly rewarding program, please contact:

Forrest CouchManaging Editor, Xcell Publications

[email protected]


by Ralf KreugerSr. Staff Applications EngineerXilinx, [email protected]

As FPGAs grow in size, quality on-chip clockdistribution becomes increasingly important.Clock skew and clock delay impact deviceperformance; managing clock skews anddelays with conventional clock trees becomesmore difficult in larger devices.

Xilinx® Virtex-4™ devices solve thischallenge by providing as many as 20 fullydedicated on-chip digital clock manage-ment (DCM) circuits. DCM provides zeropropagation delay and – along with fullydifferential global clock trees – low clockskew between output clock signals distrib-uted throughout the device.

Each DCM can drive up to 12 of the 32global clock routing networks within thedevice. The global clock distribution net-work minimizes clock skews due to loadingdifferences. By monitoring a sample of theDCM output clock, the delay locked loop(DLL) compensates for the delay on therouting network, effectively eliminating thedelay from the external input port to theindividual clock loads within the device.

Digital Clock Management in Virtex-4 DevicesDigital Clock Management in Virtex-4 Devices


The new Virtex-4 FPGAincludes improvements and additions to the digital clock module.

The new Virtex-4 FPGAincludes improvements and additions to the digital clock module.


In addition to providing zero delay withrespect to a user source clock, DCM pro-vides multiple phases of the source clock.The DLL can act as a clock doubler ordivide the user source clock by up to 16.

DCM can also act as a clock mirror. Bydriving DCM output off-chip and back inagain, you can use it to de-skew a board-level clock between multiple devices.

Digital Phase Shift (DPS)Virtex-4 FPGAs provide a digital phase shift(DPS) module that phase shifts the DCM’soutput clock in small increments – 1/256thof its period. You can operate the versatileDPS in four different modes for maximumflexibility: fixed, variable-positive, variable-center, and direct.

Digital Frequency Synthesis (DFS)The DCM digital frequency synthe-sis (DFS) module provides two out-puts, CLKFX and CLKFX180,derived from the input clock by fre-quency multiplication and division.Through a frequency calculator, youprovide the multiply and divide val-ues implemented by the DFS mod-ule. For example, an M value of 19and a D value of 8 yields a 2.375source clock multiplier.

DCM FeaturesDCMs are located in the center column ofthe Virtex-4 architecture. This enableswell-matched clock routes to and fromevery DCM for enhanced symmetry.

The Virtex-4 DCM’s superior perform-ance does not just include a wider operatingrange. It encompasses lower jitter, improvedphase accuracy, finer phase-shift resolution,tolerance of imperfect clocks and boarddesigns, less duty-cycle distortion, and lesssensitivity to sporadic voltage changes.

Xilinx also added new features. You now

Phase-Matched Delay ClocksPMCDs preserve edge alignments, phaserelations, or skews between the CLKA inputclock and other PMCD input clocks. Threeadditional inputs (CLKB, CLKC, andCLKD) and three corresponding delayedoutputs (CLKB1, CLKC1, and CLKD1) areavailable. The same delay is inserted toCLKA, CLKB, CLKC, and CLKD; thus, thedelayed CLKA1, CLKB1, CLKC1, andCLKD1 outputs maintain edge alignments,phase relationships, or the skews of theirrespective inputs.

You can use PMCDs alone or with otherclock resources, including global buffers andDCMs. Together, these clock resources pro-vide flexibility in managing complex clocknetworks.

The PMCDs are located in the centercolumn right next to the DCMs. They aregrouped as pairs in each tile.

ConclusionThe many features and functions of the clockmanagement subsystem allow you to maxi-mize system performance. By taking advan-tage of DCM to remove on-chip clock delay,you can greatly simplify and improve system-level designs involving high fan-out, high-performance clocks. Virtex-4 devices have anabundance of clock management resourcesalong with comprehensive software support.

Specialized individual features furtherimprove the ability to optimize design per-formance. Frequency synthesis is a powerfulfeature to generate a wide range of frequen-cies in the FPGA or the entire system. Afine-resolution phase-shift capability allowsyou to improve margins. And the newPMCD further increases the number ofclock derivatives that can be generated with-out the use of additional DCMs.

For more information, see the user guide atw w w. x i l i n x . c o m / b v d o c s / u s e r g u i d e s /ug070.pdf.

have the choice to trade off a wider phaseshift range versus higher frequencies.

In addition, a new function in theVirtex-4 architecture is the dynamic recon-figuration port (DRP). The DRP allowsyou to directly access some features inDCM through a block RAM-style inter-face. You can directly phase shift the delayline elements and change M and D values.

The software view of DCM has changedas well. Three Virtex-4 primitives –DCM_BASE, DCM_PS, andDCM_ADV – offer progressive features toenhance your design choices.

Xilinx also added a new DCM compan-ion block, the phase-matched clock divider(PMCD), to the Virtex-4 family. Let’s dis-cuss the clock management features ofthese new clock resources.

Phase-Matched Divided ClocksPMCDs create as many as four frequency-divided and phase-matched versions of aninput clock, CLKA. The output clocks area function of the input clock frequency:divided-by-1 (CLKA1), divided-by-2(CLKA1D2), divided-by-4 (CLKA1D4),and divided-by-8 (CLKA1D8).

CLKA1, CLKA1D2, CLKA1D4, andCLKA1D8 output clocks are rising-edgealigned to each other, but not to the input(CLKA). Figure 1 illustrates the newPMCD primitive.


CLKA CLKA1CLKA1_D2CLKA1_D4CLKA1_D8

CLKD1CLKC1CLKB1CLKB

CLKCCLKD

FaFa/2Fa/4Fa/8

FbFcFd

RST

Fa

FbFcFd

RELEASE

By taking advantage of DCM to remove on-chip clock delay, you can greatly simplify and improve system-level designs

involving high fan-out, high-performance clocks.

Figure 1 – Phase-matched clock divider


by Markus AdhiwiyogoApplications EngineerXilinx, [email protected]

Digital designs require good clock signalswith a short delay and minimal skew, so thatthey arrive almost simultaneously at theirmany on-chip destinations. Clocks mustmaintain their duty cycle, which is especiallyimportant in double-data-rate designs wheredata is clocked on the rising as well as on thefalling clock edge. Those delays and edgerates must therefore always be closelymatched, independent of their loading.

Although single-clock operation isdesirable, many systems require multipleclocks. Often, input and output signals areclocked very fast and require even bettertiming precision than the general logicimplemented on the chip.

Xilinx® Virtex-4™ FPGAs provide sig-nificant advances in all of these areas. Globalclocks can reach all flip-flops on the chip,and high-speed I/O clocks provide excep-tional performance, especially for source-synchronous interfaces. Additional regionalclocks serve specific areas on the chip.

Virtex-4 Clocking ResourcesVirtex-4 Clocking Resources


Xesium clocking networks are an innovative feature in Virtex-4 devices.Xesium clocking networks are an innovative feature in Virtex-4 devices.


Clock RegionsFor clocking purposes, each Virtex-4 deviceis divided into regions. The number ofregions varies with device size, from 8regions in the smallest device to 24 regionsin the largest one.

Global Clocks Independent of array size, each Virtex-4FPGA has 32 low-skew global clock distri-bution networks that can each clock allsequential resources on the whole chip(CLBs, block RAMs, DCMs, and I/Os) andalso drive logic signals. You can use any 8 ofthese 32 global clock lines in any region.

All global clock inputs have dedicatedfast routing to the corresponding globalclock buffer, which can also be used as aclock-enable circuit or a glitch-free multi-plexer. It can select between two clocksources and can also switch away from afailed clock source – a new feature in theVirtex-4 architecture.

A global clock buffer is often driven by a

clock-capable inputs, optimized forincoming high-frequency clocks. Clock-capable I/O pairs, like global clock inputs,are regular I/O pairs where the LVDS out-put drivers have been removed to reducethe input capacitance.

Each of these input pins or input pinpairs can connect to a BUFIO that drivesa high-speed differential I/O clock network, which is dedicated to the I/Ocircuits and is ideally suited for source-synchronous data capture using the built-in serializer/deserializer (SerDes).

Each BUFIO can drive all I/O logic inits region as well as in the two adjacentregions (Figure 1). This means that onereceive clock can control up to 47 differen-tial or 95 single-ended receive data lines,ideal for many networking and memoryinterface applications.

Regional clocks form a third type ofclock networks, each being able to span asmany as three adjacent clock regions.Regional clocks drive single-ended nets andare intended for the parallel clock domainof the SerDes.

You can program the regional clockbuffer to divide the incoming clock rateby any integer number from one to eight.This feature, in conjunction with the pro-grammable SerDes in the I/O block,allows source-synchronous systems tocross clock domains without using addi-tional logic resources.

ConclusionVirtex-4 clocking resources have beenoptimized for high clock rates and multi-ple clock domains. Thirty-two globalclock networks provide high-performanceclocking across the whole chip, with shortdelay, low skew, and stable duty cycles.

Many localized clock networks serve theI/O for high-speed source-synchronousapplications. These clock networks are usedin conjunction with the built-in SerDes andreduce the burden on global clock resources.

Last but not least, all of these resourcesare easy to use. They are automatically han-dled by the Xilinx ISE 6.3i software.

For more information, visit www.xilinx.com/products/virtex4/capabilities/xesium.htm.

digital clock manager (DCM) to eliminatethe clock distribution delay, or to adjust itsdelay relative to another clock. There aremore global clocks than DCMs, and a DCMoften drives more than one global clock.

Virtex-4 clock trees are designed for lowskew and low power. Any unused branch isautomatically disconnected. All global clocklines and buffers are implemented differen-tially. This minimizes duty-cycle distortionand improves common-mode noise rejection.The whole global clock network is designedfor 500 MHz operation and beyond.

I/O Clocks and Regional Clocks Virtex-4 devices have two additional clocktypes: I/O clocks and regional clock net-works, two of each per region, used primari-ly for clocks forwarded into the Virtex-4FPGA. I/O and regional clock networks areindependent from the global clock networks,thus offering a maximum of 12 independentclock domains in any clock region.

Each clock region has two pairs of


I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

I/O Tile

BUFIO BUFR

BUFIO

To Centerof Die

To Adjacent Region

To Adjacent Region

Clock- Capable I/O

Clock- Capable I/O

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

CLBs

BRAM

BRAM

DSPTile

DSPTile

BRAM

BRAM

DSPTile

DSPTile

Figure 1 – BUFIO and BUFR clocking up to three regions


by Reed Tidwell Sr. Staff Applications EngineerXilinx, [email protected]

The XtremeDSP™ system feature,embodied as the DSP48 slice primitive inthe Xilinx® Virtex-4™ architecture, is ahigh-performance computing elementoperating at an industry-leading 500 MHz.The design of the Virtex-4 infrastructuresupports this rate, with Xesium clock tech-nology, Smart RAM, and LUTs configuredas shift registers.

Many applications, however, do nothave data rates of 500 MHz. So how canyou harness the full computing perform-ance of the DSP48 slice with data streamsof lower rates?

The answer is to use a double-data-rate(DDR) technique through the DSP48slice. The DSP48 slice, operating at 500MHz, can multiplex between two datastreams, each operating at 250 MHz.

One application of this technique isalpha blending of video data. Alpha blend-ing refers to the combination of twostreams of video data according to aweighting factor, called alpha. In this arti-cle, we’ll explain the techniques and designconsiderations for applying DDR to twodata streams through a single DSP48 slice.

Alpha Blending Two Data StreamsUsing a DSP48 DDR Technique


Achieve full throughput of the DSP48 slice with a double-data-rate technique.


Virtex-4 DSP48 The DSP system elements of Virtex-4FPGAs are dedicated, diffused silicon withdedicated, high-speed routing. Each is con-figurable as an 18 x 18-bit multiplier; amultiplier followed by a 48-bit accumulator(MACC); or a multiplier followed by anadder/subtracter. Built-in pipeline stagesprovide enhanced performance for 500MHz throughput – 35% higher than forcompeting technologies.

All Virtex-4 devices have DSP48 slices,although the SX family contains the largestnumber (an industry-high 512) and the high-est concentration of DSP48 slices to logic ele-ments, making it ideal for math-intensiveapplications such as image processing.

A triple-oxide 90 nm process makes theDSP48 slice very power-efficient.

flip-flops; CLB LUTs configured as shiftregisters (SRL16); or directly from blockRAM. Block RAM, configured as a FIFOusing the built-in FIFO support, also sup-ports the 500 MHz clock rate.

Design ConsiderationsDealing with data at 500 MHz requiresgreat care; you should observe strict pipelin-ing with registers on the outputs of eachmath or logic stage. The DSP48 slice pro-vides optional pipeline registers on the inputports, on the multiplier output, and on theoutput port from the adder/subtracter/accu-mulator. Block RAM also has an optionaloutput register for efficient pipelining wheninterfaced to the DSP48 slice.

Where you are using CLBs, place onlyminimal levels of logic between registers toprovide maximum speed. For DDR opera-tion, only a 2:1 mux (a single LUT level) isrequired between pipeline stages. Whetheryou are interfacing to the DSP48 slice withmemory or CLBs, placing connected 500MHz elements in close proximity mini-mizes connection lengths in the generalrouting matrix.

DDR requires the DSP48 slice to oper-ate at double the frequency of the inputdata streams. You can use a DCM to pro-vide a phase-aligned double-frequencyclock using the CLK 2X output.

Another aspect of inserting DDR datathrough a section of pipeline is ensuringthat data passes cleanly between clockdomains. This may require adding extraregisters clocked with the double-fre-quency clock at the output of the double-pumped section, to synchronize the datawith the original clock. The rule ofthumb is that in order to insert a double-pumped section cleanly into a single-pumped pipeline, there must be an evennumber of register delays in the double-pumped section.

Architectural features, including built-inpipeline registers, accumulator, and cas-cade logic nearly eliminate the use of gen-eral-purpose routing and logic resourcesfor DSP functions, and further reducepower. This slashes DSP power consump-tion to a fraction when compared toVirtex-II Pro™ devices.

DDR with Two Data StreamsDDR, in this context, refers to multiplex-ing two input data streams into onestream at twice the rate, interleaving (in time) the data from each stream(Figure 1). Figure 1 also shows the reverseoperation, creating two parallel resultantstreams after processing.

You can drive the DSP48 slice inputs atthe fast 500 MHz clock rate from CLB


Data Stream 0

Data Stream 1

DDR Data StreamDSP48

ProcessedStream 0

ProcessedStream 1

clk2xclk1x

A0

A1

B0

B1

out0 = A0 * B0out1 = A1 * B1

clk1x

out0

out1

DSP48

All Virtex-4 devices have DSP48 slices, although the SX family contains thelargest number (an industry-high 512) and the highest concentration of DSP48

slices to logic elements, making it ideal for math-intensive applications ...

Figure 1 – DSP48 DDR

Figure 2 – Two-stream multiply through DSP48 slice


ImplementationSeveral configuration options exist forimplementing DDR functionality. Figure2 shows a straightforward implementation.

In Figure 2, stream 0 consists of A0and B0 inputs. We multiply them togeth-er and output as out0. Likewise, stream 1consists of inputs A1 and B1 multipliedtogether and output as out1. There aretwo clock domains: the clk1x domain, atthe nominal data stream frequency, andthe clk2x domain, at twice the nominalfrequency.

Figure 2 shows two registers after themultiplier. The second is the accumula-tion register, even though we do not useaccumulation in this configuration. Theregister, however, is still required toachieve the full, pipelined performance.We use two sets of registers on the inputsof the DSP to make the total delaythrough the DSP48 slice an even number(four) for easier alignment of the outputdata with clk1x. These registers are “free”because they are built into the DSP48slice, and using them reduces the needfor alignment registers external to theDSP48 slice. The extra pipeline registeron out0 compensates for taking stream 0into the DSP one clk2x cycle beforestream 1. As seen from the timing dia-gram in Figure 3, this is required to re-align the stream 0 data back into theclk1x domain.

Note that the input mux select,mux_sel, is essentially the inverse of clk1x.It is important, however, to generate thissignal from a register based on clk2x (ratherthan deriving it from clk1x) to avoid hold-time violations on the receiving registers.

At the transitions between clockdomains, the data have only one clk2x peri-od to set up. This is the reason to have no

logical operations between registers in thetwo domains. The placement of the firstregisters in the clk1x domain is more criti-cal than other registers in the same domain.

Alpha BlendingAlpha blending of video streams is amethod of blending two images into a sin-gle combined image, such as fadingbetween two images, overlaying anti-aliased or semi-transparent graphics overan image, or making a transition bandbetween two images on a split-screen orwipe. Alpha is a weighting factor definingthe percentage of each image in the com-bined output picture. For two input pixels


clk1x

clk2x

A0 Reg

A1 Reg

A DSP input

A0:0

A1:0

A0:0 A1:0

A0:1 A0:2 A0:3 A0:4 A0:5 A0:6

A1:1 A1:2 A1:3 A1:4 A1:5 A1:6

A0:1 A1:1 A0:2 A1:2 A0:3 A1:3 A0:4 A1:4 A0:5 A1:5 A0:6

B0 Reg

B1 Reg

B DSP input

B0:0

B1:0

B0:0 B1:0

B0:1 B0:2 B0:3 B0:4 B0:5 B0:6

B1:1 B1:2 B1:3 B1:4 B1:5 B1:6

B0:1 B1:1 B0:2 B1:2 B0:3 B1:3 B0:4 B1:4 B0:5 B1:5 B0:6

Mux sel

Prod0:0 Prod1:0 Prod0:1 Prod1:1 Prod0:2 Prod1:2 Prod0:3 Prod1:3 Prod0:4 Prod1:4 Prod0:5

align 0 reg

Mult. Reg

out 1

Prod0:0 Prod1:0 Prod0:1 Prod1:1 Prod0:2 Prod1:2 Prod0:3 Prod1:3 Prod0:4 Prod1:4Adder Reg

Prod1:0 Prod1:1 Prod1:2 Prod1:3

out 0 Prod0:0 Prod0:1 Prod0:2 Prod0:3

A0:0 A1:0 A0:1 A1:1 A0:2 A1:2 A0:3 A1:3 A0:4 A1:4 A0:5 A1:5A DSP input_del

B DSP input del B0:0 B1:0 B0:1 B1:1 B0:2 B1:2 B0:3 B1:3 B0:4 B1:4 B0:5 B1:5

Prod0:0 Prod1:0 Prod0:1 Prod1:1 Prod0:2 Prod1:2 Prod0:3 Prod1:3 Prod0:4

P0

P1

alpha

1 - alpha

Pf

clk2x

zero

Red 0

Red 1

Alp

ha

1-Alpha

Red out

clk1x

AlphaGenerator

1-

BlueGreen

Green out Blue out

Video Stream 0

Video Stream1

Red

DSP48

Figure 3 – Timing of two-stream multiply

Figure 4 – Alpha blend formula in graphical terms

Figure 5 – Alpha blend on three-component video


(P0, P1, and a blend factor, α, where 0 <=α < =1.0), the output pixel Pf will be:

Pf = αP0 + (1-α)P1 (see Figure 4)

This operation is performed separatelyfor each component: red, green, and blue.

A pixel rate of 250 MHz or less is suffi-cient for all standard and high-definitionvideo rates, and common VideoElectronics Standards Association (VESA)

standards as high as 1600 x 1200 at 85 Hz.Therefore, one DSP48 slice can performthe multiply and add on one component,and a set of three slices can alpha blend thethree components from each of two videostreams, as shown in Figure 5. The opera-tions must be performed identically and inparallel on each of the three components.

There are several ways to implementalpha blending depending on the nature

of the video streams and how alpha isgenerated. Figure 6 shows a basic imple-mentation with two video streams alter-nating as one multiplier input. The othermultiplier input alternates between alphaand 1- alpha.

The operating mode of the adderalternates between add zero (passthrough) mode and add output (accumu-late) mode. The DSP48 slice output reg-ister contains the result of the Video0 *alpha multiply during one clock cycle,and the final result (Video1 * (1 – alpha)+ Video0 * alpha) on the alternate clock.Figure 7 shows the timing for this configuration.

The align registers on the inputs ofthe DSP are used to make the total delaythrough the DSP48 slice an even number(four), as explained in the previousexample. The final output register forblend loads new data to every other DSPclock to register the blend results at theoriginal pixel rate.

ConclusionYou can efficiently use the high-perform-ance of Virtex-4 devices with DSP48slices by processing multiple data streamsin a time-multiplexed fashion. With care-ful design, a single DSP48 can performmultiply operations on two independentdata streams, operating at 250 MHz each.

Alpha blending of video streams, asoutlined in this article, is one example ofprocessing two data streams through a sin-gle DSP48 slice. This capability comple-ments the DSP features of Virtex-4 FPGAs– including built-in pipelining and cas-cading, integrated 48-bit accumulator,and an abundance of DSP48 slices in theSX family – to make Virtex-4 devices theideal DSP platform.

For details about the DSP48 slice, referto the “Virtex-4 FPGA Handbook,”Chapter 10, or the “XtremeDSP DesignConsiderations User Guide” at www.xilinx.com/bvdocs/userguides/ug073.pdf.


clk2x

zero

clk1x clk1x

DSP48

align

blend

blend = (Video0 * alpha) + (Video1 * (1-alpha))

Video0

Video1

alpha

1-alpha

A

B

clk1x

clk2x

Video0 reg

Video1 reg

V0:0

V1:0

V0:0 V1:0

V0:1 V0:2 V0:3 V0:4 V0:5 V0:6

V1:1 V1:2 V1:3 V1:4 V1:5 V1:6

V0:1 V1:1 V0:2 V1:2 V0:3 V1:3 V0:4 V1:4 V0:5 V1:5 V0:6A input reg

B input reg

a:0

1-a:0

a:0 1-a:0

a:1 a:2 a:3 a:4 a:5 a:6

1-a:1 1-a:2 1-a:3 1-a:4 1-a:5 1-a:6

a:1 1-a:1 a:2 1-a:2 a:3 1-a:3 a:4 1-a:4 a:5 1-a:5 a:6

Mux sel

blend output

Mult. Reg Prod0:0 Prod1:0 Prod0:1 Prod1:1 Prod0:2 Prod1:2 Prod0:3 Prod1:3 Prod0:4 Prod1:4 Prod0:5

Prod0:0 Blend0 Prod0:1 Blend1 Prod0:2 Blend2 Prod0:3 Blend3 Prod0:4 Blend4Acc. Reg

alpha reg

1 - alpha reg

Blend 0 Blend 1 Blend 2 Blend 3

V0:0 V1:0 V0:1 V1:1 V0:2 V1:2 V0:3 V1:3 V0:4 V1:4 V0:5 V1:5A align reg

B align reg a:0 1-a:0 a:1 1-a:1 a:2 1-a:2 a:3 1-a:3 a:4 1-a:4 a:5 1-a:5

You can efficiently use the high-performance of Virtex-4 devices with DSP48 slices by processing multiple data streams in a time-multiplexed fashion.

Figure 6 – Alpha blend implementation (one component)

Figure 7 – Alpha blend timing


by Tze Yeoh Product Applications EngineeringXilinx, [email protected]

Xilinx® FPGAs provide connectivity invery high speed source-synchronous businterfaces. Transmission rates of 1 Gbpsand higher are not uncommon for thesetypes of interfaces.

In source-synchronous interfaces, thetransmitter forwards a dedicated clockalong with the data. As data rates skyrock-et to 1 Gbps and beyond, you may findthat your timing budgets are eaten away byskew and jitter.

Skew is defined as the difference inarrival time between signals sent at thesame time. It is caused by variations inboard trace lengths, connectors, packageflight-time delays, and secondary parasiticeffects. Figure 1 illustrates how theimproper routing of board traces and theuse of connectors contributes towardsskew at the receiver.

Dynamic Phase Alignment with ChipSync Technology in Virtex-4 FPGAs

Dynamic Phase Alignment with ChipSync Technology in Virtex-4 FPGAs


ChipSync technology built into every I/O supports dynamic phase alignment solutions for high-speed source-synchronous interfaces.


Another challenge is jitter, the deviationfrom ideal timing caused mostly by slowtransition times, ground bounce, inter-symbol interference, and electromagneticinterference. Figure 2 illustrates the com-bined effects of skew and jitter on a systemdesigner’s timing budget.

In a real system, many bits of data (16,for example) are received in parallel andmust be clocked into the receiver by thecommon clock sent together with the data.Ideally, the clock edge arrives in the middleof the bit time, thus offering a maximumtiming margin.

But in reality, the individual data bitsarrive at slightly different times, and eachsuffers from timing jitter on its rising andfalling edges, and therefore the clock signalalso suffers from timing jitter. All of theseeffects combine to limit the data-valid win-dow, and thus might lead to unreliable datatransmission.

the delay lines in a region are continuous-ly being calibrated by a servo systemusing a dedicated delay line, a 200 MHzuser-provided clock, and a phase-com-parator-driven PLL circuit that adjuststhe delay line(s) such that the 64-stagedelay equals one period of the clock (5 ns / 64 = 78 ps per tap).

All delay lines in one region share acommon adjustment, and thus have thesame tap delay, as accurately as delay track-ing in a small silicon area allows. The refer-ence frequency is specified, tested, andsupported by software at 200 MHz. Minorvariations can be tolerated, and jitter is fil-tered out by the control structure. Thisprogrammable precision delay will find itsway into many innovative applications.Here it is described only as a method toachieve dynamic phase alignment.

The ChipSync technology built intoevery I/O contains a dedicated serial-to-parallel converter that converts the high-speed serial stream to a sequence of parallelwords that can be processed at a much slow-er rate within the FPGA. This feature decou-ples the high-speed serial data transfer fromthe clock rate supported by the FPGA fabric.

The converter supports both single datarate (SDR) and double data rate (DDR)modes. In SDR mode, the serial-to-paral-lel converter is fully programmable to gen-erate anywhere from 2- to 8-bit parallelwords. In DDR mode, the converter canbe programmed to de-serialize by a factorof 4, 6, 8, or 10, as specified by the HDLattributes of the ChipSync technology.The maximum width in a single ChipSyncmodule is six. For larger bit widths, youcan connect two adjacent ChipSync mod-ules in master-slave mode.

Word alignment can correct for dataskew greater than one bit period by com-paring the parallel version of the incomingpattern to the pre-specified training pat-tern. The Bitslip module enables you to

Virtex-4™ data and clock inputs offerChipSync™ technology, facilitatingdynamic phase alignment (DPA). DPAcan greatly reduce the skew between dif-ferent data lines, as well as between thedata lines and their associated clock input.

Using a system-generated training pat-tern, the receiving FPGA can adjust theinput delay of each data and clock input,using individual precision delay lines onevery input buffer. Gross errors exceedingone bit time pass through the bit-serialinterface, but can be corrected after serial-to-parallel conversion using theBitslip module.

A Generic Networking Interface ExampleThe generic interface is defined by a 16-channel bus and a forwarded clock. The sig-naling standard is low-voltage differentialsignaling (LVDS). The interface protocolspecifies a de-skewing method called “train-

ing.” During the initialization phase,the transmitter sends a repetitive 20-bit training pattern. The receiver usesit to de-skew the interface by delay-ing each data bit such that it is opti-mally centered over the receivedclock edge. The interface specifica-tion calls for the receiver to correctdata skew as much as +/- 1 bit timeof channel-to-channel skew.

This fine-grained delay adjust-ment uses a 64-tap delay line witha counter-controlled tap multiplex-er available on each input. All of


SPI-4.2Source

SPI-4.2Sink

Co

nn

ecto

r

Data Valid Window

Clock Jitter and Data Jitter

Skew between clock and data plus skewbetween data channels

Bit period

The ISERDES built into every I/O contains a dedicated serial-to-parallel converter that converts the high-speed serial stream to a sequence of parallel

words that can be processed at a much slower rate within the FPGA.

Figure 1 – Improper board trace routing and use of connectors contribute to skew

Figure 2 – Effects of skew and jitter on timing budgets


match an incoming data stream to a pre-determined data pattern by shifting theoutput of the dedicated serial-to-parallelconverter. An example of this feature inoperation is given in Figure 3.

The IDELAY, SERDES, and Bitslip features are encapsulated in a module calledISERDES, available as part of theChipSync technology in every single I/O.

The Virtex-4 DPA SolutionLet’s use the Virtex-4 ChipSync technologyfeatures previously described to create a DPAsolution that meets interface requirements.There are three basic steps in the solution:

• Bit alignment – completed during theinitialization procedure, its purpose isto correct for skew less than one bittime and position the clock edge at thecenter of the data eye

• Word alignment – completed duringthe initialization procedure, its purposeis to align the incoming data stream tothe pre-determined training pattern

• Real-time window monitoring – con-tinuously monitors the data eye so thatthe clock edge is always centered to thedata eye

Figure 4 illustrates the implementationof DPA in a Virtex-4 device.

The goal of the bit-alignment procedureis to position the captured clock edge in thecenter of the data eye to provide maximummargin. The bit-alignment procedure takesadvantage of the dedicated 64-tap delayline feature of the ISERDES.

The word alignment procedure aligns theoutput pattern from the ISERDES to a spe-cific training pattern. This procedure effec-tively removes word skew and aligns allchannels to a specific word boundary. Theword alignment unit primarily uses theBitslip module of the ISERDES. Each chan-nel monitors the pattern streaming in. If thetraining pattern is not found, activate Bitslipuntil the pattern is found. Once found, thechannel is – by definition – de-skewed.

After the initialization stage using thetraining procedure, the channels areassumed to remain trained throughout

normal operation. However, the data validwindow might shift because of operatingconditions. The window monitoring unitcan continuously monitor the data validwindow during normal operation and canadjust the sampling point as necessary toprovide maximum margin.

ConclusionDynamic phase alignment is a criticalfunction in many bus interfaces as datarates explode into the gigabit range. AsFPGAs are increasingly being used direct-ly in the data path of these very high speedinterfaces, dynamic phase alignment in theFPGA is a must.

Virtex-4 ChipSync technology builtinto every I/O enables you to quickly andeasily develop a DPA solution that meetsyour application.

An application note describing theimplementation of DPA is available atwww.xilinx.com/bvdocs/appnotes/xapp700.pdf. The application note, “Dynamic PhaseAlignment for Networking Applications,” ispublished as XAPP 700. The referencedesign enables you to quickly understandhow to implement a DPA solution that fitsyour particular application.


ISERDES

ISERDES

1

0

0

1

0

0

1

1

0

0

1

0

0

1

1

1

0

1

0

0

1

1

1

0

1

0

0

1

1

1

0

0

0

0

0

1

1

0

0

1

0

1

1

1

0

0

1

0

1

0

0

0

0

1

0

0

1

0

0

0

1

0

0

1

1

0

0

1

0

0

1

1

Intitial1st

Bitslip2nd

Bitslip3rd

Bitslip4th

Bitslip5th

Bitslip6th

Bitslip7th

Bitslip

8thBitslip(Intitial

Pattern)

Input Serial Data

64-Tap Delay Line in Silicon

VIRTEX-4 ISERDES VIRTEX-4 FPGA Fabric

DESERIALIZER

BITSLIP

4

64-Tap Delay Line in Silicon

VIRTEX-4 ISERDES

DESERIALIZER

BITSLIP

4

IBUFDS_DIFF_OUT

Dip

Din

clkp

clkn

TrainingController

Real-TimeWindow

Monitoringand

AdjustmentController

...

...

Figure 3 - Operation of Bitslip

Figure 4 - Virtex-4 DPA implementation with ChipSync technology


by Chen Wei TsengConfiguration Product PAEXilinx, [email protected]

Because of the necessary configuration ofFPGAs on each power up, as their popular-ity increases so do design security concerns.Without proper protections, attackerscould easily clone or reverse-engineer thebitstream during FPGA configuration.

All Xilinx® Virtex-4™ devices have anon-chip decryptor that can be enabled tomake the configuration bitstream secure.Virtex-4 has implemented the AdvancedEncryption Standard (AES) scheme forsecuring the bitstream.

Modern Security DesignXilinx has replaced the Triple DES encryp-tion scheme implemented in the Virtex-II™ architecture with AES. Although bothencryption schemes provide a high level ofsecurity, AES offers both increased securityand throughput over Triple DES by replac-ing three 56-bit keys with one 256-bit keyand allowing configuration clocking fre-quencies as high as 100 MHz.

Lock Your Designs with the Virtex-4 Security Solution


Virtex-4 FPGAs provide an up-to-date AES encryption scheme to prevent IP or microchip design theft.


Let’s review some key benefits of theXilinx Secure Chip solution.

1. AES is an official government stan-dard, FIPS-197, supported by theNational Institute of Standards andTechnology and the U.S. Departmentof Commerce. The NSA has also cer-tified AES’ ability to protect classifiedcommunication to the top secret level.

2. The AES key can only be pro-grammed through the JTAG interface.This allows you to monitor anyunwanted activities on the JTAG linesboth externally and internally withthe BSCAN_Virtex4 primitive.

3. A battery-backed volatile key providesthe maximum protection against hos-tile hacking.

4. This low-cost solution includes widelyavailable standard components such as a Rayovac™ lithium battery.

5. Encryption key storage (Figure 1) has a long life span (20+ years).

Advanced Encryption Standard (AES) Although the Triple DES algorithmremains effective against attacks, AES isnow replacing DES in many applications asthe most secure encryption scheme. Asspecified by FIPS-197, AES has the NSA-approved cryptographic algorithm that canbe used to protect electronic data.

AES employs a cipher block that elimi-nates symmetry in the behavior of thecipher to overcome shortcomings of theDES’ key. The non-linearity of the AESkey expansion practically eliminates thepossibility of equivalent keys.

Because of its key strength, AES is suitedfor applications such as banking, defense,government, and sophisticated technicalapplications such as ATM, HDTV, broad-band ISDN, voice, and satellite.

Data Encryption SupportThe Virtex-4 AES system comprises soft-ware-based bitstream encryption and on-chip AES (Rijndael) decryption withcipher block chaining (CBC) to decryptthe incoming bitstream. The AES key isstored in dedicated memory, powered by

configuration interface as SelectMAP toaccess configuration logic internally so thatyou can partially reconfigure the device forextra design security.

In addition to ICAP, Virtex-4 devicescan monitor activities on the external JTAGpins with the internal BSCAN_Virtex4primitive. The BSCAN_Virtex4 primitivemirrors the activity on the TDI pin and

outputs several JTAG tap controllerstates, such as Test-Logic-Reset orUpdate-DR. Tampering with the JTAGduring a “side channel” attack can bedetected. You can then take countermea-sures such as cutting power to the FPGA– including VBATT – or erasing and writ-ing a new encryption key by once againentering the key access mode.

Moreover, you can return any faultypart to Xilinx for testing without having toprovide the encryption key for returnedmaterial analysis.

Software IntegrationXilinx ISE™ version 7.1i will have full soft-ware support for encrypted bitstream andkey creation. Generating an encrypted bit-stream requires only two additional bitgenoptions. For example, “bitgen -g encrypt:yes -g key0:AA995566 top.ncd top.bit” willautomatically create an encrypted bit-stream (top.bit) and the encryption key(top.nky) with the key of “AA995566.”You must then load the top.nky file intothe device through the JTAG interfacebefore loading the encrypted bitstream.

either an auxiliary power supply (VCCAUX)or an externally connected battery.

To combat a brute-force software attacksuch as key search, Virtex-4 devices featurea 256-bit AES key system that enables 1.1x 1077 possible key combinations. To pro-gram the key, the device must enter “key-access mode” in IEEE1532 flow via JTAG.Once in this mode, the previous encryp-

tion key will be cleared to prevent readbackof the key. (Further flow details are docu-mented in the Virtex-4 1532 BSDL files.)If the encryption keys are compromised,you can update the design with new keysand new encrypted bitstreams.

Virtex-4 FPGAs also embed the memo-ry holding the key under layers of metal.Because the key is stored in volatile memo-ry, disrupting the power supply for the keymemory during hardware attacks will resultin key loss.

You can always use a non-encrypted bit-stream to configure the device regardless ofthe presence of the key. For example, whenloading a non-encrypted bitstream, youshould be careful when generating the bit-stream. The proper security level should beset if you want readback of the non-encrypted bitstream. Reconfiguring theencrypted bitstream, however, wouldrequire you to toggle the PROG pin, cyclepower, or issue one of two JTAG instruc-tions: JPROG or JSTART.

Internally, you can use the internal con-figuration access port (ICAP) to reconfig-ure the device. ICAP provides the same


+–3.6V

Configuration DeviceAES Encrypted Bitstream

To NextDevice DIN

Battery: Tadiran TLH-4986> 20-Year Solution @ 50ºC

Internal System Air Temperature

DIN DOUT

VBATT

Figure 1 – Encrypted bitstream reference circuit for system-level applications


As for the GUI, Xilinx Project Navigatoroffers encryption options in the GenerateProgramming File command. You can setpreferences for allowing readback, partialreconfiguration, and encryption.

iMPACT, the Xilinx programming tool,allows you to program just the key or theencrypted bitstream with the key. For inde-pendent programming applications, thedetailed steps to download the encryptionkey are documented in the Virtex-4IEEE1532 BSDL files, which are installedin the Xilinx/Virtex4/data directory withISE installation, or downloadable fromwww.xilinx.com/support/sw_bsdl.htm.

Battery-Secured SystemsDesigning secure systems incorporatingbatteries for volatile storage is a provenmethod in multiple markets that is recog-nized as the highest form of security andis required by the U.S. government for itssecured modules (http://csrc.nist.gov/publications/fips/fips140-2/fips1402.pdf).

Several misconceptions exist related tobattery use – some believe that batterieswill require additional maintenance cycles.These fears are unfounded: maintenanceand lifetime are of no concern for mostapplications, and the lifetime of the batterywill usually far exceed the useful lifetime ofthe product.

All batteries “self discharge” when sit-ting idle, even with no load. Modern lithi-um batteries feature extremely lowself-discharge rates. Rayovac lithium bat-teries self-discharge at a rate of less than0.3% per year. Even at higher tempera-tures, the self-discharge experiences onlyvery minor deterioration – in this example,let’s use a conservative 0.6%. The capacityof the BR1225 is 50 mAh.

Assume that the Virtex-4 IBATT currentvalue is 50 nA. The VBATT signal is rout-ed internally to the PCB to eliminateleakage currents. The self-discharge perhour is 34 nA.

34 nA + 50 nA = 84 nA

50 mAh / 0.000084 mA = 595238 hours = ~67 years

Thus, a 20-year product life is easilyachieved using a battery.

For more information about battery lifeexpectancy calculations and design consid-erations, see Xilinx XAPP766, “UsingHigh Security Features in Virtex-II SeriesFPGAs,” at http://www.xilinx.com/bvdocs/appnotes/xapp766.pdf.

ConclusionVirtex-4 devices provide the most up-to-date security option for your designs. Withthe ease of integrated software flow, minimalboard space requirements, and maximumsecurity through AES, the Virtex-4 SecureChip AES security solution is ideal for keep-ing hackers from your designs.

For more information about theAdvanced Encryption Standard, please visit:

• http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf

• http://csrc.nist.gov/encryption/aes/rijndael/

• http://csrc.nist.gov/encryption/aes/rijndael/Rijndael.pdf

• http://csrc.nist.gov/encryption/aes/

• http://csrc.nist.gov/encryption/aes/round2/r2report.pdf

• http://csrc.nist.gov/encryption/aes/round2/NSA-AESfinalreport.pdf

Designing secure systems incorporating batteries for volatile

storage is a proven method in multiplemarkets that is recognized as the

highest form of security...

R

Let Xilinx help you get your

message out tothousands of

programmable logic users worldwide.

That’s right ... by advertising your product or service in the

Xilinx Xcell Journal, you’ll reach more than 70,000 engineers, designers, and engineering

managers worldwide.

The Xilinx Xcell Journal is an award-winning publication,

dedicated specifically to helping programmable

logic users – and it works.

We offer affordable advertising rates and a variety of advertisement

sizes to meet any budget!

Call today : (800) 493-5551 or e-mail us at

[email protected]

Join the other leaders in our industry and advertise

in the Xcell Journal!

Let Xilinx help you get your

message out tothousands of

programmable logic users worldwide.

That’s right ... by advertising your product or service in the

Xilinx Xcell Journal, you’ll reach more than 70,000 engineers, designers, and engineering

managers worldwide.

The Xilinx Xcell Journal is an award-winning publication,

dedicated specifically to helping programmable

logic users – and it works.

We offer affordable advertising rates and a variety of advertisement

sizes to meet any budget!

Call today : (800) 493-5551 or e-mail us at

[email protected]

Join the other leaders in our industry and advertise

in the Xcell Journal!



by Ralf KruegerSr. Staff Applications EngineerXilinx, [email protected]

Configuration memory in Xilinx®

Virtex™ FPGAs is used primarily toimplement user logic, connectivity, andI/Os, but it is also used for other purposes.For example, it specifies a variety of staticconditions in the two functional blocks,DCMs and RocketIO™ multi-gigabittransceivers (MGTs).

Sometimes an application requires achange in the conditions of the functionalblocks while the blocks are operational.You can accomplish this through the glob-al internal configuration access port(ICAP) or through partial dynamic recon-figuration, using JTAG or SelectMap in the“persist” mode.

Since the late 1990s, all Virtex FPGAshave supported this powerful dynamic par-tial reconfiguration feature. However, partialdynamic reconfiguration required you tohave a detailed knowledge of the configura-tion logic functions, the configuration regis-ters, and the location of configuration bits.

Dynamic Reconfiguration of Functional BlocksDynamic Reconfiguration of Functional Blocks


The Virtex-4 dynamic reconfiguration port offers an innovative way to reprogram functions in the FPGA.The Virtex-4 dynamic reconfiguration port offers an innovative way to reprogram functions in the FPGA.


DRP FunctionalityThe new Virtex-4™ dynamic reconfigu-ration port (DRP) is an integral part ofthe two functional blocks, as it simplifiesthe dynamic reconfiguration processgreatly. These configuration ports exist inthe DCMs and RocketIO MGTs.

In this article, we’ll describe theaddressable, parallel write/read configura-tion memory implemented in each func-tional block that permits reconfiguration.This memory has the following attributes:

• It is directly accessible from theFPGA fabric. Configuration bits canbe written to and/or read fromdepending on their function.

• Each bit of memory is initialized withthe value of the corresponding con-figuration memory bit in the bit-stream. Memory bits can also bechanged later using the ICAP.

• The output of each memory bitdrives the functional block logic, sothe content of this memory deter-mines the configuration of the func-tional block.

The address space can include status(read-only) and function enables (write-only). Read-only and write-only opera-tions can also share the same addressspace. Figure 1 shows how the bitstreamconfiguration bits drive the logic in func-tional blocks and how the reconfigurationlogic changes the flow to read or write theconfiguration bits.

Figure 1 also lists each signal on theFPGA fabric port. Individual functionalblocks can implement all or only a subsetof these signals. In general, it is a syn-chronous parallel-memory port, with sep-arate read and write buses similar to theblock RAM interface. Bus bits are num-bered from least significant to most sig-nificant, starting at 0. All signals areactive high.

Synchronous timing for the port isprovided by the DCLK input, and all theother input signals are registered in thefunctional block on the rising edge ofDCLK. Input (write) data to the func-tional blocks is presented simultaneously

software tools to show the additional DPRsignals. For example, writing a 04h toaddress 50h will change the M value to 5.

In the MGTs, the DRP allowsadvanced users to manipulate manyattributes of the physical media attach-ment (PMA) and the physical codingsublayer (PCS). The new signals are partof the regular MGT primitive and can beoperated by the application. The MGTimplementation makes a large number ofsettings available for you to changedynamically. Various comma detect,channel bonding, and other attributescan be manipulated.

ConclusionThe Virtex-4 dynamic reconfigurationport provides an easy-to-use, block RAM-style interface that empowers you to

modify the functionality of your applica-tion while the device is operating. Thisleads to flexible implementations and anapplication that can adapt to changingconditions – without having to reconfig-ure an FPGA with a different bitstreamfrom scratch.

For more information, see the config-uration guide, www.xilinx.com/bvdocs/userguides/ug071.pdf.

with the write address and DWE andDEN signals before the next positive edgeof DCLK.

The port asserts DRDY for one clockcycle when it is ready to accept more data.The timing requirements relative toDCLK for all the other signals are thesame. The output data is not registered inthe functional blocks. Output (read) datais available after some cycles following thecycle that DEN and DADDR are assert-ed. The availability of output data is indi-cated by the assertion of DRDY.

DRP Implementation in DCMs and MGTsAs mentioned earlier, the DRP is availablein two major Virtex-4 functional blocks.Writing a specific value to a specificaddress will manipulate configuration bitsand alter functions or attributes on the fly.

The user and configuration guidesdescribe the address space (locations) andallowed values for each function.

In the DCM, the DRP allows you tomake dynamic adjustments to the phaseshift value of the digital phase shifter(DPS) and to change the multiply (M)and divide (D) values of the digital fre-quency synthesizer (DFS). A new primi-tive, DCM_ADV, has been added to the


DCLK_B

DEN_B

DWE_B

DADDR_B[6:0

DI_B[15:0]

DO_B[15:0]

DRDY_B

Controller

FPGA Fabric

Configuration Logic

StandardReconfigurationPort (To Fabric)

Reconfigurable Bits

To Block Logic

Bits That are Not ReconfigurableTo Block Logic

Functional Block (DCMs and MGTs)

Block Status

Function Enables

Figure 1 – Configuration changes


by Hamish FallsideSenior Manager, Systems Engineering,Advanced Product DivisionXilinx, [email protected]

Ethernet is the predominant wired connec-tivity standard. The range of standard prod-ucts for Ethernet is large, and it just gotbigger with the introduction of the Xilinx®

Virtex-4™ FX device family. Combiningembedded Ethernet connectivity with theunique flexibility of the Virtex-4 feature set,Xilinx has created a compelling single-chipplatform for solutions not possible withexisting off-the-shelf products.

The Virtex-4 FX device family containspaired embedded Ethernet media accesscontrollers (MAC) that are independentlyconfigurable to meet all common Ethernetsystem connectivity needs. Each Virtex-4FX device contains either two or four MAC,implemented using Xilinx IP immersiontechnology, as shown in Figure 1.

Using standard Xilinx design products,you now have the unprecedented capabili-ty to create a huge range of customizedpacket processing and network end-pointproducts for 10/100/1000 Mbps Ethernet.

An external physical layer device (PHY)is required for the MAC to connect to anetwork. The Virtex-4 FX device directlysupports all standard serial and parallelPHY interfaces for both copper and opticalEthernet connections. In addition, Virtex-4 embedded RocketIO multi-gigabit trans-ceivers can be used to drive Ethernetdirectly across PCB traces, such as serialbackplanes, for in-system connectivity.PHY connections can be routed to any userpin or RocketIO block in the device.

In this article, we'll review the feature setof the embedded Ethernet MAC blocks inVirtex-4 FX devices, and offer some point-ers on how you can start right away usingthem with standard Xilinx design tools,LogiCORE™ IP, and development boards.

Feature SetThe Virtex-4 Ethernet MAC addresses allcommon configuration requirements forembedded Ethernet connectivity, and isfully compliant to the IEEE802.3-2002

Designing with the Virtex-4 Embedded Tri-Mode Ethernet MAC

Designing with the Virtex-4 Embedded Tri-Mode Ethernet MAC


Integrate the versatile Virtex-4 10/100/1000 Ethernet MAC into your next programmable SoC design.


specification. It will allow you to buildEthernet systems that support VLAN,jumbo frames, and end-to-end flow control.

Built-in hardware address filteringreduces the burden on software of process-ing unneeded frames. You can independ-ently configure each MAC for multiplerates and topologies:

• 10 Mbps or 100 Mbps full- and half-duplex

• 10/100 Mbps full- and half-duplex

• 1000 Mbps full-duplex

• 10/100/1000 Mbps full-duplex

When used in multi-rate modes, auto-negotiation support is provided.

Connecting the MAC to external PHYand optical modules is supported throughthe PHY interface to the FPGA fabric.This provides flexible use models for theMAC, allowing, for example, attachmentto a shared processor bus or to custompacket processing hardware.

Controlling the MAC in your system isperformed through the host interface,which provides flexible software access tothe internal registers. Each MAC pair sharesa common host interface, which can be

directly to a discrete external PHY, and iscommonly used to connect to small form-factor pluggable (SFP) modules for bothoptical and copper connectivity:

• Serial GMII (SGMII) for 10/100/1000 Mbps

• 1000BaseX for 1000 Mbps

These interface options have 9-bit sig-naling that connect to the RocketIO.Embedded state machines in the MACprovide University of New Hampshire-certified operation for link initializationusing these options.

A MII management (MIIM) interface isalso included, which allows your softwareto access external PHY registers throughthis standard IEEE interface. The registersare accessed via the address map in the hostinterface.

Host InterfaceFor your software to control the MAC, ahost interface provides access to the inter-nal registers. A dedicated DCR bus con-nects the embedded PowerPC directly tothe host interface, requiring no additionalFPGA resources. Alternatively, the hostinterface can also be accessed directly fromthe fabric, providing a flexible solution forporting legacy driver software. Each pair ofMAC shares a single host interface.

The registers accessed through the hostinterface are used by driver software to ini-tialize and control the MAC during opera-tion. All register values may be preset atpower-on from the FPGA fabric. Thisallows the MAC to be used by applicationsthat do not include a processor and soft-ware. The registers provide access to thefollowing settings:

• Independent receiver settings for resetand enable, pause frame address,jumbo and VLAN frame enables,half/full duplex, and passing framecheck sequence (FCS) to the client

• Independent transmitter settings forreset and enable, inter-frame gap (IFG)adjustment, jumbo and VLAN frameenables, half/full duplex, and FCSfrom client

directly accessed by the embeddedPowerPC™ 405 device control register(DCR) bus, or from the FPGA fabric.

Let’s describe each of these interfaces inmore detail.

PHY InterfacesYour application will require connection toa particular medium – copper, fiber optics,or one of your own invention. The PHYinterface provides many options to meetyour requirements.

All common interfaces to externalmedia are directly supported in the PHYinterface. As the PHY interface is routed tothe outside world through FPGA fabric,creating “bump-in-the-wire” solutions inFPGA fabric is straightforward.

PHY interfaces fall into two categories:one using SelectIO™ resources and anoth-er using RocketIO serial transceivers.

The first category is typically used toconnect to a discrete external PHY:

• Media independent interface (MII) for 10/100 Mbps

• Gigabit MII (GMII), and reducedGMII (RGMII) for 10/100/1000 Mbps

The second category will also connect


To PowerPC 405 BlockDCR Bus

Statistics Block

Client Interface

FPGA Fabric

To PowerPC 405 Block

PHY Interface

PHY InterfaceClient Interface

Generic Host Bus

Statistics Block

EMAC1

Ethernet

MAC Block

Host Interface

DCR

Bridge

DCR Bus

EMAC0

RX Stats MUX1 Tx Stats MUX1

RX Stats MUX0 Tx Stats MUX0

Figure 1 – Embedded Virtex-4 Ethernet MAC Block, with interfaces to FPGA resources


• Independent flow control enables forreceiver and transmitter

• RGMII/SGMII status, and speed forfixed and negotiated settings

• Management interface enable andclock rate

• Receive-side address filter access – uni-cast and multi-cast address entries

The address filter provides a single uni-cast and as many as four multi-castEthernet addresses that are used to matchagainst the destination address of incomingframes. You can set the filter to optionallydiscard incoming frames that do not matchthe stored addresses or to simply flag whena match occurs, allowing you to make rout-ing decisions for received frames at hard-ware speed rather than in software.

Client InterfaceEthernet frames are passed between theMAC and your design across the clientinterface, which is divided into receive andtransmit sides.

Receiver Side Client InterfaceOn the receive interface, frame errors andunmatched frames are signaled to the userlogic. When flow control is enabled, anyvalid pause frames received will be flaggedas invalid.

Transmitter Side Client InterfaceThe transmit interface will indicate colli-sions on half-duplex connections, and willcorrupt a truncated frame in the case ofFIFO starvation in the middle of a frame.When flow control is enabled, the transmit-ter interface will automatically assert backpressure on the client when a pause requestframe is received from the remote host.

Flow Control and Statistics VectorsA separate flow control interface allowsthe client to make pause requests to thefar end, allowing the pause interval to beset for each individual request. Separateinterfaces provide separate statistics vec-tors for the receiver and transmitter por-tions of the MAC. The IEEE-definedstatistics are updated on a per-frame

basis, and can be accumulated using cir-cuitry in the FPGA fabric.

Over-Speed OperationThis feature allows you to clock the MACat higher rates than allowed by the standard.The double-width interface on the clientside means that your design can processframes at the same system frequency as nor-mal operation, but at twice the data width,providing up to 2 Gbps in each direction.

Virtex-4 Ethernet MAC Use ModelsThe features described previously providethe Virtex-4 Ethernet MAC with multipleuse models. Some examples of these aregiven here, but this should not be consid-ered a complete list.

• Attach the MAC to CoreConnect PLBor OPB peripheral interface in FPGAfabric to embedded PowerPC orMicroBlaze™ processors, as in Figure 2.

• Create a custom interface to packetprocessing hardware implemented inFPGA fabric, such as protocol offload,DMA engines, embedded FIFO, andembedded block RAM. Figure 3 showsan example scheme for a TransmissionControl Protocol (TCP) offload engine(TOE), and/or Remote DirectMemory Access (RDMA), as covered

by the iWARP protocols from theRDMA Consortium.

• Directly connect multiple MACblocks to Virtex-4 embedded FIFOand external QDR and DDR memoryfor classification, policing, and switch-ing applications, see Figure 3.

• Provide independent packet monitor-ing and statistics collection, using cus-tom hardware in FPGA fabric thatconnects directly to the statistics inter-face of the MAC blocks.

Any of these use models may be con-nected to external PHY in multiple sys-tem topologies:

• Optical gigabit Ethernet connectivity –connect directly to external opticalmodules through the Virtex-4RocketIO transceiver for 1000BaseXoperation (Figure 4)

• 10/100 Ethernet connected to externalcopper PHY through RMII interfaceimplemented between the MII PHYinterface and SelectIO pins

• 10/100/1000 tri-mode Ethernet toexternal PHY or SFP module throughSGMII connection to RocketIO trans-ceiver, utilizing a RocketIO block


Exte

rnal P

HY

Tx

Rx

Host

Interface

Virtex-4

MAC

Sele

ctIO

or

RocketIO

Read

Packet

FIFO

SimpleDMA

orSGDMA

MasterAttachment

Register,SRAM, and

InterruptInterfaces

PLBArbiter

Write

Packet

FIFO

SlaveAttachment

CoreConnect Peripheral LogoCore

Client Transmit

Client Receive

FPGA Fabric

Figure 2 – Embedded MAC connected to embedded PowerPC as a PLB peripheral, with the addition of Xilinx CoreConnect LogiCORE IP


Tools, IP, and Development BoardsXilinx provides support for the MAC withtools, LogiCORE IP, reference designs, andVirtex-4 development boards.

Virtex-4 Embedded EMAC WrappersAvailable from the Xilinx COREGenerator™ tool, you can automaticallygenerate HDL wrappers for the MACinstantiations in your design and complete-ly configure the MAC through the GUI. Alow-level software driver for the embeddedPowerPC to access the MAC across thededicated DCR interface will also be auto-matically generated.

Embedded Developers Kit (EDK)The EDK tool enables you to build a com-plete processor subsystem around theMAC. The tool includes standard XilinxLogiCORE IP to connect the MAC as aCoreConnect peripheral, and will auto-matically generate a software driver.

Xilinx Ethernet LogiCORE IP and Reference DesignsMuch of the legacy Virtex-II Pro™Ethernet collateral will be reusable with theVirtex-4 MAC.

Reference designs are available thatdemonstrate useful techniques for opti-

mizing your Ethernet system designs. TheLocalLink LLTEMAC checksum offloadperipheral, available with the GigabitSystem Reference Design (XAPP536)demonstrates how to accelerate the TCPperformance of your network endpoint.

Development BoardsXilinx provides a family of developmentboards for immediate prototyping of yoursystem design. These include:

• The ML403, a low-cost developmentplatform featuring the Virtex-4 FX12device, includes a tri-speed EthernetPHY for Ethernet copper connectivity

• The ML405 development board pro-vides a superset of the ML403, withadditional serial connectivity optionsenabled by the Virtex-4 FX20RocketIO transceivers

All Xilinx and partner-developed boardsare available from the “Xilinx on Board”section of the Xilinx website.

ConclusionThe embedded tri-mode Ethernet MAC inVirtex-4 FX devices provides unparalleledflexibility for today’s Ethernet systemsdesigners; spanning:

• Hub, switch, and router systemstopologies

• Tightly coupled network processingfunctionality utilizing embeddedprocessors and custom logic

• Embedded processing shared bus subsystems

• Direct low latency connectivity topacket storage

• Cost effective interoperability withfuture, current, and legacy physicallayer standards

In short, the Virtex-4 FX family enablesyou to customize your solution for the Ethernet topology and feature set thatyour application requires. To find outmore, please follow the Virtex-4 links on the Xilinx website, www.xilinx.com/virtex4/.


RX1 TX1

LocalLink LocalLink

Direct Memory Access Controller Interface -- Connects to Memory Subsystem

Protocol

Control

Engine

HEADER STRIP

Virtex-4 FX Embedded Ethernet MAC

TX

FIFO

TX_DMAC_ IF

HEADER

FIFO

TX_GMAC_IF

HEADER STRIP

RX

FIFO

RX STATUS

FIFO

RX_DMAC_IF

RX_GMAC_IF

Protocol

Control

Engine

FPGA Fabric

Protocol Offload Engine

External

PHY or

SFP

Module

SelectIO

or Optional

Virtex-4

RocketIO

Block

for Serial

Connectivity

Virtex-4

Embedded

FIFO Block

Client

Interface

PHY

Interface

Virtex-4 FX

Embedded MAC

GMII/

SGMII/

RGMII/

MII

Interface

Block

Core

Virtex-4 FPGA Fabric

Figure 3 – Packet-processing end-point in Virtex-4 FPGAs using embedded MAC with additional logic for checksum offload, TCP segmentation offload (TSO),

network address translation, and other standard or custom applications

Figure 4 – Multiple Gigabit Ethernet MAC in a switch/router configuration; Virtex-4 embedded FIFO blocks provide intermediate packet storage in the fabric.


by Darren ZacherTechnical Marketing EngineerMentor Graphics Corporation, Design Creation andSynthesis [email protected]

Customers in today’s demanding commu-nications and consumer applications needto attain unprecedented levels of capacityand performance while reducing powerconsumption and overall cost. With theintroduction of high-end devices into themarketplace, more of these applications arebeing addressed by FPGA solutions.

As professional programmable logicdesigners, you are always searching for bet-ter ways to create value and differentiateyour products. To do so effectively, youneed to adopt comprehensive, high-pro-ductivity design flows instead of point toolsto crack new design challenges and takeadvantage of the benefits of the latest pro-grammable silicon platforms.

Multiple Platforms, Unprecedented Opportunity With the release of Xilinx® Virtex-4™devices, you can enjoy twice the density,twice the performance, and half the powerconsumption of previous Xilinx FPGAfamilies. If you seek sheer DSP perform-ance, you might prefer Virtex-4 SX FPGAs,which offer 256 GigaMAC/s performance

for 18-bit operations. The LX family ofFPGAs offers higher performance logic;with FX devices, you can explore embed-ded processing and high-speed serial con-nectivity applications. These threeplatforms, comprising a complete selectionof 17 devices, collectively offer a com-pelling alternative to ASICs and ASSPs.

To fully exploit this immense potential,design teams must consider moving awayfrom serial, iterative, point-tool approach-es that involve designing or re-designingfrom scratch. To manage non-recurringengineering time and costs and create effi-cient, reliable flows, you must clearly iden-tify which of the various “building blocks”you need to focus on when using a plat-form approach to successfully implement ahigh-end design.

Typical building blocks may include:

• Intellectual property such as internalcompany, Xilinx, or third-party IP

• Lower-level blocks used in the contextof a bottom-up design flow

• Algorithms via C or C++ or MATLAB™

• RTL blocks

• Embedded processors

• I/O interfaces

By using a comprehensive, methodicaldesign flow, you can effectively optimizethese blocks in a multimillion-gate device.

As high-end FPGAs approach ASIC-level performance, designers are adaptingmany advanced ASIC techniques forFPGA design. The complex FPGA designflow shares some commonality with ASICdesign; for instance, RTL simulationremains basically unchanged. But certainsubtle differences exist under the hood, andmany steps are fundamentally different.The pre-built nature of FPGAs implies a“use or lose” approach to features or capa-bilities, so you must match functionalrequirements with the device architecture.Thus, common steps such as synthesis orplace and route all differ subtly in theFPGA domain.

You can use C++ synthesis techniquesborrowed from ASIC flows to targetFPGAs. C++ specifications are much lesstied to any specific hardware than the cor-responding RTL code.

Another technique, physical synthesis,illustrates the subtleties involved when thesame general approach is used for bothASICs and FPGAs. Physical synthesisrequires a detailed understanding of theFPGA’s hardware structure. At the veryleast, physical synthesis tools must be morespecifically targeted to FPGA architectures.

Emerging Design Methodologies Elicit the Power of Virtex-4 FPGAsEmerging Design Methodologies Elicit the Power of Virtex-4 FPGAs


Adopt a broader design flow methodology instead of the traditional point-tool approach.Adopt a broader design flow methodology instead of the traditional point-tool approach.


A typical high-end FPGA design flowshould encompass such tasks as:

• Early design rule checking

• Higher level design abstraction

• Functional and system-level simulationand verification

• Advanced physical synthesis techniques

Let’s describe each of these in more detail.

Integrated Approach to Design CreationIn terms of design entry, the need to createfaster, larger, and complex designs packedinto the latest FPGA devices within theshortest possible time presents significantchallenges. The high availability of config-urable logic in platform FPGAs thatinclude hard ASIC macros – such asembedded processor blocks and complexI/O standards – has truly enabled program-mable SoC, where a serialized designapproach would not work. Only a system-level RTL design concept, used in parallelwith multiple aspects of managing andoptimizing the high-level design creationprocess, will ensure success.

Large design projects mandate the col-laboration of several engineers or engineer-ing teams, often belonging to separatecompanies and typically distributed in dif-ferent geographic locations worldwide.This team-based approach raises theimportance of a consistent design codingstyle for teams to share code effectively.

Teams invariably comprise experiencedproject leaders and designers alongside lessexperienced junior engineers working onthe various building blocks of a design. Theresulting skill diversity makes the need forconsistency critical. It is imperative thatcompanies carefully scrutinize the planningand creation process to identify poordesign styles, incorrect design rules, andsyntax/semantic errors at the earliest possi-ble stage before even attempting to tie thebuilding blocks together or simulate/syn-thesize the design.

In bigger designs, it is not unusual formultidisciplinary design teams to focus onand optimize only a portion of the device.As the system is defined in RTL by combin-ing both vendor and internal IP (and for

Similarly, synthesis can become a protracted,iterative process in order to achieve desiredperformance goals. You need to maximize theproductivity of potentially long EDA toolruns by ensuring that as many code errors aspossible are found and fixed before the startof simulation and synthesis (Figure 1).

Equally important are integrated con-nections to advanced tools such asDesignAnalyst™ and Precision® Synthesisfrom Mentor Graphics to ensure againsterrors and reduce iterations, as well as inte-gration with any third-party EDA toolsthrough a flexible integration mechanism.Through static design checking or “lint-ing” products, you can perform many dif-ferent forms of checking during the designcreation process.

Interactive HDL visualization and cre-ation tools provide automatic documenta-tion features and reporting as well asintelligent debug and analysis to effectivelymanage FPGA designs. Moreover, tight bi-directional communications with PCB toolsfrom within the design creation processshorten design cycles by integrating and syn-chronizing HDL design with PCB design,eliminating time-consuming manual steps.

Higher Abstraction Levels Speed Hardware DesignFor the first time, professional design engi-neers are literally struggling to keep pacewith Moore’s Law, which makes it difficultto fully utilize the capacity of 90 nm ASICs

those applications utilizing DSP functionali-ty, RTL generated algorithmically), you willneed an integrated system design approachto help synchronize the development of eachspecific part of a large, high-capacity FPGA.

From the configuration of the embeddedprocessor to logic development and high-speed I/O assignment, the ideal synchro-nization of these teams and processes isrequired to deliver an optimized field-programmable SoC. The merging and man-agement of these multiple disciplines to gen-erate the system-level RTL and associateddesign files is a huge task best handled by acomprehensive and flexible environment.

To reduce development cost and time tomarket, 80-90% of projects may now includeboth re-work of an existing design as well asreuse of previously designed components orIP, whether internal or purchased. Becausethis trend is expected to increase, you need toensure that your components/subsystems aredesigned to be reusable and conform to estab-lished design reuse rules.

Through cooperative efforts in the designcommunity and internal corporate standardi-zation, the industry has developed a numberof reuse methodology guidelines that can bechecked using automated tools. Tools such asMentor Graphics® HDL Designer Series™(HDS) can help design teams successfullyintegrate both hard and soft IP (such asPowerPC™ and MicroBlaze™ processors).

Larger designs at higher speeds have pro-longed traditional simulation cycles.


Traditional

HDL Linters

Simulation Synthesis Place and Route

HDL

Designer

Series

Design-

Analyst

Silicon Vendor Policies

Corporate Rule Policies

Group Rule Policies

HDL

Design

Entry

© Mentor Graphics

Figure 1 – When used in tandem for concurrent design entry and checking, interactive HDL visualization and creation tools can increase design quality, reduce iterations,

shorten simulation and synthesis cycles, and improve testability and reuse in high-end FPGAs.


or efficiently target the complex structuresfound in domain-specific FPGAs.Algorithmic C synthesis (Figure 2) promis-es to raise the abstraction of hardwaredesign by providing a new, more abstractentry point, benefiting both ASIC andFPGA hardware designers. But to under-stand the need for higher abstraction lan-guages, you must first analyze the problemswith existing RTL methodologies.

The design complexity of new DSPapplications has outpaced traditional RTLcapabilities. To create hardware implemen-tations for blocks of computationallyintensive algorithms using RTL, designteams must iterate through several steps,including micro-architecture definition,handwritten RTL, and area/speed opti-mization through RTL synthesis. Thismanual process is slow and error-prone. Inthe final result, both the micro-architectureand technology characteristics becomehard-coded into the RTL description. Thishard coding renders the whole notion ofRTL reuse or retargeting impractical in realapplications.

An optimized C-to-RTL synthesis flownot only promotes a higher level ofabstraction, it also gives the design teamthe flexibility to transition from one imple-mentation technology to another. You cantune the hardware for high-performanceparallel implementations or smaller, moreserial implementations.

Using this approach to describe func-tional intent (offered in the Mentor

Graphics Catapult™ C Synthesis tool),you can move up to a far more productiveabstraction level for designing hardware. Ashardware designers, you can reduce imple-mentation efforts by as much as 20X whilecreating a more repeatable and reliabledesign flow.

The ability to select fundamentallysuperior micro-architectural alternativesallows you to create designs of better qual-ity than traditional RTL methods. Finally,this approach closes the conceptual gapbetween algorithm designers modeling inC/C++ and hardware designers working atthe RTL abstraction level.

Simulation and Verification ChallengesUsing standard RTL verification methodsin high-capacity FPGAs quickly diminish-es the benefits of faster hardware creation.The current execution speeds of softwarevalidation platforms and RTL verificationenvironments are insufficient to quicklytest design functionality. Design verifica-tion takes significantly longer than designdevelopment because of the limited speedof RTL simulators and the time needed tomanually create an RTL test bench.

Additionally, C/C++ simulation(although upwards of 10,000X faster thanRTL) may be inadequate to validate theoriginal algorithm given the data-intensivenature of DSP designs. These challengesare in fact opportunities for both algorithmdevelopment and system validationthrough the use of accelerated simulation.

High-level design verification flows arenow turning to address rapid algorithmvalidation and verification, using hardwareacceleration by leveraging the benefits of aSystemC verification environment. Theseflows begin with the algorithm designervalidating designs in C++ and end withthe hardware designer verifying the algo-rithm in RTL.

This method of using high-levelC/C++ synthesis in combination with aSystemC verification environment pro-vides an automated path from algorithmdevelopment to synthesized RTL runningin an FPGA prototyping environment.Executing the algorithm directly in hard-ware gives algorithm designers the abilityto validate algorithms and hardwaredesigners the ability to validate the entiresystem at or near real-time speeds.

The use of SystemC as a verificationenvironment permits both algorithm andhardware designers to use the same testbench and test vectors, eliminating theneed for manual test bench creation. Thecombined approach of hardware accelera-tion of C/C++ algorithms in a SystemCverification environment provides a push-button solution for accelerated algorithmdevelopment and system validation.

Balancing the Cost/Timing Closure Equation An essential step in realizing a high-capacity FPGA design is to optimize thatdesign for both timing and cost. Timingclosure challenges are well known. Usingstand-alone logic synthesis with place androute can be non-deterministic by nature,especially for large devices.

Designers tend to write and rewriteRTL code and constraints to try and coaxthe place and route tool to do their bid-ding. Once you go down this path, youthen must iterate through place and route– the most time-consuming step in FPGAdesign – before gaining any visibility as towhether your changes were a step in theright direction or if they only served tofurther exacerbate the problem.

Similar to optimization for timing, theprocess of achieving true “cost closure”involves a reduction in area to reduceFPGA part cost, or a reduction in the total


C++Class Library

AlgorithmSimulation

INTERFACESYNTHESIS

Cycle AccurateSimulation

Constraints

RTL Simulation Gate Simulation

C-Based Design

Flowchart

© Mentor Graphics

C++ Algorithm

RTL Synthesis

Schedule

Data PathControl Generator

Analysis

RTL Generator ASIC/FPGA

Figure 2 – An optimized C-to-RTL synthesis flow promotes a higher level of design abstraction and gives you the flexibility to easily transition from one implementation technology to another.


cost of the design by increasing levels ofabstraction and design reuse. The irony isthat once you attain a successful imple-mentation, any change – no matter howsmall – in the design or architecture threat-ens to obsolete that success. This unpre-dictability negates the reduced cost andtime-to-market benefits of using program-mable logic in the first place.

Increasing die sizes place additional bur-dens on the extant methodologies. A largedie poses a significant challenge in obtainingrepeatable, high-quality placements out ofcurrent placement algorithms. The larger diesize is now widening the distribution curveof net delays grouped by fanout, the basisbehind industry-accepted wire delay models.

This widened distribution has a degrad-ing effect on the accuracy of fanout-basedwire delay models. In larger devices, inter-connect delay dominates performance forFPGA platforms. Because fanout-baseddelay estimates in FPGAs struggle to modeleven a simplified version of physical realitytoday, you can see why optimization deci-sions based on a wire-load estimate are oftenineffective. Worse, physical proximity can-not always relate directly to delay, so tradi-

tional floorplanning falls painfully short.Advanced physical synthesis techniques cansolve these issues in several ways.

First, to improve accuracy and reducedesign iterations, you must consider realinterconnect delay and physical effects upfront (Figure 3); combining logic and phys-ical synthesis is critical for the design of larg-er, high-performance FPGAs. Some physicalsynthesis alternatives available today arebased solely on technology borrowed fromthe ASIC implementation space.

In reality, forcing an ASIC methodology– and mentality – on the FPGA world can-not work. Such approaches essentially tryto outsmart the vendor placement and mayshow promise in certain situations, butmost cannot match the performance of atool that leverages the FPGA vendor’s post-layout information to provide accuratephysically aware synthesis.

Second, FPGA-oriented physical syn-thesis solutions need to take into accountsuccessful implementation experience thatyou have previously developed. Forinstance, when you complete a modulardesign and have optimized performance fora portion of it using physical synthesis, a

good tool must ensure that you can takefull advantage of these optimizations andreuse them on subsequent designs.

Physical synthesis in FPGAs is growingbeyond the ASIC model to be a valuablepart of cost minimization and componentreuse strategies. When investing in a syn-thesis tool with a highly deterministicprocess for improved results, look for tech-nologies and algorithms that not only opti-mize designs for cost and timing, but alsoenable you to translate your professionalexperience and previous design implemen-tations at the physical level into faster timeto market in subsequent designs.

Any tool used in professional FPGAdesign (including the Precision Synthesistool from Mentor Graphics) should con-sider FPGA vendor placement results assoon as possible, and only then begin tomanipulate the design using physical syn-thesis – integrated with logic synthesis in aunified data model – to converge on timingat a lower cost.

From Point Tools to ESL Design FlowsEvery designer stands poised to benefitfrom the new standard set by Virtex-4 high-performance FPGAs. The next-generationchallenge faced by mainstream FPGA EDAtool vendors is to leverage point-toolexpertise and thus meld apparently contra-dictory trends – higher levels of abstractionon the one hand and greater dependence onspecific physical characteristics on the other– into a coherent design methodology andhighly productive flow.

In keeping with these advances, EDAtool companies will continue to extend andimprove their comprehensive, integrateddesign flows spanning all levels of abstrac-tion. Mentor Graphics continues to be atechnology leader in this space. Designersmust take advantage of EDA tools thatnow address both physical and electronicsystem-level (ESL) challenges of high-endFPGAs, and thus realize the unprecedentedpotential of these devices as ASIC replace-ments in new SoC designs.

To access the latest product news, appli-cation notes, and case studies, evaluate newdesign flows, or schedule a product demon-stration, visit www.mentor.com/fpga/.


Figure 3 – High-end FPGA synthesis tools should ideally consider FPGA vendor placement results up-front, and only then begin to manipulate the design using physical synthesis –

integrated with logic synthesis in a unified data model – to converge on timing.


by Andy NortonPresidentComm Logic Design, [email protected]

The availability of embedded processorsubsystems in FPGAs opens the door to amyriad of applications, including embed-ded network processors, flexible sandboxprototyping, control plane and data pathsubsystems, and exception handlingprocessors. Today’s FPGAs integrate exist-ing IP cores, interfaces, custom processingengines, and now embedded processor sub-systems. You can easily instantiate thesesubsystems into a top-level HDL designjust as you would integrate off-the-shelf IP.

Xilinx® Virtex-4™ FX FPGAs inte-grate a higher performance IBM™PowerPC™ core with the new AuxiliaryProcessor Unit interface. The direct con-nection to the FPGA fabric facilitatesadvanced coprocessor designs.

You can use Xilinx Platform Studio/EDKsoftware to design embedded processor sub-systems in FPGAs with embedded PowerPChard processor cores or with XilinxMicroBlaze™ soft processor cores.Although off-the-shelf peripheral cores andMicroBlaze soft cores are synthesized usingXST during EDK platform generation, theoverall FPGA project and custom peripher-al cores are synthesized with Synplicity®

Synplify Pro® 8.0, leveraging new featuresand superior quality of results.

EDK Subsystem Project FlowAll projects begin by defining an overallFPGA directory structure. The embeddedsubsystem should reside in its own sub-directory. For example:

fpga_project

/doc spec and documentation

/src RTL source code files

/constraints .ucf, .sdc files

/sim simulation files

/syn synthesis project files

/pnr place and route files

/ppc_subsystem embedded processorsubsystem

Creating a new EDK project in/ppc_subsystem results in a system.xmpproject file. Next, EDK Project Optionsmust indicate that it is a subsystem bysetting:

1. Design Hierarchy to SubModuleSpecifying the top-instance name of the embedded subsystem(ppc_subsystem). The indicated top-instance name will be used when instantiating the subsystem in the overall top-level design.

2. Synthesis Tool to None This indicates that no synthesis tool isused to synthesize the overall designwithin EDK (the instantiated subsys-tem will be included later in theSynplify Pro project), although EDKwill have used XST (and possiblySynplify Pro) in the platform creationof the subsystem and its peripherals.

3. Implementation Tool Flow to ISE™Although Synplify Pro supports mixedlanguages, you can select Verilog™ orVHDL for EDK output files in ProjectOptions/HDL and Simulation.

Platform GenerationYou can create the embedded processorsubsystem by using either the Base SystemBuilder wizard, the GUI selection ofperipheral cores, or direct text editing ofthe microprocessor hardware specification(MHS) file.

Once the MHS file has been construct-ed, Generate Netlist invokes PlatformGeneration. PlatGen constructs thenetlist, builds and interconnects indicatedperipherals, runs DRC checking for errorsand warnings, and generates output files.

Integrating EDK-Created EmbeddedProcessor SubsystemsIntegrating EDK-Created EmbeddedProcessor Subsystems


Use Synplify Pro as the primary synthesis tool for complex designs containingembedded processors.

Use Synplify Pro as the primary synthesis tool for complex designs containingembedded processors.


The EDK Platform generated directo-ries and files include:

ppc_subsystem top-level instanceof the subsystem

/hdl

system_stub.[vhd|v] HDL subsystemwith Xilinx I/Oprimitives inserted

system.[vhd|v] HDL subsystemwithout XilinxI/O primitives

wrappers.[vhd|v] implementationnetlist peripheralfiles with instanti-ated wrappers

/implementation

system_stub.bmm BMM file withtop-level subsys-tem instance inpath

system.bmm BMM file withoutthe top-level sub-system instance inpath

peripherals.ngc files XST-generatedperipheral files

PlatGen will generate two top-level filesin /hdl: system_stub.v and system.v.System_stub.v instantiates system.v andadds I/O insertion as Xilinx primitives forall top-level ports. With the processor as asubsystem, system_stub.v is not usedbecause there are other cores, subsystems,and logic in the design. For example, clocksignals could be generated by top-levelinstantiated DCMs and subsystem signalscould go to other modules at the same levelof hierarchy instead of off-chip.

Also, using Synplify Pro, the I/O inser-tion is automatic; you don’t need to explic-itly instantiate BUFG, IBUF, or OBUFprimitives for most I/O standards.

Choosing to instantiate system_stub.vas our subsystem would then require edit-ing, removing, or modifying the I/O inser-tion for the ports not directly connected toan external pin. Once modified, rerunning

must also add the required HDL to controlthe bidirectional signals:

genvar i;generate

for(i=0; i<=31; i=i+1)begin: ddrtri

assign ddr_dq[i] = ddr_dq_t[i]? 1’bZ : ddr_dq_o[i];

endendgenerate

Now EDK-generated subsystem Verilog filesdo not need to be modified – only instantiated.Bi-directional signals are handled correctly andI/O insertion is either handled automatically bySynplify or explicitly instantiated as Xilinxprimitives when required.

Memory GenerationPlatGen will also generate the required mem-ory initialization files for the specified blockRAMs coupled with DSOCM, ISOCM(PowerPC only), LMB (MicroBlaze softprocessor core only), OPB, and PLB blockRAM controllers.

PlatGen will produce two BMM (blockRAM memory map) files in the /implemen-tation directory: system.bmm andsystem_stub.bmm. A BMM file will be usedin the ISE flow to indicate the logical dataspace used by the embedded subsystem andorganization of the block RAM memory. Inthe case of our subsystem, system_stub.bmmwould be used, as it contains the completehierarchical path (because we specified thetop-level instance of our subsystem in theproject options).

During the ISE bitgen phase of the flow, asystem_stub_bd.bmm file will be created in the/implementation directory, indicating the phys-ical location of the block RAMs.

Synplify Project FlowWhile XPS/EDK generates the embeddedprocessor subsystem (/implementation/sys-tem.v), once created the ppc_subsystem isinstantiated exactly as any IP block byadding it to the overall Synplify synthesisproject. Whether the underlying embeddedprocessor subsystem used XST, Synplify, orboth to create the peripherals and generatethe subsystem is irrelevant to the overallSynplify synthesis project.

PlatGen would overwrite this file onceagain. Another choice might be to renamesystem_stub.v after editing the file; thedownside to this approach is that port/sub-system modifications would require you torecreate the modified/edited file.

A better approach is to instantiate sys-tem.v directly in the top-level HDL.Synplify will take care of the necessaryI/O insertion where required or, for I/Ostandards requiring I/O primitive instan-tiation (for example, LVDS), this shouldbe done directly in the top-level HDLfile. System.v is always correct as generat-ed by EDK PlatGen and never needs tobe modified. The one additional steprequired is at the top level, in the case oftri-state signals.

For example, you can define the projecttop-level ports as:

module fpga_top(

inout [31:0] ddr_dq,);

PlatGen will generate system.v (in /implementation), bringing out the tri-state signals as shown in the instantiatedppc_subsystem:

system ppc_subsystem (.. .ddr_dq_I ( ddr_dq ),.ddr_dq_O ( ddr_dq_o ),.ddr_dq_T ( ddr_dq_t),..);

The EDK-generated system_stub.v –the file we don’t want to use – added theIOBUF insertion, as shown here for eachbus signal:

IOBUFiobuf_28 (

.I ( ddr_dq_O[0] ),.IO ( ddr_dq[0] ),.O ( ddr_dq_I[0] ),.T ( ddr_dq_T[0] )

);

Because we want to be able to instanti-ate system.v directly into our top level, we



A typical synthesis project flow, as shownin Figure 1, would follow this order:

1. Create a synthesis project

2. Add files to the synthesis projectproject_top.v/ppc_subsystem/hdl/system.v(EDK-generated subsystem)

3. Synthesize and review the synthesizedproject

4. Use the generated output files in theISE project

fpga_top.edf (top-level source file)

fpga_top.ncf (sdc-translated constraints file)

System.v contains the actual embeddedsubsystem with the peripheral wrappersinstantiated. At the end of system.v areblack_box definitions for each of the wrap-pers. Although Synplify doesn’t recognizethese XST synthesis directives, it does real-ize that it has to create black boxes and doesso without modification.

Synplify will generate the warningsshown in Figure 2 because of the XST-generated synthesis directives and emptyblack box modules. Once reviewed andaccounted for, these warnings can now be“hidden” using the Synplify Pro warnings

filter, as shown in Figure 3. The filter cre-ates a project.prf file (Figure 4). This filecan also be sourced in the Tcl window(source filename).

ProjNav ISE FlowThe /pnr directory is used for the XilinxProjNav ISE flow. The fpga_project.nplfile is created by ProjNav indicating ISEproject options.

The following source files are added tothe ISE project:

1. fpga_top.edf (Synplify top-levelnetlist with ppc_subsystem)fpga_top.ncf (not added as anexplicit source file; created from the Synplify contraints [.sdc])

2. /constraints/constraints.ucf (Xilinx constraints file)

3. /ppc_subsystem/implementation/system_stub.bmm

This file requires no modification,assuming that the subsystem instantiated in the top-level moduleuses the same instance name as generated by system_stub.v (that is,the top instance name indicated inthe project options).

4. /ppc_subsystem/ppc405_0/code/executable.elf

An .elf file (pronounced “elf ”) is a binary data file that contains an executable CPU code imageready for running on a CPU. These files are produced by software compiler/linker tools.Data2BRAM uses .elf files as itsbasic data input form.


Processor IPMPD Files

User IP Files

/hdlSystem.v

MHS Filesystem.mhs

PlatGen

Synthesis

ImplementationPeripherals.ngc

System_stub.bmm

Constraints.ucfExecutable.elf

System_stub.bmm

.edf

Translate Marco Search Path Pointing to /implementation

EDK Subsystem Synplify Synthesis ISE ProjNav

Figure 1 – Synthesis project design flow

Figure 2 – Synplify Pro 8.0 compiler warnings

Figure 3 – Synplify Pro 8.0 warnings filter

Figure 4 – Synplify Pro .prf file


ISE Translate Propertiesmust set the Macro SearchPath to point to the/ppc_subsystem/implemen-tation directory for it to findthe .ngc peripherals that wereblack-boxed by Synplify, ref-erenced in fpga_top.edf.These peripherals were creat-ed by XST during PlatGen.

Project implementationthen follows a normalProjNav flow producingtranslate, map, place androute, and timing reports.

You can easily incorporateembedded processor softwarechanges made by the EDKGNU compiler into the final.bit file without hardware recompiles byrunning Generate Programming File, oralternatively, the Data2Mem utility. Whenusing Data2Mem, the BMM file specified(-bm) must use the BitGen-generated sys-tem_stub_bd.bmm in the /implementa-tion directory.

Custom Peripheral CoresXPS provides a Create Peripheral Wizardthat generates core description files andensures that custom peripherals complywith the Xilinx implementation of theIBM CoreConnect PLB and OPB busstandard. The PLB and OPB buses willconnect to an IPIF, allowing user logic toconnect to the IPIC side of the interface.Unfortunately, the wizard currently sup-ports only VHDL. Peripheral cores canalso be created in Verilog, but cannottake advantage of the templates createdby the wizard.

DCR and OCM bus IP cores are notcurrently supported through a templateor wizard. DCR and OCM bus protocolsare simple to understand, however, andyou can easily create Pcores for thesebuses either in VHDL or Verilog. Thecurrent EDK-provided OCM buses nowallow configurable multi-slave capabili-ties, providing an easy way to create low-latency slave-only peripherals.

You can integrate custom IP cores intothe EDK project either as a black box

synthesized with Synplify or as an XSTnetlist. The Synplify-generated IP corerequires associated MPD (microproces-sor peripheral definition) and BBD(black box definition) files. The XSTnetlist is synthesized by PlatGen alongwith the system and requires MPD andPAO files.

Directory StructureFigure 5 shows the required Pcore directorystructure. PlatGen searches for IP accordingto the following priorities:

1. /pcores directory in the project directory

2. <library_path>/<LibraryName>/pcores if -lp option set (proj-ect options/peripheral repository)

3. $EDK/hw/XilinxProcessorIPLib/pcores

Pcore FilesThe Pcore HDL source files must belocated in the /verilog or /vhdl directoryif they are to be synthesized by XST withPlatGen. If the Pcore is provided as aSynplify-generated netlist, the EDIFmust be located in the /netlist directoryand indicate its black-box status in aBBD file. Required MPD, PAO, andBBD files for the peripheral must beplaced in the /data directory.

The .mpd file specifies PORTs,PARAMETERs, BUS_INTERFACEs,and OPTIONs. For Verilog files, theHDL option specified is OPTION HDL= VERILOG.

If XST is used as the synthesis tool forcreation of the peripheral, the netlist optionis OPTION IMP_NETLIST = TRUE.

If Synplify is used for the creation ofthe peripheral, the netlist option isOPTION IMP_NETLIST = FALSE.This would tell PlatGen to not run XSTsynthesis for this peripheral. A peripher-al wrapper is still created and instantiat-ed in system.v and the project synthesisrun in Synplify would again create ablack box for this peripheral.

ConclusionYou can easily integrate Xilinx embeddedprocessor subsystems created using EDKinto a Synplicity flow by instantiating theEDK-generated embedded subsysteminto the top-level HDL design. You canuse Synplicity tools not only as the overallproject synthesis tool but also as theperipheral core synthesis tool in the cre-ation of custom peripherals.

For more information, visitwww.CommLogicDesign.com. Comm LogicDesign is a Xilinx XPERTS partner focusedon architecting, building, and deliveringsystem solutions for wired-network, tele-com, and storage applications.


Ppc_subsystem (EDK Project Directory)

pcores

Xilinx IP-Cores

data hdl

verilog

IP-Core-Name

pcores MyIP Peripheral Repository

netlist

.v or .vhd source

vhdl

.mpd

.pao

.bbd

.edf or .ngc

Figure 5 – Pcore directory structure


by Carlos Abraham FPGA Synthesis CAE Synopsys, [email protected]

Yanbing LiCorporate Applications Engineering ManagerSynopsys, [email protected]

Synopsys® Design Compiler® FPGA (DCFPGA) allows you to meet your high-performance design goals by using a pow-erful set of optimization algorithms andfeatures specifically tuned for the Xilinx®

Virtex-4™ architecture. These algorithmsuse special Virtex-4 resources such as theDSP48 block and block RAM to achievethe lowest overall area utilization and theoptimal circuit timing performance.

Design Compiler FPGA OverviewDesigns that target complex devices suchas Virtex-4 FPGAs require the same powerand flexibility in synthesis that only ASICdesigners had access to in the past. DCFPGA is built on Design Compiler’sindustry-leading ASIC synthesis technologyand then customized to include FPGA-specific optimizations to handle even themost challenging designs. FPGA-specificoptimizations enable optimal mapping toFPGA basic primitives such as LUTs andcomplex components like RAM, multipliers,and DSP blocks.

Optimizing Virtex-4 High-Performance DesignsOptimizing Virtex-4 High-Performance Designs


Synopsys Design Compiler FPGA can take your high-speed design to the next level of performance.Synopsys Design Compiler FPGA can take your high-speed design to the next level of performance.


DC FPGA includes innovative AdaptiveOptimization™ (AO) technology todynamically tune the synthesis algorithmsbased on the design context, as well as tim-ing constraints to provide faster synthesisruntime and optimal timing. DC FPGAinherits Design Compiler’s reliability –proven through the development of morethan 125,000 ASIC designs. DC FPGAbrings the powerful ASIC-strength synthe-sis of Design Compiler to FPGA designs.

In addition to AO technology, DCFPGA deploys a rich set of optimizations toachieve the best timing Quality of Results(QoR) for FPGA devices. These include:

• Constraint-driven synthesis and designspace exploration

• Automatic finite state machine (FSM)extraction and optimization

• Automatic inference of special FPGAresources, such as RAM, ROM, multi-pliers, DSP blocks, shift registers, andglobal clock buffers

• Advanced datapath optimizations andmodule generation

• Logic and register duplication

• Register retiming and pipelining

• Critical path re-synthesis

• Across-boundary optimization

• Automatic gated-clock transformation

DC FPGA is part of a family of prod-ucts from Synopsys that work in conjunc-tion with the Xilinx ISE™ tool tostreamline the FPGA design process.

In this article, we’ll show how DCFPGA optimizes for high performance inXilinx Virtex-4 FPGAs.

Constraint-Driven SynthesisDC FPGA uses a true timing-driven synthe-sis engine. You can greatly influence the finalimplementation choice by specifying appro-priate timing and design-specific constraintsduring synthesis. Therefore, we recommendthat you drive DC FPGA synthesis with thesame set of constraints as the Xilinx ISE tool.

At a minimum, you should specifyappropriate design timing constraints such

which was impossible to achieve with thecarry logic structure. At the overall designlevel, a 29% timing improvement is achievedwith a minor area increase of 11 slices.

Flexible FSM SupportDC FPGA contains sophisticated FSMextraction and optimization algorithms toensure optimum high-performance statelogic implementation. Once the FSM isdetected and extracted from the RTL code,DC FPGA’s powerful state machine opti-mization engine performs various opti-mization schemes, such as optimizingunreachable states or removing duplicatestates to produce the best logic implemen-tation to meet timing.

At the same time, you have the flexibili-ty to select a different FSM coding stylesuch as one-hot, binary, gray, and zero-one-hot on a state-machine-by-state-machinebasis, design basis, and global basis. ThisFSM encoding exploration flexibility allowsyou to customize the synthesis script to

address design bottlenecks.For an FPGA implementation,

one-hot state implementations typ-ically provide the best timing QoRfor most designs at the expense of ahigher register-to-LUT ratio. Formost designs this is not a problembecause of the register-rich architec-ture of FPGA devices.

High-Performance DSP Inference CapabilityThe availability of special FPGA resourcessuch as block RAM, dedicated DSP slice,and carry logic combined with your speci-fied design and timing constraints guidesDC FPGA’s specialized optimization algo-rithms to determine the best optimum cir-cuit implementation.

DC FPGA is highly capable of inferringcomplex circuit topology from yourdesign’s RTL coding structure, effectivelydeciding the final implementation that bestexploits the resources of the targetedFPGA. DC FPGA minimizes overallresource usage while providing the best cir-cuit performance possible.

This powerful optimization feature allowsDC FPGA to effectively infer and map com-plex logic configurations into special

as clock frequency, I/O offsets, and anytiming exceptions applicable to your design(such as multicycle and false paths). Anyother design-specific constraints – such ascontrolling special FPGA resource usage –could also be specified. For best perform-ance, your design should not be over-constrained, which in some cases can leadto unnecessary increases in area.

Without any timing constraints, DCFPGA will perform area-based optimiza-tions with good timing results. With prop-er timing constraints, DC FPGA appliesthe AO technology to explore the area-timing tradeoffs of various optimizations,selecting the final implementation that bestfits your constraints.

For example, your timing goals enableDC FPGA to decide whether distributedRAM, block RAM, or a LUT with register-based implementation is sufficient for aninferred memory component in yourdesign. Otherwise, DC FPGA optimizesfor the lowest area utilization possible.

Table 1 shows two implementations fora small sub-module with two differentclock constraints. The module is the criticalone for a larger design of about 8,600slices. The design contains a single clockdomain with only one clock period con-straint specified in DC FPGA.

In the first case, the module is constrainedat 10 ns. DC FPGA exceeds the timingrequirement after its area-based implementa-tion and does not invoke the timing opti-mization phase. The critical path of thedesign runs through a series of carry logic.

In the second case, when a much tighterconstraint (3 ns) is applied, DC FPGA per-forms aggressive timing optimizations andreplaces the carry logic on its critical pathswith parallel circuit structures built by LUTs.This results in a design with a slightly largerarea but meets the new timing requirement,


Clock Post-PAR Area Post-PAR Constraint (# of Slices) Fmax (MHz)

Case 1 10 ns 105 260.1

Case 2 3 ns 116 334.8

Table 1 - Design example showing area-timing tradeoffs in DC FPGA


resources such as the Virtex-4 dedicatedDSP48 slice. To illustrate this powerfulfeature, Figure 1 shows a simple multiplyaccumulate (MAC) logic structure, whereA- and B-registered input signals are mul-tiplied. The registered multiplier interme-diate output is then accumulated in thelast adder stage, feeding the registered Qoutput signal.

The RTL code for this simple MACfunction is:

module test ( Q, A, B, clk );output [47:0] Q;input [16:0] A, B;input clk;

reg [47:0] Q;reg [16:0] A_reg, B_reg;reg [33:0] mult;

always @( posedge clk )begin

A_reg <= A;B_reg <= B;mult <= A_reg * B_reg;Q <= Q + mult;

end

endmodule

DC FPGA is able to effectively imple-ment the logic configuration shown inFigure 1 in a single DSP48 slice, fully recog-nizing and taking advantage of the DSP48’sembedded 18 x18 signed multipliers, accu-mulated adder mode, and integratedpipeline registers to obtain the highest per-formance system clock speed.

Figure 2 shows the final DC FPGA sin-gle DSP48 implementation without theuse of other logic resources. TheOPMODE control input pin of theDSP48 element is set to “0100101” torealize the overall MAC functionality modeintended by circuit topology, while theAREG, BREG, MREG, and PREG attrib-utes are set to “1,” respectively, to signify asingle-stage register pipeline.

Furthermore, the high-performanceDSP inference feature in DC FPGA sup-ports very complex design topologies.Such topologies are extensively used inDSP-intensive applications such as a digi-


VDD

0, A[16:0]

0,B[16:0]

"0100101"

clk

Q[47:0]

GND

CLK

PCIN[47:0]

BCIN[17:0]

CECARRYIN

CECINSUB

CECTRL

CEP

CEM

CEC

CEBCEA

CARRYINSEL[1:0]

CARRYIN

SUBTRACT

OPMODE[6:0]

C[47:0]

B[17:0]

A[17:0]

RSTCARRYIN

RSTCTRLRSTP

RSTM

RSTC

RSTB

RSTA

DSP48

PCOUT[47:0]

BCOUT[17:0]

P[47:0]

X[n]

y[n]

D

Q

D Q

DSP48 Slice 2

OPMODE = 0010101Multiply-Add

h1

D Q

D Q

D

Q

D Q

OPMODE = 0000101

Multiply

h0

D Q

D

Q

D Q

DSP48 Slice 3

OPMODE = 0010101

Multiply-Add

h2

D Q

D Q

D

Q

D Q

DSP48 Slice 4

OPMODE = 0010101

Multiply-Add

h3

D Q

D Q

0

D Q D QD Q

A[16:0]

B[16:0] Q[47:0]

D Q

Figure 1 - Simple multiply accumulate (MAC) logic

Figure 2 - DC FPGA single DSP48 implementation for MAC logic

Figure 3 - Four-tap systolic FIR digital filter structures


tal FIR filter, commonly found in wirelesscommunication applications.

Figure 3 shows the schematic of a four-tap systolic FIR digital filter structure. DCFPGA uses advanced DSP inference toimplement this design in only four DSP48slices without the use of external logicresources. The integrated pipeline registersare further exploited for faster clockthroughput performance for this type offilter structure.

The following shows the RTL code forthe systolic FIR filter:

module test ( Yn, Xn, h0, h1, h2, h3, clk );output [47:0] Yn;input [15:0] Xn, h0, h1, h2, h3;input clk;

reg [15:0] X [7:1];wire [15:0] h [3:0];reg [32:0] mult [3:0];reg [47:0] pcout [3:0];wire [47:0] Yn;integer i;

assign h[3] = h3, h[2] = h2, h[1] = h1, h[0] = h0;

always @( posedge clk )begin

X[1] <= Xn;mult[0] <= h[0] * X[1];pcout[0] <= mult[0];

for (i=1; i <= 3; i=i+1)begin: my_for_loop_block0

X[2*i] <= X[2*i-1];X[2*i+1] <= X[2*i];

mult[i] <= h[i] * X[2*i+1];pcout[i] <= pcout[i-1] + mult[i];

end //my_for_loop_block0end

assign Yn = pcout[3];

endmodule

DC FPGA can also implement othercomplex logic configurations in a DSP48slice. Table 2 shows a sample of some of thesecomplex logic structures.

The designs shown in Table 2 weresynthesized using DC FPGA and place

and routed using Xilinx ISE 6.3i ServicePack 2, while targeting an XC4VFX20-11 Virtex-4 device. The purpose of thisexercise is to show the performance andarea improvements performed by DCFPGA’s advanced DSP inference capabil-ity. Each design was synthesized withand without DSP inference enabled dur-ing synthesis.

ConclusionComplex devices such as Virtex-4 require aflexible ASIC-strength synthesis solution.The advanced optimization engine inSynopsys Design Compiler FPGA efficient-ly utilizes the special resources available inVirtex-4 devices to provide the highest per-formance design possible.

DC FPGA gives you the freedom tomodify synthesis scripts to addressdesign bottlenecks, implement different

FSM encoding styles, or to explore otherdesign optimizations to reach your designgoals. Now you have access to the powerand flexibility of Design Compiler toimplement your complex FPGA designs.

DC FPGA is an integral part of thecomplete ASIC-strength prototypingsolution from Synopsys. Other tools sup-ported in the Xilinx flow are Formality™for formal verification, DesignWare®

Library IP, Leda® for RTL design andcode checking, PrimeTime® for statictiming analysis, VCS® for simulation,Module Compiler™ for datapath synthe-sis, and HSPICE™ for analysis of multi-gigabit serial I/Os.

DC FPGA has a rapidly growing base ofmore than 100 customers. For more infor-mation about Design Compiler FPGA,visit www.synopsys.com/products/dcfpga/dcfpga.html.


Design Test DescriptionImplementation Implementation

with DSP48 without DSP48Max Delay (ns) Max Delay (ns)

Test1* A_reg[17:0] (FD) <= A 3.028 3.062B_reg[17:0] (FD) <= BQ[35:0] (FD) <= A_reg * B_reg

Test2 A_reg1[16:0] (FD) <= A_reg (FD) <= A 1.720 5.444B_reg1[16:0] (FD) <= B_reg (FD) <= Bmult[34:0] (FD) <= A_reg1 * B_reg1Q[47:0] (FD) <= Q + mult

Test3 Q[47:0] (FD) <= Q + A[16:0] * B[16:0] 3.954 7.975

Test4 A_reg[16:0] (FD) <= A 1.633 8.081B_reg[16:0] (FD) <= BC_reg[47:0] (FD) <= CQ[47:0] (FD) <= sel ? C_reg + (A_reg * B_reg) : C_reg - (A_reg * B_reg)

Test5 A[16:0], B[16:0], C[47:0] 5.680 8.177Q[47:0] = sel ? C + (A * B) : C - (A * B)

Test6 A[16:0], B[16:0], C[16:0], D[16:0] 6.151 7.631E[16:0], F[16:0], G[16:0], H[16:0]mult1[33:0] (FD) <= A * B + C * Dmult2[33:0] (FD) <= E * F + G * HQ[47:0] = mult1 + mult2

* Input and output signals are signed

Table 2 - Design examples showing performance improvement of advanced DC FPGA DSP inference


by Marc DefossezSr. Staff Applications EngineerXilinx, [email protected]

In modern high-speed digital designs, con-nectors require careful attention; you can’tjust use any one that’s available. Whendesigning with Xilinx® Virtex-4™ multi-gigabit transceiver (MGT) devices, withdata transfer rates increasing to 10 Gbps,connectors are part of the total solution.

It is often said that the silicon, in ourcase the FPGA, does all the work in a sys-tem. Passive components such as connec-tors get the blame for increasing designcost, complexity, and size, and therefore areoften neglected.

Today’s digital designs enter the RFworld with transfer speeds of 10 Gbps andmore per data pair; thus, you can no longerignore the overall impact connector choicehas on a design.

Connector manufacturers must keeptrack of high-speed digital design needswhile meeting the demand for multiplehigh-speed low-loss connections in a smallconnector shape. Connector design, there-fore, becomes increasingly difficult.

The two worlds need to be combined;therefore, we advise following these stepswhen selecting a connector:

• Choose your connector type – back-plane, board-to-board, board-to-cable,or mezzanine

• Find manufacturers carrying connec-tors with the right physical parameters

• Carefully examine the manufacturer’selectrical specifications, test reports,and other published references

Board-to-Backplane or Board-to-Board ConnectorsDesigning a system in which multipleMGT signals (3.125 Gbps to 10 Gbps)cross directly from board to board or runover a backplane need special connectors.The Teradyne™ GBx connector is a high-density, optimized differential connectorfamily delivering data rates greater than 5Gbps (tested up to 12 Gbps) (Figure 1).

Tyco™-AMP offers in this same rangethe Z-Pack HM-Zd differential connectorsystem, designed for serial switching appli-cations from 3.125 Gbps to 6.4 Gbps(demonstrated at 12 Gbps) (Figure 2).

Both connector families are madespecifically for high-data-transfer-ratedesigns such as enterprise switching equip-ment, telecommunications equipment, andmass data storage. They are robust, have amodular setup, and offer routability andoptimal system performance.

Teradyne’s GbX advanced performanceinterconnects provide high-density opti-mized differential connectors. They areavailable in three-, four-, and five-pair ver-sions and permit vertical and horizontalrouting, making them the ideal solution forstar or mesh backplane designs.

Tyco-AMP’s high-speed, differential,board-to-backplane electrical connectorsare an extension of the already establishedIEC 61076-4-101 hard metric connectorfamily. However, HM-Zd also provides ahigh-speed differential solution. Z-PackHM-Zd connectors are available in two-,three-, and four-paired versions.

In board-to-board designs where sizematters, Samtec’s™ QSE and QTE con-nector families are for data transfer rates upto 6 Gbps (Figure 3).

For board to board, with a point-to-point setup, Samtec offers a reliable cableconnection based on the QSE/QTE con-

Selecting Connectors for Multi-Gigabit Transceiver DesignsSelecting Connectors for Multi-Gigabit Transceiver Designs


With data transfer rates at 10 Gbps, connector choice is crucial.With data transfer rates at 10 Gbps, connector choice is crucial.


YFS/YFT single-ended and differential-pair-array connector arrays calledSamArray (Figure 6). These connectorshave a performance up to 10 Gbps andcomprise a vast amount of single-endedconnections. Differential signaling isobtained through pin layout (Figure 7).

Connectors are offered as five-, eight-,or ten-row with as many as 50 contacts perrow, for stacking heights from 5 to 25 mm.Technical figures are provided in PDF for-mat at www.samtec.com/signal_integrity/technical_specifications/electrical.asp?series=Y F S - D P & s t a c k = 2 5 & m e n u = S i g n a l_Integrity.

Mezzanine connectors have a BGA foot-print and can be treated by assemblymachines as regular BGA components.Experience with these connectors showedthat before soldering, they are best glued tothe PCB. If not glued, there is a greatchance that the connector will move duringsoldering.

Connectors for Cable ConnectionsFor design reasons you may not be able touse the connectors described above. In thiscase you can still turn to older solutions,such as the well-known SMA connectorand the small MMCX connector.

SMA is an acronym for “SubMiniatureversion A,” first developed in the 1960s.They are 50 ohm, semi-precision subminia-ture units that provide excellent electricalperformance from DC to 18 GHz with athreaded interface. These high-performanceconnectors are compact in size and haveoutstanding mechanical specifications.

Besides the standard straight, 90degrees, and edge-launch version, anSMT-mount device version is now alsoavailable (Figure 8). This SMT version ispreferable over the other because of its per-formance characteristics.

The MMCX series is sometimes alsocalled MicroMate. It is the smallest RFconnector and was developed in the 1990s.MMCX is a micro-miniature connectorseries with a lock-snap mechanism, allow-ing for 360 degrees rotation and thusenabling great flexibility in PCB layouts.MMCX connectors conform to theEuropean CECC 22000 specification.

nector technology. The 50 ohm controlledimpedance, 38 AWG mini coax ribboncable (Figure 4) is available with as many as240 signal lines, as well as a differential orsingle-ended flex-strip solution.

You can create custom connector specifi-cations for both the QSE/QTE and ribboncable on Samtec’s website and downloadcable specifications and test reports oncross-talk, travel delay, and impedance.

Mezzanine Board-to-Board ConnectorsMezzanine card systems are mostly used torelocate high-pin-count devices onto mez-zanine or module cards, simplifying boardrouting without compromising systemperformance.

Mezzanine cards need a high bandwidthand high amount of parallel connections aswell as several serial connections. Teradyne’sversion is the NexLev connector family, withperformance up to 12 Gbps. This connectorenables a vast amount of connection possi-bilities at different connector heights.

The NexLev connector is built in astripline construction, providing a continu-ous ground plane for each signal contact(Figure 5). The connectors come as ten-row connectors with 100, 200, or 300 posi-tions at possible stacking heights from 10mm to 30 mm. You can find technical fig-ures at www.teradyne.com/prods/tcs/products/connectors/mezzanine/nexlev/signintegr.html#differential.

Samtec offers a similar solution with its


Data Pair

Grounded Shielded Plates

Ro

w

1

J

I

H

G

F

E

D

C

B

A

2 3 4 5 6

Figure 1 – Teradyne Gbx connector

Figure 2 – Tyco-AMP Z-Pack HM-Zd connector

Figure 4 – Samtec ribbon cable

Figure 5 – Stripline construction of NexLev

Figure 3 – Samtec QSE and QTE connector


MMCX products range to 6 GHz for a50 Ω interconnect system. A set of connec-tors includes surface mount, edge card, andcable connectors. Here the SMT version ispreferable (Figure 9).

You can purchase ready-made, custom,and length-matched cable interconnect forthis type of connection from differentsources and choose between flexible orsemi-rigid cabling.

Connector BasicsSuppose you’ve selected your IC devicesand your board has been laid out with all of

the right design rules, such as:

• Controlled impedance traces

• Controlled time delay of stubs

• Stubs shorter than about 20% of thefastest signal’s rise time

• Time delay of discontinuities shorterthan about 15% of the fastest signal’srise time.

• Adjacent traces paced far enough apartto keep crosstalk at an acceptable level

• A stack-up with power and groundplanes on adjacent layers of silicon

• A continuous return path under eachsignal trace

You’re not quite done yet. In high-performance systems, every elementmust be optimized for the entire systemto meet performance, schedule con-straints, size, and cost. It is like a chain –every link must be strong for the wholeto meet the demanding performancespecs of today’s high-speed products.

How can components like connectorsaffect system performance? Usually thepotential problems are lumped into twocategories: timing and noise, togetherreferred to as signal integrity (SI).

What is important when selectingconnectors?

• EMI, translated to series inductance

• Crosstalk, translated to mutual inductance

• Signal propagation, as parasitic capacitance

Series InductanceThe most fundamental effect a connectoradds to a circuit is series inductance. Theprimary factor for the series inductance isthe pin length of the connector. Togetherwith the series inductance of each connec-tor pin, the pin layout of the connectordetermines the radiated EMI (electromag-netic interference).

Signals traveling through a connectorneed a current return path (ground).Even if no return path is providedthrough the connector, large inductiveloops can be created (Figure 10). Thiswill result in substantial EMI emission.

Differential signaling solves the prob-lem of current return paths by eliminat-ing it. Differential signaling uses twoidentical but opposite signals. The returnpaths are therefore also opposite to eachother (Figure 11). This effect will cancelout. The only signal returning from adifferential pair is because of an imbal-ance between the two signals. The sub-traction of both signals will not beexactly zero.

Mutual InductanceCurrent loops illustrate mutual inductivecoupling in Figure 12. Current leaving


Ro

w

J

I

H

G

F

E

D

C

B

A

1 2 3 4

Column

Best Case Pin Setup

Data Pair

Worst Case Pin Setup

5 6 7 8

Ro

w

J

I

H

G

F

E

D

C

B

A

1 2 3 4

Column

5 6 7

Ground

8

Figure 6 – Samtec YFS/YFT connector

Figure 7 – Best- and worst-case pin layout for YFS/YFT

Figure 8 – SMA edge launch, SMT

Figure 9 – MMCX edge launch, SMT


device A returns through signal return pathX. Even currents leaving devices B and Chave signal return paths through Y and Z.

Because all of these paths overlap, mag-netic fields from one path induce electricvoltages (noise) in other paths. Theinduced noise will be larger or smaller withthe physical location of a path. In ourexample, Y will receive more noise than Zbecause it shares more area.

Do not worry about crosstalk betweendifferential signals. Because of their nature,crosstalk is canceled out.

Parasitic CapacitanceMutual and shunt (pin-to-pin) capacitanceis another effect that comes with a connec-tor – usually you can ignore it. The effectcapacitance has is to slow down systemedge rate. In multi-drop backplane applica-tions, parasitic capacitance places moreburdens on connectors than in point-to-point applications.

Signals transmitted pass each tap on thebus; the cumulative effect of the parasiticcapacitance can distort the signals and theseries inductance of the source connector.

Connector SelectionTo provide excellent high-speed connec-tors, manufacturers need to control andmanage the above parameters as well as alot more. Engineers now have access to anextensive amount of data measured andcalculated by connector manufacturers.

On most manufacturers’ websites, elec-trical, mechanical, and SI information is

available, together withPCB drawing and sim-ulation aids:

• Mechanical

– Dimensiondrawing in PDFformat

– 3D models inIGES, STEP, orParasolid ACISformat

– Mechanical qual-ification andstress test reports

– PCB layout tool library components

• Electrical

– Electrical test reports

– Application notes

– SI parameters and results

– Datasheets

• Simulation

– IBIS and SPICE models

An extra service offered by Samtec is the“Final Inch” website, for designing a con-nector break-out region on a PCB.

The manufacturers mentioned in thisarticle are not the only high-speed connec-tor manufacturers on the market. There areother companies such as ERNI™, Hirose,Molex™, Amphenol™, and Radiall™manufacturing (under license) similar con-

nectors. Many other companies have theirown range of high-speed connectors.

ConclusionToday’s high-speed digital design engineerscan benefit from the RF knowledge of con-nector suppliers, using the informationavailable in datasheets, application notes,and on the Internet.

You can use this article as a starting pointfor better PCB and connector design.

For more information, see the books“High-Speed Digital System Design” byStephen H. Hall, Garrett W. Hall, and JamesA. McCall; “High-Speed Digital Design” byHoward Johnson; or visit www.johnson-comp.com, www.samtec.com, www.samtec.com/sudden_service/current_literature/q-pairs/index.html, www.samtec.com/sudden_service/current_literature/SamArray/index.html,www.teradyne.com/prods/tcs, and hmzd.tycoelectronics.com.


Loop 2

A

B

Connector

Loop 1

Return Current Splits

This effect can be minimized

through the use of enough

ground pins in the connector.

APath X

Path Y

Path Z

B

C

DRIVER RECEIVER

Positive Current Loop

Negative Current Loop

Figure 10 – EMI generated due to improper current return paths Figure 11 – Differential eliminated returned signal currents

Figure 12 – Mutual inductive coupling through a connector


by Mike BlackStrategic Marketing ManagerMicron Technology, [email protected]

With network line rates steadily increas-ing, memory density and performance arebecoming extremely important inenabling network system optimization.Micron Technology’s RLDRAM™ andDDR2 memories, combined with Xilinx®

Virtex-4™ FPGAs, provide a platformdesigned for performance.

This combination provides the criticalfeatures networking and storage applicationsneed: high density and high bandwidth. TheML461 Advanced Memory DevelopmentSystem (Figure 1) demonstrates high-speedmemory interfaces with Virtex-4 devices andhelps reduce time to market for your design.

Micron MemoryWith a DRAM portfolio that’s among themost comprehensive, flexible, and reliablein the industry, Micron has the ideal solu-tion to enable the latest memory platforms.Innovative new RLDRAM and DDR2architectures are advancing system designsfarther than ever, and Micron is at the fore-front, enabling customers to take advan-tage of the new features and functionalityof Virtex-4 devices.

RLDRAM II MemoryAn advanced DRAM, RLDRAM II mem-ory uses an eight-bank architecture opti-mized for high-speed operation and adouble-data-rate I/O for increased band-width. The eight-bank architecture enables

RLDRAM II devices to achieve peakbandwidth by decreasing the probability ofrandom access conflicts.

In addition, incorporating eight banksresults in a reduced bank size compared totypical DRAM devices, which use four.The smaller bank size enables shorteraddress and data lines, effectively reducingthe parasitics and access time.

Although bank management remainsimportant with RLDRAM II architec-ture, even at its worst case (burst of two at400 MHz operation), one bank is alwaysavailable for use. Increasing the burstlength of the device increases the numberof banks available.

I/O OptionsRLDRAM II architecture offers separateI/O (SIO) and common I/O (CIO)options. SIO devices have separate readand write ports to eliminate bus turn-around cycles and contention. Optimizedfor near-term read and write balance,RLDRAM II SIO devices are able toachieve full bus utilization.

In the alternative, CIO devices have ashared read/write port that requires oneadditional cycle to turn the bus around.RLDRAM II CIO architecture is optimizedfor data streaming, where the near-term busoperation is either 100 percent read or 100percent write, independent of the long-termbalance. You can choose an I/O version thatprovides an optimal compromise betweenperformance and utilization.

The RLDRAM II I/O interface pro-vides other features and options, includingsupport for both 1.5V and 1.8V I/O lev-

els, as well as programmable output imped-ance that enables compatibility with bothHSTL and SSTL I/O schemes. Micron’sRLDRAM II devices are also equippedwith on-die termination (ODT) to enablemore stable operation at high speeds inmultipoint systems. These features providesimplicity and flexibility for high-speeddesigns by bringing both end terminationand source termination resistors into thememory device. You can take advantage ofthese features as needed to reach theRLDRAM II operating speed of 400 MHzDDR (800 MHz data transfer).

At high-frequency operation, however, itis important that you analyze the signal driv-er, receiver, printed circuit board network,and terminations to obtain good signalintegrity and the best possible voltage andtiming margins. Without proper termina-tions, the system may suffer from excessivereflections and ringing, leading to reducedvoltage and timing margins. This, in turn,can lead to marginal designs and cause ran-dom soft errors that are very difficult todebug. Micron’s RLDRAM II devices pro-vide simple, effective, and flexible termina-tion options for high-speed memory designs.

On-Die Source Termination ResistorThe RLDRAM II DQ pins also have on-die source termination. The DQ outputdriver impedance can be set in the range of25 to 60 ohms. The driver impedance isselected by means of a single external resis-tor to ground that establishes the driverimpedance for all of the device DQ drivers.

As was the case with the on-die end ter-mination resistor, using the RLDRAM II

Xilinx/Micron Partner to ProvideHigh-Speed Memory Interfaces


Micron’s RLDRAM II and DDR/DDR2 memory combines performance-critical features to provide both flexibility and simplicity for Virtex-4-supported applications.


on-die source termination resistor elimi-nates the need to place termination resistorson the board – saving design time, boardspace, material costs, and assembly costs,while increasing product reliability. It alsoeliminates the cost and complexity of endtermination for the controller at that end ofthe bus. With flexible source termination,you can build a single printed circuit boardwith various configurations that differ onlyby load options, and adjust the MicronRLDRAM II memory driver impedancewith a single resistor change.

DDR/DDR2 SDRAMDRAM architecture changes enable twice thebandwidth without increasing the demand onthe DRAM core, and keep the power low.These evolutionary changes enable DDR2 tooperate between 400 MHz and 533 MHz,with the potential of extending to 667 MHzand 800 MHz. A summary of the functional-ity changes is shown in Table 1.

Modifications to the DRAM architec-ture include shortened row lengths forreduced activation power, burst lengths offour and eight for improved data bandwidthcapability, and the addition of eight banksin 1 Gb densities and above.

New signaling features include on-die ter-mination (ODT) and on-chip driver (OCD).ODT provides improved signal quality, withbetter system termination on the data signals.OCD calibration provides the option of tight-ening the variance of the pull-up and pull-down output driver at 18 ohms nominal.

Modifications were also made to the moderegister and extended mode register, includingcolumn address strobe CAS latency, additivelatency, and programmable data strobes.

ConclusionThe built-in silicon features of Virtex-4devices – including ChipSync™ I/O tech-nology, SmartRAM, and Xesium differentialclocking – have helped simplify interfacingFPGAs to very-high-speed memory devices.A 64-tap 80 ps absolute delay element as wellas input and output DDR registers are avail-able in each I/O element, providing for thefirst time a run-time center alignment of dataand clock that guarantees reliable data cap-ture at high speeds.

Virtex-4 devices. The ML461 system,which also includes the whole suite of ref-erence designs to the various memorydevices and the memory interface genera-tor, will help you implement flexible, high-bandwidth memory solutions withVirtex-4 devices.

Please refer to the RLDRAM informa-tion pages at www.micron.com/products/dram/rldram/ for more information andtechnical details.

Xilinx engineered the ML461Advanced Memory Development Systemto demonstrate high-speed memory inter-faces with Virtex-4 FPGAs. These includeinterfaces with Micron’s PC3200 andPC2-5300 DIMM modules, DDR400and DDR2533 components, andRLDRAM II devices.

In addition to these interfaces, theML461 also demonstrates high speedQDR-II and FCRAM-II interfaces to


FEATURE/OPTION DDR DDR2Data Transfer Rate 266, 333, 400 MHz 400, 533, 667, 800 MHzPackage TSOP and FBGA FBGA only Operating Voltage 2.5V 1.8VI/O Voltage 2.5V 1.8VI/O Type SSTL_2 SSTL_18Densities 64 Mb-1 Gb 256 Mb-4 GbInternal Banks 4 4 and 8Prefetch (MIN Write Burst) 2 4CAS Latency (CL) 2, 2.5, 3 Clocks 3, 4, 5 ClocksAdditive Latency (AL) No 0, 1, 2, 3, 4 ClocksREAD Latency CL AL + CLWRITE Latency Fixed READ Latency – 1 ClockI/O Width x4/ x8/ x16 x4/ x8/ x16Output Calibration None OCDData Strobes Bidirectional Strobe Bidirectional Strobe

(Single-Ended) (Single-Ended or Differential) with RDQS

On-Die Termination None SelectableBurst Lengths 2, 4, 8 4, 8

DDR 2

SDRAM DIMM

DDR SDRAM

DIMM

FCRAM II

QDR II

SRAM

RLDRAM II

DDR 2

SDRAM

DDR SDRAM

Table 1 – DDR/DDR2 feature overview

Figure 1 – ML461 Advanced Memory Development System


by Matt DiPaolo APD Product Application EngineerXilinx, [email protected]

Ryan CarlsonDirector of Marketing, High Speed Serial I/OXilinx, [email protected]

Xilinx® introduced FPGAs with integratedmulti-gigabit serial transceivers (MGTs)more than three years ago. Since then,Virtex-II Pro™ devices have enabled hun-dreds of applications to move from parallelinterfaces to high-speed serial interfaces, asdesigners took advantage of the integratedRocketIO™ transceivers.

With Virtex-II Pro devices, Xilinx led theindustry with a transceiver capable of 622Mbps-3.125 Gbps operation. Xilinx contin-ues this trend with its new Virtex-4™ fam-ily, in which RocketIO transceivers canoperate from 622 Mbps to over 10 Gbps(Figure 1). This broad speed range – coupled with a host of user-friendly, pro-grammable options – creates an extremelyflexible multi-gigabit transceiver.

Multiple Interface StandardsOne trend occurring in multiple end-marketsegments is the widespread adoption of high-speed differential signaling schemes to addressincreased bandwidth demands. As designsmove to faster interface speeds, a serial imple-mentation saves power, board space, designcomplexity, and ultimately cost.

Virtex-4 RocketIO transceivers weredesigned to enable high-speed data trans-mission for many different protocols. Table1 shows all of the serial standards support-ed in Virtex-4 FPGAs.

Harvesting the Flexibility of Virtex-4 RocketIO TransceiversHarvesting the Flexibility of Virtex-4 RocketIO Transceivers


New features include support for all major serial I/O standards and multiple encoding schemes.New features include support for all major serial I/O standards and multiple encoding schemes.


Flexibility and ProgrammabilityXilinx brings its approach to FPGAs –making them user-programmable, with maximum flexibility – to its multi-gigabit transceivers. This approach hasimpacted both of the major functionalcomponents of the RocketIO transceiv-er: the physical media attachment(PMA) block and the physical codingsublayer (PCS) block.

PMA BlockThe Virtex-4 RocketIO PMA block sup-ports all major serial I/O standards andis compliant to their physical layerrequirements. For example, theRocketIO transceiver meets the OC-48SONET/SDH specification (2.488Gbps) for both transmit jitter generationand receive jitter tolerance.

This same transceiver can also meet therequirements of the Fibre Channel physi-cal layer specification, and it can do so at1.0625 Gbps, 2.125 Gbps, 4.25 Gbps,and 8.5 Gbps.

Other PMA features of the Virtex-4RocketIO transceiver include:

• Programmable transmit pre-emphasis(3-tap)

• Programmable active receive equaliza-tion

• Programmable decision-feedbackequalization (DFE)

• Integrated receiver AC-couplingcapacitors (user-bypassable)

built into the transceiver. You can select a10-bit based data path (for Ethernet anddata communications protocols) or a 16-bit based data path (for SONET/SDH-based protocols).

User-programmable clock correctionsequences (CCS) allow synchronizationdifferences between remote transceivers tobe tolerated and corrected. Channel bond-ing sequences (CBS) enable you to connectmultiple RocketIO transceivers together tocreate a logical channel with even morebandwidth. All of these features are com-pliant to industry standards (makingdesigns easier to complete), while still sup-porting proprietary designs.

For applications requiring lower latency,a new feature of the Virtex-4 RocketIOtransceiver is a reduced latency mode thatallows you to bypass the receive and trans-mit FIFOs (as well as other function blocks),offering a 50% reduction in latency fromprevious generations of Xilinx transceivers.

Other PCS features of the Virtex-4RocketIO transceiver include:

• Multiple loopback modes, including aPMA Rx to Tx path

• Comma detection, includingA1A1A2A2 for SONET applications

• PCI Express-compliant electrical idlesupport

• PCI Express-compliant beaconingsupport

• PCI Express-compliant spread spec-trum clocking support

• Multiple loopback modes, including aPMA Rx to Tx path

PCS BlockThe Virtex-4 RocketIO PCS block sup-ports multiple encoding schemes; both8B10B and 64B66B encoders/decoders are


Mode Channels (Lanes) I/O Bit Rate (Gbps)

SONET OC-12 1 0.622

Fibre Channel (1, 2, 4, 8) 1 1.0625/2.125/4.25/8.5

Gb Ethernet 1 1.25

SONET OC-48 1 2.488

Infiniband 1/4/12 2.5

PCI Express 1/2/4/8/16 2.5

Serial Rapid IO 1 1.25/2.5/3.125

Serial ATA 1 1.5/3

XAUI (10 Gb Ethernet) 4 3.125

XAUI (10 Gb Fibre Channel) 4 3.1875

SONET OC-192 1 9.95328

10 Gb Ethernet 1 10.3125

Table 1 – Example supported standards of the Virtex-4 RocketIO transceiver

Figure 1 – Evolution of the RocketIO transceiver


• Clock correction/channel bondingreceive elastic buffer

• Autonomous CRC-32 blocks (one for transmitted data and one for received data)

• Dynamic configuration bus to accessevery PCS attribute dynamically,including CCS and CBS

• 64B66B block sync, gearbox,encoder/decoder, andscrambler/descrambler

• 8B10B encoder/decoder

• Built-in clock dividers to reduce theneed of DCMs for clocking use models

Figures 2 and 3 show block diagrams of theVirtex-4 PCS (both receiver and transmitter).

ConclusionThe Virtex-4 RocketIO transceiver is the com-plete solution for today’s high-speed serialdesigns, with a broad speed range (622 Mbpsto over 10 Gbps) and programmable PCSfunctions (optional encoding schemes, channelbonding, and clock correction).

For more information about the Virtex-4FPGA family, visit www.xilinx.com/virtex4/.For more details about the functionality anddesign recommendations with Virtex-4RocketIO transceivers, see the Virtex-4RocketIO transceiver user guide at www.xilinx.com/bvdocs/userguides/ug076.pdf.


Reset

RXPRXN

User-Selectable Alignment and Clock Correction;Enables Aurora, Ethernet, Fibre Channel, and SONET

Clock

DynamicConfig

PMA

PMAAttr.

Sync Control Logic

CommaDetectAlign

10GBlockSync

8B 10BDecode

10GDescram

Channel Bonding &Clock Correction

16x52 bitRing

Buffer10G

Decode

Clock 2

DATA andSTATUS

Low-Latency Bypass Modes for Custom Designs

Reset

CLOCKCLOCK 2

TXPTXN

Built-In Support for Multiple Protocols

Low-Latency Bypass Modes for Custom Designs

Real-Time Reconfiguration of RocketIO Settings (e.g., Rx EQ)

DynamicConfig

8B 10BDecode

64B 66BEncode

6x40 bitRing

Buffer

10GbEGearbox

10GbEScrambler

DATA andSTATUS

PMA

PMAAttributes

Figure 2 – Virtex-4 RocketIO PCS (receiver)

Figure 3 – Virtex-4 RocketIO PCS (transmitter)

Xilinx Events and Tradeshows

Xilinx participates in numerous trade shows and events throughout

the year. This is a perfect opportunity to meet our silicon and software experts,

ask questions, see demonstrations of new products and technologies, and hear other customers’ success stories

with Xilinx products. For more information and the most up-to-date

schedule, visit: www.xilinx.com/events/.

Worldwide Events Schedule

North America

Jan. 31 - Feb. 3 DesignCon WestSanta Clara, CA

February 15-17 TI Developers ConferenceHouston, TX

March 1-3 Intel Developer ForumSan Francisco, CA

March 8-10 Embedded Systems ConferenceSan Francisco, CA

Europe

Jan. 31 - Feb. 2 Elektronik Systeme im AutomobilMunich, Germany

February 1-3 EP05 Electronic ExhibitionStockholm, Sweden

February 14-17 3GSM World CongressCannes, France

February 22-24 Embedded World Nurenberg, Germany

March 16-17 Workshop SoC DéfenseBrussels, Belgium

March 16-17 Hi-Tech TechnologiesTel Aviv, Israel

March 17-18 AMAA Conference and ExhibitionBerlin, Germany

Japan

January 29-30 EDSFYokohama, Japan

February 15 Processor SeminarOsaka, Japan

February 21 Processor SeminarTokyo, Japan


by Scott Beekman Business Development ManagerToshiba America Electronic Components, [email protected]

Among the many cost/performance trade-offs system designers face, one of the criticaldecisions in network systems, communica-tions equipment, and high-performanceconsumer electronics is the type of memoryto use to ensure that performance can keeppace with the processor.

Traditionally, network system designershad to choose between dynamic randomaccess memory (DRAM), available at alower cost-per-bit because of the high vol-umes used in personal computers, or high-er performance static random accessmemory (SRAM), available only in lowdensities and at a much higher cost. A com-bination of the two is typically used withDRAM for buffer memory and SRAM forlook-up table (LUT) memory.

More recently, high-performance, low-latency DRAM solutions developed specifi-cally for high-bandwidth applications,including Toshiba’s™ Network FCRAM™(fast cycle random access memory), provideanother alternative. Which type of memoryis right for your particular system? Whatadditional requirements for memory con-trollers are associated with each choice?

Optimize Memory SubsystemPerformance with Network FCRAM


Toshiba’s Network FCRAM often provides the best cost/performance by combining DRAM densities with random cycle performances that approach SRAM speeds.


Generally, you can choose the optionthat provides the highest performancewithin the system’s specified cost con-straints, and in the time available to bringthe system to market. In many cases,Network FCRAM provides the bestcost/performance for networking and com-munications customers by combiningDRAM densities with random cycle per-formances that approach SRAM speeds.This allows equipment manufacturers todevelop higher performance, lowercost, and lower power communica-tions systems than they could withdouble-data-rate synchronousdynamic RAM (DDR SDRAM) andhigh-speed static RAM (HSSRAM).

In this article, we provide anoverview of Network FCRAM andthe advantages it offers in compari-son to standard DDR SDRAM orhigh-speed SRAM, and discuss thealternatives available for memorycontrollers supporting NetworkFCRAM.

Network FCRAMToshiba Network FCRAM is a high-performance, low-cost replacementto DDR SDRAM and high-speedSRAM targeted primarily for buffermemory and LUT memory in networking/telecom applications.Network FCRAM incorporatesenhanced DRAM technology opti-mized for the high-bandwidth, low-latency requirements of network andcommunication systems. Narrowingthe active memory area achieves lowpower consumption and randomcycle time performances almosttriple that of standard DRAM.

Network FCRAM devices offerthe following advantages:

• Fast random cycle time (tRC) of20 ns to 25 ns

• Fast data transfer rate of 666Mbps+ (For purposes of meas-uring data transfer rate in thiscontext, megabit per secondand/or Mbps = 1,000,000 bitsper second.)

Network FCRAM technology excels inapplications where you need DRAM den-sities and random cycle performanceapproaching SRAM-like speeds. Its highbandwidth and low latency makesNetwork FCRAM suitable for networkapplications, cache applications, andhigh-performance consumer applications.Typical network equipment applicationsinclude packet buffer memory, tablelook-up memory, and external cache

memory in servers. NetworkFCRAM is also being used in dig-ital consumer and supercomputerapplications.

Performance ComparisonNetwork FCRAM and the specifi-cation-compatible, dual-sourceSamsung™ Network DRAM™feature one of the shortest cycletimes and latency among existingDRAM. As a result, NetworkFCRAM can improve system per-formance approximately 20 to 25percent in comparison to DDRSDRAM. This is achieved as aresult of higher data transfer rates,as shown in Figure 1, and anapproximately threefold faster ran-dom cycle time (tRC), as shown inFigure 2.

As an alternative to HSSRAM,Network FCRAM costs approxi-mately 1/16th as much per bit,and offers much higher densities(up to 512 Mb) compared to max-imum densities of 36 Mb or 72Mb for HSSRAM. NetworkFCRAM offers not only perform-ance improvement alternatives butalso lower-cost solutions, as shownin Figure 3.

Customers today are takingadvantage of these features toboost performance and bringdown their system’s cost by replac-ing DDR SDRAM with NetworkFCRAM, thus reducing chipcount and board space because ofNetwork FCRAM’s higher per-formance, and/or by replacingHSSRAM.

• Large density up to 512 Mb (Whenused in relation to memory density,megabit and/or Mb means 1,024 x1,024 = 1,048,576 bits. Usable capaci-ty may be less. For details, please referto specifications.)

• Simplified command input

• Low power consumption

• Multiple sources


0

200

400

600

800

1000

1200

2000 2001 2002 2003 2004 2005

Year

Network

FCRAM

DDR DDR-II

SDR

Da

ta T

ran

sfe

r R

ate

(M

bp

s)

0

10

20

30

40

50

60

70

80

2000 2001 2002 2003 2004 2005

Year

SDR DDR-IIDDR

Network

FCRAM

Ra

nd

om

Cy

cle

Tim

e (

ns

)

Figure 1 – Faster data transfer rates with Network FCRAM

Figure 2 – Network FCRAM typically provides 20 to 25 percent higher system performance than DDR

SDRAM offers, in part because of its faster random cycle time(approximately three times faster).


Selecting the Right FCRAMNetwork FCRAM is available witha selection of interfaces, speeds,and organizations to meet variousrequirements:

• 256 Mb (x8/ x 16) NetworkFCRAM1 (up to 400 Mbpswith tRC = 25 ns)

• 288 Mb (x18) NetworkFCRAM2 (up to 666 Mbps with tRC = 20 ns)

• 288 Mb (x36) NetworkFCRAM2 (up to 666 Mbpswith tRC = 20 ns)

• 512 Mb (x8/ x 16) Network FCRAM1(up to 533 Mbps with tRC = 22.5 ns)

Network FCRAM1 supports non-ECC bit densities (such as 256 Mb and512 Mb as a single component), whileNetwork FCRAM2 supports ECC bitdensities (such as 288 Mb with roadmapsto higher densities).

Memory ControllersOnce you have selected Network FCRAMas the memory of choice for a design, thenext step is to determine the best source ofa memory controller for your system. Forlarge-volume applications, some customersdevelop custom ASICs that include thememory controller; in addition, many net-work processors (NPUs) now supportNetwork FCRAM. However, for manysmaller volume applications, FPGAs offerlower cost and faster time to market.

Xilinx® Virtex-II™, Virtex-II Pro™,and Virtex-4™ FPGAs interface toNetwork FCRAM.

When evaluating memory alternativesfor network systems, consider the perform-ance advantages of Network FCRAM andthe time-to-market advantages of anFPGA-based memory controller.

Development Tools Toshiba offers several design guides to helpcustomers and systems architects identifythe key advantages of incorporatingNetwork FCRAM technology into theirhigh-performance applications. Network

FCRAM devices are also supported byadvanced simulation models to facilitateand accelerate design-in activity. Modelssupported include Verilog™, HSPICE™and IBIS models, and SOMA modelsjointly developed by Toshiba and Denali™Software Inc. For more information, visitwww.fcram.toshiba.com.

ConclusionAs a result of Network FCRAM’scost-performance advantages,today it is designed into more than100 network solutions at morethan 70 companies. Toshiba firstintroduced Network FCRAMworking samples in 1999 and hascontinued to expand its productoffering and build momentum inthe network/telecom market.

Today, Network FCRAM is inproduction with data transfer ratesas high as 666 Mbps and randomcycle time performance as low as 20

ns. Toshiba now supports three densitiesin mass production, with higher density,higher bandwidth, and faster devicesplanned for 2005.

The official Network FCRAM/DRAMwebsite can be found at www.networkfcram.com.


TM TM

Virtex-II and Virtex-II Pro are trademarks of Xilinx, Inc.

0

20

40

60

Bit Cost

Higher Performance ‡ tRC is 3 times faster

Lower Cost/bit 10 to 16 times or Less

Low High

18Mb NtRAM288Mb Network FCRAM

Higher Performance ‡ tRC is 3 times fasterHigher Performance ‡ tRC is 3 Times Faster

Lower Cost/bit 10 to 16 times or LessLower Cost/Bit ‡ 10 to 16 Times or Less

256 Mb DDR1 SDRAM

288Mb Network FCRAM288 Mb Network FCRAM

Ra

nd

om

Cy

cle

Tim

e (

ns

)

FCRAM (Fast Cycle RAM ) is a trademark or a registered trademark of Fujitsu Limited, Japan. Memory Modeler AV is a trademark of Denali Software Inc. Network DRAM is a trademark or a registered trademark of Samsung Electronics Co., Ltd. Korea.

Figure 3- Network FCRAM can also be a lower cost alternative to HSSRAM, as it costs approximately 1/10th to 1/16th as much per bit.


by Suhel DhananiSr. Marketing Manager, Spartan SolutionsXilinx, [email protected]

All low-cost FPGAs provide basic logiccapability at attractive prices and serve abroad range of general-purpose designrequirements. When you consider embed-ding DSP functions in an FPGA fabric,however, you may believe that you mustchoose high-end FPGAs to get platformfeatures such as embedded multipliers anddistributed memory.

With Spartan-3™ FPGAs, the land-scape for embedded DSP has changed.Spartan-3 devices may be low cost, butthey also have the platform featuresrequired for DSP designs. These plat-form features allow area-efficient imple-mentation of signal processing functions– allowing you to realize significantlylower price points.

Spartan-3 devices are ideal ascoprocessors or pre-/post-processors,offloading highly computational func-tions from a programmable DSP toenhance system performance.

Using Spartan-3 FPGAs to Implement High-Performance DSP

Using Spartan-3 FPGAs to Implement High-Performance DSP


Spartan-3 FPGAs provide breakthrough cost points for embedded DSP.Spartan-3 FPGAs provide breakthrough cost points for embedded DSP.

Optimized for DSPThe Spartan-3 family from Xilinx uses 90nm process technology in conjunction with300 mm wafers to dramatically lower thecost of FPGAs. At the same time, thedevices incorporate key DSP resources suchas embedded 18 x 18-bit multipliers andlarge blocks (18 kb) of memory, distributedRAM, and shift-register logic. Thisadvanced feature set means that you canuse Spartan-3 FPGAs to implement DSPalgorithms at a significantly lower cost thancompeting FPGAs. The specific featuresthat help in efficiently implementing DSPare shown in Figure 1.

In addition to increasing the basic per-formance of systems, these embedded fea-tures enhance device utilization. Forinstance, the embedded Spartan-3 multiplierwould take 300-400 logic elements (LEs) ifimplemented in the logic fabric. And becausethe embedded multiplier is adjacent to logicfabric, augmenting the functionality (such ascreating accumulators or concatenating themultipliers to create complex arithmeticfunctions) is fairly straightforward.

Many DSP functions are best imple-mented in pipelines with time multiplexingfor efficiency. This allows you to createfaster systems with higher bandwidth, butit comes at the expense of requiring moreinterim storage elements. For example, atime-multiplexed filter would store theresults of individual multiply-accumulatecells in shift registers. Such designs can run

is capable of implementing logic functionsor acting as a 16-bit shift register.

As shown in Figure 2, this architectureenhancement allows you to use a singleLUT in place of 16 registers – maximizingarea efficiency when implementing time-multiplexed DSP functions.

Many DSP functions are also extremelymemory-intensive – requiring scratch-padmemory for storing coefficients, imple-menting FIFOs, and large buffers. Asshown in Figure 3, Spartan-3 devices pro-vide more memory bits than other low-costFPGAs available today.

For many DSP designs, the criticalresource is the embedded memory withinthe FPGA – not logic or multipliers.Because of insufficient memory, designersusing competing low-cost devices may haveto migrate to a larger device or use externalmemory for systems that would fit into asingle, small Spartan-3 FPGA.

out of registers or memory before they runout of logic resources. The Spartan-3FPGA family is unique in providing amode where a single look-up table (LUT)


16

40

16x 16x 16x

k0 k1 k2 k3

One LUTD

CE

A3

A2

A1Q

A0

D Q

59K78K

92K

239K

294K

432K

576K

288K

216K

XC3S50

XC3S200

XC3S400

XC3S1000

SPARTAN-3

XC3S1500

72K

Competing Low-Cost FPGA Family

700

600

500

400

300

200

100

0

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000

LEs

Em

bed

ded

Mem

ory

(kb

)

Figure 1 – Spartan-3 architecture optimized for lower DSP costs

Figure 2 – You can implement 16 registers in one LUT.

Figure 3 – Spartan-3 fabric provides significantly more memory resources than other competing low-cost FPGAs.

Common DSP FunctionsLet’s see how these features impact deviceutilization by looking at two implementa-tion examples of a finite impulse response(FIR) filter. One is a MAC-based imple-mentation, while the other is a multi-channel distributed arithmetic (DA)implementation.

FIR filters are commonly used in basestations, digital video, wireless LANs,xDSL, and cable modems. Our benchmarkis the implementation of a 64-tap, MACFIR filter with 16-bit data and coefficientsrunning at 130 MHz in a Spartan-3XC3S400 FPGA. The first implementationuses a single MAC; the second implementa-tion uses four MACs. Figure 4 shows thedevice utilization section of the report filefor both implementations.

Going from a one-MAC to a four-MACimplementation dramatically increases theperformance of the FIR filter. The number ofLUTs only doubles and remains at just 4% ofthe total available logic. A four-MAC imple-mentation uses four block RAMs and fourmultipliers to efficiently implement the FIRfilter using minimum device logic resources.

Another interesting implementation isthat of a multi-channel FIR function. Inthis case we can look at how the device uti-lization changes when we go from a one-channel FIR to an eight-channel FIR filter.

As shown in Figure 5, a single channeldistributed arithmetic FIR filter uses 29%of the logic resources and 39% of the regis-ters of a XC3S1000 Spartan-3 device.When implementing an eight-channel ver-sion of the same filter, we would normallytime multiplex the different channels toconserve logic. But this would use a lot ofregisters, or a significant amount of on-chipmemory to store the intermediate results.

With Spartan-3 FPGAs, the intermedi-ate results are stored in LUTs configured as16-bit shift registers (SRL-16). This allowsthe eight-channel version of the same filterto be implemented using only 10% moreof the available logic and only 7% more ofthe available registers – 8x more channelsfor only 25% more device resources (seeFigure 6).

This dramatic savings is directly relatedto the use of the SRL-16s available in theSpartan-3 device. In the report file, youcan see that an additional 1,343 LUTs areused in the SRL-16 mode for the eight-channel implementation.

Implementing this design in an FPGAwithout SRL16 capability would requirean additional 10,744 (1343 x 8) flip-flopsused as storage elements, demanding amassive device for the register count andlikely squandering the associated combina-torial logic resources.

ConclusionThe Spartan-3 architecture is optimizedto give you very high area efficiency whenimplementing signal processing func-tions. By combining these DSP-friendlysystem features with low unit costs,Spartan-3 FPGAs enable the industry’slowest price points for high-performanceDSP functions. This allows a Spartan-3device to act as a low cost but highly effi-cient and high-performance co-processorto a programmable DSP processor.


Excerpt from the Four-MAC Implementation Report File

Excerpt from the One-MAC Implementation Report File

Figure 6 – The eight channel version of the same DA FIR filter only uses 10%

more logic and 7% more registers.

Figure 5 – This single channel DA FIR filter uses 29% of the logic and 39% of the registers

in a Spartan-3 XC3S1000 device.

Figure 4 – Using the embedded multipliers and block RAM features of the Spartan-3 fabric for higher performance DSP functions

1 800 332 8638www.em.avnet.com

© Avnet, Inc. 2004. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Enabling success from the center of technology TM

Xilinx® Virtex-4™ SpeedWay Seminar™

This seminar will explore the following topics: integratedPowerPC™ processors, the world’s most popular embeddedprocessor architecture, next generation Xtreme™ DSP technology, Advanced Silicon Modular Block (ASMBL)architecture and RocketIO™ serial transceivers.

ADS-SPDWY-V4-INTRO Xilinx Virtex-4 FREE SpeedWay Seminar

ADS-XLX-V4LX-EVL25 Xilinx Virtex-4 LX25 Evaluation Kit $299.00 USD*

ADS-BASEX-BUNDLE Xilinx Virtex-4 LX25 Evaluation $550.00 USD*Kit bundled with ISE BaseX (only available with purchase of Virtex-4 LX25 Evaluation Kit)

ADS-FOUNDATION- Xilinx Virtex-4 LX25 Evaluation $2,400.00 USD*BUNDLE Kit bundled with ISE Foundation

only available with purchase ofVirtex-4 LX25 Evaluation Kit)

Xilinx is revolutionizing the fundamentals of FPGA economics with the Virtex-4™ family. To help you get a jumpstart on your next design, Avnet ElectronicsMarketing has created the Virtex-4 LX Evaluation Kit and a SpeedWay Seminar.™

The Virtex-4 SpeedWay Seminar will allow you to:• Learn about the Virtex-4 product family features• Learn how to use Virtex-4 in your specific application• Learn about the key features of the new Xilinx ISE

6.3i integrated software environment

For your convenience, the seminar can take place at yourlocation at a time of your choosing.

Speedway Seminar Registration -www.em.avnet.com/v4speedway

Kit Information and Purchases - www.em.avnet.com/virtex4lx

Ready.

Set.

Go to market.™

Get Started Now with Xilinx®

Virtex-4™ FPGAs

*Pricing valid only within 60 days of attending a seminar.

Part Number Description

• Multi-Platform FPGA family

• Support for (3) application domains

• 90 nm process technology

• Reduced power consumption

• Dramatic reduction in cost per function

Virtex-4 FPGAs Virtex-4LX25 Evaluation Kit

Support Across The Board.™

• Virtex-4LX25 FPGA

• 8 MB Flash and 32 MB DDR SDRAM

• Cypress CY7C68013 USB 2.0 controller

• National Semiconductor DP8384710/100 Ethernet PHY

• 128x64 OSRAM graphical display

Special Pricing forSeminar Attendees*

Shrinking budgets and design cycles make evaluating, designing, andtesting complex systems more challenging than ever before. Xilinx®

provides the answer with the Virtex-4™ ML401 evaluation platform.Powered by the XC4VLX25 device and incorporating industry-

standard peripherals, connectors, and interfaces, the Virtex-4 ML401evaluation platform provides a rich feature set that spans a wide rangeof applications.

Xilinx also provides expert guidance to designers with hardware-verified reference designs, application notes, and user-friendly tools.

The Virtex-4 ML401 evaluation platform specifications include:

• Xilinx devices

– XC4VLX25-FF668-10C, XC95144XL, XCCACE (SystemACE CF solution), XCF32P (Platform Flash)

• Clocks

– 100 MHz oscillator, extra clock socket

• Memory

– 64 MB DDR SDRAM, 1 MB ZBT SRAM, 32 MB CompactFlash, 8 MB Flash, 4 kb IIC EEPROM, 32 Mb Platform Flash

• Display

– 16 x 2-character LCD

• Connectors and Interfaces

– Four SMA connectors (differential clocks), two PS/2 connectors (keyboard/mouse), LVDS personality module,audio (line in, line out, microphone, headphone), RS-232serial port, USB (one host and two peripheral), ParallelCable-IV header, DB15 VGA display, RJ-45 Ethernet port


R

The Virtex-4 ML401 evaluation platform is a low-cost, full-featured development system.

Virtex-4 ML401 Evaluation PlatformFeatures

• Support for multiple clock sources and differential clock inputs

• Memory interfaces for DDR SDRAM, ZBT SRAM, and Linear Flash

• Multiple FPGA configuration modes: Platform Flash, System ACE™ CF solution, Linear Flash, and Parallel Cable-IV

• Audio and video interfaces

• Multiple user interfaces: dual PS/2, IIC Bus, RS-232, USB, and tri-mode Ethernet

• High-speed expansion module interface supporting single-ended and LVDS I/O standards

• Reference designs and IP cores for numerous applications speed up your design cycle

• A comprehensive suite of application notes guides you every step of the way

• Demonstrations ship in Platform Flash, Linear Flash, and System ACE CF solution

Order your Virtex-4 ML401 evaluation platform today toget a head start on your design. For more information aboutthe Virtex-4 FPGA family, visit www.xilinx.com/virtex4/.

T H E B O A R D R O O M

Today’s telecom and networking systems use high-bandwidth inter-faces based on LVDS, HyperTransport™, and other differential I/Ostandards. These standards simplify system design by lowering pincount and power consumption and improving signal integrity.

Protocols based on these standards, such as SPI-4.2, RapidIO™,and HyperTransport, are central to leading-edge system design.

Xilinx® Virtex-4™ FPGAs offer up to 1 Gbps SelectIO™parallel I/O, with the flexibility to use any I/O pair as differen-tial I/O. Additional benefits for higher level protocol imple-mentation include:

• ChipSync™ source-synchronous I/O technology for dynamic precision phase alignment and data centering with per-bit de-skew

• Bitslip module supports training patterns

• Internal SerDes modules and regional clocks enable 1 Gbps DDR bandwidth

The Virtex-4 FPGA source-synchronous interfaces tool kitcomes with the following Xilinx Productivity Advantage (XPA)options:

• ML450 platform, including Compact Flash, clock modules, documentation, reference designs, cables, and evaluation software

• ISE™ Foundation™ software

• IP cores: SPI-4.2, RapidIO, and GFP

• Training, Premium, and Titanium Services

• Check with your Xilinx sales representative for availability


R

Achieve faster, easier implementation with source-synchronous interfaces.

Virtex-4 FPGA Source-SynchronousInterfaces Tool Kit

Features

• Design with major differential I/O standards in networking, computing, storage, and wireless

• Pre-engineered IP and reference designs

• A unique built-in silicon feature enables 1 Gbps performance

Buy the source-synchronous interfaces tool kit today to get started on your design. For more information about

the kit, the Virtex-4 FPGA family, ChipSync technology, and available optional IP, visit www.xilinx.com/virtex4/.


Building interfaces to high-performance memory devices pres-ents challenges such as high-speed synchronous data capturing,along with implementing complex physical-layer interfaces andcontrol logic.

Virtex-4 FPGAs solve these challenges with advanced siliconcapabilities, including ChipSync™ source-synchronous technology,Xesium clocking, and Smart RAM.

• ChipSync technology provides 80 ps resolution for clock-to-data alignment, ensuring reliable data capture

• 500 MHz Xesium differential global clocks minimize skewand jitter, providing increased design margins

• 500 MHz Smart RAM blocks have built-in FIFO functionality, minimizing design size

• Column-based I/O eliminates memory interface placementrestrictions, alleviating board congestion

To shorten design time, Xilinx provides expert guidance in theform of free hardware-verified reference designs, applicationnotes, user-friendly tools, and advanced development systems.This combination of unique silicon capabilities and comprehen-sive support enables you to build and verify robust memory inter-faces quickly and easily.

The advanced memory development system, ML 461, offers

an excellent platform to develop and verify high-performancememory interfaces.

Xilinx also offers a menu-based tool, the memory interfacegenerator, to further customize reference designs (Figure 2). Thetool generates the pin placement file and a complete modular setof HDL files.


Virtex-4 FPGAs make complete memory interface solutions possible.

ML461 – Advanced Memory Development SystemFeatures

• Memory interfaces: DDR2 SDRAM, DDR SDRAM, QDR II SRAM, RLDRAM II, FCRAM II (Figure 1)

• Four Xilinx® Virtex-4™ LX-25 devices

• JTAG interface

•System ACE™ Compact Flash card

• CD-ROM with complete documentation

• 5V power supply

You can download the reference design, applicationnotes, memory interface generator, and other resources

for memory interface designs by visitingwww.xilinx.com/virtex4/. If you are interested in

purchasing the ML461, please contact your local sales representative, or e-mail [email protected].

R

Parameter DDR2 SDRAM DDR SDRAM QDR II RLDRAM II FCRAM II

Data Rate 534 Mbps 400 Mbps 1.2 Gbps 600 Mbps 600 Mbps

CLK Rate 267 MHz 200 MHz 300 MHz 300 MHz 300 MHz

Data Width 144-bit (DIMM) 144-bit (DIMM) (72+72)-bit 36-bit 36-bit28-bit 28-bit

I/O Standard SSTL 18 SSTL 2 HSTL HSTL SSTL 18

Figure 1 – Memory architectures supported by ML461

Figure 2 –Memory interface generator


The Memec™ LC development kit for Xilinx® Virtex-4™devices creates an easy-to-use yet effective Virtex-4 prototypingenvironment. The LC board provides prototype features commonto most designers’ needs, with a focus on usability in real-worldapplications.

The kit bundles a full-featured, expandable Virtex-4-based sys-tem board with a power supply, user guide, and reference designs.Optional Xilinx ISE™ software, JTAG cable, and application-specific P160 expansion modules are also available.


The Virtex-4 LC development kit accelerates design time.

The Memec MB development kits for Xilinx Virtex-4 devices pro-vide advanced functions and interface features for your mostdemanding Virtex-4 prototype needs.

The MB board is available in both LX25 and LX60 densities,and for DSP applications, the SX35.

The kit bundles an expandable Virtex-4-based system board witha power supply, user guide, reference designs, and optional ISE soft-ware and JTAG cable. The new P240 expansion module standardincluded on the board provides both LVDS and single-ended signalsto support more challenging expansion requirements.

The Virtex-4 MB development kits give you maximum flexibility to target high-end applications.

Memec Virtex-4 Board SolutionsVirtex-4 LC Development KitFeatures

• XC4VLX25-10SF363 FPGA• 10/100 Ethernet PHY• 32M x 16 DDR memory• P160 interface• 2 x 16-character LCD• RS232• System ACE™ interface• Low cost

Virtex-4 MBDevelopment KitFeatures

• XC4VLX25, LX60, or SX35-10FF668 FPGA

• 10/100 Ethernet PHY• 32M x 16 DDR memory• 2M x 16 Flash memory• P240 high-performance

interface• High-speed LVDS interface• 2 x 16-character LCD• RS232 and USB interface• System ACE interface• High performance

For more information or to order your Virtex-4 development kit from Memec,

visit www.memec.com/xilinx-v4/or call (888) 488-4133 (in the U.S.) and

(858) 314-8910 (outside the U.S.).


The Virtex-4 family of FPGAs delivers powerful new capabilities fordesigns in the programmable logic, DSP, embedded processing, andhigh-speed serial I/O applications domains. As a Xilinx distributor,Avnet plays a critical role in helping customers rapidly adopt theVirtex-4 solution into innovative, feature-rich end products.

Avnet is now shipping three new evaluation kits: the Virtex-4LX25 and LX60 Evaluation Kits and the Virtex-4 SX35 EvaluationKit (Figure 1). The LX Evaluation Kits feature an XC4VLX25 orXC4VLX60 device. These two kits are optimized for general logicintegration applications.

The SX35 Evaluation Kit, which is optimized for high-performance DSP applications, uses the same board populatedwith a Virtex-4 XC4VSX35 device.

All three kits offer a choice of affordable, easy-to-use platformsfor evaluating and experimenting with a Virtex-4 LX or SX design.And by tying in expansion cards available from Avnet, such as add-on memory, audio/video, and adapters for data conversion, thesekits can serve as powerful prototyping platforms.

Purchasing any Avnet Design Kit gets you into an AvnetSpeedWay Design Workshop™ for free, where you’ll learn how toleverage Xilinx solutions using real-world design examples. SpeedWayWorkshops are hardware-based and lab-oriented. You’ll work with realhardware and development tools to build actual designs and leave with

an in-depth knowledge of the FPGA architecture and design methodsused in the lab. For more information or to register for a SpeedWayWorkshop, visit www.em.avnet.com/xlxspeedwayindex/.


Virtex-4 LX25, LX60, and SX35 Evaluation Kits are now available.

Avnet Virtex-4 Evaluation Kits Features

• Xilinx® XC4VLX25 FF668, XC4VLX60 FF668, or XC4VSX35 FF668 FPGA

• Cypress™ CY7C68013 USB 2.0 controller

• National Semiconductor™ DP83847 10/100 Ethernet PHY

• Intel™ 8 MB Flash

• OSRAM 128 x 64 graphical display

• Micron™ 32 MB DDR SDRAM

• Texas Instruments™ CDC5801 clock multiplier/divider

Avnet’s design kits and technical workshops are powerfultools that you can leverage to increase your design advantage

when implementing Virtex-4-based solutions. For more information, visit www.em.avnet.com/xlxv4kits/.

Virtex-4 LX Platform

Featured Device Avnet Part Number Price

XCV4LX25 ADS-XLX-V4LX-EVL25 $349.00 USD

XCV4LX60 ADS-XLX-V4LX-EVL60 $599.00 USD

Virtex-4 SX Platform

Featured Device Avnet Part Number Price

XCV4SX35 ADS-XLX-V4SX-EVL35 $449.00 USD

Virtex-4 FX Platform

...coming soon


Support for Multiple Clock Sources and Differential Clock Inputs• Memory interfaces for DDR2 SDRAM at 533 MHz, ZBT SRAM, and Linear Flash• Multiple FPGA configuration modes: Platform Flash, System ACE™ CF, Linear

Flash, and Parallel Cable-IV• Audio and video interfaces• Multiple user interfaces: dual PS/2, IIC Bus, RS-232, USB, and tri-mode

Ethernet• High-speed data acquisition expansion module interface supporting single-

ended and LVDS I/O standardsOptimize Your Design with Unique Built-In Silicon Features• ChipSync™ source-synchronous technology embedded in every I/O ensures reli-

able data capture• Xesium differential global clocks minimize skew and jitter for increased

design margins

Finish Faster Using Proven Reference Designs• Reference designs and IP cores for numerous applications speed up your

design cycle• A comprehensive suite of application notes guides you every step of the way

* Demonstrations ship in Platform Flash, Linear Flash, and System ACE CF formats


Evaluate and implement your design by leveraging the ML401 board’s rich feature set.

All of the designs and related documentation for the Virtex-4 board are available on the Nu Horizons

website at www.nuhorizons.com/v4/.

Nu Horizons Virtex-4 Development PlatformThe NH401 from Nu Horizons Electronics Corp. is designed as alow-cost, high-value development platform to provide a demonstra-tion of the Xilinx® Virtex-4™ LX/SX/FX family. The NH401 plat-form showcases the enormous power and flexibility of Virtex-4FPGAs, including new and improved clock technology, systemmonitors, DSP blocks, Smart RAM blocks, advanced I/Os, embed-ded MACs, 10/100/1000 Ethernet MAC, RocketIO™ MGTs, andembedded processors (Power PC™ 405 hard-core andMicroBlaze™ soft-core processors).

The NH401 is built around a Virtex-4 FPGA and is designed tooffer a user-friendly and highly useful set of features at an extreme-ly low price point. The board is envisioned to function as an easy-to-use demonstration platform, as well as a high-performance DSPdevelopment or embedded processing platform. Included with theNH401 are simple tutorials, reference designs, and interestingdemos, including a full embedded computer that can you can easi-ly expand or adapt for your own applications.

Feature Summary• XC4VLX25/40/60, FX12, SX35-FF668• Memory

– 64 MB DDR2 SDRAM - 533 MHz– 1 MB ZBT SRAM– 32 MB CompactFlash™– 8 MB Flash– 128 kb IIC EEPROM– 32 Mb Platform Flash

• VGA controller (resolutions as high as 1024 x 768 at 60 Hz)• Audio in/out CODEC (microphone in, line-in/out, and headphone output jacks)• LCD display (16 x 2 character)• RS232 serial port• 2 x PS/2 (P/C keyboard and mouse)• GPIO: 5 Buttons + 13 LEDs + 8 DIP switches• 4 SMAs (differential clock in/out) + CLK oscillator socket• ADC system monitor (-3V or 0-6V swing can be sampled)• 64-bit expansion I/O connector routed for LVDS, Agilent Soft Touch connector• 10/100/1000 Ethernet PHY• PC4 connector (allow for JTAG debug/download via the Parallel-IV cable)• USB host/peripheral interface• CPLD for Flash configuration of FPGA• High-speed frequency synthesizer - 622 MHz

Additional plug-in evaluation modules are available:• Linear Technology high-speed A/D converters

– 10/12/14-bit 10 to 135 Msps ADCs• Intersil high-speed D/A interface

– 8/10/12/14-bit 130 to 260 Msps DACs


With recent, rapid progress in memory-related technology, thestandard of SDRAM is shifting from SDR to DDR, furtherenabling the rise of DDR2 SDRAM. It is becoming the defacto standard in the industry with its numerous advantages oflow power consumption, high speed, and reduced EMI.

The TED DDR2 memory evaluation board from HiTechGlobal Distribution allows you to evaluate DDR2 SDRAM withthe Virtex-4 LX series (LX25/40/60). The DDR2 SDRAM com-prises two embedded component chips and two DIMM modules,thus allowing use in various memory evaluation applications.

Additionally, so that you can use the board immediatelyafter purchase, the board is under plan to provide a 533 Mbpsreference design.

We also offer a Gerber file as well as a board schematic file,which can assist you in developing high-speed interfaces forDDR2 SDRAM and FPGAs.


High-performance, easy-to-use, and low-cost platforms for the rapid evaluation of DDR-II memory devices.

TED DDR2 Memory Evaluation BoardFeatures

• Xilinx® Virtex-4™ LX25/ LX40/ LX60 in FF668 package • 2X DDR DIMM (533 Mbps) • 2X DDR mounted memory (533 Mbps)• 533 Mbps DDR2 memory controller reference design• Board schematic/Gerber/BOM files• Various option boards (HDL reference design)

– DVI Tx/Rx option board– HDMI Tx/Rx option board– CameraLink I/F board– Optical I/F board

The designs and related documentation for this board areavailable on the HiTech Global Distribution, LLC website at

www.hitechglobal.com/ted/virtex4ddr.htm.


Xil

inx V

irte

x-4

™FP

GA

sh

ttp

://w

ww

.xil

inx

.co

m/d

ev

ice

s/


Easy

Path

™ S

olut

ions

4VFX

124V

FX20

4VFX

404V

FX60

4VFX

100

4VFX

140

12,3

1219

,224

41,9

0456

,880

94,8

9614

2,12

8

648

1,22

42,

592

4,17

66,

768

9,93

6

4VSX

254V

SX35

4VSX

55

23,0

4034

,560

55,2

96

2,30

43,

456

5,76

0

4VLX

154V

LX25

4VLX

404V

LX60

Virt

ex-4

LX

(Log

ic)

4VLX

804V

LX10

04V

LX16

04V

LX20

0

24,1

9241

,472

59,9

0480

,640

110,

592

152,

064

200,

448

Logi

c Ce

lls

Pow

erPC

™ P

roce

ssor

Blo

cks

Ana

log-

to-D

igit

al C

onve

rter

s (A

DC)

10/1

00/1

000

Ethe

rnet

MAC

Blo

cks

Rock

etIO

™ S

eria

l Tra

nsce

iver

s

Virt

ex-4

FX

(Em

bedd

ed P

roce

ssin

g &

Ser

ial C

onne

ctiv

ity)

Virt

ex-4

SX

(Sig

nal P

roce

ssin

g)

1,29

61,

728

2,88

03,

600

4,32

05,

184

6,04

8

44

812

1220

48

88

88

1212

1212

00

48

88

04

44

44

88

88

320

320

448

576

768

896

320

448

640

448

640

640

768

960

960

960

160

160

224

288

384

448

160

224

320

224

320

320

384

480

480

480

3232

4812

816

019

212

819

251

248

6464

8096

9696

00

01

11

00

00

00

11

11

11

22

22

——

——

——

——

——

22

44

44

——

——

——

——

——

08

1216

2024

——

——

——

——

——

5,01

7,08

87,

641,

088

15,8

38,4

6422

,262

,016

35,1

22,2

4050

,900

,352

9,65

1,07

214

,476

,608

24,0

88,3

20

13,8

24

864 4 0 320

160

32 0 — — —

4,87

5,39

28,

037,

312

12,6

47,6

8018

,315

,520

24,1

01,4

4031

,818

,624

41,8

63,2

9650

,648

,448

Tota

l Blo

ck R

AM

(kbi

ts)

Dig

ital

Clo

ck M

anag

ers

(DCM

)

Phas

e-m

atch

ed C

lock

Div

ider

s

Max

Sel

ectI

O™

Max

Diff

eren

tial

I/O

Pai

rs

Xtre

meD

SP™

Slic

es

Conf

igur

atio

n M

emor

y Bi

ts

——

——

——

XCE4

VFX1

40XC

E4VF

X100

XCE4

VFX6

0XC

E4VF

X40

XCE4

VSX5

5XC

E4VL

X40

XCE4

VLX6

0XC

E4VL

X80

XCE4

VLX1

00XC

E4VL

X160

XCE4

VLX2

00

Pb

-free

sol

utio

ns a

re a

vaila

ble.

For

mor

e in

form

atio

n ab

out P

b-fre

e so

lutio

ns, v

isit

ww

w.x

ilinx

.com

/pbf

ree/

.

1.

Num

ber o

f ava

ilabl

e Ro

cket

IO M

ulti-

Gig

abit

Tran

scei

vers

240

320

240

448

320

240

240

SF36

317

x 1

7 m

m—

240

320

448

448

320

FF66

827

x 2

7 m

m

—44

8

640

640

768

768

768

FF11

4835

x 3

5 m

m

—76

8

960

960

960

FF15

1340

x 4

0 m

m

—96

0

320

(8)1

352

(12)

135

2 (1

2)1

FF67

227

x 2

7 m

m

1235

2

448

(12)

157

6 (1

6)1

576

(20)

1FF

1152

35 x

35

mm

20

576

768

(20)

176

8 (2

4)1

FF15

1740

x 4

0 m

m

2476

8

896

(24)

1FF

1760

42.5

x 4

2.5

mm

24

896

4VFX

204V

FX40

4VFX

604V

FX10

04V

FX14

04V

SX35

4VSX

554V

FX12

4VLX

154V

LX25

4VLX

404V

LX80

448

448

448

320

320

448

320

FF67

627

x 2

7 m

m

—44

8

640

4VLX

604V

LX10

04V

LX16

04V

LX20

04V

SX25

Pack

age

Are

aM

GT

Pins

Pro

du

ct S

ele

ctio

n M

atr

ix

Impo

rtan

t:Ve

rify

all

data

in t

his

docu

men

t w

ith

the

devi

ce d

ata

shee

ts f

ound

at

http

://w

ww

.xili

nx.c

om/p

arti

nfo/

data

book

.htm

Xil

inx S

part

an

™-3

FPG

As

htt

p:/

/ww

w.x

ilin

x.c

om

/de

vic

es/


Pro

du

ct S

ele

ctio

n M

atr

ixPack

ag

e O

pti

on

s an

d U

serI

/O1

CLB

Reso

urce

sM

emor

y Re

sour

ces

CLK

Reso

urce

sD

SPI/O

Fea

ture

sSp

eed

PRO

M

System Gates (see note 1)

CLB Array (Row x Col)

XC3S

5050

K16

x 1

2

Number of Slices

768

Logic Cells (see note 2)

1,72

8

CLB Flip-Flops

1,53

6

Max. Distributed RAM Bits

12K

# Block RAM4

Block RAM (bits)

72K

Dedicated Multipliers

4

DCM Frequency (min/max)

24/3

30

# DCMs

2

Frequency Synthesis

YES

Phase Shift

YES

Digitally Controlled Impedance

Number of Differential I/O Pairs

Maximum I/O

I/O Standards

Commercial Speed Grades(slowest to fastest)

YES

5612

4Sin

gle-en

ded

LVTT

L, LV

CMOS

3.3/2.

5/1.8/

1.5/1.

2, PC

I 3.3V

– 32

/64-bi

t 33

MHz,

SSTL

2 Clas

s I &

II,

SSTL

18 Cl

ass I

, HST

L Clas

s I,

III, H

STL1

.8 Cla

ss I, I

I & III

,GT

L, GT

L+

Diffe

rentia

lLV

DS2.5

, Bus

LVDS

2.5,

Ultra

LVDS

2.5, LV

DS_e

xt2.5,

RSDS

, LDT

2.5, LV

PECL

-4 -5

Industrial Speed Grades(slowest to fastest)

-4

Configuration Memory (Bits)

.4M

XC3S

200

200K

24

x 2

01,

920

4,32

03,

840

30K

1221

6K12

24/3

304

YES

YES

YES

7617

3-4

-5-4

1.0M

XC3S

400

400K

32

x 2

83,

584

8,06

47,

168

56K

1628

8K16

24/3

304

YES

YES

YES

116

264

-4 -5

-41.

7M

XC3S

1000

10

00K

48 x

40

7,68

017

,280

15,3

6012

0K24

432K

2424

/330

4YE

SYE

SYE

S17

539

1-4

-5-4

3.2M

XC3S

1500

15

00K

64 x

52

13,3

1229

,952

26,6

2420

8K32

576K

3224

/330

4YE

SYE

SYE

S22

148

7-4

-5-4

5.2M

XC3S

2000

20

00K

80 x

64

20,4

8046

,080

40,9

6032

0K40

720K

4024

/330

4YE

SYE

SYE

S27

056

5-4

-5-4

7.7M

XC3S

4000

40

00K

96 x

72

27,6

4862

,208

55,2

9643

2K96

1,72

8K96

24/3

304

YES

YES

YES

312

712

-4 -5

-411

.3M

XC3S

5000

50

00K

104

x 80

33,2

8074

,880

66,5

6052

0K10

41,

872K

104

24/3

304

YES

YES

YES

344

784

-4 -5

-413

.3M

Not

e:

1. S

yste

m G

ates

incl

ude

20-3

0% o

f CLB

s us

ed a

s RA

Ms

2.

For

Spa

rtan

-3, a

Log

ic C

ell i

s de

fined

as

a 4-

inpu

t LU

T +

flip

-flop

3. A

utom

otiv

e Q

-Gra

de S

olut

ions

for S

part

an-3

will

be

avai

labl

e 2H

2004

.

Spar

tan-

3 Fa

mily

– 1

.2 V

olt

(see

not

e 3)

Not

e 1:

Num

bers

in ta

ble

indi

cate

max

imum

num

ber o

f use

r I/O

sN

ote

2: A

rea

dim

ensi

ons

for l

ead-

fram

e pr

oduc

ts a

re in

clus

ive

of th

e le

ads.

Pb-fr

ee s

olut

ions

are

ava

ilabl

e. F

or m

ore

info

rmat

ion

abou

t Pb-

free

solu

tions

vis

it w

ww

.xili

nx.c

om/p

bfre

e/.

XC3S50

XC3S200

XC3S400

XC3S1000

XC3S1500

XC3S2000

XC3S4000

Are

a2Pi

nsI/O

’s12

417

326

439

148

756

571

278

4XC3S5000

30.6

x 3

0.6

mm

208

16.0

x 1

6.0

mm

100

6363

22.0

x 2

2.0

mm

144

97

124

141

141

9797

PQFP

Pac

kage

s (P

Q) –

wir

e-bo

nd p

last

ic Q

FP (0

.5m

m le

ad s

paci

ng)

VQFP

Pac

kage

s (V

Q) –

ver

y th

in T

QFP

(0.5

mm

lead

spa

cing

)

TQFP

Pac

kage

s (T

Q) –

thi

n Q

FP (0

.5m

m le

ad s

paci

ng)

31 x

31

mm

900

565

633

633

35 x

35

mm

1156

712

784

17 x

17

mm

256

23 x

23

mm

456

264

333

27 x

27

mm

676

391

487

489

173

173

173

333

19 x

19

mm

320

221

221

221

FGA

Pac

kage

s (F

T) –

wir

e-bo

nd fi

ne-p

itch

thi

n BG

A (1

.0 m

m b

all s

paci

ng)

FGA

Pac

kage

s (F

G) –

wir

e-bo

nd fi

ne-p

itch

BG

A (1

.0 m

m b

all s

paci

ng)

Spar

tan-

3 (1

.2V)

Impo

rtan

t:Ve

rify

all

data

in t

his

docu

men

t w

ith

the

devi

ce d

ata

shee

ts f

ound

at

http

://w

ww

.xili

nx.c

om/p

arti

nfo/

data

book

.htm

FPG

A a

nd C

PL

D D

evic

esht

tp://

ww

w.x

ilinx

.com

/dev

ices

/

Con

figu

rati

on a

nd S

tora

ge S

yste

ms

http

://w

ww

.xili

nx.c

om/c

onfig

soln

s/

Pack

agin

ght

tp://

ww

w.x

ilinx

.com

/pac

kagi

ng/

Soft

war

eht

tp://

ww

w.x

ilinx

.com

/ise/

Dev

elop

men

t R

efer

ence

Boa

rds

http

://w

ww

.xili

nx.c

om/b

oard

_sea

rch/

IP R

efer

ence

http

://w

ww

.xili

nx.c

om/ip

cent

er/

Glo

bal S

ervi

ces

http

://w

ww

.xili

nx.c

om/su

ppor

t/gsd

/

For t

he la

test

info

rmat

ion

and

prod

uct s

pecif

icatio

ns o

n al

l Xilin

x pr

oduc

ts, p

lease

visi

t the

follo

wing

link

s:

Track

Track

Vo 2 = 2.5 V

Vo 1 = 3.3 VVIN = 3.3 V, 5 V, or 12 V

Track

Vo 3 = 1.8 V

20 A

30 A

15 A

The new PTHxx family of plug-in power modules from Texas Instruments providesindustry-leading features that allow designers to take charge of point-of-load (POL) powerproblems and designs. New Auto-Track sequencing via single-pin control simplifies multimodule power up/down. In addition to those listed below, other key featuresinclude wide adjustable output voltage, on/off inhibit, overcurrent protection and remote sense.

Samples shipped in 24 hours.

The Industry’s Most Advanced Plug-In Power ModulesFeaturing TI’s New Auto-TrackTM Sequencing

Applications

– Networking

– Servers

– Data communications

– Workstations

– Industrial electronics

Features– Auto-Track sequencing simplifies power

up/down sequencing of multiple modules

– Pre-bias startup capability allows usewith all ASICs and FPGAs

– Margin up/down provides for additionaltest capability during manufacturing

– A 96% efficiency rating means morepower in a smaller package

– Point-of-Load Alliance (POLA) compatibilityassures interoperable second sources

High-Performance Power Management

FREE! Plug-in Power and

Power ManagementSelection Guides

www.ti.com/xcell1-800-477-8924, ext. 1202

Datasheets, Samples, Plug-in Power and

Power Management Selection Guides

Input Auto- Track Pre-bias Margin ThermalSeries Bus (V) IOUT (A) Sequencing Startup Up/Down Shutdown

PTH03050/5050 3.3/5 6 PTH12050 12 6 PTH03060/5060 3.3/5 10 PTH12060 12 8 PTH03010/5010 3.3/5 15 PTH12010 12 10 PTH03020/5020 3.3/5 20 PTH12020 12 16 PTH03030/5030 3.3/5 30 PTH12030 12 20

Auto-Track, Technology for Innovators and the red/black banner are trademarks of Texas Instruments. M6496 © 2004 TI

PN 0010842

Xcell Journal Issue 52 - Xilinx

Documents