ISSUE 52, FIRST QUARTER 2005 XCELL JOURNAL XILINX, INC. THE AUTHORITATIVE JOURNAL FOR PROGRAMMABLE LOGIC USERS R Xcell journal Xcell journal THE AUTHORITATIVE JOURNAL FOR PROGRAMMABLE LOGIC USERS Achieving Breakthrough Performance at the Lowest Cost Achieving Breakthrough Performance at the Lowest Cost Issue 52 First Quarter 2005 Special Edition Special Edition
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ISSUE 52, FIRST QUARTER 2005XCELL JOURNAL
XILINX, INC.
T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S
R
Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S
Achieving Breakthrough Performance at the Lowest Cost
Achieving Breakthrough Performance at the Lowest Cost
Issue 52First Quarter 2005
Special EditionSpecial Edition
IIn case you haven’t heard, Xilinx recently announced its new Virtex-4™ family of FPGAs. In thisspecial edition of Xcell Journal, you’ll find articles devoted exclusively to Virtex-4 business view-points, system design challenges, engineering solutions, and engineering references.
Our “View from the Top” article is by Erich Goetting, Xilinx Vice President and General Manager ofthe Advanced Products Division. Erich presents an overview of the new Virtex-4 family, and gives youa guided tour of some of the new Virtex-4 technologies, as well as the inspiration and rationale behindthem. Other articles in the Business Viewpoint section discuss how these new Virtex-4 devices, basedon 90 nm technology, have greatly expanded high-performance processing and system integration.
You’ll also find technical articles written by Xilinx marketing, applications, and development staff,as well as our partners and customers, including:
• System Design Challenges articles emphasize the Virtex-4 family advantages and leadershipthemes. These technical articles outline design challenges and demonstrate how the Virtex-4solution addresses these challenges.
• Engineering Solutions articles demonstrate some of the key capabilities of Virtex-4 FPGAs andhow they are used in a design. These articles provide in-depth descriptions of Virtex-4 features,IP, and tools.
• The Engineering Reference section describes some of the Virtex-4 hardware development plat-forms and other design solutions, to help you determine which platform is best for your appli-cation and design task.
It’s Time to Re-SubscribeThis issue marks the 16th anniversary of our Xcell Journal. From itshumble beginnings in the fourth quarter of 1988 as an eight-page, two-color newsletter, the journal has grown into an award-winning publica-tion printed in five languages and distributed in 144 countries with a cir-culation of more than 60,000 readers.
Periodically, we must clean our mailing database. Beginning January 1, 2005,you must re-subscribe to continue receiving the Xcell Journal FREE. If yousubscribed after January 1, 2005, you do not have to re-subscribe. If you sub-scribed before this date, please visit our site at www.xilinx.com/xcell/subscribeand take a minute to renew your FREE subscription and ensure its uninter-rupted delivery.
I want to thank all of you, our readers, for your continued interest and support of the Xcell Journal.Please feel free to drop me a note at [email protected] about your suggestions on how we mayimprove. I’d like to hear from you.
L E T T E R F R O M T H E E D I T O R
Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780
The articles, information, and other materials included inthis issue are provided solely for the convenience of ourreaders. Xilinx makes no warranties, express, implied,statutory, or otherwise, and accepts no liability with respectto any such articles, information, or other materials ortheir use, and any use thereof is solely at the risk of theuser. Any person or entity using such information in anyway releases and waives any claim it might have againstXilinx for any loss, damage, or expense caused thereby.
B U S I N E S S V I E W P O I N T S SYSTEM DESIGN CHALLENGES
5858
ENGINEERING SOLUTIONS
112
ENGINEERING REFERENCE
This section discusses how the new Virtex-4 devices,based on 90 nm technology, have greatly expandedhigh-performance processing and system integration.
This section emphasizes the Virtex-4 family advantages and leadership themes. These articles outline design challenges anddemonstrate how the Virtex-4 solution addresses these challenges.
This section demonstrates some of the key capabilities of Virtex-4FPGAs and how they are used in a design. These articles providein-depth descriptions of Virtex-4 features, IP, and tools.
This section describes some of the Virtex-4 hardware developmentplatforms and other design solutions, to help you determine whichplatform is best for your application and design task.
The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value.
Introducing the NewVirtex-4 FPGA FamilyIntroducing the NewVirtex-4 FPGA Family
V I E W F R O M T H E T O P
F I R S T Q U A R T E R 2 0 0 5 , I S S U E 5 2 Xcell journalXcell journalView from the Top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
BUSINESS VIEWPOINTSWill the Evolution of Platform FPGAs Mean the End for ASICs and ASSPs? . . . . . . . . . . . . . . .10EasyPath FPGAs Beat ASIC Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
Welcome to the Xilinx® Virtex-4™ editionof the Xcell Journal. We’ve created this spe-cial issue to show you the new Virtex-4FPGA family, and how its innovationsenable the creation of next-generation sys-tems that do more than ever thought possi-ble only a few years ago.
In this article, I’ll take you behind thescenes for a guided tour of some of the newtechnologies, as well as a bit of the inspira-tion and rationale behind them.
With more than 100 innovations, theVirtex-4 family represents a new milestonein the evolution of FPGA technology. Afterconducting extensive interviews with lead-ing design engineers worldwide, we knewthat they wanted the following things in anadvanced next-generation FPGA family:
• Higher performance
• Higher logic density
• Lower power
• Lower cost
• More advanced capabilities
Introducing the New Virtex-4 FPGA Family
6 Xcell Journal First Quarter 2005
The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value.
Viewfrom the top
by Erich GoettingVice President & General Manager, Advanced Products DivisionXilinx, [email protected]
It’s relatively easy to deliver on one or twoof these items – our challenge was to deliverall of them at the same time. We did thisthrough a combination of innovative processand circuit design, process development, theASMBL architectural approach, and the useof advanced embedded functions.
Development work on the Virtex-4 fam-ily (code-named “Whitney” after the high-est mountain in the continental UnitedStates) began more than two years ago. Itrepresents the creativity and dedication ofhundreds of engineers, spanning integratedcircuit design and layout, software and IPdevelopment, process development, testingand characterization, systems and applica-tions engineering, technical documenta-tion, and product marketing.
One of the most remarkable develop-ments embodied in the new Virtex-4 FPGAfamily is the ASMBL architecture, which rep-resents a fundamentally new way of con-structing the FPGA floor plan and itsinterconnect to the package. First of all,ASMBL enables I/O pins, clock pins, andpower and ground pins to be located any-where on the silicon chip, not just along theperiphery as with previous approaches. Thisin turn allows power and ground pins to bebrought directly into the center of the silicondie, thereby significantly reducing on-chip IRdrops that can occur with the largest FPGAsrunning at the highest frequencies.
Clock input pins are also located in thecenter of the die, which reduces clock latency.This is because clock networks need to haveequal delay to all endpoints (that is, mini-mum skew), and thus the clock must emanatefrom the center. In periphery-connected clockinput pins, the signal first traverses from theedge of the die to the center, and is then dis-tributed to all regions. The Virtex-4 ASMBLdesign eliminates this traversal completely,and thus directly reduces the clock networkpropagation delay.
In addition to its electrical advantages,ASMBL provides another significant benefitin that it allows a more flexible – and thusmore precise – allocation of on-chip resources.
material, which is copper rather than alu-minum (the traditional material). More lay-ers provide more routing in less space andshorter connection distances. Copperreduces resistance compared to aluminum,and thus speeds signal interconnect andreduces on-chip power-distribution IRdrop. As clock rates go up and voltages godown, these considerations have becomeincreasingly important, and have driven theindustry-wide shift to copper interconnect.
The Virtex-4 logic fabric was complete-ly re-engineered to fully take advantage ofthe 90 nm triple-oxide CMOS process,resulting in the highest performance fabricever, with system clock rates in excess of500 MHz (at three LUT levels). At thesame time, static power was cut in halfcompared to 130 nm Virtex-II Pro™devices, as was dynamic power.
Thus, while some industry pundits wereproclaiming that the future of deep submi-cron CMOS devices was getting hotter andhotter, with chip temperatures destined toreach that of rocket nozzles and the surfaceof the sun, the Virtex-4 design’s creativeapproach has turned that conventional wis-dom on its head, resulting in overall powerreductions of 50% compared to our previ-ous 130 nm generation. In many applica-tions, such as DSP functions, power levelsare reduced even more – as much as 90%.No wonder design engineers say thatVirtex-4 FPGAs are cool – they literally are.
High-Performance Clocking Clocks were rated as one of the mostimportant and critical FPGA resources inour surveys of design engineers. Quantity,quality, connectivity, frequency, duty cycle,jitter, and skew all made a big difference.
To take clocking to the next level inVirtex-4 devices, all global clock resourceswere made fully differential, thereby reduc-ing skew, jitter, and duty-cycle distortion.This marks the first implementation of dif-ferential clocking in a programmable logicdevice. Not only that, but the number ofglobal clocks was increased to 32, for every
That in turn has enabled us to offer Virtex-4devices in three unique platforms, each witha different mix of on-chip resources:
• The LX platform, optimized for logicapplications
• The SX platform, optimized for high-end DSP applications
• The FX platform, optimized forembedded processing and high-speedserial applications
A Look Inside the Virtex-4 FPGAAt the heart of the Virtex-4 FPGA is ournext-generation 90 nm triple-oxide 10-layer copper CMOS process technology.While that’s quite a lot of adjectives,every one of them is incredibly impor-tant. The first, 90 nm, refers to the“drawn” gate length of the smallest tran-sistors. As transistors get smaller, they getfaster, use less dynamic power, and enablehigher complexity at lower price points.Chip designers think in terms of “transis-tor budgets,” which are now in the billiontransistor range.
Triple-Oxide 90 nm CMOS TechnologyTriple-oxide technology refers to the num-ber of transistor oxide thicknesses availablein the process. More oxide thicknessesallow more tuning of performance andpower in the device circuitry, and enableVirtex-4 devices to deliver industry-leadingperformance while dramatically loweringpower consumption.
One of our key inputs from many engi-neers was that performance and power werevery important constraints in their systemsdesigns, and that they needed both highperformance and low power. With a dual-oxide 90 nm process, we would have had tochoose performance or power. This wasn’tgood enough. By employing a triple-oxide90 nm process, we achieved high perform-ance and low power.
The 10-layer copper refers to the num-ber of metal interconnect layers and their
First Quarter 2005 Xcell Journal 7
V I E W F R O M T H E T O P
At the heart of the Virtex-4 FPGA is our next-generation 90 nm triple-oxide 10-layer copper CMOS process technology.
device, and internal connectivity optionsenhanced to allow any region to use any 8clocks simultaneously.
500 MHz Synchronous Memories and FIFOsOn-chip synchronous block RAM wasenhanced to run at 500 MHz. Built-in sup-port for first-in first-out (FIFO) memorieswas included directly in the block RAMunit, enabling the same 500 MHz opera-tion for FIFOs (approximately a 2Xspeedup over fabric-based FIFOs), whileeliminating the need for any additionallogic cells or complex FIFO designs.
If you’re designing systems requiringECC (error checking and correcting)memory, Virtex-4 devices have built-inECC support, with single-bit correct anddouble-bit detect. ECC is common ininfrastructure equipment in networking,telecom, storage, servers, instrumentation,and aerospace applications, and providesthe highest levels of data integrity. Like theintegrated FIFO support, the integratedECC eliminates the cost and delay of fabric-based solutions.
Speaking of on-chip memory, Virtex-4devices continue to offer SelectRAM™memory, whereby each LUT is trans-formed into a 16 x 1 RAM, ideally suitedfor building high-speed register files andlocal buffers.
At the other end of the spectrum, inter-faces to external memory devices such asDDR, DDR2, QDR-II, and RLDRAM-IIare dramatically enhanced through our newChipSync™ technology, which offers mem-ory interface speeds at rates limited only bythe speed of the external memory devices.
The new Virtex-4 ML461 AdvancedMemory Development System containsfully functional and hardware-proven refer-ence designs for all of today’s most popularmemory technologies. If you plan to useexternal memory, I highly recommend thatyou check this out.
DSP Performance of 256 GigaMAC/sIn the DSP domain, we incorporated someof the world’s fastest multiply accumulate(MAC) technology. The XtremeDSP™slice can perform an 18 x 18 signed multi-ply and 48-bit accumulate every 2 ns.
The Virtex-4 LX, FX, and SX platformsinclude the breakthrough XtremeDSPtechnology. With the new SX platform wedid something completely new – we dra-matically increased the ratio of DSP unitsto logic cells. Given the highly integratednature of XtremeDSP slices, they need onlysmall amounts of logic fabric to implementmost common DSP functions, and thusincreasing the ratio provides a significantincrease in DSP compute power per unitsilicon area. In fact, SX devices provide a10X performance increase per unit costover previous solutions.
Power is dramatically reduced as well,with more than a 10X reduction for multi-ply/add functions from previous FPGAsolutions. The Virtex-4 SX55 contains 512XtremeDSP slices, providing an aggregateDSP compute performance of 256GigaMAC/s, making it one of the mostpowerful DSP devices ever manufactured.
The state-of-the-art XtremeDSP sliceemploys new “silicon algorithms” devel-oped by a company called Arithmatica™.Many different architectures exist forimplementing multiplication, and theArithmetica architecture is truly a break-through. We are excited to see it availablefor the first time to FPGA users. For moreinformation, visit Arithmatica’s website atwww.arithmatica.com.
The Evolution of Advanced I/O TechnologyI/O continues to be a critical success factorfor today’s systems designers. During thelast decade, we have seen four majorchanges in I/O. First was the shift awayfrom 5V, the result of the need to scale volt-ages as we scaled the transistor. This in turnled to the plethora of I/O standards that weare all familiar with today: SSTL, HSTL,LVDS, and LVCMOS 1.5. The Virtex-4SelectIO™ resource continues to lead theindustry, supporting virtually every I/Ostandard in use today on every pin.
XCITE On-Chip TerminationThe second major change was the transi-tion from lumped loads to transmissionline loads – again the direct result ofMoore’s Law. As transistors got faster andclock rates increased, I/O edge rates
increased as well. But because the propaga-tion speed of signals is a constant, dictatedby the speed of light, we entered the realmin which a signal on one end of a wire wasno longer the same as the signal on theother end of the same wire. This is whattransmission lines are all about, and theirappearance during the last few years hasdriven a sea change in all aspects of signalinterconnect and I/O design.
To make sure that these signal “waves”don’t start “splashing” uncontrollably, trans-mission lines need to be driven, built, andreceived using proper signal integrityapproaches, the most critical of which is ter-mination. Traditionally implemented withdiscrete resistors on the PCB, terminationlayouts can become exceedingly difficultaround high-density pinouts like those usedin FPGAs. This often dictates more PCBlayers and thus more system cost.
Virtex-4 FPGAs include our third-generation of XCITE™ integrated digi-tally controlled termination technology.Offering a precisely controlled sourceimpedance at the output drive pin, it isdesigned to enable the driving of trans-mission lines without external compo-nents, with maximum speed and signalintegrity, and with straightforward PCBlayout and layer stack-ups.
Likewise, on inputs, XCITE offers par-allel termination for single-ended inputsand true differential termination for differ-ential inputs. Termination occurs on theend of the transmission line at the die, noton the way there on the PCB, offering max-imum signal integrity. Many customersreport that the XCITE technology hassaved them many PCB layers, increasedPCB packing density, and saved them sub-stantial dollars in their bill of materials.
Source-Synchronous InterfacesThe third major change was the shift fromsystem-synchronous to source-synchronousinterfaces. Traditional system-synchronousinterfaces work by distributing a singleclock to all transmitters and receivers inthe system, and transmitting data betweensource and destination within a singleclock cycle. This makes the data rateinversely proportional to the sum of clock-
8 Xcell Journal First Quarter 2005
V I E W F R O M T H E T O P
to-out, transmission line delay, and inputsetup time.
Typically, system synchronous interfacestop out at speeds in the range of 100 MHz.To go faster, source-synchronous interfacestransmit a clock along with the data, and thereceiver uses this clock to capture the data.Using this technique, along with double-data-rate transmissions, enables parallel I/Odata rates in excess of 1 Gbps.
The challenge of source-synchronousinterfaces is that each interface generates anew clock domain at the receiver. On topof this, to operate at high speeds, the pre-cise alignment of clock and data at thereceiver is paramount. To address this newworld of source-synchronous interfaces,Virtex-4 devices include the breakthroughChipSync technology. ChipSync units liebetween the SelectIO technology and thecore FPGA fabric, are available on everyI/O pin on the device, and serve to trans-mit and receive high-speed source-syn-chronous data and clocks, achieving speedsof 1 Gbps per pin pair.
On the receiver, precise digital delay lineswork internally to align data signals to eachother, and then to align these to the receivedclock. The captured data is synchronizedand transferred to the selected FPGA coreclock domain.
To operate at maximum data rates, thetransmit and receive units include parallel-to-serial and serial-to-parallel conversionunits, respectively. Using ChipSync technol-ogy is virtually automatic for most designs,as it is utilized automatically in the variousXilinx IP cores and reference designs.
Networking interfaces such as SPI-4.2and HyperTransport™, and memory inter-faces such as DDR, DDR2 SDRAM, andQDR II SRAM, all employ the Virtex-4ChipSync technology. And if you’re design-ing your own source-synchronous interface,the ChipSync wizard gives you completecontrol and an easy-to-use GUI that lets youdial in exactly what you want to build.
Multi-Gigabit Serial InterfacesThe fourth major change in I/O has beenthe rapid adoption of high-speed serialinterfaces. For years, serial interfaces werelimited to long-distance communications,
such as those used in fiber-optic links in theSONET/SDH world and the Ethernetlinks like 100BASE-T.
A key breakthrough occurred in the late1990s, in which high-speed serial transceivers(which traditionally had been designed usingcomplex process technology such as GaAs[Gallium-Arsenide]) were for the first timecreated using advanced design techniquesusing standard CMOS. Once implementedin CMOS, these transceivers had lower costand much lower power, and could even beintegrated into complex CMOS chips.
Virtually overnight, gigabit serial tech-nology changed from a rare, expensive, andpower-hungry technology to a common,low-cost, and very power-efficient technol-ogy. This has been the economic and tech-nical impetus behind the industry’s “SerialTsunami,” in which interface after inter-face has shifted from parallel to gigabitserial links. Two common examples are vis-ible in today’s computer architectures, withthe shift from parallel PCI to 2.5 Gbpsserial PCI-Express™, and the shift fromthe parallel ATA drive interface to theSerial ATA interface.
There are more than a dozen multi-gigabit serial interfaces in widespread usetoday, with more being introduced everyyear. The Virtex-4 FX family provides ourthird-generation RocketIO™ multi-gigabitserial transceiver technology. Spanningspeeds from 622 Mbps to more than 10Gbps, each Virtex-4 RocketIO transceiver isprogrammable and can implement a myriadof speeds and serial standards. Link-layer IPis available for such standards as PCIExpress, Serial-ATA, FibreChannel, GigabitEthernet, and Aurora, to name a few.
In addition, Virtex-4 FX devices eachinclude multiple embedded tri-mode (or10/100/1000) Ethernet MACs, makingimplementation of compliant Ethernetdevices simpler and faster than ever.
Application-Specific Embedded ProcessingVirtex-4 embedded processing solutionsinclude full support for both MicroBlaze™32-bit soft CPUs on all devices, andembedded PowerPC™ 32-bit RISC CPUson all Virtex-4 FX devices. The versatileMicroBlaze soft CPU runs at clock rates
over 165 MHz on Virtex-4 devices, anddelivers more than 140 DMIPS.
The number of CPUs in one device islimited only by your imagination, and ofcourse by the available logic cells. Thepowerful PowerPC CPU runs at clockrates up to 450 MHz and delivers up to702 DMIPS each. The first PowerPCprocessor available by any manufactureron 90 nm, the PowerPC processor isincredibly power-efficient, using only 29mw/DMIPS. This makes it among thelowest power microprocessors availablefrom any manufacturer worldwide.
New Auxiliary Processing Unit (APU)technology connects the CPU to the FPGAfabric, enabling implementation of acceler-ation hardware for virtually any applica-tion. Once only the domain of high-budgetASIC and ASSP design teams, the Virtex-4FPGA’s architectural ability to combineapplication-specific hardware accelerationwith high-performance RISC CPUs shat-ters traditional barriers of cost, time-to-market, and risk.
During the next few years, I expect to seemore and more instances of application-specific acceleration, as it truly offers theability to deliver very high performance atlow cost and low power. A recent researchprogram completed within Xilinx ResearchLabs, led by Dr. Kees Vissers, demonstrateda 20-fold speedup for an encryption/decryp-tion (IPSEC) application over the basePowerPC processor. Using only 135 mW, itoutperforms a 3.2 GHz Pentium™-4, whileat the same time reducing power by 99%.That, in my opinion, is what state-of-the-artembedded processing is all about.
ConclusionI hope that you’ve enjoyed reading a bitabout the Virtex-4 Platform FPGA and thefactors that drove its design. From thebreakthrough ASMBL architecture and thetriple-oxide 90 nm CMOS process tech-nology, to the world’s most capable embed-ded processing and multi-gigabit serialsolutions, Virtex-4 devices offer an unpar-alleled set of enabling technologies for yournext-generation systems designs. I look for-ward to seeing the creativity of the world’sdesigners in tomorrow’s products.
First Quarter 2005 Xcell Journal 9
V I E W F R O M T H E T O P
by Richard SevcikExecutive Vice President, Programmable Logic Systems andIntellectual Property/Cores and Software Solutions GroupsXilinx, [email protected]
The debate over FPGAs as a viable alterna-tive to ASICs and ASSPs has been ongoingfor nearly a decade. Industry analystsiSupply, Gartner Dataquest™, and othershave documented the trend in decreasingASIC design starts and the increase inFPGA design starts.
Next-generation platform FPGA devicesbased on 90 nm have greatly expandedhigh-performance processing and systemintegration options. They continue to pushASIC design starts lower as additionalapplication solutions are defined.
With the beginning of the new millen-nium, the debate continued with theintroduction of Xilinx® Virtex-II™ andVirtex-II Pro™ devices – the industry’sfirst platform FPGAs. These high-performance devices, with their flexibledevice integration capability, programma-ble I/O, and significantly lower overalldesign cost, helped to usher in and estab-lish SoC design methodology and quicklyassumed innumerable ASIC SoC designs.
Will the Evolution of PlatformFPGAs Mean the End for ASICs and ASSPs?
The addition of high-performanceRISC CPUs, block RAM, multi-gigabithigh-speed serial I/Os, dedicated DSPfunctions, and other system enhancementsintroduced technological advances that fur-ther solidified the rise of platform FPGAsover their ASIC SoC counterparts.However, to get high-performance DSP,processing, or connectivity features for aspecific applications domain, designerswere typically forced to purchase thelargest, costliest devices. The larger partshad the biggest helpings of advanced fea-tures, while the smaller parts had reducedportions of the same.
Today, a new breed of domain-optimized, mullti-platform FPGAs fromXilinx – the Virtex-4™ family – promisesmulti-dimensional application scalingbased on required features and cost goals.By combining the economic benefits of aninnovative columnar architectural approachwith advances in process technology (90nm/300 mm), Xilinx is poised to movebeyond the $5.1 billion programmablelogic market to capture additional share inthe $84 billion ASIC and ASSP markets(Source: Gartner Dataquest 2007).
Just the Right MixBased on the revolutionary AdvancedSilicon Modular Block (ASMBL) columnararchitectural approach, Xilinx can nowcost-effectively develop multiple FPGAplatforms, each with different combina-tions of feature sets. Thus, a specific plat-form can be optimized specifically for acertain domain of applications – such aslogic, DSP, connectivity, and embeddedprocessing – to meet application require-ments previously delivered only by ASICs,ASSPs, and similar devices, while remain-ing programmable at heart.
Not only does the designer or designteam have a choice in selecting the idealplatform, they also have a choice in choos-ing the device size with just the right featuremix to best achieve needed capability andperformance at the lowest possible cost.
This unique flexibility and ability to cre-ate optimal application domain subsystemssets even higher standards for FPGAs.Devices that are both hardware- and
technology is used throughout the world.No two people use the same technology,systems, or software, nor do they subscribeto or want the same content.
Higher costs and longer design times forASICs and ASSPs relegate their primaryuses to proven lower-risk, very-high-vol-ume applications. The rapid and significantincrease in ASIC development costs clearly
gives the advantage to platform FPGAs intoday’s leading-edge applications. Theoverall cost benefit of zero NRE pushes thehigh-volume ASIC or ASSP crossoverpoint upwards, locking in FPGAs likenever before.
ConclusionDomain-optimized multi-platform FPGAsare revolutionary in their ability to acceler-ate the deployment of FPGA technologyinto many more application areas. Thecombined leverages of reduced risk, dra-matically shorter design cycles, and zeroNRE will soon move all but the highestvolume applications away from cell-basedASIC implementation toward more flexi-ble, forgiving architectures like today’sdomain-optimized FPGAs. For more infor-mation, visit www.xilinx.com/virtex4/.
software-programmable enable more flexi-ble implementation options than eitherASIC or ASSP devices. Reinvestigating,changing, or enhancing system architectureat any time in the development processprovides the ultimate tool kit to meetapplication requirements.
Designers can use this same capabilityto evolve hardware in the field to meet new
requirements or avoid expensive hardwareupgrades. This flexibility becomes para-mount given today’s many emerging andcompeting standards.
The “Total Cost” AdvantageFPGAs have demonstrated a clear and con-sistent trend in reducing cost and makingFPGA technology more suitable for a widerrange of applications. The combination of90 nm silicon fabrication technology with300 mm wafers results in a cumulativeeffect: increasing the number die-per-waferfive times over previous devices. Increasingthe die-per-wafer together with architectur-al integration enables substantially lowersystem costs.
A key and often overlooked componentin favor of programmable logic’s economicadvantage is clearly demonstrated in how
First Quarter 2005 Xcell Journal 11
B U S I N E S S V I E W P O I N T S
by Gokul Krishnan Sr. Marketing Manager, Market Specific Products GroupXilinx, [email protected]
The risk of deploying ASIC solutions hasworsened in magnitude with the move tosmaller process geometries. As design com-plexity increases, customers are looking for aviable solution that offers low design, unit,and total costs, high-level system integration,design flexibility, easy-to-use design tools, arich selection of IP, and fastest time to market.
Customers are increasingly turning toother alternatives to avoid the pitfalls ofASICs – high NRE and re-spin expenses,slow turnaround times, complex design envi-ronments, and hidden conversion, verifica-tion, and development costs. In this article,we’ll analyze two such alternatives: Xilinx®
EasyPath™ FPGAs and structured ASICs.
Structured ASIC product offerings tendto be similar to FPGAs in that they havepredefined combinations of gates, memo-ry, and I/Os. However, their architecturestend to trade off flexibility in favor ofreduced area to achieve their cost targets.The reality remains that a vast majority ofdesigns intended for ASICs are originallyprototyped in an FPGA, yet there are stillproblems with FPGA-to-structured-ASICconversions. EasyPath FPGAs offer thebest migration path to high-volume pro-duction at the lowest cost possible.
EasyPath FPGAsEasyPath FPGAs are the industry’s onlycustomer-specific and flexible solution forvolume production priced lower thanstructured ASICs.
EasyPath FPGAs are identical to ourstandard FPGA offerings but use patentedtesting techniques and customer-specifictest patterns to significantly improve FPGAyields. You can reap the benefits of theseimproved yields in the form of lower costs,because Xilinx only tests those parts of an
FPGA that are actually used in your design.With EasyPath FPGAs, you can realize a
30-80% reduction in prices when you moveto high volume, as compared to standardFPGAs. EasyPath FPGAs are availableacross six platforms, four different productfamilies, and 28 different devices over arange of gate and memory counts.
EasyPath FPGAs are identical to theirstandard FPGA counterparts, effectivelyeliminating any conversion work. Onceyou have frozen your design, Xilinx candeliver EasyPath parts in high volume ineight weeks. This compares favorablyagainst structured ASIC companies, whichtypically take 12-14 weeks from prototypesignoff to production.
Structured ASICsStructured ASICs are a variant of the gatearrays of yesteryear, but they use a “sea ofmodules” approach as opposed to a “sea ofgates” approach. The architecture of eachmodule varies depending on the vendor, butin general is some combination of NANDgates, inverters, flip-flops, and muxes.
EasyPath FPGAs Beat ASIC Prices
12 Xcell Journal First Quarter 2005
EasyPath is the most comprehensive volume production solution in the industry.
B U S I N E S S V I E W P O I N T S
Structured ASICs promise cost savingsprimarily as a result of customizing fewermask layers per design, unlike standard cellASICs that use all-custom metal layers.Structured ASICs use only the top few (typ-ically two to four) metal layers; the basemodules are all buried in the lower layers,with their ports coming up to the program-mable layers. During the fabrication phase,the connections between various ports aremade to realize the requisite logic.
The Lowest Total Cost SolutionFigure 1 shows the comparative economicsof standard cell ASICs, structured ASICs,FPGAs, and EasyPath FPGAs. FPGAs havetraditionally offered a zero-NRE solution,which has led to their broad adoption.Standard cell ASICs have a high NRE and arelatively low unit cost, but with the over-head discussed earlier. Structured ASICspromise to lower the NRE at a unit cost thatis higher than that of standard cell ASICs,but lower than that of standard FPGAs.
With next-generation EasyPathFPGAs, you can now enjoy unit prices aswell as NREs that are lower than struc-tured ASICs. The combination of theindustry’s lowest NRE charges (starting at$75K); low cost design tools and IP;prices below structured ASICs; fastesttimes to production; and no hidden con-version charges show how EasyPath
MACs, translation to a structured ASICvendor often requires a re-validation of theIP on the vendor’s silicon platform ofchoice. With Xilinx Virtex-4™ EasyPathsolutions, you get the same wide range ofvalidated IP as with standard FPGAs.There is no additional fee required tomigrate the IP to a volume solution.
The bottom line is that whether it is ageneric design or an IP-centric design,EasyPath FPGAs offer very competitiveand cost-effective solutions for high-vol-ume migration when compared to struc-tured ASICs, all from a single trustedsupplier. Migration to structured ASICs,on the other hand, can pose a number ofchallenges.
Conversion-Free Methodology The vast majority of IC design starts beginwith FPGA prototyping, followed by aconversion to a volume solution. This car-ries the inherent risk of redesigning and re-verifying the design in the targetarchitecture, along with the related costs ofre-spins, conversions, and a host of otherdesign issues. The conversion from FPGAto structured ASIC is not seamless; rather,it is fraught with risks.
One issue faced by structured ASICcompanies revolves around the mapping ofmemories from an FPGA to a structuredASIC. FPGAs generally tend to havecolumnar memory architectures and offeran efficient means to form larger memorystructures when required. On the otherhand, the use of distributed memory blocksin some structured ASIC architectures canpose problems when large contiguousblocks are required by the design.
The need to join together blocks thatare physically separated to form a largerblock that is logically monolithic canincrease routing congestion. This can notonly potentially deteriorate the access timesof those memory structures but also leavefewer routing resources available for logic,thus impacting design performance.
With EasyPath FPGAs, there is no con-version. EasyPath FPGAs are exactly thesame as the standard FPGAs on which adesign is prototyped – the only difference isthat the latter are completely programma-
FPGAs are the industry’s lowest total costsolution for volume production.
Unmatched Choice of PlatformsStructured ASIC vendors can roughly begrouped into two camps based on their abil-ity to address IP-centric designs. On the onehand are those that have a wide portfolio ofIP; on the other are companies that typical-ly can only address generic designs. Withthe recent announcement of next-genera-tion EasyPath FPGAs from Xilinx, both ofthese segments can be addressed economi-cally and efficiently.
Xilinx now offers four families and sixplatforms, with 28 devices from which tochoose. This comes with all the benefits of theFPGA ecosystem that Xilinx customers arealready used to – hard IP such as the IBM™PowerPC™, MGTs, and XtremeDSP™blocks, as well as 600+ proven soft IP coresand low-cost design tools.
Some structured ASIC vendors focusexclusively on generic designs or logic-heavydesigns. This class of design tends to be veryprice competitive. Xilinx is now able tooffer a more compelling solution than anystructured ASIC vendor with its Spartan-3™ EasyPath FPGAs, which are pricedbelow these structured ASICs.
For designs that require a lot of IP andsystem integration such as PowerPC proces-sors, DSP, high-speed I/O, or Ethernet
First Quarter 2005 Xcell Journal 13
Figure 1 – EasyPath FPGAs offer the lowest total cost solution.
B U S I N E S S V I E W P O I N T S
ble, while the former are not. As a result,memory mapping and performanceachieved in an EasyPath FPGA is identicalto that achieved in a standard FPGA.
Another problem that some structuredASIC companies face has to do with padlimitations. It is fairly well known that asprocess nodes shrink, more and moredesigns become pad-limited in ASICs. Toget an adequate number of pads, struc-tured ASIC vendors sometimes have togrow their die size and increase the effec-tive cost to end customers. This problemis compounded by the fact that structuredASIC I/Os tend not to be as flexible asFPGA I/Os.
To keep I/O structures small and lessarea-intensive, structured ASIC vendorshave to make some difficultchoices about what standardsthey want to address and how.In cases where designs requirelarge buses of input and out-put I/Os (for example, SSTL2buses for SDRAM, or HSTLbuses for certain telecom pro-tocols), the limitations in thedesign of I/O structures canmake it difficult to achievepin compatibility in theFPGA-to-ASIC conversion.The end result is that cus-tomers have to either re-spintheir board or migrate to alarger device – both unpalat-able options. None of theseare issues with EasyPathFPGAs because of the one-to-one mapping between themand standard FPGAs.
Apart from memory and I/Os, there isa whole other host of issues, includingdifficulties with IP translation and test-ing, when moving from FPGAs to struc-tured ASICs. FPGA cost reduction plansthat involve converting to structuredASICs in order to get a smaller die arelikely to trigger design changes andschedule risks.
The EasyPath solution, on the otherhand, is neither an ASIC conversion nor amask-programmed FPGA. No conversionor silicon differences are involved, so there
are no long lead times, no timing or pinoutchanges, no need for product qualification,no lost feature support, and no risk of adesign failure. In addition to eliminatingany hidden design or qualification expens-es and the risks of ASIC conversions,EasyPath FPGAs are delivered in eightweeks in production volume, allowing youthe benefits of faster time to market ormore time to perfect your designs
Unprecedented FlexibilityOne of the major advantages of FPGAsover ASICs is the flexibility to makedesign changes in case of a specificationchange or a design error. Traditionally,customers have had to forgo this advan-tage as they move from FPGAs to an
inflexible custom solution like standardcell or structured ASICs. Now, withEasyPath FPGAs, Xilinx offers two flexi-bility features that allow you to enjoysome of the FPGA advantages when yougo to volume production at prices belowstructured ASICs.
Spartan-3 and Virtex-4 EasyPathFPGAs enable you to buy a custom devicethat supports two applications – one fordiagnostic testing and one for the actualapplication. EasyPath FPGAs can now betested for two designs, or two variations ofthe same design. This means that you can
now enjoy greater flexibility while alsosaving on BOM and inventory costs. Forexample, you can use one bitstream toperform system diagnostics on the entiresystem and then load the second applica-tion-specific bitstream. This reduces asso-ciated manufacturing system costs.
Xilinx offers EasyPath FPGA deviceswith LUTs and I/Os tested for drivestrengths and slew rates, allowing revisionslike engineering change orders at the LUTor I/O level. In many instances, even afterthe customer design is fully functional andcertified, flexibility with I/O drivestrengths and slew rates is critical.
For instance, a line card in a routermight need to have the drive strength (andslew rate) adjusted a notch or two depend-
ing on what load it sees.EasyPath customers can chooseto have a range of drivestrengths available to them forcertain I/Os. The unique flexi-bility is implemented on an as-needed basis. This eliminatesany re-spin and conversion-related engineering effort,delay, and expenses associatedwith ASICs and structuredASICs.
ConclusionEasyPath FPGAs from Xilinxoffer a seamless one-for-one,no-conversion volume reduc-tion solution across an industry-leading portfolio of productfamilies. The comparisonbetween EasyPath FPGAs and
structured ASICs shown in Table 1 illus-trates why EasyPath is a much superiorsolution. Unlike structured ASICs,EasyPath customers can get to productionvolumes much faster and now can do so atlower prices as well.
For more information about the next-generation EasyPath FPGAs, please visitwww.xilinx.com/easypath/, where you canget information on the platform support,flexibility features, and use an online costcalculator to find out why EasyPathFPGAs are the lowest total cost solution inthe industry.
14 Xcell Journal First Quarter 2005
Selection Criteria StructuredASICs*
EasyPathFPGAs
Time to Prototype Samples
Total Time to Volume Production
Vendor NRE/Mask Costs
Design Costs for Conversion
Additional Cost of Tools for Conversion
Unit Costs
Risk
Flexibility to Make Changes In-System
Design Conversion from Prototype to Production
4-8 weeks 0 weeks
12-15 weeks 8 weeks
$100K-$200K $75K
$250K-$300K $0
$100-$200K $0
Low Low
High Low
Inflexible Flexible
Additional Engineering Conversion Free
*Xilinx market analysis
Table 1 – EasyPath FPGAs versus structured ASICs
B U S I N E S S V I E W P O I N T S
Coming Soon to a location near you!
Learn how the latest Xilinx technology can help you design cost effective solutions faster.
Gain hands-on experience to speed up your next development cycle.
What Are You Waiting For?Register now for the event nearest you.
Visit www.memec.com/xfest-2005
ASIA CANADA EUROPE JAPAN UNITED STATES
April through June
(MG0
084-
04) 1
2.20
.04 Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission. All company and product
names may be trademarks of their respective companies.
by Matt KleinSr. Staff Engineer, Applications Engineering, Advanced Products DivisionXilinx, [email protected]
Device power consumption is a primaryissue in the semiconductor industry – asprocess technologies get smaller and faster,they normally consume more power, put-ting power concerns and performance atodds. The new Virtex-4™ FPGA familyfrom Xilinx® employs innovative architec-tural features and clever IC design tech-niques that dramatically reduce powerconsumption, without compromising per-formance. This bucks expected trends nor-
mally associated with the reduced featuresizes of 90 nm process technology.
In this article, we’ll explore how Xilinx ICdesigners achieved remarkable power efficiencyin the high-performance Virtex-4 FPGA.
Components of Power ConsumptionThere are two main components to powerconsumption: static and dynamic. Static orquiescent power is mainly dominated bytransistor leakage current. When this currentis listed in data sheets, it is listed as ICCINTQ
and is the current drawn through theVCCINT supply powering the FPGA core.
Dynamic or active power has componentsfrom both the switching power of the core ofthe FPGA and the I/O being switched. The
dynamic power consumption is determinedby the node capacitance, supply voltage, andswitching frequency and governed by thebasic formula P=CV 2ƒ.
Both static and dynamic power havebeen significantly reduced in Virtex-4devices, even when compared to Virtex-IIPro™ devices.
Dramatic Power ReductionThe Virtex-4 product family has reducedpower consumption in several key areas.The power-per-CLB has been cut in half,with static power reduced by 40% anddynamic power reduced by 50% whencompared to the 130 nm Virtex-II ProFPGA and other 90 nm FPGAs.Furthermore, certain hard-logic siliconfunctions in the Virtex-4 FPGA reducepower consumption by 80-95%, a whop-ping factor when compared to the samefunctions implemented in configurablelogic blocks and programmable intercon-nect routing.
Additionally, comprehensive powerplanning tools are available to help youget an idea, up front, of power consump-tion for your Xilinx FPGA under its oper-ating conditions.
Reduced Power Consumption BenefitsReduced power consumption benefits cutacross a few areas of product design inreduced thermal concerns as well as easedpower supply design (see Figure 1).
• Reduced thermal concerns – Whenyou reduce power consumption in adevice or system, you use smaller heatsinks, or no heat sinks at all in somecases. You also have simpler thermalsystem design from the point of viewof reducing airflows and fan size needs.
• Easier power supply design – You canalso use smaller supply circuitry andreduce the number of components inthe power supply. Using less PCBspace allows you to reduce the cost ofthe power system. Plus, by not havingyour device consume as much power,you can achieve higher reliability bylowering the temperature of theFPGA die.
The Virtex-4 Power Play
16 Xcell Journal First Quarter 2005
The latest Xilinx FPGA offers revolutionary power innovations.
SYSTEM DESIGN CHALLENGES
Static Power Trends in 90 nm TechnologyThe reduction in transistor size in 90 nmtechnology has several effects on power con-sumption. The biggest potential problem isin the area of static power.
Scaling Trends for Static PowerAs we mentioned earlier, static power is dom-inated by transistor leakage current.Unfortunately, channel leakage increases astransistor size decreases. This is especially truefor low VT transistors where VT refers to volt-age threshold between the gate and drain.
Low VT transistors are the fastest transis-tors – the ones with the shortest turn-on andpropagation delay – that IC designers useinside the FPGA when the highest speed per-formance is needed. Regular VT transistors arealso used when less performance is acceptable,but this only helps so much with leakage.
Figure 2 shows that leakage goes up dra-matically when moving from 130 nm to 90nm technology. The Virtex-II Pro deviceuses 130 nm process technology, whereasthe new Virtex-4 device uses 90 nm processtechnology.
Triple-Oxide – The Savior of Static PowerTriple-oxide simply means that we use athird thickness of oxide in making some ofthe transistors in the FPGA (two oxidethicknesses are used in devices like theVirtex-II Pro FPGA). Most transistors in thepast had a thin oxide layer. Within thosetransistors could be low VT, regular VT,NMOS, or PMOS transistors. Thick-oxidetransistors are mostly used for I/O driversand a few other functions.
Oxide deposition thickness is a very sta-ble and controllable process in the semicon-ductor industry because it depends ontemperature, concentration, and exposure
FPGAs can use different transistortypes for different functions, and Xilinxdesigners have accomplished this balance.
Optimizing Performance and LeakageOur IC designers have many things thatthey can do to adjust the mix to optimize forcertain factors. The Virtex-4 FPGA is thefirst Platform FPGA designed for high speedand low power.
Low VT transistors are used only wherenecessary for maximum speed, while the mid-dle thickness of oxide from the triple-oxideprocess may be used for less aggressive per-formance with very low leakage. You may usedifferent sizes and types of transistors for per-formance and function. Combinations arealso possible, such as small and medium-sizedlow VT fast transistors and small and medi-um-sized middle oxide thickness transistors. Itis not a one-size-fits-all procedure.
Xilinx IC designers were given a directiveto reduce power, among other things, in theVirtex-4 platform while maintaining thehighest system performance. These transistorsare used across the various FPGA functions ofLUTs, I/O, interconnect, and configurationmemory cells. Even within a given FPGAfunction, all transistors don’t need to be thesame, and that is up to the Xilinx IC design-ers (see Figure 4).
The surprising result of this balancing isthat the overall static current in Virtex-4devices with 90 nm process is reduced by 40%when compared to Virtex-II Pro devices with130 nm process. Table 1 shows a chart of theweighted average changes to the transistors inthe Virtex-4 die compared to Virtex-II Prodie, which allows you to arrive at the reducedtransistor leakage in the Virtex-4 FPGA.
time. Figures 3a and 3b show the Virtex-4transistor with the middle oxide thicknessused in the triple-oxide process. You maynotice that the oxide thickness is still very,very thin, but this thicker oxide transistorhas much lower leakage than the standardthin-oxide low VT and regular VT transis-tors used in Virtex-II Pro FPGAs and invarious parts of Virtex-4 FPGAs.
Why Doesn’t Everyone Use Triple-Oxide?If triple-oxide is such a great process, whydon’t other companies like Intel™ orIBM™ use it in their own ASICs?
They probably would ifit benefited them. The rea-son they don’t is that all oftheir transistors need to runat speed; hence, they mustuse the low VT leakier tran-sistors for everything.FPGAs can have many dif-ferent transistor types,which can be selected forfunction, power, or per-formance.
First Quarter 2005 Xcell Journal 17
1000
100
10
1
0.1220 180 150 130 90 75 65
Transistor IOFF Trend
Technology Node
I OFF (
nA/u
m) Low VT
Regular VT
Figure 1 – Virtex-4 devices reduce thermal concerns and simplify power supply design.
Figure 2 – Transistor leakage trends due to process scaling
Figure 3a, 3b – Middle oxide thickness Virtex-4 transistor used in triple-oxide process and with
highlighted portions of the transistors
SYSTEM DESIGN CHALLENGES
Dynamic Power ReductionStatic power reduction, while dramatic, isnot the only power winner that you cantake advantage of. Dynamic power is alsoreduced by 50% when compared toVirtex-II Pro FPGAs.
The dynamic power in the FPGA isgoverned by the following equation:
PDynamic=FPGACore (CV 2ƒ )+FPGAI/O(CV 2ƒ )
The Virtex-4 family of FPGAs has thefollowing:
• Reduced FPGA core dynamic power
– Internal operating voltage is the dominant factor
– Secondary scaling by frequency (f ) and node capacitance (C)
• Constant FPGA I/O dynamic power
– Unchanged voltage swing (VI/O), toggle rate (f ), and pin/pad capaci-tance (C) for a given I/O standard
So you can see that we may be able tohave an effect on dynamic power inside thedevice, but that dynamic power consumedby I/O switching remains unchanged.
When we go from the 130 nm processof the Virtex-II Pro FPGA to the 90 nmprocess of the Virtex-4 FPGA, the inter-nal supply voltage changes from 1.5V to1.2V. This reduces the dynamic powerconsumption for every internal transistorby of that in the Virtex-IIPro FPGA.
Additionally, the FPGA internal com-posite capacitance is reduced in the Virtex-4FPGA. This internal capacitance comprisestransistor parasitic capacitances and trace-to-metal and trace-to-trace capacitances forthe interconnecting metal traces. Figure 5shows the capacitance involved relative totheir structures.
Does low-K reduce power? Low-K refersto the dielectric insulating materialbetween the metal traces in the FPGA.Lower K dielectric insulating layers doreduce internal capacitances per unit tracelength, but “low-K” is a relative term.Xilinx has reduced-K-insulating materials,and in the past has used low-K itself; wemay do so again in the future.
36% (1-[ ]2 )1.5
1.2
As mentioned earlier, dynamic poweris related to the bulk capacitance andinternal voltage levels being switched,P=CV 2ƒ. All things being equal, havinga lower internal capacitance for the inter-connects would be a benefit for dynamicpower and reduced resistor-capacitordelay, but other factors contribute tointerconnect capacitance, such as dis-tance above the metal plane, intercon-nect width, and interconnect length.
Additionally, other parasitic capaci-tances such as gate-to-drain and gate-to-source are also part of the equation. Totalcapacitance for a path is based on a com-plex combination of parasitic capacitance
in the transistors; the architecture of theinterconnect paths and actual pathlengths; and the number of hops throughinterconnect switches. Xilinx has reducedthe overall capacitance for those compo-nents in the Virtex-4 FPGA.
The overall effect is mostly due toreduced gate capacitance and lowers capac-itance by 20% for Virtex-4 FPGAs whencompared to Virtex-II Pro FPGAs. Table 2shows a dynamic power reduction of 50%for the Virtex-4 FPGA when compared tothe Virtex-II Pro FPGA. We have a 23%reduction in dynamic power when run-ning at a 50% higher frequency.
Because the Virtex-4 FPGA is a muchhigher performance device than the Virtex-II Pro FPGA, you may need to operate it athigher clock speeds to meet newerdemanding performance targets that couldnever be achieved in previous systems.
18 Xcell Journal First Quarter 2005
ParametersChannel Width RatioChannel Length Ratio
Leakage Current per Unit Width RatioLeakage Current per Transistor
VCCINT RatioStatic Power per Transistor Ration
(ILEAKAGE* VCCINT)
Virtex-II Pro Virtex-4 Change0.640.711.140.740.80
0.59
-36%-29%+14%-26%-20%
1.00
-41%Table 1 – Overall weighted average transistor leakage and parameter comparisons
for 90 nm Virtex-4 transistors relative to 130 nm Virtex-II Pro transistors
Figure 4 – Optimal transistor mix for minimizingleakage and maximizing performance
Embedded BlocksAnother major area of improvement inpower consumption is in the area ofembedded functions. This has alwaysbeen a strength in Xilinx FPGAs, but it ismore so in the Virtex-4 FPGA, evenwhen compared to the feature-richVirtex-II Pro FPGA.
In Virtex-4 FPGAs you can take furtheradvantage of both static and dynamic powerreduction by using the embedded functions,which are built as hard-logic functions.
When embedded functions are imple-mented as hard-logic functions instead ofin configurable logic blocks and program-mable interconnects, there is a lot lessstatic and dynamic power consumed. Thisis because far fewer transistors are used forhard, fixed logic than for programmablelogic. Additionally, there are no transistorsneeded to make connections for intercon-nects in the embedded functions, becausethere are no programmable interconnects.
Xilinx has carefully studied some of thefunctions that engineers like you havestruggled with and that we have alsofound tedious to implement within the
FPGA programmable logic. The newembedded functions lower power by 80-95% compared to their configurable logicblocks and routed counterparts in pro-grammable silicon.
Comprehensive Power Planning ToolsAnother useful thing in planning power isthat Xilinx data sheets show you both typ-ical and maximum power consumptionnumbers. Maximum numbers are forworst-case process, temperature, and volt-age, but many designers are very happy towork with typical numbers, depending ontheir application and the number of partsbeing used in one system.
One additional very useful thing thatyou can take advantage of in planning forpower consumption in Xilinx FPGAs arepower planning tools. Xilinx web powertools are available for estimating powerearly in the design cycle. Also, as part of theXilinx design flow, XPower looks in moredetail at a mapped or routed design. Bothcan be found, along with power applicationnotes, by searching the Xilinx website forthe phrase “Xilinx Power Tools.”
ConclusionXilinx has made profound improvements inboth static and dynamic power in the Virtex-4 90 nm family of FPGAs when comparedto Virtex-II Pro FPGAs – and (we believe) incomparison to our competitors. We havedone this through a multi-pronged, purpose-ful approach in the areas of reduced leakagecurrent, reduced dynamic power consump-tion, and embedded functions, withoutcompromising performance. These, alongwith comprehensive power planning tools,make the Virtex-4 device an excellent choicefor a high-performance FPGA system.
For more information about power con-sumption in Virtex-4 and other XilinxFPGAs, visit www.xilinx.com/products/design_resources/design_tool/grouping/power_tools.htm.
Note: The factor of 0.5 above comes from the fact that Virtex-4 power per slice is 1/2 of the Virtex-II Pro power per slice because of the 50% dynamic power reduction in Virtex-4 devices compared to Virtex-II Pro devices.
Table 2 – Chart showing changes in internal FPGA in Virtex-4 devices compared to Virtex-II Pro devices and the effect on dynamic power
Table 3 – QDR II SDRAM and SPI-4.2 core benefit in reduced power consumption from significant logic cell reduction due to new Virtex-4 ChipSync block
Virtex-4 Embedded Functions andReduction of Dynamic Power
• PowerPC – 50% power reductioncompared to Virtex-II Pro PowerPC
– 10:1 power reduction over FPGAfabric-built version
• DSP – XtremeDSP™ slice greatlyreduces logic cells, which previously needed many filtering functions
– 20:1 power reduction over Virtex-IIPro separated multiply/accumulatefunctions
• SSIO – New ChipSync™ blockreduces logic cell count for SSIO(source synchronous I/O) designs
– Significant logic cell savings for vari-ous memory and networking inter-face designs leads to reduction inoverall power up to 9:1 for selecteddesigns (see Table 3)
• Embedded Ethernet MAC(s) – Noneed to use logic and interconnectfor MAC function, which saves>3,000 logic cells for the Xilinximplementation
• FIFO – SmartRAM™ memoryincludes built-in FIFO controllers,which can save hundreds of logiccells per FIFO and greatly simplifydesign as well
Krista MarksSr. Manager, IP Solutions DivisionXilinx, [email protected]
SPI-4.2 (System Packet Interface Level 4Phase 2) is the Optical InternetworkingForum’s recommended interface for theinterconnection of devices for aggregatebandwidths of OC-192 (ATM and POS)and 10 Gbps (Ethernet), as illustrated inFigure 1.
In the last few years, this interface hasbecome the de-facto standard on all leading10 Gbps framer ASSPs and has been imple-mented directly on many next-generationnetwork processors. SPI-4.2 has beenbroadly adopted because of its efficientinterface, which offers high bandwidthwith a low pin count and seamless handlingof typical system requirements such as flowcontrol, error insertion/detection, synchro-nization, and bus re-alignment.
The Xilinx® Virtex-4™ architectureprovides an ideal platform for implement-ing SPI-4.2. The Xilinx SPI-4.2LogiCORE™ IP targeting Virtex-4devices provides a solution with one-thirdless resources, dramatic power savings, 1+Gbps LVDS double-data-rate (DDR) I/O,and complete pin assignment flexibility.
SPI-4.2 LogiCORE IPXilinx has improved on its Virtex-II™ andVirtex-II Pro™ SPI-4.2 solution, alreadyone of the smallest in the industry, andmade it 30% smaller by leveraging newChipSync™ technology in the Virtex-4FPGA. ChipSync technology is supportedon every pin of the Virtex-4 device family;thus the new SPI-4.2 LogiCORE IP canbe targeted to any device pin-out. Thisallows you to select I/O pins that best fityour system and PCB requirements.
In addition, for those applicationsrequiring multiple SPI-4.2 interfaces, theVirtex-4 FPGA’s logic density, high pincount, and extensive clocking resourceswill support four or more full-duplex coresin a single device. Regardless of the per-formance your application requires,
Virtex-4 devices fully support the entireSPI-4.2 operating range, with high-speedLVDS support of data rates greater than 1Gbps per pin.
ChipSync TechnologyXilinx introduced ChipSync technology inVirtex-4 FPGAs to enhance I/O capabilitywhen used for source-synchronous applica-tions like SPI-4.2. ChipSync features are sup-ported in every Virtex-4 I/O pin and include:
• New serial and de-serial (OSERDESand ISERDES) features. This enableslogic built in the fabric to interface tothe I/O at a fraction of the source-synchronous clock rate. The ISERDESalso includes a Bitslip function. Bitslipallows you to shift the starting bit ofdeserialized data to achieve proper wordalignment when linking multiple pinstogether (bus deskew).
• A new input delay (IDELAY) feature.This allows you to precisely adjust theinput delay of each bit of a bus independ-ently, in 78 ps increments. This providesa mechanism for tuning the interfacetiming to the system environment.
Virtex-4 devices offer an idealplatform for source-synchronous designs like the widely adoptedSPI-4.2 interface.
Virtex-4 devices offer an idealplatform for source-synchronous designs like the widely adoptedSPI-4.2 interface.
SYSTEM DESIGN CHALLENGES
Additional DDR registers are now fullyintegrated into the input (ILOGIC) andoutput (OLOGIC) pins, simplifying theinterface between the FPGA fabric and I/Oblocks and supporting data transfer to andfrom the I/O logic on a single clock edge.
SPI-4.2 and ChipSync TechnologyThe SPI-4.2 interface has a DDR source-synchronous data bus that comprises 18LVDS pairs (16 data bits, 1 control bit, and1 clock). The SPI-4.2 source-synchronousclock varies from 311 MHz to 500 MHz.
As the frequency of the source-synchro-nous clock increases, data recovery at thereceiving (sink) device becomes more chal-lenging. The SPI-4.2 protocol provides acalibration data, or training pattern, thatpermits a receiving device to adjust its datasampling to the system interface timing.The process of tuning the interface to itsparticular timing is referred to as dynamicphase alignment (DPA).
Before Virtex-4 devices, Xilinx DPAsolutions worked by over-sampling theinput data and choosing the best samplefrom the group. This required valuableFPGA resources and careful control of theinput data path in the FPGA fabric, restrict-ing the SPI-4.2 interface pin placement. InVirtex-4 FPGAs, the IDELAY feature pres-ent in every I/O is ideally suited to performthis function, as shown in Figure 2. (See“Dynamic Phase Alignment with ChipSyncTechnology in Virtex-4 FPGAs,” also inthis issue of the Xcell Journal).
The IDELAY features have two pri-mary benefits for the SPI-4.2 core inVirtex-4 FPGAs:
• Integrating the IDELAY feature intothe input pin (ILOGIC) reduces theFPGA resources required for DPA toless than 350 slices.
• The IDELAY function’s ability toadjust the data sampling point enablesDPA to be implemented in the I/O –except for a small control statemachine, which is implemented in thefabric. The state machine portion isfully synchronous and does not requirea complex macro. Thus, there are norestrictions on SPI-4.2 pin assignments.
Clocking ResourcesVirtex-4 FPGAs provide an unprecedentednumber of clock resources for implement-ing multiple SPI-4.2 interfaces in a singledevice. With the Virtex-II and Virtex-IIPro architectures, implementing more thantwo SPI-4.2 interfaces posed a clock man-agement challenge. The abundance andflexibility of clock distribution in theVirtex-4 family solves this challenge, sup-porting as many SPI-4.2 interfaces as thedevice logic and I/O will allow.
For example, a typical OC-192 framer will require an aggre-gate bandwidth of 10 Gbps,which for a 16-bit dual data ratebus would require a data clock ofat least 311 MHz, with 350 MHza typical clock rate. The XilinxSPI-4.2 LogiCORE IP easilymeets your application require-ments, regardless of performance,and with Virtex-4 ChipSync tech-nology delivers a solution that issmaller and more flexible thenprior FPGA implementations.
The SPI-4.2 core usesChipSync technology to serializeegress data and de-serialize ingressdata to a four-word (bus cycle)SPI-4.2 data stream at a lowerclock rate. Operation of the corelogic at a lower internal clock rate
allows you to implement high-frequencySPI-4.2 interfaces in the slowest speedgrade Virtex-4 device.
The ISERDES and OSERDES functionsallow the core logic to time multiplex andde-multiplex these four words to and fromthe I/O logic without using any CLB logicresources. The core logic need only operate athalf the source-synchronous DDR clockrate. For example, a SPI-4.2 interface with a500 MHz DDR reference clock would onlyrequire an FPGA fabric clock of 250 MHz –easily achievable in the Virtex-4 architecture.
First Quarter 2005 Xcell Journal 21
SPI-4.2
PHY Layer
Device
or
MPU
Rx Data Path
Rx Status Path
Tx Data Path
Tx Status Path
User's Logic
SPI-4.2
Sink
Interface
User
Sink
Interface
SPI-4.2 Sink Core
SPI-4.2 Interface
UserInterface
Virtex-4 Device
SPI-4.2
Source
Interface
User
Source
Interface
SPI-4.2 Source Core
ReceiveLVDS
DDR I/O
Time Sliced(Delay Chain)Oversampling(8 times/bit)
Per BitSample
SelectionState
Machine
Virtex-II or Virtex-II Pro FPGASPI-4.2 Dynamic Phase Alignment (DPA)
Figure 2 – DPA implementation in I/O logic for Virtex-II devices versus Virtex-4 devices
SYSTEM DESIGN CHALLENGES
In Virtex-4 devices, all devices have 32global clock resources. No restrictions existon global clock distribution other than amaximum of eight global clocks per clockregion. All clock regions have access to any8 of the 32 total global buffers, regardlessof the requirements of other clock regions.
In addition to the eight global clocks,each region in the device has two regionalclock buffers. The regional clock resourcesare ideal for interface clocking, like thesource-synchronous clock scheme used bySPI-4.2. Note that even the smallestVirtex-4 device has a total of 48 availableclock resources, each designed for low-skewclock distribution and clock power man-agement. The SPI-4.2 LogiCORE IP canbe configured to use either global orregional clock resources.
In Virtex-4 FPGAs, the global clocktrees and associated buffers are implement-ed differentially, for best duty-cycle fidelityand greater common-mode noise rejection.With Virtex-II and Virtex-II Pro devices, ifSPI-4.2 interface operates above 350 MHz,you must route the high-speed referenceclock using two clock buffers to minimizeduty-cycle distortion at the DDR registers.
Because each global clock tree in Virtex-4FPGAs is implemented differentially, onlyone clock buffer is required.
Not only does the Virtex-4 architecturehave considerably more clock resources,but because they are distributed differen-tially, the SPI-4.2 LogiCORE IP requiresfewer of them. These high-performanceclock resources support as many as fourSPI-4.2 interfaces in a mid-range device(LX40/LX60) and more than four SPI-4.2
interfaces in the larger devices (Figure 3).The Virtex-4 clocking capability opens up awhole new class of SPI-4.2 applications, andprovides an ideal platform for applicationssuch as multiplexing and de-multiplexing,bridges, and switches.
Higher Performance at Lower PowerVirtex-4 silicon is manufactured with atriple-oxide process that reduces staticpower consumption by 40%. This willhave a positive impact for all designs,including the SPI-4.2 interface, where thepower savings are dramatic, as readily illus-trated and summarized in Table 1.
With Virtex-4 devices, SPI-4.2 uses sig-nificantly less power than its Virtex-II andVirtex-II Pro predecessors, both because of
the enhanced 90 nm semiconductorprocess and because the LogiCORE IPuses 30% less fabric resources. At thesame time, Virtex-4 FPGAs support 30%higher internal performance for SPI-4.2,with a maximum frequency of 250 MHzin the lowest speed grade (compared to175 MHz in the lowest speed grade ofVirtex-II and Virtex-II Pro devices). Inaddition, Virtex-4 FPGAs support 1+Gbps LVDS for every I/O on the device.
This means that not only can youplace multiple SPI-4.2 interfaces any-where on the device, but for each imple-mented interface you get an aggregatebandwidth as high as 16+ Gbps. Designsthat do not require this level of perform-ance (such as more typical framer interfaces running at 10-12 Gbps) auto-matically get additional performanceoverhead that ensures ease of designintegration and timing closure.
ConclusionThe Xilinx SPI-4.2 LogiCORE IP, cou-pled with Virtex-4 features, provides ahighly efficient SPI-4.2 solution. Wedeveloped ChipSync technology that sup-ports every I/O pin specifically for source-synchronous interfaces like SPI-4.2.
This technology enables you to designthe most efficient SPI-4.2 solution, whichuses significantly less resources (35% less),allows fully flexible device pin assignments(you choose the pinout), and supportsextremely high interface speeds (1+ GbpsLVDS DDR I/O).
The higher performance is even morecompelling because Virtex-4 FPGAs deliverit with lower power and significantly high-er internal operating rates. The wealth ofVirtex-4 clocking resources, combined withfull pin assignment flexibility, opens up thepossibility for new applications with multi-ple SPI-4.2 interfaces.
For more information about SPI-4.2 LogiCORE IP targeting Virtex-4devices, please refer to this site at the XilinxIP Center: www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=DO-DI-POSL4MC. A hardware demon-stration is also available; for more informa-tion, contact your Xilinx representative.
Figure 3 – Illustration of four SPI-4.2LogiCORE IP implemented on a Virtex-4
XC4VLX60 device
Table 1 – SPI-4.2 power estimates for Virtex-II, Virtex-II Pro, and Virtex-4 FPGAs
SYSTEM DESIGN CHALLENGES
by Maria GeorgeSenior Product Applications EngineerXilinx, [email protected]
Xilinx® Virtex-4™ devices have a 64-tapabsolute delay element built in each I/O,making high-speed memory interface readdata capture very easy. This feature alsoprovides the flexibility to adopt differentread data capture schemes whereclock/strobe or data can be delayed.
During a write to the external memorydevice, the clock/strobe must be transmit-ted center-aligned with respect to data. Amemory write is easy to implement withVirtex-4 devices by means of the quadra-ture phase outputs of the DCM (CLK0,CLK90, CLK180, CLK270), ensuring thatthe clock/strobe is center-aligned withdata. Figure 1 illustrates the clock/strobeand data phase relationship during readand write transactions.
For most memory interfaces, such asDDR 2 SDRAM, RLDRAM II, FCRAMII, and QDR II SRAM, the data rate istwice the clock rate because data isreceived and transmitted on both the ris-
ing and falling edges of the forwardedclock/strobe. Virtex-4 devices have bothinput and output DDR flip-flops, mak-ing DDR operation extremely simple.
Write Data and Clock/Strobe TransmissionDuring a write operation, the clock/strobe isgenerated using the output DDR registersclocked by a DCM clock output (CLK0) onthe global clock network. The write data istransmitted using the output DDR registers
clocked by a DCM clock output that is 90degrees phase ahead (CLK270) of the clockused to generate clock/strobe. This meets thememory vendor specification of centering theclock/strobe in the data window.
Another innovative feature of the outputDDR registers is the SAME_EDGE mode ofoperation. In this mode, a third registerclocked by a rising edge is placed on the inputof the falling edge register (Figure 2). Usingthis mode, both rising edge and falling edgedata can be presented to the output DDR reg-isters on the same clock edge (CLK270),thereby allowing higher DDR performancewith minimal register-to-register delay.
Read Data CaptureMost memory interfaces are source-syn-chronous interfaces, where the clock/strobeis received edge-aligned with data during aread from the external memory device. Thismakes read data capture challenging becausethe read clock/strobe must be delayed tocapture read data.
Read data capture is challenging becausethe read data and the incoming memoryread clock/strobe are received edge-alignedfrom the external device.
Virtex-4 devices make challenging memory interface requirements simple.Virtex-4 devices make challenging memory interface requirements simple.
word 0 word 1 word 2 word 3
ReceivedRead Dataat FPGA
Read Clock/Strobe
word 0 word 1 word 2 word 3
TransmittedWrite Datafrom FPGA
Transmitted Clock/Strobe
Figure 1 – Clock/strobe and data during read and write
SYSTEM DESIGN CHALLENGES
The traditional technique to captureread data is to register it in the delayedmemory clock/strobe domain. This entails:
• Ensuring that the memory clock/strobeand the associated data have matchedPCB trace delays between the memorydevice and the FPGA
• Delaying the clock/strobe signals suchthat the edges of the clock/strobe cen-ter in the valid data window, as shownin Figure 3
• Registering the read data with thedelayed memory clock/strobe
• Synchronizing registered read data to thesystem (FPGA) clock domain
An alternate and simpler technique,currently used in Xilinx reference designs,is to capture read data directly in the sys-tem (FPGA) clock domain. This entails:
• Ensuring that the memory clock/strobeand the associated data have matchedPCB trace delays between the memorydevice and the FPGA
• Determining phase difference betweenthe memory clock/strobe to the system(FPGA) clock by detecting two memoryclock/strobe transitions in the systemclock domain
• Detecting transitions of memoryclock/strobe after the memory initial-ization sequence by delaying memoryclock/strobe with respect to the system(FPGA) clock in unit increments
• Delaying read data based on memoryclock/strobe to system (FPGA) phaseinformation such that the system(FPGA) clock is centered in the validdata window
Both techniques require delay elementsto delay the clock/strobe or data.
The 64-tap, 80 ps absolute delay ele-ment available in each Virtex-4 I/Oallows center alignment of memoryclock/strobe in the data window or datacentering with the system (FPGA) clock.Each Virtex-4 I/O also has input DDRflip-flops that are required for read datacapture, either in the delayed memory
domain and must be re-captured in the sys-tem (FPGA) clock domain. The transfer ofcaptured read data from the delayed mem-ory clock/strobe domain to the internalsystem (FPGA clock) domain is defined asread data re-capture. Read data is re-cap-tured within the I/O block.
Using the second technique, imple-mented in the Xilinx reference designs,you can directly capture read data in thesystem (FPGA) clock domain by delayingread data to meet the setup/hold time ofthe flip-flops in the system (FPGA) clockdomain. A simple state machine is suffi-cient to implement the center alignmentof the delayed read data with respect to
strobe domain or the system (FPGA)clock domain.
You can use the input DDR flip-flops inthe SAME_EDGE or SAME_EDGE_PIPELINED modes. In the SAME_EDGEmode, the falling edge data is output on thefollowing rising edge of the clock (Figure 4).In the SAME_EDGE_PIPELINED mode,both the rising edge and falling edge dataare output together on the same rising edgeof the clock (Figure 5). With these modesyou can achieve higher design performanceby avoiding half-clock cycle data paths inthe FPGA fabric.
In the first technique, read data is cap-tured in the delayed memory clock/strobe
First Quarter 2005 Xcell Journal 25
word 0 word 1 word 2 word 3 word 4 word 5 word 6 word 7
Clock/Strobefrom Memory
Delayed Clock/Strobe in FPGA
Read Data
Figure 3 – Clock/strobe delayed in FPGA to center in read data window
SYSTEM DESIGN CHALLENGES
CLK
R
CE
D Q
CLK
R
CE
D Q
CLKSS
S
R
CE
D Q
D1
D2
R
CE
C
S
OQ
DDR MUX
C
CE
OQ
D1
D2
D1A D2A D1B
D1A D1B D1C D1D
D2A D2B D2C D2D
D2B D1C D2C D1D
Figure 2 – Output DDR in SAME_EDGE mode
the system (FPGA) clock after the initializa-tion period.
This “run time” adjustment after thememory initialization sequence has signifi-cant advantages over other methods that setthe required delay or phase shift during“compile time.” The 64-tap absolute delayelement compensates for variations inprocess, temperature, or voltage, and henceincreases the timing margins – resulting in amore reliable system.
The read data is re-captured and storeddirectly into the block RAM FIFO, a Virtex-4feature that saves additional logic resources.
ConclusionVirtex-4 architectural features enable you toeasily and reliably implement high-speedmemory interfaces. You can use the 64-tap, 80ps absolute delay elements to capture read databy either delaying the memory clock/strobe orthe data. Built in each I/O, the 64-tap absolutedelay elements provide you the flexibility toselect any I/O for memory interfaces. The“run time” adjustment after memory initializa-tion improves design margins.
The input and output DDR registersenable you to receive and transmitclock/strobe and data at high frequencies; thedifferential clocking resource provides higherperformance with better duty cycle and lowerglobal clock buffer utilization; and the blockRAM FIFO feature enables you to storetransmitted or received data without addi-tional logic resources.
For more information about the imple-mentation and design details of differentmemory interfaces in Virtex-4 devices, visitthe following websites:
• DDR2 SDRAM (XAPP 701 andXAPP702) and DDR SDRAM(XAPP709): www.xilinx.com/products/design_resources/mem_corner/resource/xaw_dram_ddr.htm
• Increased visibility with FPGA dynamic probe• Intuitive Windows
®XP Pro user interface
• Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000
Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.
X-ray vision for your designsAgilent 16900 Series logic analysis system with FPGA dynamic probe
X-ray vision for your designs
by Niall Battson DSP Applications Engineer Xilinx, [email protected]
With the introduction of Xilinx® Virtex-4™FPGAs in September 2004, the world of DSPdesign witnessed a dramatic leap in program-mable logic DSP: higher performance, lowercost, lower power, and maximum flexibility.
At the same time this phenomenon asksDSP hardware engineers to change their tradi-tional way of designing and embrace a differentapproach. These great improvements have beenmade possible by the XtremeDSP™ slice.
The XtremeDSP SliceThe XtremeDSP slice (also referred to as theDSP48) is a high-performance multiplier andarithmetic unit with great flexibility that canform the building block of many DSP algo-rithms implemented in FPGAs. A detaileddiagram of the DSP48 structure is shown inFigure 1.
The XtremeDSP slice comprises four mainsections:
• I/O registers
• 18 x 18 signed multiplier
• Three-input adder/subtractor
• Op-mode multiplexers
The I/O registers ensure a maximum clockperformance of 500 MHz in the fastest speedgrade device (400 MHz in the slowest speedgrade), also ensuring support for higher samplerates. The dynamic op-mode multiplexers arekey to the functionality of the structure; they areresponsible for the DSP48’s great flexibility. Forexample, in a simple MACC engine, you set theX and Y MUX to multiply and select the feed-back path from the registered output P as the ZMUX input to the arithmetic unit.
In the Virtex-4 architecture, XtremeDSPslices are arranged in columns. The most impor-tant aspect about the column is the cascade logicand routing between each block, which exists onboth the input and output stages of each slice.This dedicated routing enables a number of filters and other functions to be built entirelywithin the XtremeDSP slice, thus removing theneed for signals to be routed through the FPGAinterconnect or logic fabric.
Designing with the Virtex-4 XtremeDSP Slice
28 Xcell Journal First Quarter 2005
Harness the full capabilities of the XtremeDSP slice in filter design.
SYSTEM DESIGN CHALLENGES
However, you must take this adder-chainconfiguration into account when designingfunctions that exploit the XtremeDSP slice.Herein lies the fundamental change in theapproach to filter design. The simple, tradi-tional adder-tree approach limited the per-formance and extensibility of a given filterimplementation. By using adder-chain-styleimplementations, these limitations are liftedand the huge benefits Virtex-4 FPGAs offerare possible.
The embedded nature of the XtremeDSPslice has also had a radical impact on reduc-ing the power consumed by high-speed mul-tiply and add functions. Figure 2 illustratesthis dramatic reduction, showing that thedynamic power consumption is 1/17 ofVirtex-II Pro™ devices with a specificationof 2.9 mW/100 MHz. As a designer, youshould migrate as much functionality intothese embedded functions as possible.
Filter TechniquesDuring the last ten years, hardware andFPGA designers have created a wide varietyof filter architectures to efficiently exploitthe building blocks that the current gener-ation of technology offers. With the intro-duction of Virtex-4 FPGAs and theXtremeDSP slice, filter implementationsmust change to most efficiently exploit thislatest FPGA offering. Filters are prolific inDSP designs and nearly always form thestarting point for analyzing an architecture.
The Semi-Parallel FIR FilterEven within the filter world, you canimplement a wide variety of filters. The keyparameters that tell us which FIR filterimplementation we will construct are:
• Number of coefficients (N)
• Sample rate (Fs)
Let’s examine a particular filter structureto demonstrate the key design techniquesthat can help you maximize the benefits ofVirtex-4 devices. Our filter has 20 coeffi-cients and a sample rate of 74.25 MHz.
As noted earlier, the maximum capableclock speed of the XtremeDSP slice is 400MHz in the slowest speed grade (-10).Therefore, we have a total of five clockcycles to perform the required 20 multiplyand adds to form the result.
This equation determines how manymultipliers to use for a particular semi-parallel architecture:Number of Multipliers = (Maximum Input Sample Rate x Number of Coefficients) / Clock Speed
For our example, the required numberof multipliers will be four. Once we havedetermined the required number of multi-pliers, there is an extendable architectureusing the XtremeDSP slices that can serveas the basis for the filter.
The general FIR filter equation is asummation of products (also known as aninner product) defined in the equation:
In this equation, a set of N coefficients ismultiplied by N respective data samples,and the results are summed to form anindividual result. The values of the coeffi-cients determine the characteristics of thefilter: low-pass, band-pass, or high-pass.
* Based on power estimator spreadsheet, uses slice logic
Virtex-4
~2.3 mW/100 MHz
Virtex-II Pro*
~39 mW/100 MHz
Virtex-II*~47 mW/100 MHz
Figure 1 – Simplified diagram of the XtremeDSP slice
Figure 2 – Dynamic power consumption of the XtremeDSP slice
SYSTEM DESIGN CHALLENGES
XtremeDSP arithmetic units aredesigned to be chained together easily andefficiently thanks to dedicated routingbetween slices. Figure 3 illustrates how thefour XtremeDSP multiply and add ele-ments are cascaded together to form themain part of the filter.
It is critical to highlight the usage of theadder chain here rather than the more tradi-tional adder tree. The adder chain has a pro-found impact on the control logic requiredfor the filter, as well as its efficiency, becauseof the mapping to the XtremeDSP slice.
Continuing to analyze the filter structure,an extra XtremeDSP slice is required to per-form the accumulation of the partial results,thus creating the final result. A new result iscreated every five clock cycles. This meansthat for every five cycles the accumulationmust be reset to the first inner product of thenext result. This reset (or load) is achieved bychanging the op-mode value of theXtremeDSP slice for a single cycle, from0010010 to 0010000 (this is just a single bitchange). At the same time, the capture regis-ter is enabled and the final result stored onthe output.
The Control LogicThe control is the most important and com-plicated aspect of semi-parallel FIR filters;getting it right is crucial to filter operation.Because the XtremeDSP slice is most effi-ciently used in adder chains, memoryaddressing is necessary to provide the delayfor each multiply-add element that the adderchain causes. Figure 4 illustrates the controllogic required to create memory addressing.
The counter creates the fundamentalzero through four count. This is thendelayed by one cycleby the use of a registerin the control path.Each successive delayis used to address boththe coefficient memo-ry and the data buffer– and their respectivemultiply-add ele-ments. Hence, a singledelay is required forthe second multiply-add element, twodelays for the thirdmultiply-add element,
and so on. Note that this is extensible con-trol logic for M number of multipliers.
Figure 4 also shows write enablesequencing. A relational operator isrequired to determine when the countlimited counter resets its count. This sig-nal is high for one clock cycle every fivecycles, reflecting the input and outputdata rates. The clock enable signal isdelayed by a single register just like thecoefficient address; each delayed versionof the signal is tied to the respective sec-tion of the filter.
The filter and control logic areextremely cascadable. The address for eachSRL16E data buffer and coefficient mem-ory pair are a delayed version of the previ-ous elements’ address, and are identical.
The performance and resource utiliza-tion for our filter is specified in Table 1. Inthe table, you can see how logic slice uti-lization dramatically drops when using theXtremeDSP slice. Clock frequency per-formance approximately doubles overVirtex-II Pro FPGAs.
30 Xcell Journal First Quarter 2005
DSP48 Slice
opmode = 0000101
0
x(n)
y(n)
18
40
h0h1
h2
h3
h4
h5h6
h7
h8
h9
h10h111
h2
h13
h14
h15h16
h17
h18
h19
DSP48 Slice
opmode = 0010101
DSP48 Slice
opmode = 0010010
Q
CE
D
Counter0 -> (NM-1)
Coefficient and Data Buffer 0
Address
WE
1
3
WE1 WE2 WE3 LOAD
Z-5
Coefficient and Data Buffer 1
Address
Coefficient and Data Buffer 2
Address
Coefficient and Data Buffer 3
Address
Compare= (N/M-2)
Q
CE
D Q
CE
D Q
CE
D
Q
CE
D Q
CE
D Q
CE
D
Four-Multiplier 20-Tap Semi-Parallel FIR Filter Virtex-4 (-11) Virtex-II Pro (-7)18-Bit Data, 18-Bit Coefficients
Logic Slices 108 309
XtremeDSP Slice 5
Embedded Multipliers 7
Performance (Sample Rate) 90 MHz 77 MHz
Performance (Clock Frequency) 450 MHz 231 MHz
Figure 3 – The four-multiplier semi-parallel systolic FIR filter
Figure 4 – Control logic for the four-multiplier semi-parallel FIR filter
Table 1 – Resource utilization and performance of four-multiplier 20-tap semi-parallel FIR filter
SYSTEM DESIGN CHALLENGES
Three Important Design PointsThis new filter architecture, along withVirtex-4 devices and the XtremeDSP slice,addresses the demanding needs of current andfuture DSP designs. However, it is only onefilter in an extremely large array of possibleimplementations, not to mention other DSPfunctions such as IIRs, FFTs, and DCTs.
Knowing this, you can take away threevery important design questions that willenable you to exploit the XtremeDSP sliceand Virtex-4 device as designed.
1. Is the design running as fast as possible?
The fastest speed grade (-12) shouldrun at 500 MHz. If your design isrunning at 50 MHz, you’ve got theroom to reduce your resource utiliza-tion by increasing performance (andreducing cost) by making more effi-cient use of the FPGA resources. Thefaster a particular function operates,the smaller it becomes. Our semi-parallel FIR filter, for example, usedfive XtremeDSP slices running at 375MHz instead of 20 XtremeDSP slicesrunning at 74.25 MHz.
2. Are there any XtremeDSP slices left?
If you are not using them all up, youcan probably add some functionality.This can lead to logic slice reductionand lower power consumption.
3. Are you using adder chains instead of adder trees?
DSP algorithms must aim to exploitadder chain-based implementationswherever possible, as this will lead tothe best utilization of the XtremeDSPslice. Such implementations will resultin performance gains, power reduction,and logic slice reduction.
ConclusionFor more information, see the XtremeDSPSlice Design Considerations User Guide,which provides in-depth details on other filterimplementations and DSP functions, atwww.xilinx.com/bvdocs/userguides/ug073.pdf.There are also other HDL and SystemGenerator for DSP reference designs to getyou started.
First Quarter 2005 Xcell Journal 31
SYSTEM DESIGN CHALLENGES
by Suresh Sivasubramaniam Senior Design Engineer Xilinx, Inc. [email protected]
The Xilinx® Virtex-4™ FX family ofdevices contains up to 24 RocketIO™multi-gigabit transceivers, each capable ofoperating anywhere from 622 Mbps to 11Gbps. This seamless scalability, coupledwith support for various emerging stan-dards (Figure 1), allows you tremendousflexibility to upgrade today’s designs tomeet increasing bandwidth requirements.
To realize the full potential of thisupgradeability to high-bandwidth pro-cessing applications, you must carefullydesign the serial interconnect channels onthe PCB, be it line card or backplanes.
Once the transfer characteristics of thephysical channel are well understood, youcan effectively employ features such astransmit pre-emphasis/voltage swing andreceive equalization (Figure 2) to over-come losses and attenuation in the chan-nel, thus ensuring high signal integrity atthe receiver.
MK322 Evaluation Board Case StudyThe MK322 platform is the primaryboard used for the electrical evaluationand characterization of the RocketIO Xhigh-speed serial multi-gigabit trans-ceivers in Virtex-II Pro™ X FPGAs. Thisboard was specifically designed to evalu-ate and test the RocketIO X transceiverand is available for sale.
The SMA connectors on the board allowyou to interface the board to a scope, toother boards, or for loopback tests. Thephysical channel for each transceiver is care-fully optimized to ensure the highest signal
quality at the SMAs (on the transmit path)or at the FPGA (on the receive path).
The data can significantly degrade afterit has passed through the transmissionpath. Degradation includes loss of signalamplitude, reduction of signal rise time,and a spreading at the zero crossings. It iscritical to model the transmission pathwhen designing a high-performance, high-speed serial interconnect system. The trans-mission path may include longtransmission lines, connectors, vias, andcrosstalk from adjacent interconnect.
MK322 Board StackupThe MK322 is a 12-layer board. The stackand trace geometries are designed for 100Ohm differential and 50 Ohm single-ended signaling. The board material isstandard FR4 (Er = 4.2 and tanδ = 0.02).All trace and plane layers are 0.5 oz. copper(0.65 mil thick). The electrical channel ofinterest for our case study is routed as fol-
Designing For Signal IntegrityDesigning For Signal Integrity
32 Xcell Journal First Quarter 2005
You can use the Xilinx/Ansoft 10 Gbps Backplane Design Kit to predict interconnect performance.
You can use the Xilinx/Ansoft 10 Gbps Backplane Design Kit to predict interconnect performance.
Virtex-II Pro
Virtex-II Pro X
Rocket PHY
Virtex-4
Storage
Networking
Telecom
Computing
Video
1GFC 2GFC 4GFC 8GFC 10GFC
SATA SATA 2
SATA 3
GbE XAUI CEI (OIF) 10Gb ECEI (OIF)
OC-12 OC-48 OC-192
GbE SATA PCIE SATA 2
HD-SDI
Rate (Gbps) 0.622 1.0 2.0 3.0 5.0 6.0 10.0 11.0
0.622 2.488 9.952
1.25 1.5 2.5 3.0
1.45
1.25 3.125 6.25 10.313 11G
1.06 2.12 4.25 8.5
6.01.5 3.0
10.519
Programmable Termination
Programmable Voltage Swing
Transmit Pre-Emphasis
Integrated AC Coupling
Receive Equalization
Automatic EQ Settings Algorithm
Yes
Yes
Yes
Yes
Yes
Linear and DFE
Reduces reflections
Reduces power
Equalizes simple channels
Direct interface to other devices,reduces component count
Equalizes stringent channel; allows legacy backplanes to be upgraded
Automatically finds optimum EQsetting for a given channel;eases design and ensures
signal integrity
Feature Benefit
Figure 1 – Seamless scaling from 622 Mbps to 10 Gbps Figure 2 – Programmable pre-emphasis and equalization features in the Virtex-4 FX family
SYSTEM DESIGN CHALLENGES
lows: microstrip on the top layer and tran-sitions to layer 10 stripline through aGSSG differential via.
Differential Signal TopologyThe differential signals are routed into andout of the board using Rosenberger™high-performance coax-to-board SMAconnectors. The signals are routed from thetop-mounted connector to the FPGA usingstripline transmission lines (layer 10),which transition to microstrip before inter-facing with the FPGA BGA package. Theactual trace layout for one Tx and Rx pairis shown in Figure 3.
Modeling and SimulationThe electrical channel comprises five mainsections (Figure 4):
• The BGA package
• Microstrip transmission line
• Differential via (GSSG configuration,G- ground, S- signal)
• Stripline transmission line
• Connector
Let’s look at each piece in turn.
BGA PackageThe package model and the specific Tx pairof interest were extracted from theCadence™ APD database and simulatedusing Ansoft HFSS. Figure 5 is a plot of thedifferential insertion loss (red) and returnloss (blue) as computed by Ansoft HFSS.
For this particular differential pair,return loss is better than 15 dB, up to 22GHz. Ansoft HFSS can output the differ-ential S-parameters as Touchstone files.Typically, companies are reluctant to giveout their package databases except underan NDA, because they contain sensitivedesign information. However, you can useS-parameters derived from the model forchannel simulations.
Microstrip and Stripline InterconnectWe performed simulations for the striplineand microstrip structures using the two-dimensional quasistatic finite element sim-ulator within Ansoft SI 2D Extractor. The
provides a comparisonof the simulationresults using the threedifferent methods. Asyou can see in the fig-ure, all methods predictsimilar performance.For an extended discus-sion of the trade-offs ofthe different approach-es, please refer to thewhite paper accompa-nying the kit, availableon the Xilinx SICentral website.
In addition, we parameterized each of theinterconnect models. For example, in themicrostrip interconnect model, the width,spacing, metal thickness, and physicallength are parameters that can vary. For theinitial simulations, these values were set togeometries specific to the MK322 board.
Differential ViaIn keeping with good design practices thatminimize unterminated stubs, layer 10was used to transition from the microstrip
stripline geometries were designed to pro-vide nominally 100 Ohms differentialimpedance. Simulations confirmed that theimpedance was within 7% of the nominalvalue (see Figure 6).
You can model PCB interconnects usingvarious methods within Ansoft Designer™.The simplest is to use a coupled-line circuitmodel (like those found in popular high-frequency circuit simulators such as AnsoftDesigner). In this instance, the interconnectis modeled with a uniform differential cou-
pled transmission line without any discon-tinuities. On the other end of the modelingspectrum is the utilization of a full-waveplanar EM field simulator based on themethod of moments (MoM). Althoughaccurate, MoM simulations are also themost computationally expensive method topredict interconnect performance.
A compromise that offers the accuracyof planar EM simulations with some of thespeed of circuit simulation is offered byusing a combination of the two. Figure 7
First Quarter 2005 Xcell Journal 33
BGA Package
S-Parameters
MicrostripTransmission Line
Circuit Model
DifferentialVia
HFSS Model
StriplineTransmission Line
Circuit ModelHFSS Model
SMAConnector
S-Parameters
1
2
Microstrip Stripline
Figure 3 – Physical structure of a Tx and Rx differential pair on the MK322 board
Figure 4 – The individual pieces comprising the full channel
Figure 5 – Package model insertion loss (red) andreturn loss (blue) as computed by Ansoft HFSS
SYSTEM DESIGN CHALLENGES
to stripline using the throughhole differ-ential via. The actual geometries for theground-signal-signal-ground configura-tion were taken from Appendix D of theXFP specification (see pages 160-163 ofthe specification).
Several key variables for the via areparameterized, including spacing betweensignal vias, via radius, and antipad radius.Simulation results for the differential viastructure are shown in Figure 8. The viastructure shows excellent broadbandinsertion and return loss (> -10 dB) wellbeyond 20 GHz.
SMA ConnectorThe SMA connector used on the MK322board is manufactured by Rosenberger(Part # 32K153-400). Rosenberger wasgracious enough to provide uswith the HFSS model for theconnector, along with the optimized PCB footprint. Thecritical parameters for opti-mization involve the pad andantipad radii, as well as place-ment and spacing of severalground return vias around thecenter conductor. The groundvias around the center conduc-tor allow the signal to transi-tion from a radial coaxial field to atransverse electromagnetic mode (TEM)transmission line field in such a way thatit minimizes any impedance mismatches.Figure 9 shows the insertion and returnloss (> -10 dB up to 12 GHz) for the opti-mized SMA launch.
Full Channel SimulationIt is possible to cascade results generatedfrom EM and circuit simulations on each
of the individual components to get a fullsystem simulation. Figure 10 is a snapshotof the schematic of the full channel, fromthe SMA connector, through the board tothe Xilinx Virtex-II Pro X BGA package,set up for frequency domain analysis.
Figure 11 is a plot of the system simu-lation results displaying the insertion andreturn loss up to 40 GHz. As expected, thechannel has a response similar to a low-pass filter. The majority of the energy for abaseband digital binary signal is containedwithin the first null of its power spectrum.For the rise time and signaling rate of thischannel (30 ps, 10 Gbps), we are mostconcerned with the response up to 17GHz. As seen in the plot, the insertion lossis roughly -10 dB and the return loss isbelow -10 dB up to 17 GHz.
You can also perform time domain sim-ulations (see Figure 12) using the systemsimulator in Ansoft Designer. This simula-tor uses a convolution algorithm to processthe frequency domain channel data withuser-defined input bitstreams. Insertionand return loss is included in the simula-tion.
An ideal 10 Gbps pseudo-random bitsource with a 0.5V p-p amplitude and 30ps rise time was applied to the channel.
34 Xcell Journal First Quarter 2005
εr=4.2, δ =
0.02
S BW
S10 20.650 6.750 7.500 54.73 93.64 31.32
Layer B W S Zse Zd Zoom
All dimensions are in mills
Figure 6 – Impedance for the stripline traces as extracted using Ansoft SI 2D Extractor
Figure 7 – A comparison of the three methods to simulate interconnects
Figure 8 – Differential S-parameters for the via as computed by Ansoft HFSS
Figure 9 – Differential S-parameters for the SMA connector
Figure 10 – Schematic of the full channel setup for frequencydomain analysis within Ansoft Designer
Figure 11 – Insertion and return loss for the full channel
SYSTEM DESIGN CHALLENGES
The channel was terminated in single-ended 50 Ohm impedances. The resultingeye diagram is shown in Figure 13, alongwith a measured eye diagram. There isexcellent correlation between the measure-ment and simulation results. A very clearand open eye is achieved, as is expectedfrom the frequency domain results.
For comparison to the measured eye,the driver capacitance was added to thechannels. These capacitors are not part ofthe package model, because the passivechannel will eventually be used with actualdriver/receiver models that already includethe capacitance. No pre-emphasis was usedin the simulation. It should be anticipatedthat some pre-emphasis would sharpen upthe time-domain response.
Extension of the MethodologyIn creating the models, we emphasized thatthe critical variables that make up the phys-ical structure are parameterized. Why para-meterize? Although there are many reasonsfor doing so, let’s show through some exam-ples the power and utility of models thatallow manipulation of critical variables.
A Longer Stripline SegmentIn the original model, the nominal lengthfor the stripline segment of the channel is2.5 in. For whatever reason (board routingcongestion is an obvious one), suppose thatthe stripline segment now needed to be 5in. You can easily investigate the channelperformance for this new scenario bychanging the physical length variable(SL_L) in the model. Examples of such ananalysis, for various trace lengths, areshown in Figure 14.
Increasing the length of the striplinesegments results in significant eye degra-dation. Because every component of the
channel is parameter-ized, you can explorethe performance impactof different variables ineach section of thechannel when investi-gating design trade-offs. In fact, withexactly this intent inmind, we have made
these models available as a Xilinx/Ansoft10 Gbps Backplane Design Kit at www.gigabitbackplanedesign.com. Completedetails on each of the models and the para-meterized variables are available at this site.
ConclusionModern platform FPGA devices providewide bandwidth processing and high-speedI/O. Serial I/O with speeds in the gigabitrealm creates new challenges for PCBdesigners.
Models associated with this effort havebeen assembled into a 10 Gbps backplanedesign kit that you can use to predict per-formance of circuit board designs.
The design kit is available on the Xilinx“SI Central” website, enabling you to rap-idly evaluate your own board designs. Visitwww.gigabitbackplanedesign.com for moreinformation.
First Quarter 2005 Xcell Journal 35
Figure 12 – Schematic showing setup for time-domain simulations
Figure 13 – Simulated (left) and measured (right) eye diagram for the full channel; the simulated eye is in excellent agreement with measurements
Figure 14 – Channel performance degrades due to losses in the transmission line as the trace length increases
SYSTEM DESIGN CHALLENGES
by Ahmad Ansari Senior Staff Systems ArchitectXilinx, [email protected]
The APU controller provides a flexiblehigh-bandwidth interface between the re-configurable logic in the FPGA fabric andthe pipeline of the integrated IBM™PowerPC™ 405 CPU. Fabric co-processormodules (FCM) implemented in the FPGAfabric are connected to the embeddedPowerPC processor through the APU con-troller interface to enable user-defined con-figurable hardware accelerators. Thesehardware accelerator functions operate asextensions to the PowerPC 405, therebyoffloading the CPU from demanding com-putational tasks.
APU InstructionsThe APU controller allows you to extend thenative PowerPC 405 instruction set with cus-tom instructions that are executed by the soft
FCM; the primary capabilities are shown inFigure 1. This provides a more efficient inte-gration between an application-specificfunction and the processor pipeline than ispossible using a memory-mapped coproces-sor and shared bus implementation.
The instructions supported by the APUare classified into three main categories:
• User-defined instructions (UDI)
• PowerPC floating-point instructions
• APU load/store instructions
The UDIs are programmed into thecontroller either dynamically through thePowerPC 405 device control register(DCR) or statically when the FPGA is con-figured through its bitstream. The APUcontroller allows you to optimize your sys-tem architecture by decoding instructionseither internally or in the FCM.
The floating-point unit (FPU) is anexample of an FCM. The PowerPC float-ing-point instruction set is decoded in theAPU controller, whereas the computation-al functionality is implemented in theFPGA fabric. To support FPUs with dif-ferent complexities, the APU controllerallows you to select subgroups of thePowerPC floating-point instructions.These instructions are executed in theFCM while other subgroups of instructionsare either computed through software FPU
emulation or ignored completely. This fine-tuning optimizes FPGA resources whileaccelerating the most critical calculationswith dedicated logic.
The APU controller also decodes high-performance load and store instructionsbetween the processor data cache or systemmemory and the FPGA fabric. A singleinstruction transfers up to 16 bytes of data –four times greater than a load or storeinstruction for one of the general purposeregisters (GPR) in the processor itself. Thus,this capability creates a low-latency and high-bandwidth data path to and from the FCM.
APU Controller OperationFigure 2 identifies the key modules of theAPU controller and the 405 CPU in rela-tion to the FCM soft coprocessor moduleimplemented in FPGA logic. To explainthe operation of the APU controller andthe processor interactions related to theexecution units in soft logic, we can tracethe step-by-step sequence of events thatoccur when an instruction is fetched fromcache or memory.
Once the instruction reaches the decodestage, it is simultaneously presented to boththe CPU and APU decode blocks. If theinstruction is detected as a CPU instruc-tion, the CPU will continue to execute theinstruction as it would normally.Otherwise, within the same cycle, the CPU
Accelerated System Performancewith APU-Enhanced ProcessingAccelerated System Performancewith APU-Enhanced Processing
36 Xcell Journal First Quarter 2005
The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.
The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.
SYSTEM DESIGN CHALLENGES
will look for a response from the APU con-troller. If the APU controller recognizes theinstruction, it will provide the necessaryinformation back to the CPU.
If the APU controller does not respondwithin that same cycle, an invalid instruc-tion exception will be generated by theCPU. If the instruction is a valid and rec-ognized instruction, the necessary operandsare fetched from the processor and passedto the FCM for processing.
Because the PowerPC processor and theFCM reside in two separate clock domains,synchronization modules of the APU con-troller manage the clock frequency differ-ence. This allows the FCM to operate at aslower frequency than the processor. In thisinstance, the APU controller would receivethe resultant data from the coprocessor and
implement synchronization semantics topace the software execution with the hard-ware FCM latency.
Non-autonomous instruction types arefurther divided into blocking and non-blocking. If blocking, asynchronous excep-tions or interrupts are blocked until theFCM instruction completes. Otherwise, ifnon-blocking, the exception or interrupt istaken and the FCM is flushed.
Software DescriptionSoftware engineers can access the FCMfrom within assembler or C code. On oneside, Xilinx has enabled the GCC compiler(which is contained in the EmbeddedDevelopment Kit) to generate code thatuses an FCM floating-point unit to calcu-late floating-point operations. Furthermore,assembler mnemonics are available forUDIs and the pre-defined load/storeinstructions, enabling you to place hard-ware-accelerated functions into the regularprogram flow. For the ultimate level of flex-ibility, you can define your own instructionsdesigned specifically for the hardware func-tionality of the FCM.
You can easily use the pre-definedload/store instructions through high-levelC macros. For example, in an applicationwhere the FCM is used to convert pixeldata into the frequency domain, 8 pixels of16 bits are transferred from main memoryto an FCM register with a simple program:
unsigned short pixel_row[8]; // 8 pixels,each pixel has a size of 16 bits
lqfcm(0, pixel_row); // transfer a row ofpixels to FCM register zero
The quadword load operation main-tains cache coherency as the data is movedthrough the cache, if caching is enabled forthe corresponding address space.
The FCM operation on the pixel datacan start on an explicit command; forexample, a UDI. However, for many appli-cations the operation starts immediatelyafter the FCM hardware detects the com-pletion of the load instruction.
The latter approach has many advantages:
• Simple software – A load operationmoves the data from the memory to
at the proper execution time send the databack to the processor. The APU controllerknows in advance, based on instructiontype, if or when it will get the result.
Autonomous and Non-Autonomous InstructionsTwo major categories of instructions exist:autonomous and non-autonomous. Forautonomous instructions, the CPU contin-ues issuing instructions and does not stallwhile the FCM is operating on an instruc-tion. This overlap of execution allows youto achieve high performance through tech-niques such as software pipelining.
On the other hand, during the syn-chronized execution, the CPU pipelinestalls while the FCM is operating on aninstruction. This feature allows you to
First Quarter 2005 Xcell Journal 37
PowerPCAPU
Controller
SoftAuxiliary
Processor
PLB
OCM FPGA Fabric
APUI/F
FPGAI/F
Processor BlockProcessor Block
• Extends PPC 405 Instruction Set – Floating Point Support (with soft auxiliary processor) – User-Defined Instructions
• Offloads CPU-Intensive Operations – Matrix Calculations • Video Processing – Floating-Point Mathematics • 3D Data Processing
• Direct Interface to HW Accelerators – High Bandwidth – Low Latency
the FCM and starts the operation. Asubsequent store instruction retrievesthe result of the operation and stores itback to main memory.
• High data transfer rates – Quadwordload and store operations take just a fewcycles to complete. A single operationmoves 16 bytes within that timeframe.
• Low latency – FCM load operationsare simple to use. The processor com-pletes the operation in a single cycle.
The principle of the RISC architectureuses a number of simple instructions ondata stored in general-purpose registers(GPR) to compute complex operations.User-defined instructions fall into this cat-egory but take the concept a step further inthat the system architect defines the com-plexity of the operation on data stored inGPRs and FCM registers (FCR). Again,from a software point of view, the engineercodes user-defined instructions through Cmacros. GCC recognizes mnemonics suchas udi0fcm as a user-defined operation ofthe general form:
The target of the operation is either aGPR or an FCR. The operands are eitherGPRs, FCRs, immediate values, or a com-bination. As you can see, the semantics arenot defined by the instruction and dependon your intentions and the implementationin the FCM.
This code sequence demonstrates theuse of a user-defined instruction as anexample of a complex add operation:
struct complex int r, i; // 32 bit integer for realand imaginary parts
;complex a, b, r;ldfcm(0, &a); // load complex number ainto FCM register 0ldfcm(1, &b); // load complex number binto FCM register 1udi0fcm(2, 1, 0); // udi0fcm computes r = a+ b, where r is stored in FCM register 2stdfcm(&r, 2); // store complex resultfrom FCM register 2 to variable r
To increase the readability of the code,you can redefine the user-defined instruc-tion with regular C preprocessor constructs.Instead of using the udi0fcm() macro, youcan redefine it to a more comprehensiblecomplex_add() macro with #define com-plex_add(r, a, b) udi0fcm(r, a, b) and changethe listing to call complex_add(2, 1, 0)instead of udi0fcm(2, 1, 0).
Therefore, system architects can partitiontheir tasks into hardware- and software-executed pieces that are efficiently and pre-cisely interfaced to one another through theuse of the APU controller. This partitioningcan be done statically during the initial sys-tem configuration or dynamically duringthe program execution. Using the directprocessor/FPGA coupling presented by theAPU controller and its high throughputinterfaces, hardware/software synchroniza-tion is greatly simplified and performancesignificantly improved.
Accelerating System PerformanceThe following examples showcase keyadvantages the APU provides based on twodifferent scenarios. The first scenario isessentially a benchmarking comparison of afinite impulse response (FIR) filter using asoft FPU core, implemented as an FCMattached directly to the APU controller (ascompared to software emulation used tocalculate the filter function). The secondscenario implements a two-dimensional
inverse discrete cosine transform (2D-IDCT) typically used as one of the pro-cessing blocks in MPEG-2 videodecompression, again compared to emu-lating the 2D-IDCT function in software.
The two use cases are different in thatthe FPU implements a set of registers in theFPGA fabric upon which the FPU instruc-tions operate. The 2D-IDCT only requiresload and store operations, while the func-tionality of the operation on the datastream is fixed. In either case the operationsare complex enough to justify offloadinginto the FPGA fabric.
Thus, the combination of using theAPU and FPGA hardware accelerationclearly provides a significant performanceadvantage over software emulation – or theconventional method involving the proces-sor and processor local bus architecturewith a soft co-processing function.
FIR FilterThe implementation of floating-pointcalculations in hardware yields animprovement by a factor of 20 over soft-ware emulation. Connecting the FPU asan FCM to the APU controller providesperformance improvement because thelatency to access the floating-point regis-ters is reduced and dedicated load andstore instructions move the operands andresults between the FPU registers and thesystem memory.
38 Xcell Journal First Quarter 2005
PowerPCAPU
ControllerProcessor(soft logic)
XtremeDSPXtremeDSP
XtremeDSP
OCM FPGA Fabric
APUI/F
FPGAInterface
Processor BlockProcessor Block
43.8 .40 0
0 0 0 0 0 000
0 0 0 0 0 000
0 0 0 0 0 000
0 0 0 0 0 000
0 0 0 0 0 000
0 0 0 0 0 000
0 0 0 0 0 000
-4.1 0 -1.1 0 0
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
223 191 159 128 98 72 39 16
Pixel Amplitude Values
Pixel DCT ValuesRGB
YUV
Blocks
APU Function: • Decompresses encoded pixel data for output display• Utilize FPGA Resources – Less overhead logic – Fast data transfer
Spatial Redundancy:Pixel Decoding Using the IDCT
MPEG Decode Flow
Figure 3 – Utilizing APU to decode pixel data for display output
SYSTEM DESIGN CHALLENGES
2D-IDCTThe 2D-IDCT transforms a block of 8 x 8data points from the frequency domain intopixel information. A high-level diagramdepicting the pixel decode by the APU con-troller, along with advantages, is shown inFigure 3. In this example, each data pointhas a resolution of 12 bits and is representedas a 16-bit integer value. The data structureis defined where each row of 8 pixels con-sumes 16 bytes. This is an ideal size thatallows optimal use of the FCM load andstore instructions described earlier. In otherwords, eight FCM quadword load instruc-tions are needed to load a data block into the2D-IDCT hardware. Eight FCM quadwordstore instructions are sufficient to copy thepixel data back into the system memory.
The calculation of the 2D-IDCT in theFCM starts immediately after the first load,and the pixel data is available shortly afterthe last load operation. As shown in Figure4, the 2D-IDCT makes uses of the newXtremeDSP™ slices in the Virtex-4 archi-tecture that offer multiply-and-accumulatefunctionality.
A software-only implementation of a2D-IDCT takes 11 multiplies and 29 addi-tions together with a number of 32-bit loadand store operations, while the hardware-accelerated version takes 8 load and 8 storeoperations. The reduced number of opera-tions results in a speed-up of 20X in favora 2D-IDCT FCM attached through theAPU controller.
By comparison, if you connect the 2D-IDCT hardware block to the processor localbus, as it is done conventionally, the systemperformance will be reduced. This increasedlatency is mainly caused by the bus arbitra-tion overhead and the large number of 32-bit load and store instructions. This isillustrated schematically in Figure 5.
ConclusionThe low-latency and high-bandwidth fab-ric coprocessor module interface of theAPU controller enables you to acceleratealgorithms through the use of dedicatedhardware. Where operations are complexenough to justify the offloading into theFPGA fabric, or when acceleration of a
specific algorithm is desired to achieveoptimal performance, the combination ofthe APU controller and FPGA hardwareacceleration provides a definitive per-formance advantage over software emula-tion or the conventional method ofattaching coprocessors to the processormemory bus.
Generating the accelerated functionscalled by user-defined instructions is easilyperformed through GUI-based wizards.This functionality will be included in sub-sequent releases of the powerful EmbeddedDevelopment Kit or Platform Studio.
If you are more comfortable workingat the source code or assembly level, theAPU controller allows you to define yourown instructions written specifically forthe hardware functionality of the FCM,or you can easily use the pre-definedload/store instructions through high-levelC macros.
The APU controller provides a closecoupling between the PowerPC processorand the FPGA fabric. This opens up anentire range of applications that can imme-diately benefit customers by achievingincreases in system performance that werepreviously unattainable.
For additional details on the APU con-troller in Virtex-4-FX devices, includingdetailed descriptions and timing waveforms,refer to the Virtex-4 PowerPC 405 ProcessorBlock Reference Guide at www.xilinx.com/bvdocs/userguides/ug018.pdf.
First Quarter 2005 Xcell Journal 39
PowerPCAPU
Controller
AuxiliaryProcessor(soft logic)
XtremeDSPXtremeDSP
XtremeDSP
PLB
OCM FPGA Fabric
APUI/F
FPGAInterface
Processor BlockProcessor Block
• Leverages Integrated Features – PowerPC, APU, XtremeDSP Blocks
Example: Video Application – MPEG De-Compression Algorithm
• HW Acceleration Over Software – Lower Latency and High Bandwidth
APU w/XtremeDSP SlicesSingle Instruction ExecutionLeverages APU and Soft Logic
Inverse Two-Dimensional IDCT Algorithm
APUController
Processor Block
Figure 4 – Accelerated system performance with APU
Figure 5 – Comparison of implementation models for 2D-IDCT
SYSTEM DESIGN CHALLENGES
by Ryan CarlsonDirector of Marketing, High Speed Serial I/OXilinx, [email protected]
The industry is moving away from parallelbuses and relatively slow differential signalstoward higher speed differential signalingschemes. These high-speed signals solvemany design challenges: they offer new levelsof bandwidth, they lower overall system cost,and they make designs easier by addressingthe skew issues of large parallel buses.
However, with these improvementscomes a new challenge: maintaining signalintegrity. As signals push the limits of themedia across which they are transmitted, thechallenge of dealing with signal impairmentsbecomes non-trivial, to say the least. Thenew Xilinx® Virtex-4™ RocketIO™ trans-ceivers have incorporated multiple new fea-tures designed to solve this challenge.
Frequency-Dependent LossSeveral factors contribute to the frequency-dependent loss of a typical channel. Figure1 shows the frequency response of 1 m ofFR-4 trace. Dielectric loss and skin effectcombine to create a significant loss above 1GHz. With today’s serial I/O standards
approaching 10 Gbps, this loss becomes acritical design issue.
As a signal travels across a channel (likethe one with a transfer function shown inFigure 1), a bit is degraded to the pointwhere it interferes with neighboring bits;this is known as inter-symbol interference(ISI). Figure 2 shows the effect of ISI on asignal transmitted across a typical back-plane channel. The high-frequency com-ponents are subject to losses that aregreater than the low-frequency compo-nents. The edges that contain the high-frequency components are degraded,resulting in added jitter and eye closure.Additional techniques are needed to com-pensate for these losses.
Signal Integrity FeaturesThe Virtex-4 RocketIO transceivers con-tain several features aimed at solving thisproblem. The first is transmit pre-emphasis. By modifying the signal beforeit is transmitted through a channel,transmit pre-emphasis can proactivelycompensate for some of the frequency-dependent loss of the channel.
Although most existing solutions usetwo-tap transmit pre-emphasis (addressingonly the post-cursor ISI shown in Figure 2),
the Virtex-4 RocketIO transceivers employthree-tap transmit pre-emphasis to addressboth pre- and post-cursor ISI. For signalrates above 3 Gbps, pre-cursor ISI becomesa non-negligible effect, and three taps oftransmit pre-emphasis are needed to solvethe problem.
In addition to transmit pre-emphasis,Virtex-4 RocketIO transceivers providetwo different types of receive equalization.These options can be used in conjunctionwith transmit pre-emphasis to furtherimprove signals degraded by lossy channels.
The first type of receive equalizationworks by amplifying the high-frequencycomponents of the signal that have beenattenuated by the channel (Figure 1). Thetransfer functions of this equalizer are pro-grammable, and are shown in Figure 3.
The second type of receive equalization iscalled decision feedback equalization (DFE).This technique removes ISI effects by look-ing at consecutive bits and choosing theamount of equalization needed.
Both forms of receive equalizationdescribed above seek to amplify the high-frequency components of the desired sig-nal. An advantage of DFE is that it doesnot amplify any crosstalk that may be asso-ciated with the signal. This technique can
Solving the Signal Integrity ChallengeSolving the Signal Integrity Challenge
40 Xcell Journal First Quarter 2005
Virtex-4 RocketIO transceivers bring blazing speed, and the ability to use it.Virtex-4 RocketIO transceivers bring blazing speed, and the ability to use it.
SYSTEM DESIGN CHALLENGES
therefore be useful for increasing the speedof legacy backplanes, where extensivecrosstalk may exist.
All of these signal integrity features arefully programmable; they can be used inde-pendently or together, and each has multi-ple settings to equalize any channel. Tofully take advantage of these hardware-based features, Xilinx also provides soft-ware-based reference designs that use biterror rate tests (BERT) to find the optimalsettings for each unique application.
Integrated Receive Side AC-Coupling CapacitorsMany applications require AC-couplingcapacitors to ensure compatibility betweendifferent Tx and Rx blocks. These capaci-tors require their own vias; at high speedsvias present yet another discontinuity toimpair signal quality.
The Virtex-4 RocketIO transceiversintegrate the AC-coupling capacitors onchip. This not only reduces external com-ponent count and design effort, but moreimportantly improves signal integrity byremoving the need for extra vias in theboard. These integrated AC-couplingcapacitors can be optionally bypassed.
ConclusionSignal integrity is an engineering challengethat accompanies the move to high-speedserial signaling. Once the system design hasbeen optimized to minimize the physicaleffects of connectors, board materials, traces,vias, coupling capacitors, and cables, theremaining losses and channel effects need tobe addressed by advanced silicon features.
Virtex-4 RocketIO transceivers are theindustry’s fastest integrated transceivers.Along with these leading-edge speeds, theRocketIO transceivers deliver multiple fea-tures designed to simultaneously addressthe signal integrity challenge that comeswith them.
Xilinx has detailed information abouthigh-speed design challenges, and thesolutions available to solve them, atw w w. x i l i n x . c o m / s i g n a l i n t e g r i t y .Instructional DVDs that describe variousaspects of the signal integrity challengecan be purchased from the Xilinx onlinestore by visiting www.xilinx.com/store/.
First Quarter 2005 Xcell Journal 41
1.0
0.8
0.6
0.4
0.2
0
1 MHz 10 MHz 100 MHz 1 GHz 10 GHz
-3 db
-6 db
-12 db
-20 db
Dielectric Loss
Total Loss
Conductor Loss (Skin Effect)
Ampli
tude
TransmittedPulse
Example Backplane
Time Time
Received Pulse,Attenuated and
Dispersed
Cursor
Pre-CursorCauses ISI
(Secondary Effect)
Post-CursorCauses ISI
(Primary Effect)
16
14
12
10
8.0
6.0
4.0
2.0
0.010 M 100 M 1 G 10 G
Freq (Hz)
Ampli
ficati
on (d
B)Figure 1 – Frequency-dependent loss
Figure 2 – A transmitted bit (left) and the result of inter-symbol interference (right)
Figure 3 – Virtex-4 RocketIO receive equalization transfer functions
SYSTEM DESIGN CHALLENGES
by David GambaSenior Manager, Strategic Solutions MarketingXilinx, [email protected]
Wireless infrastructure revenue continuesto experience phenomenal growth, increas-ing from approximately $27 billion in2003 to an estimated $35 billion in 2004.Industry analysts are predicting that 2004will be the peak revenue year, as forecastsshow the revenue figure dropping back to$27 billion in 2005, eventually settling into the $10-$15 billion range by the end ofthe decade. This revenue decline is drivenboth by lower prices as well as a drop inbase station deployments, from nearly500,000 stations in 2004 to less than200,000 in 2010.
As the industry transitions from a high-growth phase to a more mature state, costpressures will increasingly mount in allfacets of the infrastructure, including thewireless base station. Next-generation basestation deployments must conquer thechallenge of continually reducing cost (asmeasured by cost per channel) whileadding functionality to support new servic-es, protocols, and changing subscriberusage patterns.
Using FPGAs in Wireless Base Station DesignsUsing FPGAs in Wireless Base Station Designs
42 Xcell Journal First Quarter 2005
Wireless base station design trends benefit from Virtex-4 device features.Wireless base station design trends benefit from Virtex-4 device features.
SYSTEM DESIGN CHALLENGES
To begin addressing this challenge,wireless base station designs are shiftingfrom ASIC technology to more readilyavailable off-the-shelf components such asFPGAs. This shift is driven both by declin-ing annual base station unit volumes aswell as FPGA technology improvementsthat increase processing power and enable amuch lower cost per channel.
The migration to FPGAs is not just anattempt to reduce costs and create a com-mon platform to achieve commoditization– it is also being driven by time-to-marketpressures, along with the need to make in-
field upgrades of base station deployments.This shift away from ASICs has enabledsignificant new design opportunities forXilinx® Virtex-4™ devices to fill the void.
Wireless Base Station Module Building BlocksInside a wireless base station are fairly dis-tinct module blocks performing differentfunctions, such as radio, baseband process-ing, transport network interfacing, andcontrol (Figure 1). Traditional base stationdesigns used ASICs – along with DSPs andother discrete components – to implementthese various architectural features andfunctions.
This design approach is rapidly givingway to more cost-effective and flexibledesigns that use FPGAs. With lower costsand increased flexibility, product delivery isaccelerated and inventory control is much
Extending Current Design LifecyclesStandardization is the first step towardsthe commoditization of base stationdesign and will eventually lead to a phas-ing out of ASICs from wireless base sta-tions. In the interim, companies areinserting discrete devices next to their cur-rent ASICs to support new functionalitythat cannot be added in a timely or cost-effective manner to the current design.
For instance, the Third GenerationPartnership Project (3GPP), which is acollaboration agreement between severaltelecommunications bodies, is activelycreating additional standards for thewireless industry. 3GPP has added ahigh-speed downlink packet access(HSDPA) feature as a new UniversalMobile Telecommunications System(UMTS) requirement in its latest base-band processing specification, Release 5,for Wideband Code Division MultipleAccess (W-CDMA).
ASICs in current base stations do notsupport this new variant for UMTS.This creates a hole in the service offer-ings for UMTS, which forecasters arepredicting will represent approximately80% of the wireless traffic in the nextfew years. This deficiency must beaddressed before future field deploy-ments, and it can be – without exceedingthe system power budget – by using aVirtex-4 LX device next to the ASIC,implementing HSDPA using the avail-able Xilinx HSDPA IP offering.
Next-Generation Base Station DesignsBut adding external devices to patchdesign holes created by existing ASICdesigns limitations is purely a stopgapsolution. Future base station designs mustbe able to quickly adapt to changes in sub-scriber traffic patterns, as well as supportthe upcoming convergence of new servic-es and emerging cellular technologies suchas W-CDMA, TD-SCDMA, EDGE,1xEV-DO, and WiMAX.
As shown in Figure 2, the amount ofcellular technologies is expected to contin-ue to proliferate, leading base stationsdown the path of having to support manymore technologies. Current issues such as
more manageable, avoiding some of themulti-million dollar inventory obsoles-cence issues that base station manufacturershave faced with ASIC solutions fabricatedto support the 3G launch.
Standardizing the Wireless Base StationAnother significant step taken by the wire-less industry is the launch of industryorganizations focused on standardizing thenon-differentiated features inside a basestation. The most notable development forXilinx is the migration to a standardizedhigh-speed serial interconnect solution
between the different base station moduleblocks, such as the Open Base StationArchitecture Initiative (OBSAI) ReferencePoint 3 (RP3) and Common Public RadioInterface (CPRI) interconnects for base-band and radio module connectivity.
Many leading base station manufactur-ers are members of these organizationsand are rapidly preparing to adopt one ofthese two standard interconnect solutionsin their upcoming design implementa-tions. Xilinx is fully prepared to supportthese standards, and has both OBSAI andCPRI IP solutions and reference designsavailable for implementing in Virtex-IIPro™, Virtex-II Pro X, and Virtex-4 FXFPGA devices, using the integratedRocketIO™ multi-gigabit tranceivers(MGTs) in association with the logicbuilding blocks.
First Quarter 2005 Xcell Journal 43
Antenna
MultichannelPower Amp
Low NoiseAmp
ADC
AnalogRF RX
AnalogRF TX
ADC DAC
Digital DownConversion
Digital UpConversion
Digital Filteringand Antenna
Diversity
Pre-Distortionand Digital
Filtering
BasebandInterface Bus
SymbolEncoding
SymbolDecoding
Modulationand Spreading
SymbolDetection and
Combining
Chip-RateDemodulation
and Despreading
ChannelEstimation
Bac
kpla
ne
Circuit SwitchedNetwork Control
Packet SwitchedNetwork Control B
TS
to R
NC
IIn
terf
ace
Central Processor
ControlInterface
Timing and ClockGeneration Power Supply AC/DC
Power
E1, T1Frame Relay
orIP Network
(GigibitEthernet
etc.)
Amplifiers Baseband Processing Network Interface
Main Processor
TX/RX
Figure 1 – Wireless base station module block diagram
SYSTEM DESIGN CHALLENGES
multi-user detection and antenna selectionwill be augmented by new technical chal-lenges, such as channel provisioning andbase station tuning, that will need to beresolved appropriately to reduce a serviceprovider’s customer turnover. The funda-mental expectation to receive the samehigh-quality wireless service wherever a cus-tomer roams must be completely addressed.
These customer expectations wouldbenefit from substantial flexibility in thebase station. Fortunately, many of the base-band processing functions and radio mod-ule functions are well suited forimplementation in Virtex-4 devices, taking
advantage of the integrated XtremeDSP™slices in the product architecture.
For instance, quite a few basebandprocessing tasks – such as call initiationand set-up and multi-path signal detec-tion and monitoring – are heavily basedon mathematical algorithms. You canvery efficiently implement these algo-rithms by using the integrated multipliercapabilities available in Virtex-4 devices,along with the readily available intellectu-al property components such as theRandom Access Channel (RACH),Searcher, and 3G Turbo ConvolutionalCodecs (3GTCC) that Xilinx has imple-
mented as reference designs to demon-strate these capabilities.
The integrated DSP capability in theVirtex-4 SX device enables a very lowpower implementation of these func-tions. Radio functions can be expandedby using a Virtex-4 SX device to enablemore channel support.
Several enabling pieces of intellectualproperty targeted at radio functions, suchas digital pre-distortion (DPD), crest fac-tor reduction (CFR), and digital up/downconversion (DUC/DDC), are supportedby the Virtex-4 SX device. Not only doesthis help increase in the number of chan-nels supported in a base station, but it alsohelps reduce the cost per channel. Table 1gives an overview of the different capabili-ties offered by Xilinx baseband and radiomodule IP offerings.
System Generator for DSP Development ToolXilinx complements its Virtex-4 productofferings with the System Generator forDSP tool. This is a complete integratedDSP design environment that simplifiesthe development, debug, and verificationof high-performance DSP designs target-ing wireless base stations. This tool alsohelps designers interface with complemen-tary general-purpose and DSP processorsused in wireless base station designs.
System Generator for DSP provideshigh-level abstractions that are automati-cally compiled into Virtex-4 devices at thepush of a button, with no loss in perform-ance over designs implemented in lower-level languages such as VHDL. SystemGenerator is part of the XtremeDSP solu-tion, which combines state-of-the-artFPGAs, design tools, intellectual propertycores, and design and education services.
ConclusionTo learn more about the key markets and end applications of Xilinx wirelesssolutions, visit www.xilinx.com/esp/, or e-mail [email protected]. For more details about Virtex-4 FPGAs, visitwww.xilinx.com/virtex4/. And for moredetails on System Generator for DSP orother pieces of the Xilinx DSP solution,visit www.xilinx.com/dsp/.
44 Xcell Journal First Quarter 2005
GSM
TDMA
IS95a/b 1xRTT
1xEV-D0
1xEV-DV
3xRTT
W-CDMA
TD-SCDMA
GPRS EDGE HSDPA
Wireless LANs
4G
2G 2.5G 3G 3.5G 4G
Current Being Deployed Development Future
IEEE 802.11IEEE 802.16
Xilinx Baseband Intellectual Property Offerings
IP Offering Application
HSDPA Increases downlink data transmission rate to a peak of 14.4 Mbps
RACH Receiver path preamble detection (specified by W-CDMA)
Searcher Multi-path delay estimate for each subscriber
3G TCC Forward error correction
Xilinx Radio Intellectual Property Offerings
IP Offering Application
DPD Signal conditioning to enable use of lower cost RF power amplifiers
CFR Signal amplitude conditioning to enable increased RF power amplifier efficiency
DUC Baseband signal modulation for digital-to-analog converter input
DDC Receiver signal modulation for analog-to-digital converter input
Table 1 – Xilinx baseband and radio IP offerings
Figure 2 – Mobile technology roadmap
SYSTEM DESIGN CHALLENGES
A new benchmark in delivery!
The supremely popular, low-cost Spartan™ FPGA product line from Xilinx recently shipped its 100 millionth
device. And we are in high-volume production of our 90nm Spartan-3 series, already delivered to customers
worldwide. Addressing the demands of consumer-oriented, cost-sensitive applications, Spartan-3 FPGAs
offer full-feature capability with the lowest price points ever.
Get started today with the world’s lowest-cost FPGA
The Spartan-3 Starter Kit gives you instant access to the FPGA’s complete platform capabilities, bringing
high-volume designs to reality faster. The kit includes a total starter board,
JTAG cable, handbook and resource CD, plus free ISE software, all for just
US $99. Contact your local distributor, or order your Spartan-3 Starter
Kit today at www.xilinx.com/spartan3.
Now there’s a hundred million reasons to get started today!
by Delfin RodillasStrategic Solutions ManagerXilinx, [email protected]
With the continued proliferation of cableand satellite television and the rapidgrowth of the Internet, video transmissionbandwidth has experienced phenomenalgrowth. With video streaming now beingintroduced into mobile handsets, thisgrowth rate is not showing any signs ofslowing down.
The technology advances of Xilinx®
FPGAs have kept pace with the increasingtransmission requirements and have solvedmany of the critical design issues in thesesystems. The Virtex-4™ product familyincorporates additional enhancements –high-speed DSP, ultra low power, flexibleintegrated memory, and high-speed serialI/O – that enable these devices to meet thehigh bandwidth requirements of videoapplications.
With these features, you can use Virtex-4 devices in a variety of products, such ascable modem termination systems, digitalvideo broadcast systems, flat-panel dis-plays, master control switches, MPEGencoders, non-linear video editors, broad-cast routers, image statistical multiplexers,and video servers.
Implementing a Cable ModemTermination System with Virtex-4 FPGAs
Implementing a Cable ModemTermination System with Virtex-4 FPGAs
46 Xcell Journal First Quarter 2005
Integrated features make the Virtex-4 device an ideal choice.
Integrated features make the Virtex-4 device an ideal choice.
SYSTEM DESIGN CHALLENGES
Cable Modem Termination SystemOne common application where you canuse Virtex-4 devices is in a cable modemtermination system (CMTS), shown inFigure 1. The CMTS is used in cableheadends, a switching system that worksin conjunction with Internet serviceproviders to route data between cablemodems and the Internet.
In a CMTS, the transmitted data is mul-tiplexed onto a cable channel along withbroadcast video transmissions. Bandwidthis shared by all active subscribers (typically500 to 2,000) in the cable network seg-ment. Downstream transmission rates runat 40 Mbps using quadrature amplitudemodulation (QAM), while upstream ratescan be as high as 10 Mbps using QAM orquadrature phase shift keying (QPSK). Thespeed of the upstream link depends on theservice level agreement (SLA) that the sub-scriber has signed with their cable company.
CMTS Design ChallengesCable operators can offer a variety of dif-ferent services by using quality of service(QoS) provisioning to support differentsubscriber packages, helping to maximizetheir revenue stream. For QoS in theCMTS, the design needs to support packetclassification, packet prioritization, flowcontrol, congestion control, queuing,scheduling, and QoS statistical measure-
port bandwidth. The design goal is toreduce the amount of congestion in order tooffer the maximum amount of bandwidthand packet throughput by optimizing end-to-end delay and minimizing packet loss.
In addition, the implementation needsto support fair bandwidth distribution foreach service class; furnish protectionbetween the different class levels; providefast, flexible access to bandwidth withoutimpacting forwarding performance; andallow other service classes to use underuti-lized bandwidth.
To surmount these challenges, efficientqueuing and scheduling techniques arerequired to optimize queue memory man-agement, which controls the number ofpackets in a queue. This function controlsservice-class access to the packet memorybuffer and determines which packets todrop because of congestion.
Multiple queue memory managementtechniques are in use today, including ran-dom early detection (RED), weighted ran-
dom early detection (WRED)and leaky bucket. Per-flowqueuing is commonly per-formed using one or a combi-nation of the schedulingalgorithms shown in Table 1.
Table 1 shows that thereare many different queuingand scheduling algorithms.Given the dearth of standardsactivity in this area, many dif-ferent algorithms will contin-ue to exist for the foreseeablefuture. In addition, thesealgorithms need to handlevariable sized packets, which
are more complicated than fixed cells.Virtex-4 devices offer a high-performance
solution for these queuing and schedulingrequirements, for the devices offer anextremely fast and flexible fabric for imple-menting designs without impacting for-warding performance. Scheduling decisionsare typically performed every clock cycle andrequire heavily pipelined designs.
Virtex-4 devices also offer a register-richarchitecture with ample routing, enablingefficient implementation of these decisions.The high-speed designs also require very
ments. All of these functions need to besupported without a reduction in userbandwidth. Given this, QoS processing isgenerally done in hardware, for softwareimplementations lack theprocessing power to makereal-time routing decisionsand can result in delaysand excessive queuing.
Maintaining efficientbandwidth utilizationwhile supporting SLAs andmultiple traffic typesmakes traffic managementvery challenging. Throw invarying protocols, memorymanagement, differentsized payloads, and a vari-ety of different systeminterfaces, and it is easy tosee how these designs require high-performance, cost-effective flexibility thatASSPs and ASICs cannot offer. These chal-lenges open up opportunities for Virtex-4devices that can provide flexible trafficmanagement capability at the required per-formance levels.
CMTS Queuing and Scheduling RequirementsQoS provisioning is basically a queuing andscheduling problem. Proper queuing andscheduling entails recognizing service classesalong with managing buffer memories and
First Quarter 2005 Xcell Journal 47
System Backplane
HFC Network
MAN / WAN
Cable
Transceiver
Network
Transceiver
MixedSignal
Packet
Processing,
Queuing/
Scheduling
Traffic
Flow
Management
QoS
Measuring
QAM
Modulator
Disk
Controller
Memory
Interface
Switch
Fabric
Memory
Xilinx
Hard Disk Drive
SRAM
DRAM
Flash
Memory
Host
CPU
CPU
QAM
Demodulator
QAM
Demodulator
QAM
Demodulator
Non-Xilinx
Queuing and Scheduling Algorithms
First-In, First-Out
Round Robin
Weighted Round Robin
Fair Queuing
Weighted Fair Queuing
Priority Queuing
Shortest Remaining Time
Figure 1 – Cable modem termination system block diagram
Table 1 – Common queuingand scheduling algorithms
SYSTEM DESIGN CHALLENGES
wide internal buses, which are easily imple-mented in the Virtex-4 architecture byusing the integrated DLLs and DCMs tohelp manage multiple clock domains.
Many of the queuing and schedulingbuffer management schemes are math-intensive; these schemes must quickly cal-culate multi-variable equations such aspacket transmit scheduling and customerservice normalization schemes. Forinstance, the bandwidth calculation shownin Figure 2 is a multi-variable equationused to calculate the bandwidth (B1, B2)for each user for a given level of total band-
width. These types of functions can takeadvantage of the integrated 500 MHz per-formance, low power, 18 x 18 multipliers,and 48-bit adder/subtractor integrated inthe XtremeDSP™ slice.
CMTS Memory RequirementsMost networking applications are builtaround a load-store type of architecture,with packets being stored in linked lists inexternal memories. Because of the increasingqueuing and scheduling performancerequirements of the CMTS, high-speedDDR or QDR SRAM memories preventmemory access from becoming a bottleneck.
To properly interface to these memorydevices, all Virtex-4 devices have theChipSync™ feature in every device I/O.ChipSync lets designers easily align theDQS control signal with memory data invery small increments; this alignment canbe easily monitored and altered as temper-ature and voltage changes alter the verydelicate timing.
Converting the high-speed 300 MHz+memory data to wider, slower, more man-ageable data is easily accomplished with thebuilt in ISERDES and OSERDES availablein every I/O. Additionally, the Virtex-4
memory-rich architecture, capable of run-ning at 500 MHz, provides much neededon-chip cache capability.
Virtex-4 devices support high-speedmemory interfaces and, along with anembedded hierarchy of memory structurescomprising distributed and block RAM,can easily facilitate implementation ofhigh-performance queuing and schedulingalgorithms. The Virtex-4 devices’ highmemory-to-logic ratio helps reduce memo-ry access latency by caching data on-chip,buffering data between two disparate clockdomains, and using scratch-pad memoryfor storing coefficients.
The integrated distributed RAM isgood for implementing small FIFOs, DSPcoefficients, shallow/wide memories, andCAMs. The block RAM is good for largerFIFOs, packet buffers, video line buffers,cache tag memory, deep/wide memories,and CAMs. Xilinx also has many provenembedded-memory CAM and FIFO ref-erence designs available to help implementthese high-speed memory designs.
CMTS Video Transmission StandardsThe ITU-T (International Telecom-munications Union – TelecommunicationStandardization Sector) has created a stan-dard for the transmission of audio, video,and data services over cable networks. Thespecification for this standard is ITU-TJ.83 Digital Multi-Program Systems forTelevision, Sound, and Data Services forCable Distribution.
This standard is supported in Virtex-4devices using the Xilinx J.83 CableModulator LogiCORE™ IP to provideeither single- or quad-channel support.(See the related article from the Winter2004 issue of the Xcell Journal, “UsingSystem Generator for DSP to Create theJ.83 Cable Modulator.”)
ConclusionGiven the high bandwidth requirementsof a CMTS along with the associatedqueuing and scheduling complexities toprovide the appropriate QoS require-ments, Virtex-4 devices offer an optimalsolution for these designs. The embeddedhierarchy of memory structures, alongwith integrated high-speed serial inter-faces and programmable flexibility, makeVirtex-4 devices a better choice overimplementations using ASICs or ASSPs.
To learn more about Xilinx key marketsand end applications, visit www.xilinx.com/esp/. For more details on Virtex-4 FPGAs, visit www.xilinx.com/virtex4/.
48 Xcell Journal First Quarter 2005
Problem: Known the whole bandwidth (B), the Drop Probability (P1, P2, and P3),
and the number of Flows for each PHG class calculate B1, B2, and B3
Total Bandwidth for the aggregate PHB AF1 group is: B
PHB AF13: [Drop Probability = P3, Bandwidth B3; Flows N3]
PHB AF12: [Drop Probability = P2, Bandwidth B2; Flows N2]
PHB AF11: [Drop Probability = P1, Bandwidth B1; Flows N1]
N3 P2 B2 = N2 P3 B3 Bi = Pi B i = 1,2,3; i = j,k
Pi =
Pi Pj
~
~
^ ^
Pj Pk
^ ^
Px =Px
Nx
^
+ Pj Pk
^ ^+ Pi Pk
^ ^
N3 P1 B1 = N1 P3 B3
B1 + B2 + B3 = B
Figure 2 – Bandwidth calculation formula example
Additionally, the Virtex-4 memory-rich architecture, capable of running at 500 MHz, provides much
The Memec Virtex-4 Development Kitsare the ideal solution for designers needing a high-performance Virtex-4 platform with the flexibility to meet your system design challenges.
Your Search is Over.
Visit www.memec.com/xilinx-v4
Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission.All company and product names may be trademarks of their respective companies.
F lexible Design
P owerful Performance
G reater Programmability
A dvanced Technology
by Amit DhirSenior Manager, Strategic Solutions, Wired Networks and Telecom MarketsXilinx, [email protected]
Although the dot-com bubble may haveburst, the Internet has continued its multi-fold growth, thus placing a strain ontelecommunication networks. Both indi-viduals and businesses are demanding morebandwidth to run new communicationsoptions, such as desktop video conferenc-ing, IP telephony, remote storage, andmobile communications.
This is the driving force behind the needto transform the multiple, costly, and com-plex networks in use today into a smarter,multipurpose, global, cost-effective broad-band network. This transformation willgenerate new sources of revenue for serviceproviders, provide greater opportunities andproductiveness for enterprises, and meet theneeds of consumers who value multimedia,the freedom of mobility, and personalizedand secure private network services. Theboundaries between public and private,wired and wireless, and voice and data net-works are vanishing.
Virtex-4 FPGAs provide the density, features, and performance at low price points to enable the communication revolution.
SYSTEM DESIGN CHALLENGES
The key elements of a more intelligent,high-speed, multi-purpose global networkinclude broadband and optical technolo-gies, voice over packet, wireless data, multi-media services and applications, andsecurity, all underpinned by a packet net-work core. Typical telecom- and datacom-wired equipment can be segmented intoline cards, switch cards, control cards, anda backplane. Network convergence requiresequipment vendors to support multipletechnologies, including SONET/SDH,PDH, Data over SONET (GFP, VCAT,and LCAS), Fibre Channel,Ethernet, DVI, DSL, PON, andMPLS, depending on the system’slocation in the access networks,metropolitan area networks,enterprises, and wireless networks.
Because data is transmitted in IPpackets, packet processing hasbecome a sophisticated architectur-al decision depending on the endsystem. This also influences theswitch architecture and backplanetopology. Also, with time to marketand cost pressures, equipmentproviders continue to focus ontechnology and innovation as thecornerstones for creating new rev-enue opportunities.
Enabling the Communications Revolution Xilinx® FPGAs offer a high-performancefabric, integrated features, and powerfulclock management, thus providing an idealplatform for communications equipmentvendors to develop their solutions. Xilinxalso provides case studies, IP, and referencedesigns to help customers with their designsin several key applications.
Telecom and Datacom Line Card Port Interfaces Digital telecom infrastructure has mostlybeen based on PDH and SONET/SDHtechnologies in the metropolitan area andtransport networks. The transport of datatraffic (Ethernet, Fibre Channel, ESCON,and DVI) onto SONET/SDH networks isgiving rise to technologies such as genericframing procedure and virtual concatena-tion. This flux is requiring a need for pro-
already enabled several customers toupgrade their backplane to faster rates.
With the Virtex-4 family’s third-genera-tion multi-gigabit transceivers and enhancedfeatures such as AC coupling, programma-ble preemphasis, and receive (linear anddecision feedback) equalization, you canensure signal integrity in a wide variety ofapplications and give new life to old systemsby upgrading legacy backplanes.
Industry standards such as SerialRapidIO™, Gigabit Ethernet, and PCIExpress (including out-of-band signaling
and spread spectrum clocking) are all sup-ported. Virtex-4 FX FPGAs enable bridg-ing between just about any serial or parallelsystem interface.
To enable the creation of mesh designs,Xilinx offers the mesh fabric referencedesign for complete flexible connectivityacross a serial backplane based on the stan-dard of your choice. Xilinx also providessignal integrity tools and resources such asthe ATCA development board to ease theprocess of designing SerDes solutions intoyour next-generation backplane.
Packet Processing Although several network processor vendorshave attempted to solve packet processing(classification, policing, queuing, andscheduling) glitches, achieving performanceand power goals continues to be challeng-ing. Virtex-4 FPGAs solve network process-
grammable solutions that can allow ven-dors to have a single SFP or XFP module tosupport multiple technologies at givenrates. With the Virtex-4™ FX family sup-porting Gigabit Ethernet (1 and 10 Gbps),Fibre Channel (1, 2, 4, 8 and 10 Gbps),and SONET (OC-12 and OC-48) onevery RocketIO™ serial transceiver, youhave extreme flexibility in the I/Os.
The FPGA, coupled with robust IPofferings from Xilinx and our partners forMACs and framers/mappers, presents aflexible solution that can be morphed
depending on the service provider’s needson a per-port basis. This also helps in thelifecycle cost management of the system, asfewer cards need to be maintained and canbe programmed with the relevant portinterfaces required upon shipping.
Serial Backplanes and Switching With exploding data rates and source syn-chronous I/Os unable to keep up with thepace at which packet communicationoccurs between the line cards, vendors areuniversally looking at serial technologies tosolve the bandwidth problem. RocketIOtransceivers, which support a wide per-formance range of 622 Mbps to 11.1 Gbps,can also be used to drive several tens ofinches on FR-4 and other exotic materials– at different rates. With Virtex-II Pro™and Virtex-II Pro-X families and the inte-grated RocketIO transcievers, Xilinx has
First Quarter 2005 Xcell Journal 51
IntegratedOptics
PMD
Memory
Traffic, Queue,
Policy Mng.
NetworkProcessor,Look-up,
Classification
Framer/Mapper/
MAC
PHYLayer
PCI, PCI-XPCI Express
ASSerial RapidIO
InfinibandProprietary
SPI-3, PL3SPI-4.1/4.2
SPI-5UTOPIA
SystemInterfaces
SFI-4XSBITFI-5
CSIXPCI, PCI-X
PCI ExpressHyperTransport
RapidIOProprietary
QDR, QDR IICAM I/F
RLDRAM, FCRAMDDRRAMNoBL/ZBT
SerDes+
SwitchFabric
Figure 1 - Typical line card
SYSTEM DESIGN CHALLENGES
ing challenges with features such as systemand memory interfaces, clock managers,block RAM, DSP slices, PowerPC™, andhigh-speed programmable logic. Xilinx alsooffers solutions such as the queue managerand mesh fabric reference designs to helpwith traffic management needs.
Simplifying System Design Challenges The fundamentals of unparalleled flexibili-ty and high performance are furtherextended in the Virtex-4 family. To helpsimplify your system design challenges,Xilinx also offers:
• Integration. The integration of proces-sors, tri-mode Ethernet MACs, DSPslices, SerDes, memory, and other fea-tures in the FPGA helps reduce yourbill of materials and saves FPGAresources. This reduction in compo-nent count helps streamline logisticswith a smaller bill of materials andsimplifies the design and manufactureof system hardware because of simplerPCB design and manufacturing andimproved reliability through the reduc-tion of solder joints.
• SelectIO™ technology and connectivityIP. Virtex-4 FPGAs make it easy tobuild robust high-speed memory andnetworking interfaces. All Virtex-4 plat-forms include configurable, high-per-formance SelectIO technology tosupport a wide variety of I/O standards.
Virtex-4 FPGAs provide as many as960 user I/Os, supporting more than20 single-ended and differential elec-trical I/O standards to enable severalparallel system interface standards onone device. New ChipSync™ tech-nology built into every I/O blockmakes source-synchronous interfacingto the latest high-speed componentseasy. Plus, powered with XCITE tech-nology, each I/O block delivers on-chip active I/O termination,eliminating external termination resis-tors to increase signal integrity, saveboard space, and reduce system cost.Xilinx also provides a robust offeringof IP (PCI, SPI-3, SPI-4.2, RapidIO)
and reference designs (DDR2, DDR,QDR II, RLDRAM II, FCRAM II)for system and memory connectivity.
• Embedded processing. With theembedded PowerPC and the softMicroBlaze™ and PicoBlaze™processors, Xilinx offers a range ofprocessing solutions to match therequirements of different tasks, rang-ing from simple control functions toadvanced algorithms and high-speedcalculations. Also, in telecom cardsthe processors assist with simplefunctions such as alarm handlingand performance monitoring.
• Low-cost designs. Xilinx manufacturesVirtex-4 FPGAs using 90 nm advancedprocess technology on 300 mm wafers.This allows us to produce approxi-
mately five times as many die perwafer, compared to building an equiv-alent chip in 130 nm process on 200mm wafers. This lowers the cost perdie significantly.
Additionally, the EasyPath™ programfurther lowers system cost for customers whoare ready to take their finished design to vol-ume production. Xilinx creates customizedtest programs for EasyPath customers thatexercise only the device resources used in thespecific design. This approach shortens testtime and increases yield to reduce FPGAunit prices as much as 80%.
ConclusionTo learn more about the key markets andend applications of Xilinx solutions, visitwww.xilinx.com/esp/ or e-mail [email protected]. For more details on Virtex-4FPGAs, visit www.xilinx.com/virtex4/.
52 Xcell Journal First Quarter 2005
FREE Online SeminarVerification of
Your EmbeddedFPGA Design
Seamless FPGA for Xilinx Virtex-II™ Pro
During this session, you will learn how to:•Leverage Platform FPGAs for embedded systems •Utilize the tightly integrated solution of Seamless with Xilinx Platform Studio (XPS) •Easily debug complex hardware-software interactions •Measure software and hardware performance of the FPGA system
Learn more today: http://www.mentor.com/fv/events/seminars/xilinxonline/
For additional details about Seamless FPGA, visit us at www.seamlessfpga.com
SYSTEM DESIGN CHALLENGES
Prove your design with high speed FPGA hardwareemulation plugged directly into your PCIe system. Here are
4.5 million gates to emulate your ASIC and kill the RTL bugs beforeyou cut masks. This board will let you test your software andincrease your chances that the first spin will be the last. TheDN6000K10PCIe is packed with the features you need:
•1,4 and 8-lane versions
•Six VirtexII-Pro FPGAs (-2vp100s, the big ones)
•10 DDR (64Mx16) and 4 SSRAMs (2Mx36) external to the FPGAs
•Expansion capability to customize your application
•Synplicity Certify® models for quick and easy partitioning
Like all our products, this new PCI Express bus board will help youget your ASIC to market on time and in budget. Call The Dini Grouptoday-- PCIe is already here.
1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419 • Email: [email protected]
Established in 1996, Red River specializesin high-performance signal processing anddata communication solutions for theembedded systems market, especially soft-ware defined radio applications.
Our main challenge in serving the soft-ware defined radio market is to have a hard-ware platform that meets the demands ofmultiple configurations. Some customersare looking for a complete, pre-built radiosolution; others are looking to add customfeatures to a radio platform. These disparaterequirements place great demands on us tofind a common programmable silicon solu-tion that meets both needs.
The Xilinx® Virtex-4™ FPGA family
allows us to do exactly that – provide differ-ent customer solutions at the lowest cost.Advanced features such as FIFO logic,embedded PowerPC™, RocketIO™ trans-ceivers, and Ethernet MAC, as well asadvanced power and packaging technology,makes Virtex-4 devices a perfect choice for us.
Model 351 (Pocket Change)Our next-generation product, the Model351, or “Pocket Change,” transforms anyportable computer into a high-performancemulti-channel software defined radiotransceiver. The Pocket Change CardBusPC Card accepts two analog input signalsthrough MMCX coaxial connectors onthe outside edge of the card. The receiverinput is AC-coupled to a 14-bit (80MSPS) A/D converter. The transmitteroutput is supplied through a 14-bit (100 MSPS) D/A converter. Most of the
digital logic is supplied using a Virtex-4FPGA device.
When we began developing the Model351, we investigated various offerings onthe market and finally decided to useVirtex-4 FPGAs. The Virtex-4 FPGA fam-ily provides the flexibility and features thatsupport both our needs and the require-ments of our customers.
The Model 351 design comprises aVirtex-4 FPGA connected to an A/D con-verter, a D/A converter, and a dedicated PCIbus controller (for the CardBus interface tothe host computer) (Figure 1). Although it istargeted at our traditional software definedradio customers, the Model 351 is also suit-able for signal acquisition or generation, sig-nal intelligence collection, transceivermodem algorithm prototyping, frequencyhop signal generation, or portable signalrecorder/playback applications.
Virtex-4 FPGAs for Software Defined RadioVirtex-4 FPGAs for Software Defined Radio
54 Xcell Journal First Quarter 2005
Red River’s new PCMCIA Type II module can transform any notebook computer into a software defined radio using a Virtex-4 FPGA for performance-critical DSP functions.
Red River’s new PCMCIA Type II module can transform any notebook computer into a software defined radio using a Virtex-4 FPGA for performance-critical DSP functions.
SYSTEM DESIGN CHALLENGES
Customization and FlexibilityInitially we considered using dedicated dig-ital upconverter/downconverter chips toimplement the Model 351 transceiverfunction. However, many of our customersprefer the flexibility of inserting customfunctions into their designs. The cus-tomization requirement pushed us to useprogrammable technology.
By selecting a leading programmablelogic architecture, we can address the cus-tomization needs of a broad set of cus-tomers. Xilinx ISE™ developmentsoftware provides our customers a familiardesign environment to embed custom DSPfunctions in the uncommitted logic of theVirtex-4 FPGA.
Another benefit from using Virtex-4FPGAs is that we can offer multiple prod-ucts using one common hardware plat-form. This has helped reduce hardwaredevelopment time and simplify inventorymanagement.
Power and Space EfficiencyOne of the challenges in CardBus PC Carddevelopment is to select a device that meetsthe PCMCIA functional specification andthe tight power restriction of 3.3W. Wewere impressed with the power efficiencyof the Virtex-4 family, as it consumes halfthe power of comparable logic solutions.
Virtex-4 FPGAs give us significant fea-tures and performance while still meetingthe tight power budget of our design. Inaddition, PCMCIA imposes severe heightrestrictions in order to fit into the Type IImodule form factor. The Virtex-4 FF668package offering is one of the few FPGApackages that meet the height requirements.
the highest-performance internal blockRAM and unique integrated FIFO logic,Virtex-4 FPGAs give us the FIFO quanti-ty and performance that we need to keepup with the bandwidth of the analogcomponents and host interface.
Three Platforms Satisfy Multiple RequirementsThe three Virtex-4 platforms (LX, SX, andFX) give us unique capabilities for severalupcoming products. For customers want-ing to add custom logic functionality, weuse the LX platform. LX offers the choiceof many different gate densities within thesame package footprint, allowing us to usethe same base design to support many dif-ferent customer needs.
We have some designs that necessitatetremendous additional DSP capabilityfor math-intensive processing, includingsignal modulation and demodulation.For these applications, we see the SXplatform as a natural fit. SX devices give us by far the largest amount of DSPperformance.
For some of our other designs, we areimplementing the advanced system-levelblock functionality of the FX platform –PowerPC running VxWorks, RocketIOtransceivers for optical and PCI Expressinterfacing, and gigabit Ethernet MACcores. Because Virtex-4 devices give usthree platforms to choose from, we canoffer different capabilities across ourproduct line.
ConclusionSoftware defined radio products mustaddress a broad application space, whichpresents a challenge when selecting com-ponent features. The three Virtex-4 plat-forms give us the feature choice andperformance that we require to field afamily of solutions for both fixed andmobile installations.
The upcoming Model 351 demon-strates cutting-edge capabilities in anextremely small, power-efficient modulethat operates in a standard notebook com-puter. Visit www.red-river.com for moreinformation about the Model 351 andother Red River products.
Advanced Features and PerformanceOne key requirement for a software definedradio application is high-performance DSPcapability. The performance requirement isdriven by the need to support multiple sig-nal channels in real time.
Virtex-4 FPGAs are capable of perform-ing multi-channel digital upconversion anddownconversion across the entire Model351 analog bandwidth. The Virtex-4device can also perform Fast FourierTransforms (FFTs) for spectral analysis ofincoming signal data.
The Virtex-4 FPGA provides the “heavylifting” to process digital informationbetween the host computer and the A/D orD/A converter. The signal processingpower comes directly from the SX plat-form. Virtex-4 devices can achieve high-DSP performance by taking advantage ofmassive parallelism within each FPGA. Formath-intensive algorithms (likeDUC/DDC applications in a softwaredefined radio), the high number of DSPslices – multiply/add/accumulate engines –that can run up to 500 MHz provides thekind of performance only previously avail-able in fixed ASIC technology.
Our designs also make extensive use ofthe internal block memories in the FPGAto provide multi-queue FIFO capabilities.The FIFOs are used to buffer databetween the A/D or D/A converters andthe local bus for DMA operations, provid-ing performance-intensive processingwithout involving the host CPU in mem-ory transfers. This gives our products theability to flexibly handle digital radio datawithout completely consuming the CPUperformance of the host computer. With
Using in-depth market knowledge, Barcodesigns and develops solutions for large-screen visualization, display solutions for life-critical applications, and systems for visualinspection. Barco is currently active in thetraffic, surveillance, broadcasting, presenta-tion, simulation and virtual reality, edutain-ment, events, media, digital cinema, airtraffic control, defense and security, medicalimaging, avionics, and textile industries.
My particular division at Barco,BarcoView Command & Control inBelgium, has been a Xilinx® customer for justover two years. Our division’s choice to stan-dardize on Virtex™ products was based onthe availability of the embedded PowerPC™processor, first introduced by Xilinx in theirVirtex-II Pro™ product family.
We like to design with FPGAs in oursystems because they can be reprogrammedthroughout the life of the product. Thiscritical feature allows us to add featuresfrom one generation to the next withouthaving to redesign the whole system.
BarcoView Command & Control isworking on a rugged family of LCD moni-tors. These products are designed for roughenvironments where commercial displayproducts would not survive. In thesedesigns, FPGAs are mainly tasked to per-form video and image processing.
The system is currently designedaround a Virtex-II Pro device, in which thePowerPC processor, running a real-timeembedded operating system, controls thecomplete display system. Looking at thenew features of the Virtex-4™ FX family,we are planning to migrate these Virtex-IIPro designs that use the PowerPC processorto Virtex-4 FX devices in a future versionof the project.
Besides the central control of the displaysystem, we also use FPGAs in the data pathfor specific processing. The part of thedesign where we chose to implement theVirtex-4 FPGA is an optional feature of thedisplays, where it performs real-time imagescaling on the video stream.
Virtex-4 FPGAs in Rugged LCD Monitors
56 Xcell Journal First Quarter 2005
Integrated features like ChipSync technology not only reducecost but improve ease of use and design cycle time.
SYSTEM DESIGN CHALLENGES
This scaler module can receive a videostream on its input at a very high rate (160MHz x 24 bits = 3.84 Gbps), perform scal-ing on the stream, and send out the scaledstream at the same rate. With the amount ofdata being processed and because of the waythe scaler algorithm works, we must storethe incoming video stream into memorybefore processing it. Thus, we had to look atvery fast external memories (DDR2).
Memory Interfaces Made EasyWhen searching for the right product forour application, we looked at many alter-natives. However, it rapidly became clearthat Virtex-4 devices could best performthe required tasks.
The main reason for choosing Virtex-4FPGAs was the availability of theChipSync™ feature, with support forDDR-2 400 memories. Having support forDDR-2 400 gives us enough bandwidth toreduce the number of physical RAM chipsneeded, reduce the board real estate, and inthe end reduce system cost.
Looking at the data flow, these videostreams are digitized into pixels up to 24-bitRGB (it could be a narrower stream depend-ing on the input source). The incomingstream is stored into an input memory bufferat a frequency reaching up to 160 MHz. Thedata from this input memory buffer is thenfed to the scaler core, also on 24 bits, at amaximum frequency of 100 MHz.
After the core has processed the data, the
ChipSync technology allows us to easilyreach 400 Mbps and intuitively design thisinterface. Without this feature, we wouldhave needed a 32-bit interface to the externalmemory. Though running at half the clockrate, more physical SDRAM on the boardwould be required, as there is no such thing asa small SDRAM device. In addition to thehigher unused memory locations, we wouldhave required a larger package for the scalerdevice because of the increased number ofpins, using more board real estate.
ChipSync technology also allows us to eas-ily use DDR-2 interfaces, enabling us tochoose the very latest in SDRAM technology.This helps to avoid obsolescence issues, acommon problem in the memory industry.
Block RAM: Not Just MemoryAnother critical point when choosing theright FPGA is the amount of block RAMavailable in the device. Having flexible, fastinternal RAM is a critical factor for us becausewe use block RAM for two things: as videoline memory and as FIFOs for the DDR-2memory controller. Smaller, slower, or lessflexible RAM blocks would have produced amore complex DDR-2 memory controllerdesign, resulting in larger logic requirementsand therefore a larger device.
In addition to speed, flexibility, and size,the integrated FIFO logic available on eachblock RAM allows us to save a substantialamount of logic and guarantees fast FIFOoperation, simplifying the design of ourwhole system.
ConclusionThe logic savings obtained through the useof the integrated FIFO, ChipSync technol-ogy, and the use of smaller external memo-ries results in a significant cost reduction.Additionally, the ease of use, implementa-tion, and modification brought by the hardIP blocks makes the Virtex-4 LX15 deviceperfect for this application.
After designing with the Virtex-4 LXFPGA, we are looking forward to evaluatingthe Virtex-4 FX platform to see how we canbenefit from all the new features availablewith the integrated PowerPC processor.
For more information about Barco andour products, visit www.barco.com.
video stream is written back into an outputmemory buffer at 100 MHz on 24 bits. Theoutput memory buffer can then be read at afrequency reaching 160 MHz on 24 bits tofurther process the data. After all that pro-cessing and some more, the images are dis-played on the LCD monitor.
As shown in Figure 1, which representsthe Virtex-4 LX15 ecosystem of our design,the memory bandwidth requirements for the
input and output buffers are identical.Focusing on the input memory stream, wecan see that the bandwidth required is (160MHz + 100 MHz) * 24 bits = 6,240 Mbps.
This is where the advantages of 400 MbpsDDR-2 are realized. Because of this memoryspeed, we can select a 16-bit-wide DDR-2SDRAM running at 200 MHz and still haveenough bandwidth to process the inputmemory buffer streams (the stream comingfrom the input source and the stream goingto the scaler core).
A simple calculation shows that 200 MHzx 2 (double data rate) x 16 bit = 6,400 Mbps.This is higher than the 6,240 Mbps previous-ly calculated for the input buffer. Of course,we need to take into account a small overheadfor the memory controller (during tran-sients), but the margin should be more thanenough to guarantee reliable system opera-tion. If for any reason the controller’s over-head becomes such that we cannot guaranteethat the system would work properly, we canalways lower the 100 MHz core frequency.
First Quarter 2005 Xcell Journal 57
160 MHz @ 24 bits (max) 160 MHz @ 24 bits (max)
From Input Stage To Mixing Stage
Virtex-4 LX15 363 pins
Proprietary Scaling Core
100 MHz Core Clock
DDR2 Memory Controller DDR2 Memory Controller
DDR2 400 - 256 Mb @ 16 bitsInput Buffer
DDR2 400 - 256 Mb @ 16 bitsOutput Buffer
Figure 1 – Video scaler block diagram based on a Virtex-4 FPGA
The advanced silicon features introducedwith Xilinx® Virtex-4™ FPGAs are readi-ly available through ISE™ (IntegratedSoftware Environment) 6.3i technology.This latest release of Xilinx design softwarecomes ready to deliver maximum designperformance, with new features andoptional tools that will speed your Virtex-4project to completion.
Advanced Timing Closure and PerformanceISE software lets you get the most out ofVirtex-4 devices and your target project.Benchmark testing on a suite of real-world,customer-based designs demonstrates thatVirtex-4 FPGAs, with ISE 6.3i design soft-ware, are as much as 43% faster than thenearest competitive FPGA. On average,that’s an extra speed grade advantage.
The performance-driven ISE technology– like our exclusive timing-driven mapoption – helps you achieve better designpacking and better performance, particu-larly if your target device is already morethan 90% utilized. Timing-driven map canyield 30% better overall design perform-ance depending on design utilization.
This additional performance advantagegives you the potential to stay in a lower den-sity target Virtex-4 device, even if utilizationis pushing 90% or higher, when competingtools would have already forced the designinto a larger, more expensive device.
ISE 6.3i Software –Unleash the Power ofVirtex-4 FPGAs
ISE 6.3i Software –Unleash the Power ofVirtex-4 FPGAs
58 Xcell Journal First Quarter 2005
New ISE technology delivers breakthrough performance with greater ease of use.New ISE technology delivers breakthrough performance with greater ease of use.
ENGINEER ING SOLUT IONS
High-Density DesignISE design software also includes a full spec-trum of tools for larger density designs,including area and logic group floorplan-ning, incremental design for faster designrecompile cycles, and modular design forteam-based project approaches. High-densi-ty designers can also separately purchase thenew PlanAhead™ hierarchical flooplanner,which wraps all of these methodologies intoone separate advanced tool. Together, thesetools augment the design flow of high-density projects with methodologies thatspeed through to project completion, as wellas performance-locking strategies to helpbring large designs under control.
Area GroupsUsing either PACE (Pinout and AreaConstraints Editor) or ISE Floorplanner,both included with all configurations of ISEdesign software, you can quickly floorplanareas of logic from your design onto your tar-get Virtex-4 device. You can create areagroups around hierarchical HDL boundaries,or let PACE create default area estimates fortarget logic, or draw logic areas by hand.
Visualizing the different areas of logichelps you partition out areas for designreuse or IP placement, or section off wherethe “tough” areas of the design will be con-centrated. Most importantly, area planningcan help accelerate timing closure by group-ing critical logic and paths together, andminimize the number of interface pointsbetween modules.
Modular DesignISE design software also includes modulardesign, a capability that implements a“divide and conquer” strategy for largedesigns – and for the corporate environ-ments that deploy teams of engineers totackle them. A design team manager firstplans the design project, using floorplan-ning to partition the overall larger projectinto smaller design “modules.” These mod-ules can then be assigned to individual teammembers for completion independent ofthe other modules.
Completion is focused on only that par-ticular module of the overall design, with allteams completing their work in parallel.
PlanAheadIn June 2004, Xilinx announced theacquisition of the leading-edge PlanAheadhierarchical floorplanner, developed origi-nally by Hier Design. The PlanAheadfloorplanner is a separately purchased tooloption to the ISE design flow that is idealfor Virtex-4 high-density designs.
The PlanAhead tool utilizes an ASIC-style floorplanning methodology using ablock-based approach. It enables you toanalyze, detect, and correct potential imple-mentation problems earlier in the designcycle, leading to the following benefits:
• Quicker incremental design changes
• Faster place and route
• Greater consistency and predictabilityin place and route
• Fewer design iterations
• Improved design performance
• Tighter utilization control
• Reuse of intellectual property andteamwork
The majority of low-density FPGAdesigns are implemented flat, with no hier-archy. Standard PLD place and route algo-rithms use more compile time to completea flat design. By breaking the designs into
Once a module is finished, its place androute results are locked while the projectmanager waits for the remaining modulesto be completed.
Modular design delivers full planningcontrol over the larger design, implement-ing a true bottoms-up design approach thatcompletes the larger project much faster.
Incremental DesignIncremental design, also included with ISEdesign software, combines the quick-and-easy facet of area groups with the perform-ance-locking aspects of modular design todeliver faster runtimes during heavy designiteration cycles.
Using PACE, you can assign area groupsalong hierarchical HDL boundaries; theoverall design is then completed as usual.Should an incremental change become nec-essary, incremental design guarantees thatyou only have to re-implement the logicarea that needs to change. The remainderof the design stays locked and intact, dras-tically speeding up overall compile times.
Incremental design also lets you makefull use of the verification phase by deliver-ing much faster overall project compiletimes. You can tweak critical design areas orimplement ECO design changes late in thecycle with minimal impact on the largerFPGA project.
First Quarter 2005 Xcell Journal 59
Figure 1 – PlanAhead floorplanning with a Virtex-4 LX100 FPGA
ENGINEER ING SOLUT IONS
smaller pieces, or blocks, place and routedoesn’t need to converge on the entiredesign timing each time an incrementaldesign change occurs. Hierarchy allowsyou to take maximum advantage toreduce place and route time.
You can also lock placement resultsfor individual blocks that already meettiming so that subsequent place androute iterations do not change their per-formance, further stabilizing the overalldesign and making the overall resultsmore consistently predictable. ThePlanAhead tool wraps area groups,incremental design, and modular designinto a single ASIC-strength floorplan-ner. Figure 1 shows Virtex-4 floorplan-ning using the PlanAhead hierarchicalfloorplanner.
Speed the Design Flow – ISE Architecture WizardsThe architecture wizards are a series ofmenus and dialog boxes built into all ISEconfigurations. These graphical menuslet you quickly set advanced configurationparameters for FPGA silicon features. Thewizards then write out editable VHDL orVerilog™ source code that is instantiateddirectly into your target project.
For example, the clocking wizard letsyou easily set clock frequency, phase, mul-tiplier factors, and delay for Virtex-4devices and other Xilinx FPGAs usingDCMs (digital clock managers). With thearchitecture wizards, you can rapidly set upand program advanced FPGA features, soeven novice users can learn the mostadvanced Virtex-4 capabilities quickly.
Also new in ISE 6.3i software are twoVirtex-4-exclusive architecture wizards, theChipSync™ and XtremeDSP™ slice wiz-ards. The ChipSync wizard configuresgroups of I/O blocks into an interface for usein memory, networking, or other types ofbus interface design. You can quickly definekey parameters such as the width and I/Ostandard of the data, address, clocks/strobes,clock buffers, and data bus specifications. Allinformation is then presented in a clear andconcise table for review.
The XtremeDSP slice wizard, shown inFigure 2, provides easy control of the revolu-
tionary Virtex-4 XtremeDSP slice technolo-gy. This new silicon capability lets you buildhigh-performance DSP filters and custompre or post-co-processing DSP algorithms.The XtremeDSP slice wizard lets you specifyaccumulator, adder/subtractor, multiplier, ormultiplier and adder/accumulator DSPmodes. You can graphically set input andoutput bus data widths, pipelining options,clock enable, and reset pin setups, and thenreview parameters and output the results asHDL-ready code.
50% Faster Verification CyclesVerification is one of the most time-consuming and time-critical phases of thedesign flow. As with most logic designsuites, HDL verification and timing analy-sis are available. The ISE tools also link toadditional verification technologies uniquein FPGA design, including formal equiva-lency verification through Formality fromSynopsys™ and Prover eCheck fromProver Technology AB, making quick workof verifying Virtex-4 high-density designs.
The ISE design tools also link directlyto our optional, separately purchasedChipScope Pro™ real-time debug environ-
ment. ChipScope Pro tools insert low-profile logic analyzer, bus analyzer, andvirtual I/O software cores duringdesign capture. These cores are thensynthesized and implemented into yoursilicon, allowing you to view:
• Any internal signal within theFPGA
• Embedded processor signals,including the IBM™CoreConnect processor local busor on-chip peripheral bus support-ing the PowerPC™ 405 insideVirtex-4 FX family devices
• Embedded processor signals for theMicroBlaze™ soft-processor core
Signals are captured at or near oper-ating system speed and brought outthrough the programming interface,freeing up pins for your design, notdebug. You can then analyze capturedsignals through the ChipScope Prosoftware logic analyzer.
The ChipScope Pro environmentalso links internal FPGA debug to AgilentTechnologies™ bench-top logic analzyersusing the included ChipScope Pro ATC2core. This core synchronizes theChipScope Pro tool with Agilent’s FPGADynamic Probe software.
This unique partnership betweenXilinx and Agilent delivers deeper tracememory, faster clock speeds, and moretrigger options, all using fewer pins on theFPGA, making Virtex-4 design debug asmuch as 50% faster than other logic veri-fication methodologies.
ConclusionYou can unlock the power of Virtex-4FPGAs with the ISE 6.3i FPGA environ-ment, the most complete available for pro-grammable systems design. Whether yourdesign includes DSP, embedded, and high-speed serial I/O design, Xilinx ISE softwareand our optional System Generator for DSP,ChipScope Pro, and EDK and PlatformStudio products will get your Virtex-4 LX,SX, and FX designs running with the maxi-mum performance, while shortening designcycles and getting you to market faster.
60 Xcell Journal First Quarter 2005
Figure 2 – XtremeDSP slice architecture wizard
ENGINEER ING SOLUT IONS
by Peter AlfkeDirector of Applications EngineeringXilinx, [email protected]
A FIFO is a memory subsystem where a datasequence can be written and retrieved inexactly the same order. No explicit address-ing is required, and the write and read oper-ations can be completely independent, usingunrelated clocks.
“First-In First-Out” has been used inaccounting for hundreds of years, as well as indata queues since the early days of computers.In 1970, Fairchild Semiconductor introducedthe first integrated FIFO, the 3341.
Today, dedicated and much larger FIFOICs are available, and mid-sized FIFOs areoften implemented in Xilinx® FPGAs usingthe dual-ported block RAMs supported bysoft cores for addressing and control.
A FIFO is an ideal subsystem: simpleand user-friendly on the outside but com-plex and demanding in its implementationdetails. The design seems to be trivial;using a RAM with two independentlyclocked ports (one for writing, one forreading) plus two independent addresscounters to steer write and read data.
It may look easy, but the difficulty isfound when you look deeper into thechallenge – specifically, the decoding andsynchronization of the obligatory statusoutputs indicating the extreme conditionsof EMPTY and FULL. Even experienced
designers have had problems decodingthese two conditions in a fail-safe way,especially when the FIFO operates withtwo independent clocks of several hun-dred megahertz.
Because fast asynchronous design isnotoriously difficult, Virtex-4™ FPGAsnow have a dedicated FIFO addressingand control circuit right inside each blockRAM. Using the Virtex-4 block RAMFIFO option, you can be assured of reli-able operation at a clock rate up to 500MHz, without using any logic slices inthe Virtex-4 fabric.
Virtex-4 FIFO The FIFO shown in Figure 1 behaves likea “black box.” You supply the data (4, 9,18, or 36 bits wide), a continuously run-ning write clock and its enable signal, anda continuously running read clock andread clock enable. Output data has thesame width as the input data, unlike thebasic block RAM where the two widthscan be different.
FIFOs Made EasyFIFOs Made Easy
First Quarter 2005 Xcell Journal 61
Virtex-4 FPGAs have a complete FIFO controller in each block RAM.Virtex-4 FPGAs have a complete FIFO controller in each block RAM.
ENGINEER ING SOLUT IONS
As the last data entry is being read,EMPTY goes high as a result of the readclock that reads the final data. You are sup-posed to disable the read operation until theEMPTY output has gone inactive again.
Note that both the rising and fallingedge of the EMPTY status signal aremade synchronous with the read clock,giving you a totally synchronous inter-face. If read clock enable stays active afterthe FIFO is empty, the read error flag isactivated, but FIFO content and address-ing are not disturbed.
ALMOST EMPTY and ALMOSTFULL are programmable status outputs,available as a warning to slow down theread or write process, or as an indication ofthe data level in the FIFO (“dipstick”).
Implementation DetailsUnderstanding FIFO design details is notnecessary. It is all “under the hood,” andworks without user intervention. But forthe curious reader, let’s briefly explain.
Detecting FULL and EMPTY requiresdetecting identity of the write and read
62 Xcell Journal First Quarter 2005
wrcount rdcount
DO
rdclkrden
DIN
wrclk
wrenreset
waddr
oe
mem
_ren
mem
_wen
raddr
fullem
ptyafullaem
ptyrderrw
rerr
WritePointer
Block RAMCore
Status FlagLogic
ReadPointer
Counter
Clock A
Clock B
FIFOA FIFOB
SubtractRegister
CompareOutD1
WC
D0 D1
WREN
D0
EMPTY
RC WC RC
FIFO Test Circuit
Figure 1 – FIFO block diagram
Figure 2 – FIFO test circuit
Verifying the EMPTY Flag Synchronization
The only tricky detail in a FIFO withunrelated read and write clocks is theproper synchronization of theEMPTY and FULL flags that crossclock boundaries. Any design thatmight thus be exposed to metastabiltyproblems deserves special attentionand scrutiny.
At Xilinx, we tested the EMPTYlogic exhaustively by writing data intothe FIFO at 200 MHz and reading itout at 500 MHz, which makes it goEMPTY soon after each write cycle(Figure 2). The detection logic wasthus exercised, and the trailing edgeof the EMPTY flag was re-synchro-nized to the write clock 200 milliontimes a second.
More specifically, we wrote anascending data sequence at 200 MHzand read it out at 500 MHz. Wewrote the output data directly into asecond FIFO at the same 500 MHz.We then read the second FIFO out atthe original 200 MHz rate.
The combined dual FIFO forms asynchronous system, but with asyn-chronous data transfer between thetwo halves. When we synchronouslysubtracted the input data from theoutput data, the difference was con-stant, indicating flawless transfer atthe 500 MHz read/write rate and noflag synchronization problem – evenat this high rate.
When the two clock frequenciesare uncorrelated, each read clockcycle has a different phase relation-ship with respect to the write clock.During any second, the active readclock edge steps across the ~5 nswrite clock period in ~200 milliondifferent phase orientations, thus cre-ating a timing granularity of 0.025femtoseconds (one quadrillionth of asecond). This resolution is millionsof times better than any convention-al deterministic test methodology canpossibly achieve.
We ran this design for a wholeweek, with more than 1014 opera-tions, without any error.
ENGINEER ING SOLUT IONS
address pointers, which generally do not sharea common clock. Binary counters would gen-erate unacceptable glitches on the comparatoroutput; using Gray-coded counters is thewell-known solution to this problem.
The simplest way to build Gray countersis to start with a binary counter and syn-chronously convert its content into Graycode. The binary address counter values canthen be used to calculate the programmableoffset for detecting ALMOST FULL andALMOST EMPTY.
Synchronization IssuesBecause EMPTY can only be caused by aread operation, the leading edge is naturallysynchronous with the read clock. But thetrailing edge is caused by a write operationand is thus synchronous with the “wrong”clock. Moving the trailing edge of EMPTYover onto the read clock domain needssome flip-flops and invites the specter ofmetastability.
Virtex-4 FPGAs use a conservative syn-chronizer design that has been demonstratedto work reliably at a 500 MHz read clockrate. We ran a week-long test with ~200 and~500 MHz asynchronous clock rates, gener-ating EMPTY more than 1014 times withouta single failure. The synchronizer delays thetrailing edge of EMPTY by a few read clockperiods. This latency is acceptable, since itdoes not affect top performance.
In a similar way, the trailing edge ofFULL is synchronized to the write clock.The software default is for FULL to haveone write clock latency. We therefore rec-ommend using ALMOST FULL instead.
A well-designed FIFO buffer shouldnever go FULL, and should go EMPTYonly when you want to drain the last wordfrom the buffer.
ConclusionThe hard-coded FIFO controller is availablein every Virtex-4 block RAM, and uses noadditional resources in the fabric. It alsosaves you from making any complex, time-consuming, and risky design decisions.
For a detailed description of the Virtex-4FIFO controller, visit the Virtex-4 UserGuide on the Xilinx website at www.xilinx.com/bvdocs/userguides/ug070.pdf.
First Quarter 2005 Xcell Journal 63
Would you like to write for Xcell Publications?
It’s easier than you think.
Would you like to write for Xcell Publications?
It’s easier than you think.We recently launched the Xcell Publishing Alliance
to help you publish your technical ideas. We can help you – from concept research and development, through planning and
implementation, all the way to publication and marketing.
Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work
with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,
the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,
writing, editing, and marketing.
We recently launched the Xcell Publishing Alliance to help you publish your technical ideas. We can help you –
from concept research and development, through planning and implementation, all the way to publication and marketing.
Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work
with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,
the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,
writing, editing, and marketing.
For more information on this exciting and highly rewarding program, please contact:
by Ralf KreugerSr. Staff Applications EngineerXilinx, [email protected]
As FPGAs grow in size, quality on-chip clockdistribution becomes increasingly important.Clock skew and clock delay impact deviceperformance; managing clock skews anddelays with conventional clock trees becomesmore difficult in larger devices.
Xilinx® Virtex-4™ devices solve thischallenge by providing as many as 20 fullydedicated on-chip digital clock manage-ment (DCM) circuits. DCM provides zeropropagation delay and – along with fullydifferential global clock trees – low clockskew between output clock signals distrib-uted throughout the device.
Each DCM can drive up to 12 of the 32global clock routing networks within thedevice. The global clock distribution net-work minimizes clock skews due to loadingdifferences. By monitoring a sample of theDCM output clock, the delay locked loop(DLL) compensates for the delay on therouting network, effectively eliminating thedelay from the external input port to theindividual clock loads within the device.
Digital Clock Management in Virtex-4 DevicesDigital Clock Management in Virtex-4 Devices
64 Xcell Journal First Quarter 2005
The new Virtex-4 FPGAincludes improvements and additions to the digital clock module.
The new Virtex-4 FPGAincludes improvements and additions to the digital clock module.
ENGINEER ING SOLUT IONS
In addition to providing zero delay withrespect to a user source clock, DCM pro-vides multiple phases of the source clock.The DLL can act as a clock doubler ordivide the user source clock by up to 16.
DCM can also act as a clock mirror. Bydriving DCM output off-chip and back inagain, you can use it to de-skew a board-level clock between multiple devices.
Digital Phase Shift (DPS)Virtex-4 FPGAs provide a digital phase shift(DPS) module that phase shifts the DCM’soutput clock in small increments – 1/256thof its period. You can operate the versatileDPS in four different modes for maximumflexibility: fixed, variable-positive, variable-center, and direct.
Digital Frequency Synthesis (DFS)The DCM digital frequency synthe-sis (DFS) module provides two out-puts, CLKFX and CLKFX180,derived from the input clock by fre-quency multiplication and division.Through a frequency calculator, youprovide the multiply and divide val-ues implemented by the DFS mod-ule. For example, an M value of 19and a D value of 8 yields a 2.375source clock multiplier.
DCM FeaturesDCMs are located in the center column ofthe Virtex-4 architecture. This enableswell-matched clock routes to and fromevery DCM for enhanced symmetry.
The Virtex-4 DCM’s superior perform-ance does not just include a wider operatingrange. It encompasses lower jitter, improvedphase accuracy, finer phase-shift resolution,tolerance of imperfect clocks and boarddesigns, less duty-cycle distortion, and lesssensitivity to sporadic voltage changes.
Xilinx also added new features. You now
Phase-Matched Delay ClocksPMCDs preserve edge alignments, phaserelations, or skews between the CLKA inputclock and other PMCD input clocks. Threeadditional inputs (CLKB, CLKC, andCLKD) and three corresponding delayedoutputs (CLKB1, CLKC1, and CLKD1) areavailable. The same delay is inserted toCLKA, CLKB, CLKC, and CLKD; thus, thedelayed CLKA1, CLKB1, CLKC1, andCLKD1 outputs maintain edge alignments,phase relationships, or the skews of theirrespective inputs.
You can use PMCDs alone or with otherclock resources, including global buffers andDCMs. Together, these clock resources pro-vide flexibility in managing complex clocknetworks.
The PMCDs are located in the centercolumn right next to the DCMs. They aregrouped as pairs in each tile.
ConclusionThe many features and functions of the clockmanagement subsystem allow you to maxi-mize system performance. By taking advan-tage of DCM to remove on-chip clock delay,you can greatly simplify and improve system-level designs involving high fan-out, high-performance clocks. Virtex-4 devices have anabundance of clock management resourcesalong with comprehensive software support.
Specialized individual features furtherimprove the ability to optimize design per-formance. Frequency synthesis is a powerfulfeature to generate a wide range of frequen-cies in the FPGA or the entire system. Afine-resolution phase-shift capability allowsyou to improve margins. And the newPMCD further increases the number ofclock derivatives that can be generated with-out the use of additional DCMs.
For more information, see the user guide atw w w. x i l i n x . c o m / b v d o c s / u s e r g u i d e s /ug070.pdf.
have the choice to trade off a wider phaseshift range versus higher frequencies.
In addition, a new function in theVirtex-4 architecture is the dynamic recon-figuration port (DRP). The DRP allowsyou to directly access some features inDCM through a block RAM-style inter-face. You can directly phase shift the delayline elements and change M and D values.
The software view of DCM has changedas well. Three Virtex-4 primitives –DCM_BASE, DCM_PS, andDCM_ADV – offer progressive features toenhance your design choices.
Xilinx also added a new DCM compan-ion block, the phase-matched clock divider(PMCD), to the Virtex-4 family. Let’s dis-cuss the clock management features ofthese new clock resources.
Phase-Matched Divided ClocksPMCDs create as many as four frequency-divided and phase-matched versions of aninput clock, CLKA. The output clocks area function of the input clock frequency:divided-by-1 (CLKA1), divided-by-2(CLKA1D2), divided-by-4 (CLKA1D4),and divided-by-8 (CLKA1D8).
CLKA1, CLKA1D2, CLKA1D4, andCLKA1D8 output clocks are rising-edgealigned to each other, but not to the input(CLKA). Figure 1 illustrates the newPMCD primitive.
First Quarter 2005 Xcell Journal 65
CLKA CLKA1CLKA1_D2CLKA1_D4CLKA1_D8
CLKD1CLKC1CLKB1CLKB
CLKCCLKD
FaFa/2Fa/4Fa/8
FbFcFd
RST
Fa
FbFcFd
RELEASE
By taking advantage of DCM to remove on-chip clock delay, you can greatly simplify and improve system-level designs
Digital designs require good clock signalswith a short delay and minimal skew, so thatthey arrive almost simultaneously at theirmany on-chip destinations. Clocks mustmaintain their duty cycle, which is especiallyimportant in double-data-rate designs wheredata is clocked on the rising as well as on thefalling clock edge. Those delays and edgerates must therefore always be closelymatched, independent of their loading.
Although single-clock operation isdesirable, many systems require multipleclocks. Often, input and output signals areclocked very fast and require even bettertiming precision than the general logicimplemented on the chip.
Xilinx® Virtex-4™ FPGAs provide sig-nificant advances in all of these areas. Globalclocks can reach all flip-flops on the chip,and high-speed I/O clocks provide excep-tional performance, especially for source-synchronous interfaces. Additional regionalclocks serve specific areas on the chip.
Xesium clocking networks are an innovative feature in Virtex-4 devices.Xesium clocking networks are an innovative feature in Virtex-4 devices.
ENGINEER ING SOLUT IONS
Clock RegionsFor clocking purposes, each Virtex-4 deviceis divided into regions. The number ofregions varies with device size, from 8regions in the smallest device to 24 regionsin the largest one.
Global Clocks Independent of array size, each Virtex-4FPGA has 32 low-skew global clock distri-bution networks that can each clock allsequential resources on the whole chip(CLBs, block RAMs, DCMs, and I/Os) andalso drive logic signals. You can use any 8 ofthese 32 global clock lines in any region.
All global clock inputs have dedicatedfast routing to the corresponding globalclock buffer, which can also be used as aclock-enable circuit or a glitch-free multi-plexer. It can select between two clocksources and can also switch away from afailed clock source – a new feature in theVirtex-4 architecture.
A global clock buffer is often driven by a
clock-capable inputs, optimized forincoming high-frequency clocks. Clock-capable I/O pairs, like global clock inputs,are regular I/O pairs where the LVDS out-put drivers have been removed to reducethe input capacitance.
Each of these input pins or input pinpairs can connect to a BUFIO that drivesa high-speed differential I/O clock network, which is dedicated to the I/Ocircuits and is ideally suited for source-synchronous data capture using the built-in serializer/deserializer (SerDes).
Each BUFIO can drive all I/O logic inits region as well as in the two adjacentregions (Figure 1). This means that onereceive clock can control up to 47 differen-tial or 95 single-ended receive data lines,ideal for many networking and memoryinterface applications.
Regional clocks form a third type ofclock networks, each being able to span asmany as three adjacent clock regions.Regional clocks drive single-ended nets andare intended for the parallel clock domainof the SerDes.
You can program the regional clockbuffer to divide the incoming clock rateby any integer number from one to eight.This feature, in conjunction with the pro-grammable SerDes in the I/O block,allows source-synchronous systems tocross clock domains without using addi-tional logic resources.
ConclusionVirtex-4 clocking resources have beenoptimized for high clock rates and multi-ple clock domains. Thirty-two globalclock networks provide high-performanceclocking across the whole chip, with shortdelay, low skew, and stable duty cycles.
Many localized clock networks serve theI/O for high-speed source-synchronousapplications. These clock networks are usedin conjunction with the built-in SerDes andreduce the burden on global clock resources.
Last but not least, all of these resourcesare easy to use. They are automatically han-dled by the Xilinx ISE 6.3i software.
For more information, visit www.xilinx.com/products/virtex4/capabilities/xesium.htm.
digital clock manager (DCM) to eliminatethe clock distribution delay, or to adjust itsdelay relative to another clock. There aremore global clocks than DCMs, and a DCMoften drives more than one global clock.
Virtex-4 clock trees are designed for lowskew and low power. Any unused branch isautomatically disconnected. All global clocklines and buffers are implemented differen-tially. This minimizes duty-cycle distortionand improves common-mode noise rejection.The whole global clock network is designedfor 500 MHz operation and beyond.
I/O Clocks and Regional Clocks Virtex-4 devices have two additional clocktypes: I/O clocks and regional clock net-works, two of each per region, used primari-ly for clocks forwarded into the Virtex-4FPGA. I/O and regional clock networks areindependent from the global clock networks,thus offering a maximum of 12 independentclock domains in any clock region.
Each clock region has two pairs of
First Quarter 2005 Xcell Journal 67
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
I/O Tile
BUFIO BUFR
BUFIO
To Centerof Die
To Adjacent Region
To Adjacent Region
Clock- Capable I/O
Clock- Capable I/O
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
CLBs
BRAM
BRAM
DSPTile
DSPTile
BRAM
BRAM
DSPTile
DSPTile
Figure 1 – BUFIO and BUFR clocking up to three regions
ENGINEER ING SOLUT IONS
by Reed Tidwell Sr. Staff Applications EngineerXilinx, [email protected]
The XtremeDSP™ system feature,embodied as the DSP48 slice primitive inthe Xilinx® Virtex-4™ architecture, is ahigh-performance computing elementoperating at an industry-leading 500 MHz.The design of the Virtex-4 infrastructuresupports this rate, with Xesium clock tech-nology, Smart RAM, and LUTs configuredas shift registers.
Many applications, however, do nothave data rates of 500 MHz. So how canyou harness the full computing perform-ance of the DSP48 slice with data streamsof lower rates?
The answer is to use a double-data-rate(DDR) technique through the DSP48slice. The DSP48 slice, operating at 500MHz, can multiplex between two datastreams, each operating at 250 MHz.
One application of this technique isalpha blending of video data. Alpha blend-ing refers to the combination of twostreams of video data according to aweighting factor, called alpha. In this arti-cle, we’ll explain the techniques and designconsiderations for applying DDR to twodata streams through a single DSP48 slice.
Alpha Blending Two Data StreamsUsing a DSP48 DDR Technique
68 Xcell Journal First Quarter 2005
Achieve full throughput of the DSP48 slice with a double-data-rate technique.
ENGINEER ING SOLUT IONS
Virtex-4 DSP48 The DSP system elements of Virtex-4FPGAs are dedicated, diffused silicon withdedicated, high-speed routing. Each is con-figurable as an 18 x 18-bit multiplier; amultiplier followed by a 48-bit accumulator(MACC); or a multiplier followed by anadder/subtracter. Built-in pipeline stagesprovide enhanced performance for 500MHz throughput – 35% higher than forcompeting technologies.
All Virtex-4 devices have DSP48 slices,although the SX family contains the largestnumber (an industry-high 512) and the high-est concentration of DSP48 slices to logic ele-ments, making it ideal for math-intensiveapplications such as image processing.
A triple-oxide 90 nm process makes theDSP48 slice very power-efficient.
flip-flops; CLB LUTs configured as shiftregisters (SRL16); or directly from blockRAM. Block RAM, configured as a FIFOusing the built-in FIFO support, also sup-ports the 500 MHz clock rate.
Design ConsiderationsDealing with data at 500 MHz requiresgreat care; you should observe strict pipelin-ing with registers on the outputs of eachmath or logic stage. The DSP48 slice pro-vides optional pipeline registers on the inputports, on the multiplier output, and on theoutput port from the adder/subtracter/accu-mulator. Block RAM also has an optionaloutput register for efficient pipelining wheninterfaced to the DSP48 slice.
Where you are using CLBs, place onlyminimal levels of logic between registers toprovide maximum speed. For DDR opera-tion, only a 2:1 mux (a single LUT level) isrequired between pipeline stages. Whetheryou are interfacing to the DSP48 slice withmemory or CLBs, placing connected 500MHz elements in close proximity mini-mizes connection lengths in the generalrouting matrix.
DDR requires the DSP48 slice to oper-ate at double the frequency of the inputdata streams. You can use a DCM to pro-vide a phase-aligned double-frequencyclock using the CLK 2X output.
Another aspect of inserting DDR datathrough a section of pipeline is ensuringthat data passes cleanly between clockdomains. This may require adding extraregisters clocked with the double-fre-quency clock at the output of the double-pumped section, to synchronize the datawith the original clock. The rule ofthumb is that in order to insert a double-pumped section cleanly into a single-pumped pipeline, there must be an evennumber of register delays in the double-pumped section.
Architectural features, including built-inpipeline registers, accumulator, and cas-cade logic nearly eliminate the use of gen-eral-purpose routing and logic resourcesfor DSP functions, and further reducepower. This slashes DSP power consump-tion to a fraction when compared toVirtex-II Pro™ devices.
DDR with Two Data StreamsDDR, in this context, refers to multiplex-ing two input data streams into onestream at twice the rate, interleaving (in time) the data from each stream(Figure 1). Figure 1 also shows the reverseoperation, creating two parallel resultantstreams after processing.
You can drive the DSP48 slice inputs atthe fast 500 MHz clock rate from CLB
First Quarter 2005 Xcell Journal 69
Data Stream 0
Data Stream 1
DDR Data StreamDSP48
ProcessedStream 0
ProcessedStream 1
clk2xclk1x
A0
A1
B0
B1
out0 = A0 * B0out1 = A1 * B1
clk1x
out0
out1
DSP48
All Virtex-4 devices have DSP48 slices, although the SX family contains thelargest number (an industry-high 512) and the highest concentration of DSP48
slices to logic elements, making it ideal for math-intensive applications ...
Figure 1 – DSP48 DDR
Figure 2 – Two-stream multiply through DSP48 slice
In Figure 2, stream 0 consists of A0and B0 inputs. We multiply them togeth-er and output as out0. Likewise, stream 1consists of inputs A1 and B1 multipliedtogether and output as out1. There aretwo clock domains: the clk1x domain, atthe nominal data stream frequency, andthe clk2x domain, at twice the nominalfrequency.
Figure 2 shows two registers after themultiplier. The second is the accumula-tion register, even though we do not useaccumulation in this configuration. Theregister, however, is still required toachieve the full, pipelined performance.We use two sets of registers on the inputsof the DSP to make the total delaythrough the DSP48 slice an even number(four) for easier alignment of the outputdata with clk1x. These registers are “free”because they are built into the DSP48slice, and using them reduces the needfor alignment registers external to theDSP48 slice. The extra pipeline registeron out0 compensates for taking stream 0into the DSP one clk2x cycle beforestream 1. As seen from the timing dia-gram in Figure 3, this is required to re-align the stream 0 data back into theclk1x domain.
Note that the input mux select,mux_sel, is essentially the inverse of clk1x.It is important, however, to generate thissignal from a register based on clk2x (ratherthan deriving it from clk1x) to avoid hold-time violations on the receiving registers.
At the transitions between clockdomains, the data have only one clk2x peri-od to set up. This is the reason to have no
logical operations between registers in thetwo domains. The placement of the firstregisters in the clk1x domain is more criti-cal than other registers in the same domain.
Alpha BlendingAlpha blending of video streams is amethod of blending two images into a sin-gle combined image, such as fadingbetween two images, overlaying anti-aliased or semi-transparent graphics overan image, or making a transition bandbetween two images on a split-screen orwipe. Alpha is a weighting factor definingthe percentage of each image in the com-bined output picture. For two input pixels
(P0, P1, and a blend factor, α, where 0 <=α < =1.0), the output pixel Pf will be:
Pf = αP0 + (1-α)P1 (see Figure 4)
This operation is performed separatelyfor each component: red, green, and blue.
A pixel rate of 250 MHz or less is suffi-cient for all standard and high-definitionvideo rates, and common VideoElectronics Standards Association (VESA)
standards as high as 1600 x 1200 at 85 Hz.Therefore, one DSP48 slice can performthe multiply and add on one component,and a set of three slices can alpha blend thethree components from each of two videostreams, as shown in Figure 5. The opera-tions must be performed identically and inparallel on each of the three components.
There are several ways to implementalpha blending depending on the nature
of the video streams and how alpha isgenerated. Figure 6 shows a basic imple-mentation with two video streams alter-nating as one multiplier input. The othermultiplier input alternates between alphaand 1- alpha.
The operating mode of the adderalternates between add zero (passthrough) mode and add output (accumu-late) mode. The DSP48 slice output reg-ister contains the result of the Video0 *alpha multiply during one clock cycle,and the final result (Video1 * (1 – alpha)+ Video0 * alpha) on the alternate clock.Figure 7 shows the timing for this configuration.
The align registers on the inputs ofthe DSP are used to make the total delaythrough the DSP48 slice an even number(four), as explained in the previousexample. The final output register forblend loads new data to every other DSPclock to register the blend results at theoriginal pixel rate.
ConclusionYou can efficiently use the high-perform-ance of Virtex-4 devices with DSP48slices by processing multiple data streamsin a time-multiplexed fashion. With care-ful design, a single DSP48 can performmultiply operations on two independentdata streams, operating at 250 MHz each.
Alpha blending of video streams, asoutlined in this article, is one example ofprocessing two data streams through a sin-gle DSP48 slice. This capability comple-ments the DSP features of Virtex-4 FPGAs– including built-in pipelining and cas-cading, integrated 48-bit accumulator,and an abundance of DSP48 slices in theSX family – to make Virtex-4 devices theideal DSP platform.
For details about the DSP48 slice, referto the “Virtex-4 FPGA Handbook,”Chapter 10, or the “XtremeDSP DesignConsiderations User Guide” at www.xilinx.com/bvdocs/userguides/ug073.pdf.
by Tze Yeoh Product Applications EngineeringXilinx, [email protected]
Xilinx® FPGAs provide connectivity invery high speed source-synchronous businterfaces. Transmission rates of 1 Gbpsand higher are not uncommon for thesetypes of interfaces.
In source-synchronous interfaces, thetransmitter forwards a dedicated clockalong with the data. As data rates skyrock-et to 1 Gbps and beyond, you may findthat your timing budgets are eaten away byskew and jitter.
Skew is defined as the difference inarrival time between signals sent at thesame time. It is caused by variations inboard trace lengths, connectors, packageflight-time delays, and secondary parasiticeffects. Figure 1 illustrates how theimproper routing of board traces and theuse of connectors contributes towardsskew at the receiver.
Dynamic Phase Alignment with ChipSync Technology in Virtex-4 FPGAs
Dynamic Phase Alignment with ChipSync Technology in Virtex-4 FPGAs
72 Xcell Journal First Quarter 2005
ChipSync technology built into every I/O supports dynamic phase alignment solutions for high-speed source-synchronous interfaces.
ENGINEER ING SOLUT IONS
Another challenge is jitter, the deviationfrom ideal timing caused mostly by slowtransition times, ground bounce, inter-symbol interference, and electromagneticinterference. Figure 2 illustrates the com-bined effects of skew and jitter on a systemdesigner’s timing budget.
In a real system, many bits of data (16,for example) are received in parallel andmust be clocked into the receiver by thecommon clock sent together with the data.Ideally, the clock edge arrives in the middleof the bit time, thus offering a maximumtiming margin.
But in reality, the individual data bitsarrive at slightly different times, and eachsuffers from timing jitter on its rising andfalling edges, and therefore the clock signalalso suffers from timing jitter. All of theseeffects combine to limit the data-valid win-dow, and thus might lead to unreliable datatransmission.
the delay lines in a region are continuous-ly being calibrated by a servo systemusing a dedicated delay line, a 200 MHzuser-provided clock, and a phase-com-parator-driven PLL circuit that adjuststhe delay line(s) such that the 64-stagedelay equals one period of the clock (5 ns / 64 = 78 ps per tap).
All delay lines in one region share acommon adjustment, and thus have thesame tap delay, as accurately as delay track-ing in a small silicon area allows. The refer-ence frequency is specified, tested, andsupported by software at 200 MHz. Minorvariations can be tolerated, and jitter is fil-tered out by the control structure. Thisprogrammable precision delay will find itsway into many innovative applications.Here it is described only as a method toachieve dynamic phase alignment.
The ChipSync technology built intoevery I/O contains a dedicated serial-to-parallel converter that converts the high-speed serial stream to a sequence of parallelwords that can be processed at a much slow-er rate within the FPGA. This feature decou-ples the high-speed serial data transfer fromthe clock rate supported by the FPGA fabric.
The converter supports both single datarate (SDR) and double data rate (DDR)modes. In SDR mode, the serial-to-paral-lel converter is fully programmable to gen-erate anywhere from 2- to 8-bit parallelwords. In DDR mode, the converter canbe programmed to de-serialize by a factorof 4, 6, 8, or 10, as specified by the HDLattributes of the ChipSync technology.The maximum width in a single ChipSyncmodule is six. For larger bit widths, youcan connect two adjacent ChipSync mod-ules in master-slave mode.
Word alignment can correct for dataskew greater than one bit period by com-paring the parallel version of the incomingpattern to the pre-specified training pat-tern. The Bitslip module enables you to
Virtex-4™ data and clock inputs offerChipSync™ technology, facilitatingdynamic phase alignment (DPA). DPAcan greatly reduce the skew between dif-ferent data lines, as well as between thedata lines and their associated clock input.
Using a system-generated training pat-tern, the receiving FPGA can adjust theinput delay of each data and clock input,using individual precision delay lines onevery input buffer. Gross errors exceedingone bit time pass through the bit-serialinterface, but can be corrected after serial-to-parallel conversion using theBitslip module.
A Generic Networking Interface ExampleThe generic interface is defined by a 16-channel bus and a forwarded clock. The sig-naling standard is low-voltage differentialsignaling (LVDS). The interface protocolspecifies a de-skewing method called “train-
ing.” During the initialization phase,the transmitter sends a repetitive 20-bit training pattern. The receiver usesit to de-skew the interface by delay-ing each data bit such that it is opti-mally centered over the receivedclock edge. The interface specifica-tion calls for the receiver to correctdata skew as much as +/- 1 bit timeof channel-to-channel skew.
This fine-grained delay adjust-ment uses a 64-tap delay line witha counter-controlled tap multiplex-er available on each input. All of
First Quarter 2005 Xcell Journal 73
SPI-4.2Source
SPI-4.2Sink
Co
nn
ecto
r
Data Valid Window
Clock Jitter and Data Jitter
Skew between clock and data plus skewbetween data channels
Bit period
The ISERDES built into every I/O contains a dedicated serial-to-parallel converter that converts the high-speed serial stream to a sequence of parallel
words that can be processed at a much slower rate within the FPGA.
Figure 1 – Improper board trace routing and use of connectors contribute to skew
Figure 2 – Effects of skew and jitter on timing budgets
ENGINEER ING SOLUT IONS
match an incoming data stream to a pre-determined data pattern by shifting theoutput of the dedicated serial-to-parallelconverter. An example of this feature inoperation is given in Figure 3.
The IDELAY, SERDES, and Bitslip features are encapsulated in a module calledISERDES, available as part of theChipSync technology in every single I/O.
The Virtex-4 DPA SolutionLet’s use the Virtex-4 ChipSync technologyfeatures previously described to create a DPAsolution that meets interface requirements.There are three basic steps in the solution:
• Bit alignment – completed during theinitialization procedure, its purpose isto correct for skew less than one bittime and position the clock edge at thecenter of the data eye
• Word alignment – completed duringthe initialization procedure, its purposeis to align the incoming data stream tothe pre-determined training pattern
• Real-time window monitoring – con-tinuously monitors the data eye so thatthe clock edge is always centered to thedata eye
Figure 4 illustrates the implementationof DPA in a Virtex-4 device.
The goal of the bit-alignment procedureis to position the captured clock edge in thecenter of the data eye to provide maximummargin. The bit-alignment procedure takesadvantage of the dedicated 64-tap delayline feature of the ISERDES.
The word alignment procedure aligns theoutput pattern from the ISERDES to a spe-cific training pattern. This procedure effec-tively removes word skew and aligns allchannels to a specific word boundary. Theword alignment unit primarily uses theBitslip module of the ISERDES. Each chan-nel monitors the pattern streaming in. If thetraining pattern is not found, activate Bitslipuntil the pattern is found. Once found, thechannel is – by definition – de-skewed.
After the initialization stage using thetraining procedure, the channels areassumed to remain trained throughout
normal operation. However, the data validwindow might shift because of operatingconditions. The window monitoring unitcan continuously monitor the data validwindow during normal operation and canadjust the sampling point as necessary toprovide maximum margin.
ConclusionDynamic phase alignment is a criticalfunction in many bus interfaces as datarates explode into the gigabit range. AsFPGAs are increasingly being used direct-ly in the data path of these very high speedinterfaces, dynamic phase alignment in theFPGA is a must.
Virtex-4 ChipSync technology builtinto every I/O enables you to quickly andeasily develop a DPA solution that meetsyour application.
An application note describing theimplementation of DPA is available atwww.xilinx.com/bvdocs/appnotes/xapp700.pdf. The application note, “Dynamic PhaseAlignment for Networking Applications,” ispublished as XAPP 700. The referencedesign enables you to quickly understandhow to implement a DPA solution that fitsyour particular application.
74 Xcell Journal First Quarter 2005
ISERDES
ISERDES
1
0
0
1
0
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
1
1
0
1
0
0
1
1
1
0
0
0
0
0
1
1
0
0
1
0
1
1
1
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
1
0
0
1
0
0
1
1
Intitial1st
Bitslip2nd
Bitslip3rd
Bitslip4th
Bitslip5th
Bitslip6th
Bitslip7th
Bitslip
8thBitslip(Intitial
Pattern)
Input Serial Data
64-Tap Delay Line in Silicon
VIRTEX-4 ISERDES VIRTEX-4 FPGA Fabric
DESERIALIZER
BITSLIP
4
64-Tap Delay Line in Silicon
VIRTEX-4 ISERDES
DESERIALIZER
BITSLIP
4
IBUFDS_DIFF_OUT
Dip
Din
clkp
clkn
TrainingController
Real-TimeWindow
Monitoringand
AdjustmentController
...
...
Figure 3 - Operation of Bitslip
Figure 4 - Virtex-4 DPA implementation with ChipSync technology
Because of the necessary configuration ofFPGAs on each power up, as their popular-ity increases so do design security concerns.Without proper protections, attackerscould easily clone or reverse-engineer thebitstream during FPGA configuration.
All Xilinx® Virtex-4™ devices have anon-chip decryptor that can be enabled tomake the configuration bitstream secure.Virtex-4 has implemented the AdvancedEncryption Standard (AES) scheme forsecuring the bitstream.
Modern Security DesignXilinx has replaced the Triple DES encryp-tion scheme implemented in the Virtex-II™ architecture with AES. Although bothencryption schemes provide a high level ofsecurity, AES offers both increased securityand throughput over Triple DES by replac-ing three 56-bit keys with one 256-bit keyand allowing configuration clocking fre-quencies as high as 100 MHz.
Lock Your Designs with the Virtex-4 Security Solution
First Quarter 2005 Xcell Journal 75
Virtex-4 FPGAs provide an up-to-date AES encryption scheme to prevent IP or microchip design theft.
ENGINEER ING SOLUT IONS
Let’s review some key benefits of theXilinx Secure Chip solution.
1. AES is an official government stan-dard, FIPS-197, supported by theNational Institute of Standards andTechnology and the U.S. Departmentof Commerce. The NSA has also cer-tified AES’ ability to protect classifiedcommunication to the top secret level.
2. The AES key can only be pro-grammed through the JTAG interface.This allows you to monitor anyunwanted activities on the JTAG linesboth externally and internally withthe BSCAN_Virtex4 primitive.
3. A battery-backed volatile key providesthe maximum protection against hos-tile hacking.
4. This low-cost solution includes widelyavailable standard components such as a Rayovac™ lithium battery.
5. Encryption key storage (Figure 1) has a long life span (20+ years).
Advanced Encryption Standard (AES) Although the Triple DES algorithmremains effective against attacks, AES isnow replacing DES in many applications asthe most secure encryption scheme. Asspecified by FIPS-197, AES has the NSA-approved cryptographic algorithm that canbe used to protect electronic data.
AES employs a cipher block that elimi-nates symmetry in the behavior of thecipher to overcome shortcomings of theDES’ key. The non-linearity of the AESkey expansion practically eliminates thepossibility of equivalent keys.
Because of its key strength, AES is suitedfor applications such as banking, defense,government, and sophisticated technicalapplications such as ATM, HDTV, broad-band ISDN, voice, and satellite.
Data Encryption SupportThe Virtex-4 AES system comprises soft-ware-based bitstream encryption and on-chip AES (Rijndael) decryption withcipher block chaining (CBC) to decryptthe incoming bitstream. The AES key isstored in dedicated memory, powered by
configuration interface as SelectMAP toaccess configuration logic internally so thatyou can partially reconfigure the device forextra design security.
In addition to ICAP, Virtex-4 devicescan monitor activities on the external JTAGpins with the internal BSCAN_Virtex4primitive. The BSCAN_Virtex4 primitivemirrors the activity on the TDI pin and
outputs several JTAG tap controllerstates, such as Test-Logic-Reset orUpdate-DR. Tampering with the JTAGduring a “side channel” attack can bedetected. You can then take countermea-sures such as cutting power to the FPGA– including VBATT – or erasing and writ-ing a new encryption key by once againentering the key access mode.
Moreover, you can return any faultypart to Xilinx for testing without having toprovide the encryption key for returnedmaterial analysis.
Software IntegrationXilinx ISE™ version 7.1i will have full soft-ware support for encrypted bitstream andkey creation. Generating an encrypted bit-stream requires only two additional bitgenoptions. For example, “bitgen -g encrypt:yes -g key0:AA995566 top.ncd top.bit” willautomatically create an encrypted bit-stream (top.bit) and the encryption key(top.nky) with the key of “AA995566.”You must then load the top.nky file intothe device through the JTAG interfacebefore loading the encrypted bitstream.
either an auxiliary power supply (VCCAUX)or an externally connected battery.
To combat a brute-force software attacksuch as key search, Virtex-4 devices featurea 256-bit AES key system that enables 1.1x 1077 possible key combinations. To pro-gram the key, the device must enter “key-access mode” in IEEE1532 flow via JTAG.Once in this mode, the previous encryp-
tion key will be cleared to prevent readbackof the key. (Further flow details are docu-mented in the Virtex-4 1532 BSDL files.)If the encryption keys are compromised,you can update the design with new keysand new encrypted bitstreams.
Virtex-4 FPGAs also embed the memo-ry holding the key under layers of metal.Because the key is stored in volatile memo-ry, disrupting the power supply for the keymemory during hardware attacks will resultin key loss.
You can always use a non-encrypted bit-stream to configure the device regardless ofthe presence of the key. For example, whenloading a non-encrypted bitstream, youshould be careful when generating the bit-stream. The proper security level should beset if you want readback of the non-encrypted bitstream. Reconfiguring theencrypted bitstream, however, wouldrequire you to toggle the PROG pin, cyclepower, or issue one of two JTAG instruc-tions: JPROG or JSTART.
Internally, you can use the internal con-figuration access port (ICAP) to reconfig-ure the device. ICAP provides the same
Figure 1 – Encrypted bitstream reference circuit for system-level applications
ENGINEER ING SOLUT IONS
As for the GUI, Xilinx Project Navigatoroffers encryption options in the GenerateProgramming File command. You can setpreferences for allowing readback, partialreconfiguration, and encryption.
iMPACT, the Xilinx programming tool,allows you to program just the key or theencrypted bitstream with the key. For inde-pendent programming applications, thedetailed steps to download the encryptionkey are documented in the Virtex-4IEEE1532 BSDL files, which are installedin the Xilinx/Virtex4/data directory withISE installation, or downloadable fromwww.xilinx.com/support/sw_bsdl.htm.
Battery-Secured SystemsDesigning secure systems incorporatingbatteries for volatile storage is a provenmethod in multiple markets that is recog-nized as the highest form of security andis required by the U.S. government for itssecured modules (http://csrc.nist.gov/publications/fips/fips140-2/fips1402.pdf).
Several misconceptions exist related tobattery use – some believe that batterieswill require additional maintenance cycles.These fears are unfounded: maintenanceand lifetime are of no concern for mostapplications, and the lifetime of the batterywill usually far exceed the useful lifetime ofthe product.
All batteries “self discharge” when sit-ting idle, even with no load. Modern lithi-um batteries feature extremely lowself-discharge rates. Rayovac lithium bat-teries self-discharge at a rate of less than0.3% per year. Even at higher tempera-tures, the self-discharge experiences onlyvery minor deterioration – in this example,let’s use a conservative 0.6%. The capacityof the BR1225 is 50 mAh.
Assume that the Virtex-4 IBATT currentvalue is 50 nA. The VBATT signal is rout-ed internally to the PCB to eliminateleakage currents. The self-discharge perhour is 34 nA.
34 nA + 50 nA = 84 nA
50 mAh / 0.000084 mA = 595238 hours = ~67 years
Thus, a 20-year product life is easilyachieved using a battery.
For more information about battery lifeexpectancy calculations and design consid-erations, see Xilinx XAPP766, “UsingHigh Security Features in Virtex-II SeriesFPGAs,” at http://www.xilinx.com/bvdocs/appnotes/xapp766.pdf.
ConclusionVirtex-4 devices provide the most up-to-date security option for your designs. Withthe ease of integrated software flow, minimalboard space requirements, and maximumsecurity through AES, the Virtex-4 SecureChip AES security solution is ideal for keep-ing hackers from your designs.
For more information about theAdvanced Encryption Standard, please visit:
Join the other leaders in our industry and advertise
in the Xcell Journal!
First Quarter 2005 Xcell Journal 77
ENGINEER ING SOLUT IONS
by Ralf KruegerSr. Staff Applications EngineerXilinx, [email protected]
Configuration memory in Xilinx®
Virtex™ FPGAs is used primarily toimplement user logic, connectivity, andI/Os, but it is also used for other purposes.For example, it specifies a variety of staticconditions in the two functional blocks,DCMs and RocketIO™ multi-gigabittransceivers (MGTs).
Sometimes an application requires achange in the conditions of the functionalblocks while the blocks are operational.You can accomplish this through the glob-al internal configuration access port(ICAP) or through partial dynamic recon-figuration, using JTAG or SelectMap in the“persist” mode.
Since the late 1990s, all Virtex FPGAshave supported this powerful dynamic par-tial reconfiguration feature. However, partialdynamic reconfiguration required you tohave a detailed knowledge of the configura-tion logic functions, the configuration regis-ters, and the location of configuration bits.
Dynamic Reconfiguration of Functional BlocksDynamic Reconfiguration of Functional Blocks
78 Xcell Journal First Quarter 2005
The Virtex-4 dynamic reconfiguration port offers an innovative way to reprogram functions in the FPGA.The Virtex-4 dynamic reconfiguration port offers an innovative way to reprogram functions in the FPGA.
ENGINEER ING SOLUT IONS
DRP FunctionalityThe new Virtex-4™ dynamic reconfigu-ration port (DRP) is an integral part ofthe two functional blocks, as it simplifiesthe dynamic reconfiguration processgreatly. These configuration ports exist inthe DCMs and RocketIO MGTs.
In this article, we’ll describe theaddressable, parallel write/read configura-tion memory implemented in each func-tional block that permits reconfiguration.This memory has the following attributes:
• It is directly accessible from theFPGA fabric. Configuration bits canbe written to and/or read fromdepending on their function.
• Each bit of memory is initialized withthe value of the corresponding con-figuration memory bit in the bit-stream. Memory bits can also bechanged later using the ICAP.
• The output of each memory bitdrives the functional block logic, sothe content of this memory deter-mines the configuration of the func-tional block.
The address space can include status(read-only) and function enables (write-only). Read-only and write-only opera-tions can also share the same addressspace. Figure 1 shows how the bitstreamconfiguration bits drive the logic in func-tional blocks and how the reconfigurationlogic changes the flow to read or write theconfiguration bits.
Figure 1 also lists each signal on theFPGA fabric port. Individual functionalblocks can implement all or only a subsetof these signals. In general, it is a syn-chronous parallel-memory port, with sep-arate read and write buses similar to theblock RAM interface. Bus bits are num-bered from least significant to most sig-nificant, starting at 0. All signals areactive high.
Synchronous timing for the port isprovided by the DCLK input, and all theother input signals are registered in thefunctional block on the rising edge ofDCLK. Input (write) data to the func-tional blocks is presented simultaneously
software tools to show the additional DPRsignals. For example, writing a 04h toaddress 50h will change the M value to 5.
In the MGTs, the DRP allowsadvanced users to manipulate manyattributes of the physical media attach-ment (PMA) and the physical codingsublayer (PCS). The new signals are partof the regular MGT primitive and can beoperated by the application. The MGTimplementation makes a large number ofsettings available for you to changedynamically. Various comma detect,channel bonding, and other attributescan be manipulated.
ConclusionThe Virtex-4 dynamic reconfigurationport provides an easy-to-use, block RAM-style interface that empowers you to
modify the functionality of your applica-tion while the device is operating. Thisleads to flexible implementations and anapplication that can adapt to changingconditions – without having to reconfig-ure an FPGA with a different bitstreamfrom scratch.
For more information, see the config-uration guide, www.xilinx.com/bvdocs/userguides/ug071.pdf.
with the write address and DWE andDEN signals before the next positive edgeof DCLK.
The port asserts DRDY for one clockcycle when it is ready to accept more data.The timing requirements relative toDCLK for all the other signals are thesame. The output data is not registered inthe functional blocks. Output (read) datais available after some cycles following thecycle that DEN and DADDR are assert-ed. The availability of output data is indi-cated by the assertion of DRDY.
DRP Implementation in DCMs and MGTsAs mentioned earlier, the DRP is availablein two major Virtex-4 functional blocks.Writing a specific value to a specificaddress will manipulate configuration bitsand alter functions or attributes on the fly.
The user and configuration guidesdescribe the address space (locations) andallowed values for each function.
In the DCM, the DRP allows you tomake dynamic adjustments to the phaseshift value of the digital phase shifter(DPS) and to change the multiply (M)and divide (D) values of the digital fre-quency synthesizer (DFS). A new primi-tive, DCM_ADV, has been added to the
First Quarter 2005 Xcell Journal 79
DCLK_B
DEN_B
DWE_B
DADDR_B[6:0
DI_B[15:0]
DO_B[15:0]
DRDY_B
Controller
FPGA Fabric
Configuration Logic
StandardReconfigurationPort (To Fabric)
Reconfigurable Bits
To Block Logic
Bits That are Not ReconfigurableTo Block Logic
Functional Block (DCMs and MGTs)
Block Status
Function Enables
Figure 1 – Configuration changes
ENGINEER ING SOLUT IONS
by Hamish FallsideSenior Manager, Systems Engineering,Advanced Product DivisionXilinx, [email protected]
Ethernet is the predominant wired connec-tivity standard. The range of standard prod-ucts for Ethernet is large, and it just gotbigger with the introduction of the Xilinx®
Virtex-4™ FX device family. Combiningembedded Ethernet connectivity with theunique flexibility of the Virtex-4 feature set,Xilinx has created a compelling single-chipplatform for solutions not possible withexisting off-the-shelf products.
The Virtex-4 FX device family containspaired embedded Ethernet media accesscontrollers (MAC) that are independentlyconfigurable to meet all common Ethernetsystem connectivity needs. Each Virtex-4FX device contains either two or four MAC,implemented using Xilinx IP immersiontechnology, as shown in Figure 1.
Using standard Xilinx design products,you now have the unprecedented capabili-ty to create a huge range of customizedpacket processing and network end-pointproducts for 10/100/1000 Mbps Ethernet.
An external physical layer device (PHY)is required for the MAC to connect to anetwork. The Virtex-4 FX device directlysupports all standard serial and parallelPHY interfaces for both copper and opticalEthernet connections. In addition, Virtex-4 embedded RocketIO multi-gigabit trans-ceivers can be used to drive Ethernetdirectly across PCB traces, such as serialbackplanes, for in-system connectivity.PHY connections can be routed to any userpin or RocketIO block in the device.
In this article, we'll review the feature setof the embedded Ethernet MAC blocks inVirtex-4 FX devices, and offer some point-ers on how you can start right away usingthem with standard Xilinx design tools,LogiCORE™ IP, and development boards.
Feature SetThe Virtex-4 Ethernet MAC addresses allcommon configuration requirements forembedded Ethernet connectivity, and isfully compliant to the IEEE802.3-2002
Designing with the Virtex-4 Embedded Tri-Mode Ethernet MAC
Designing with the Virtex-4 Embedded Tri-Mode Ethernet MAC
80 Xcell Journal First Quarter 2005
Integrate the versatile Virtex-4 10/100/1000 Ethernet MAC into your next programmable SoC design.
ENGINEER ING SOLUT IONS
specification. It will allow you to buildEthernet systems that support VLAN,jumbo frames, and end-to-end flow control.
Built-in hardware address filteringreduces the burden on software of process-ing unneeded frames. You can independ-ently configure each MAC for multiplerates and topologies:
• 10 Mbps or 100 Mbps full- and half-duplex
• 10/100 Mbps full- and half-duplex
• 1000 Mbps full-duplex
• 10/100/1000 Mbps full-duplex
When used in multi-rate modes, auto-negotiation support is provided.
Connecting the MAC to external PHYand optical modules is supported throughthe PHY interface to the FPGA fabric.This provides flexible use models for theMAC, allowing, for example, attachmentto a shared processor bus or to custompacket processing hardware.
Controlling the MAC in your system isperformed through the host interface,which provides flexible software access tothe internal registers. Each MAC pair sharesa common host interface, which can be
directly to a discrete external PHY, and iscommonly used to connect to small form-factor pluggable (SFP) modules for bothoptical and copper connectivity:
• Serial GMII (SGMII) for 10/100/1000 Mbps
• 1000BaseX for 1000 Mbps
These interface options have 9-bit sig-naling that connect to the RocketIO.Embedded state machines in the MACprovide University of New Hampshire-certified operation for link initializationusing these options.
A MII management (MIIM) interface isalso included, which allows your softwareto access external PHY registers throughthis standard IEEE interface. The registersare accessed via the address map in the hostinterface.
Host InterfaceFor your software to control the MAC, ahost interface provides access to the inter-nal registers. A dedicated DCR bus con-nects the embedded PowerPC directly tothe host interface, requiring no additionalFPGA resources. Alternatively, the hostinterface can also be accessed directly fromthe fabric, providing a flexible solution forporting legacy driver software. Each pair ofMAC shares a single host interface.
The registers accessed through the hostinterface are used by driver software to ini-tialize and control the MAC during opera-tion. All register values may be preset atpower-on from the FPGA fabric. Thisallows the MAC to be used by applicationsthat do not include a processor and soft-ware. The registers provide access to thefollowing settings:
• Independent receiver settings for resetand enable, pause frame address,jumbo and VLAN frame enables,half/full duplex, and passing framecheck sequence (FCS) to the client
• Independent transmitter settings forreset and enable, inter-frame gap (IFG)adjustment, jumbo and VLAN frameenables, half/full duplex, and FCSfrom client
directly accessed by the embeddedPowerPC™ 405 device control register(DCR) bus, or from the FPGA fabric.
Let’s describe each of these interfaces inmore detail.
PHY InterfacesYour application will require connection toa particular medium – copper, fiber optics,or one of your own invention. The PHYinterface provides many options to meetyour requirements.
All common interfaces to externalmedia are directly supported in the PHYinterface. As the PHY interface is routed tothe outside world through FPGA fabric,creating “bump-in-the-wire” solutions inFPGA fabric is straightforward.
PHY interfaces fall into two categories:one using SelectIO™ resources and anoth-er using RocketIO serial transceivers.
The first category is typically used toconnect to a discrete external PHY:
• Media independent interface (MII) for 10/100 Mbps
• Gigabit MII (GMII), and reducedGMII (RGMII) for 10/100/1000 Mbps
The second category will also connect
First Quarter 2005 Xcell Journal 81
To PowerPC 405 BlockDCR Bus
Statistics Block
Client Interface
FPGA Fabric
To PowerPC 405 Block
PHY Interface
PHY InterfaceClient Interface
Generic Host Bus
Statistics Block
EMAC1
Ethernet
MAC Block
Host Interface
DCR
Bridge
DCR Bus
EMAC0
RX Stats MUX1 Tx Stats MUX1
RX Stats MUX0 Tx Stats MUX0
Figure 1 – Embedded Virtex-4 Ethernet MAC Block, with interfaces to FPGA resources
ENGINEER ING SOLUT IONS
• Independent flow control enables forreceiver and transmitter
• RGMII/SGMII status, and speed forfixed and negotiated settings
The address filter provides a single uni-cast and as many as four multi-castEthernet addresses that are used to matchagainst the destination address of incomingframes. You can set the filter to optionallydiscard incoming frames that do not matchthe stored addresses or to simply flag whena match occurs, allowing you to make rout-ing decisions for received frames at hard-ware speed rather than in software.
Client InterfaceEthernet frames are passed between theMAC and your design across the clientinterface, which is divided into receive andtransmit sides.
Receiver Side Client InterfaceOn the receive interface, frame errors andunmatched frames are signaled to the userlogic. When flow control is enabled, anyvalid pause frames received will be flaggedas invalid.
Transmitter Side Client InterfaceThe transmit interface will indicate colli-sions on half-duplex connections, and willcorrupt a truncated frame in the case ofFIFO starvation in the middle of a frame.When flow control is enabled, the transmit-ter interface will automatically assert backpressure on the client when a pause requestframe is received from the remote host.
Flow Control and Statistics VectorsA separate flow control interface allowsthe client to make pause requests to thefar end, allowing the pause interval to beset for each individual request. Separateinterfaces provide separate statistics vec-tors for the receiver and transmitter por-tions of the MAC. The IEEE-definedstatistics are updated on a per-frame
basis, and can be accumulated using cir-cuitry in the FPGA fabric.
Over-Speed OperationThis feature allows you to clock the MACat higher rates than allowed by the standard.The double-width interface on the clientside means that your design can processframes at the same system frequency as nor-mal operation, but at twice the data width,providing up to 2 Gbps in each direction.
Virtex-4 Ethernet MAC Use ModelsThe features described previously providethe Virtex-4 Ethernet MAC with multipleuse models. Some examples of these aregiven here, but this should not be consid-ered a complete list.
• Attach the MAC to CoreConnect PLBor OPB peripheral interface in FPGAfabric to embedded PowerPC orMicroBlaze™ processors, as in Figure 2.
• Create a custom interface to packetprocessing hardware implemented inFPGA fabric, such as protocol offload,DMA engines, embedded FIFO, andembedded block RAM. Figure 3 showsan example scheme for a TransmissionControl Protocol (TCP) offload engine(TOE), and/or Remote DirectMemory Access (RDMA), as covered
by the iWARP protocols from theRDMA Consortium.
• Directly connect multiple MACblocks to Virtex-4 embedded FIFOand external QDR and DDR memoryfor classification, policing, and switch-ing applications, see Figure 3.
• Provide independent packet monitor-ing and statistics collection, using cus-tom hardware in FPGA fabric thatconnects directly to the statistics inter-face of the MAC blocks.
Any of these use models may be con-nected to external PHY in multiple sys-tem topologies:
• Optical gigabit Ethernet connectivity –connect directly to external opticalmodules through the Virtex-4RocketIO transceiver for 1000BaseXoperation (Figure 4)
• 10/100 Ethernet connected to externalcopper PHY through RMII interfaceimplemented between the MII PHYinterface and SelectIO pins
• 10/100/1000 tri-mode Ethernet toexternal PHY or SFP module throughSGMII connection to RocketIO trans-ceiver, utilizing a RocketIO block
82 Xcell Journal First Quarter 2005
Exte
rnal P
HY
Tx
Rx
Host
Interface
Virtex-4
MAC
Sele
ctIO
or
RocketIO
Read
Packet
FIFO
SimpleDMA
orSGDMA
MasterAttachment
Register,SRAM, and
InterruptInterfaces
PLBArbiter
Write
Packet
FIFO
SlaveAttachment
CoreConnect Peripheral LogoCore
Client Transmit
Client Receive
FPGA Fabric
Figure 2 – Embedded MAC connected to embedded PowerPC as a PLB peripheral, with the addition of Xilinx CoreConnect LogiCORE IP
ENGINEER ING SOLUT IONS
Tools, IP, and Development BoardsXilinx provides support for the MAC withtools, LogiCORE IP, reference designs, andVirtex-4 development boards.
Virtex-4 Embedded EMAC WrappersAvailable from the Xilinx COREGenerator™ tool, you can automaticallygenerate HDL wrappers for the MACinstantiations in your design and complete-ly configure the MAC through the GUI. Alow-level software driver for the embeddedPowerPC to access the MAC across thededicated DCR interface will also be auto-matically generated.
Embedded Developers Kit (EDK)The EDK tool enables you to build a com-plete processor subsystem around theMAC. The tool includes standard XilinxLogiCORE IP to connect the MAC as aCoreConnect peripheral, and will auto-matically generate a software driver.
Xilinx Ethernet LogiCORE IP and Reference DesignsMuch of the legacy Virtex-II Pro™Ethernet collateral will be reusable with theVirtex-4 MAC.
Reference designs are available thatdemonstrate useful techniques for opti-
mizing your Ethernet system designs. TheLocalLink LLTEMAC checksum offloadperipheral, available with the GigabitSystem Reference Design (XAPP536)demonstrates how to accelerate the TCPperformance of your network endpoint.
Development BoardsXilinx provides a family of developmentboards for immediate prototyping of yoursystem design. These include:
• The ML403, a low-cost developmentplatform featuring the Virtex-4 FX12device, includes a tri-speed EthernetPHY for Ethernet copper connectivity
• The ML405 development board pro-vides a superset of the ML403, withadditional serial connectivity optionsenabled by the Virtex-4 FX20RocketIO transceivers
All Xilinx and partner-developed boardsare available from the “Xilinx on Board”section of the Xilinx website.
ConclusionThe embedded tri-mode Ethernet MAC inVirtex-4 FX devices provides unparalleledflexibility for today’s Ethernet systemsdesigners; spanning:
• Hub, switch, and router systemstopologies
• Tightly coupled network processingfunctionality utilizing embeddedprocessors and custom logic
• Embedded processing shared bus subsystems
• Direct low latency connectivity topacket storage
• Cost effective interoperability withfuture, current, and legacy physicallayer standards
In short, the Virtex-4 FX family enablesyou to customize your solution for the Ethernet topology and feature set thatyour application requires. To find outmore, please follow the Virtex-4 links on the Xilinx website, www.xilinx.com/virtex4/.
First Quarter 2005 Xcell Journal 83
RX1 TX1
LocalLink LocalLink
Direct Memory Access Controller Interface -- Connects to Memory Subsystem
Protocol
Control
Engine
HEADER STRIP
Virtex-4 FX Embedded Ethernet MAC
TX
FIFO
TX_DMAC_ IF
HEADER
FIFO
TX_GMAC_IF
HEADER STRIP
RX
FIFO
RX STATUS
FIFO
RX_DMAC_IF
RX_GMAC_IF
Protocol
Control
Engine
FPGA Fabric
Protocol Offload Engine
External
PHY or
SFP
Module
SelectIO
or Optional
Virtex-4
RocketIO
Block
for Serial
Connectivity
Virtex-4
Embedded
FIFO Block
Client
Interface
PHY
Interface
Virtex-4 FX
Embedded MAC
GMII/
SGMII/
RGMII/
MII
Interface
Block
Core
Virtex-4 FPGA Fabric
Figure 3 – Packet-processing end-point in Virtex-4 FPGAs using embedded MAC with additional logic for checksum offload, TCP segmentation offload (TSO),
network address translation, and other standard or custom applications
Figure 4 – Multiple Gigabit Ethernet MAC in a switch/router configuration; Virtex-4 embedded FIFO blocks provide intermediate packet storage in the fabric.
ENGINEER ING SOLUT IONS
by Darren ZacherTechnical Marketing EngineerMentor Graphics Corporation, Design Creation andSynthesis [email protected]
Customers in today’s demanding commu-nications and consumer applications needto attain unprecedented levels of capacityand performance while reducing powerconsumption and overall cost. With theintroduction of high-end devices into themarketplace, more of these applications arebeing addressed by FPGA solutions.
As professional programmable logicdesigners, you are always searching for bet-ter ways to create value and differentiateyour products. To do so effectively, youneed to adopt comprehensive, high-pro-ductivity design flows instead of point toolsto crack new design challenges and takeadvantage of the benefits of the latest pro-grammable silicon platforms.
Multiple Platforms, Unprecedented Opportunity With the release of Xilinx® Virtex-4™devices, you can enjoy twice the density,twice the performance, and half the powerconsumption of previous Xilinx FPGAfamilies. If you seek sheer DSP perform-ance, you might prefer Virtex-4 SX FPGAs,which offer 256 GigaMAC/s performance
for 18-bit operations. The LX family ofFPGAs offers higher performance logic;with FX devices, you can explore embed-ded processing and high-speed serial con-nectivity applications. These threeplatforms, comprising a complete selectionof 17 devices, collectively offer a com-pelling alternative to ASICs and ASSPs.
To fully exploit this immense potential,design teams must consider moving awayfrom serial, iterative, point-tool approach-es that involve designing or re-designingfrom scratch. To manage non-recurringengineering time and costs and create effi-cient, reliable flows, you must clearly iden-tify which of the various “building blocks”you need to focus on when using a plat-form approach to successfully implement ahigh-end design.
Typical building blocks may include:
• Intellectual property such as internalcompany, Xilinx, or third-party IP
• Lower-level blocks used in the contextof a bottom-up design flow
• Algorithms via C or C++ or MATLAB™
• RTL blocks
• Embedded processors
• I/O interfaces
By using a comprehensive, methodicaldesign flow, you can effectively optimizethese blocks in a multimillion-gate device.
As high-end FPGAs approach ASIC-level performance, designers are adaptingmany advanced ASIC techniques forFPGA design. The complex FPGA designflow shares some commonality with ASICdesign; for instance, RTL simulationremains basically unchanged. But certainsubtle differences exist under the hood, andmany steps are fundamentally different.The pre-built nature of FPGAs implies a“use or lose” approach to features or capa-bilities, so you must match functionalrequirements with the device architecture.Thus, common steps such as synthesis orplace and route all differ subtly in theFPGA domain.
You can use C++ synthesis techniquesborrowed from ASIC flows to targetFPGAs. C++ specifications are much lesstied to any specific hardware than the cor-responding RTL code.
Another technique, physical synthesis,illustrates the subtleties involved when thesame general approach is used for bothASICs and FPGAs. Physical synthesisrequires a detailed understanding of theFPGA’s hardware structure. At the veryleast, physical synthesis tools must be morespecifically targeted to FPGA architectures.
Emerging Design Methodologies Elicit the Power of Virtex-4 FPGAsEmerging Design Methodologies Elicit the Power of Virtex-4 FPGAs
84 Xcell Journal First Quarter 2005
Adopt a broader design flow methodology instead of the traditional point-tool approach.Adopt a broader design flow methodology instead of the traditional point-tool approach.
ENGINEER ING SOLUT IONS
A typical high-end FPGA design flowshould encompass such tasks as:
• Early design rule checking
• Higher level design abstraction
• Functional and system-level simulationand verification
• Advanced physical synthesis techniques
Let’s describe each of these in more detail.
Integrated Approach to Design CreationIn terms of design entry, the need to createfaster, larger, and complex designs packedinto the latest FPGA devices within theshortest possible time presents significantchallenges. The high availability of config-urable logic in platform FPGAs thatinclude hard ASIC macros – such asembedded processor blocks and complexI/O standards – has truly enabled program-mable SoC, where a serialized designapproach would not work. Only a system-level RTL design concept, used in parallelwith multiple aspects of managing andoptimizing the high-level design creationprocess, will ensure success.
Large design projects mandate the col-laboration of several engineers or engineer-ing teams, often belonging to separatecompanies and typically distributed in dif-ferent geographic locations worldwide.This team-based approach raises theimportance of a consistent design codingstyle for teams to share code effectively.
Teams invariably comprise experiencedproject leaders and designers alongside lessexperienced junior engineers working onthe various building blocks of a design. Theresulting skill diversity makes the need forconsistency critical. It is imperative thatcompanies carefully scrutinize the planningand creation process to identify poordesign styles, incorrect design rules, andsyntax/semantic errors at the earliest possi-ble stage before even attempting to tie thebuilding blocks together or simulate/syn-thesize the design.
In bigger designs, it is not unusual formultidisciplinary design teams to focus onand optimize only a portion of the device.As the system is defined in RTL by combin-ing both vendor and internal IP (and for
Similarly, synthesis can become a protracted,iterative process in order to achieve desiredperformance goals. You need to maximize theproductivity of potentially long EDA toolruns by ensuring that as many code errors aspossible are found and fixed before the startof simulation and synthesis (Figure 1).
Equally important are integrated con-nections to advanced tools such asDesignAnalyst™ and Precision® Synthesisfrom Mentor Graphics to ensure againsterrors and reduce iterations, as well as inte-gration with any third-party EDA toolsthrough a flexible integration mechanism.Through static design checking or “lint-ing” products, you can perform many dif-ferent forms of checking during the designcreation process.
Interactive HDL visualization and cre-ation tools provide automatic documenta-tion features and reporting as well asintelligent debug and analysis to effectivelymanage FPGA designs. Moreover, tight bi-directional communications with PCB toolsfrom within the design creation processshorten design cycles by integrating and syn-chronizing HDL design with PCB design,eliminating time-consuming manual steps.
Higher Abstraction Levels Speed Hardware DesignFor the first time, professional design engi-neers are literally struggling to keep pacewith Moore’s Law, which makes it difficultto fully utilize the capacity of 90 nm ASICs
those applications utilizing DSP functionali-ty, RTL generated algorithmically), you willneed an integrated system design approachto help synchronize the development of eachspecific part of a large, high-capacity FPGA.
From the configuration of the embeddedprocessor to logic development and high-speed I/O assignment, the ideal synchro-nization of these teams and processes isrequired to deliver an optimized field-programmable SoC. The merging and man-agement of these multiple disciplines to gen-erate the system-level RTL and associateddesign files is a huge task best handled by acomprehensive and flexible environment.
To reduce development cost and time tomarket, 80-90% of projects may now includeboth re-work of an existing design as well asreuse of previously designed components orIP, whether internal or purchased. Becausethis trend is expected to increase, you need toensure that your components/subsystems aredesigned to be reusable and conform to estab-lished design reuse rules.
Through cooperative efforts in the designcommunity and internal corporate standardi-zation, the industry has developed a numberof reuse methodology guidelines that can bechecked using automated tools. Tools such asMentor Graphics® HDL Designer Series™(HDS) can help design teams successfullyintegrate both hard and soft IP (such asPowerPC™ and MicroBlaze™ processors).
Larger designs at higher speeds have pro-longed traditional simulation cycles.
Figure 1 – When used in tandem for concurrent design entry and checking, interactive HDL visualization and creation tools can increase design quality, reduce iterations,
shorten simulation and synthesis cycles, and improve testability and reuse in high-end FPGAs.
ENGINEER ING SOLUT IONS
or efficiently target the complex structuresfound in domain-specific FPGAs.Algorithmic C synthesis (Figure 2) promis-es to raise the abstraction of hardwaredesign by providing a new, more abstractentry point, benefiting both ASIC andFPGA hardware designers. But to under-stand the need for higher abstraction lan-guages, you must first analyze the problemswith existing RTL methodologies.
The design complexity of new DSPapplications has outpaced traditional RTLcapabilities. To create hardware implemen-tations for blocks of computationallyintensive algorithms using RTL, designteams must iterate through several steps,including micro-architecture definition,handwritten RTL, and area/speed opti-mization through RTL synthesis. Thismanual process is slow and error-prone. Inthe final result, both the micro-architectureand technology characteristics becomehard-coded into the RTL description. Thishard coding renders the whole notion ofRTL reuse or retargeting impractical in realapplications.
An optimized C-to-RTL synthesis flownot only promotes a higher level ofabstraction, it also gives the design teamthe flexibility to transition from one imple-mentation technology to another. You cantune the hardware for high-performanceparallel implementations or smaller, moreserial implementations.
Using this approach to describe func-tional intent (offered in the Mentor
Graphics Catapult™ C Synthesis tool),you can move up to a far more productiveabstraction level for designing hardware. Ashardware designers, you can reduce imple-mentation efforts by as much as 20X whilecreating a more repeatable and reliabledesign flow.
The ability to select fundamentallysuperior micro-architectural alternativesallows you to create designs of better qual-ity than traditional RTL methods. Finally,this approach closes the conceptual gapbetween algorithm designers modeling inC/C++ and hardware designers working atthe RTL abstraction level.
Simulation and Verification ChallengesUsing standard RTL verification methodsin high-capacity FPGAs quickly diminish-es the benefits of faster hardware creation.The current execution speeds of softwarevalidation platforms and RTL verificationenvironments are insufficient to quicklytest design functionality. Design verifica-tion takes significantly longer than designdevelopment because of the limited speedof RTL simulators and the time needed tomanually create an RTL test bench.
Additionally, C/C++ simulation(although upwards of 10,000X faster thanRTL) may be inadequate to validate theoriginal algorithm given the data-intensivenature of DSP designs. These challengesare in fact opportunities for both algorithmdevelopment and system validationthrough the use of accelerated simulation.
High-level design verification flows arenow turning to address rapid algorithmvalidation and verification, using hardwareacceleration by leveraging the benefits of aSystemC verification environment. Theseflows begin with the algorithm designervalidating designs in C++ and end withthe hardware designer verifying the algo-rithm in RTL.
This method of using high-levelC/C++ synthesis in combination with aSystemC verification environment pro-vides an automated path from algorithmdevelopment to synthesized RTL runningin an FPGA prototyping environment.Executing the algorithm directly in hard-ware gives algorithm designers the abilityto validate algorithms and hardwaredesigners the ability to validate the entiresystem at or near real-time speeds.
The use of SystemC as a verificationenvironment permits both algorithm andhardware designers to use the same testbench and test vectors, eliminating theneed for manual test bench creation. Thecombined approach of hardware accelera-tion of C/C++ algorithms in a SystemCverification environment provides a push-button solution for accelerated algorithmdevelopment and system validation.
Balancing the Cost/Timing Closure Equation An essential step in realizing a high-capacity FPGA design is to optimize thatdesign for both timing and cost. Timingclosure challenges are well known. Usingstand-alone logic synthesis with place androute can be non-deterministic by nature,especially for large devices.
Designers tend to write and rewriteRTL code and constraints to try and coaxthe place and route tool to do their bid-ding. Once you go down this path, youthen must iterate through place and route– the most time-consuming step in FPGAdesign – before gaining any visibility as towhether your changes were a step in theright direction or if they only served tofurther exacerbate the problem.
Similar to optimization for timing, theprocess of achieving true “cost closure”involves a reduction in area to reduceFPGA part cost, or a reduction in the total
Figure 2 – An optimized C-to-RTL synthesis flow promotes a higher level of design abstraction and gives you the flexibility to easily transition from one implementation technology to another.
ENGINEER ING SOLUT IONS
cost of the design by increasing levels ofabstraction and design reuse. The irony isthat once you attain a successful imple-mentation, any change – no matter howsmall – in the design or architecture threat-ens to obsolete that success. This unpre-dictability negates the reduced cost andtime-to-market benefits of using program-mable logic in the first place.
Increasing die sizes place additional bur-dens on the extant methodologies. A largedie poses a significant challenge in obtainingrepeatable, high-quality placements out ofcurrent placement algorithms. The larger diesize is now widening the distribution curveof net delays grouped by fanout, the basisbehind industry-accepted wire delay models.
This widened distribution has a degrad-ing effect on the accuracy of fanout-basedwire delay models. In larger devices, inter-connect delay dominates performance forFPGA platforms. Because fanout-baseddelay estimates in FPGAs struggle to modeleven a simplified version of physical realitytoday, you can see why optimization deci-sions based on a wire-load estimate are oftenineffective. Worse, physical proximity can-not always relate directly to delay, so tradi-
tional floorplanning falls painfully short.Advanced physical synthesis techniques cansolve these issues in several ways.
First, to improve accuracy and reducedesign iterations, you must consider realinterconnect delay and physical effects upfront (Figure 3); combining logic and phys-ical synthesis is critical for the design of larg-er, high-performance FPGAs. Some physicalsynthesis alternatives available today arebased solely on technology borrowed fromthe ASIC implementation space.
In reality, forcing an ASIC methodology– and mentality – on the FPGA world can-not work. Such approaches essentially tryto outsmart the vendor placement and mayshow promise in certain situations, butmost cannot match the performance of atool that leverages the FPGA vendor’s post-layout information to provide accuratephysically aware synthesis.
Second, FPGA-oriented physical syn-thesis solutions need to take into accountsuccessful implementation experience thatyou have previously developed. Forinstance, when you complete a modulardesign and have optimized performance fora portion of it using physical synthesis, a
good tool must ensure that you can takefull advantage of these optimizations andreuse them on subsequent designs.
Physical synthesis in FPGAs is growingbeyond the ASIC model to be a valuablepart of cost minimization and componentreuse strategies. When investing in a syn-thesis tool with a highly deterministicprocess for improved results, look for tech-nologies and algorithms that not only opti-mize designs for cost and timing, but alsoenable you to translate your professionalexperience and previous design implemen-tations at the physical level into faster timeto market in subsequent designs.
Any tool used in professional FPGAdesign (including the Precision Synthesistool from Mentor Graphics) should con-sider FPGA vendor placement results assoon as possible, and only then begin tomanipulate the design using physical syn-thesis – integrated with logic synthesis in aunified data model – to converge on timingat a lower cost.
From Point Tools to ESL Design FlowsEvery designer stands poised to benefitfrom the new standard set by Virtex-4 high-performance FPGAs. The next-generationchallenge faced by mainstream FPGA EDAtool vendors is to leverage point-toolexpertise and thus meld apparently contra-dictory trends – higher levels of abstractionon the one hand and greater dependence onspecific physical characteristics on the other– into a coherent design methodology andhighly productive flow.
In keeping with these advances, EDAtool companies will continue to extend andimprove their comprehensive, integrateddesign flows spanning all levels of abstrac-tion. Mentor Graphics continues to be atechnology leader in this space. Designersmust take advantage of EDA tools thatnow address both physical and electronicsystem-level (ESL) challenges of high-endFPGAs, and thus realize the unprecedentedpotential of these devices as ASIC replace-ments in new SoC designs.
To access the latest product news, appli-cation notes, and case studies, evaluate newdesign flows, or schedule a product demon-stration, visit www.mentor.com/fpga/.
First Quarter 2005 Xcell Journal 87
Figure 3 – High-end FPGA synthesis tools should ideally consider FPGA vendor placement results up-front, and only then begin to manipulate the design using physical synthesis –
integrated with logic synthesis in a unified data model – to converge on timing.
The availability of embedded processorsubsystems in FPGAs opens the door to amyriad of applications, including embed-ded network processors, flexible sandboxprototyping, control plane and data pathsubsystems, and exception handlingprocessors. Today’s FPGAs integrate exist-ing IP cores, interfaces, custom processingengines, and now embedded processor sub-systems. You can easily instantiate thesesubsystems into a top-level HDL designjust as you would integrate off-the-shelf IP.
Xilinx® Virtex-4™ FX FPGAs inte-grate a higher performance IBM™PowerPC™ core with the new AuxiliaryProcessor Unit interface. The direct con-nection to the FPGA fabric facilitatesadvanced coprocessor designs.
You can use Xilinx Platform Studio/EDKsoftware to design embedded processor sub-systems in FPGAs with embedded PowerPChard processor cores or with XilinxMicroBlaze™ soft processor cores.Although off-the-shelf peripheral cores andMicroBlaze soft cores are synthesized usingXST during EDK platform generation, theoverall FPGA project and custom peripher-al cores are synthesized with Synplicity®
Synplify Pro® 8.0, leveraging new featuresand superior quality of results.
EDK Subsystem Project FlowAll projects begin by defining an overallFPGA directory structure. The embeddedsubsystem should reside in its own sub-directory. For example:
fpga_project
/doc spec and documentation
/src RTL source code files
/constraints .ucf, .sdc files
/sim simulation files
/syn synthesis project files
/pnr place and route files
/ppc_subsystem embedded processorsubsystem
Creating a new EDK project in/ppc_subsystem results in a system.xmpproject file. Next, EDK Project Optionsmust indicate that it is a subsystem bysetting:
1. Design Hierarchy to SubModuleSpecifying the top-instance name of the embedded subsystem(ppc_subsystem). The indicated top-instance name will be used when instantiating the subsystem in the overall top-level design.
2. Synthesis Tool to None This indicates that no synthesis tool isused to synthesize the overall designwithin EDK (the instantiated subsys-tem will be included later in theSynplify Pro project), although EDKwill have used XST (and possiblySynplify Pro) in the platform creationof the subsystem and its peripherals.
3. Implementation Tool Flow to ISE™Although Synplify Pro supports mixedlanguages, you can select Verilog™ orVHDL for EDK output files in ProjectOptions/HDL and Simulation.
Platform GenerationYou can create the embedded processorsubsystem by using either the Base SystemBuilder wizard, the GUI selection ofperipheral cores, or direct text editing ofthe microprocessor hardware specification(MHS) file.
Once the MHS file has been construct-ed, Generate Netlist invokes PlatformGeneration. PlatGen constructs thenetlist, builds and interconnects indicatedperipherals, runs DRC checking for errorsand warnings, and generates output files.
PlatGen will generate two top-level filesin /hdl: system_stub.v and system.v.System_stub.v instantiates system.v andadds I/O insertion as Xilinx primitives forall top-level ports. With the processor as asubsystem, system_stub.v is not usedbecause there are other cores, subsystems,and logic in the design. For example, clocksignals could be generated by top-levelinstantiated DCMs and subsystem signalscould go to other modules at the same levelof hierarchy instead of off-chip.
Also, using Synplify Pro, the I/O inser-tion is automatic; you don’t need to explic-itly instantiate BUFG, IBUF, or OBUFprimitives for most I/O standards.
Choosing to instantiate system_stub.vas our subsystem would then require edit-ing, removing, or modifying the I/O inser-tion for the ports not directly connected toan external pin. Once modified, rerunning
must also add the required HDL to controlthe bidirectional signals:
Now EDK-generated subsystem Verilog filesdo not need to be modified – only instantiated.Bi-directional signals are handled correctly andI/O insertion is either handled automatically bySynplify or explicitly instantiated as Xilinxprimitives when required.
Memory GenerationPlatGen will also generate the required mem-ory initialization files for the specified blockRAMs coupled with DSOCM, ISOCM(PowerPC only), LMB (MicroBlaze softprocessor core only), OPB, and PLB blockRAM controllers.
PlatGen will produce two BMM (blockRAM memory map) files in the /implemen-tation directory: system.bmm andsystem_stub.bmm. A BMM file will be usedin the ISE flow to indicate the logical dataspace used by the embedded subsystem andorganization of the block RAM memory. Inthe case of our subsystem, system_stub.bmmwould be used, as it contains the completehierarchical path (because we specified thetop-level instance of our subsystem in theproject options).
During the ISE bitgen phase of the flow, asystem_stub_bd.bmm file will be created in the/implementation directory, indicating the phys-ical location of the block RAMs.
Synplify Project FlowWhile XPS/EDK generates the embeddedprocessor subsystem (/implementation/sys-tem.v), once created the ppc_subsystem isinstantiated exactly as any IP block byadding it to the overall Synplify synthesisproject. Whether the underlying embeddedprocessor subsystem used XST, Synplify, orboth to create the peripherals and generatethe subsystem is irrelevant to the overallSynplify synthesis project.
PlatGen would overwrite this file onceagain. Another choice might be to renamesystem_stub.v after editing the file; thedownside to this approach is that port/sub-system modifications would require you torecreate the modified/edited file.
A better approach is to instantiate sys-tem.v directly in the top-level HDL.Synplify will take care of the necessaryI/O insertion where required or, for I/Ostandards requiring I/O primitive instan-tiation (for example, LVDS), this shouldbe done directly in the top-level HDLfile. System.v is always correct as generat-ed by EDK PlatGen and never needs tobe modified. The one additional steprequired is at the top level, in the case oftri-state signals.
For example, you can define the projecttop-level ports as:
module fpga_top(
inout [31:0] ddr_dq,);
PlatGen will generate system.v (in /implementation), bringing out the tri-state signals as shown in the instantiatedppc_subsystem:
Because we want to be able to instanti-ate system.v directly into our top level, we
First Quarter 2005 Xcell Journal 89
ENGINEER ING SOLUT IONS
A typical synthesis project flow, as shownin Figure 1, would follow this order:
1. Create a synthesis project
2. Add files to the synthesis projectproject_top.v/ppc_subsystem/hdl/system.v(EDK-generated subsystem)
3. Synthesize and review the synthesizedproject
4. Use the generated output files in theISE project
fpga_top.edf (top-level source file)
fpga_top.ncf (sdc-translated constraints file)
System.v contains the actual embeddedsubsystem with the peripheral wrappersinstantiated. At the end of system.v areblack_box definitions for each of the wrap-pers. Although Synplify doesn’t recognizethese XST synthesis directives, it does real-ize that it has to create black boxes and doesso without modification.
Synplify will generate the warningsshown in Figure 2 because of the XST-generated synthesis directives and emptyblack box modules. Once reviewed andaccounted for, these warnings can now be“hidden” using the Synplify Pro warnings
filter, as shown in Figure 3. The filter cre-ates a project.prf file (Figure 4). This filecan also be sourced in the Tcl window(source filename).
ProjNav ISE FlowThe /pnr directory is used for the XilinxProjNav ISE flow. The fpga_project.nplfile is created by ProjNav indicating ISEproject options.
The following source files are added tothe ISE project:
1. fpga_top.edf (Synplify top-levelnetlist with ppc_subsystem)fpga_top.ncf (not added as anexplicit source file; created from the Synplify contraints [.sdc])
This file requires no modification,assuming that the subsystem instantiated in the top-level moduleuses the same instance name as generated by system_stub.v (that is,the top instance name indicated inthe project options).
4. /ppc_subsystem/ppc405_0/code/executable.elf
An .elf file (pronounced “elf ”) is a binary data file that contains an executable CPU code imageready for running on a CPU. These files are produced by software compiler/linker tools.Data2BRAM uses .elf files as itsbasic data input form.
90 Xcell Journal First Quarter 2005
Processor IPMPD Files
User IP Files
/hdlSystem.v
MHS Filesystem.mhs
PlatGen
Synthesis
ImplementationPeripherals.ngc
System_stub.bmm
Constraints.ucfExecutable.elf
System_stub.bmm
.edf
Translate Marco Search Path Pointing to /implementation
EDK Subsystem Synplify Synthesis ISE ProjNav
Figure 1 – Synthesis project design flow
Figure 2 – Synplify Pro 8.0 compiler warnings
Figure 3 – Synplify Pro 8.0 warnings filter
Figure 4 – Synplify Pro .prf file
ENGINEER ING SOLUT IONS
ISE Translate Propertiesmust set the Macro SearchPath to point to the/ppc_subsystem/implemen-tation directory for it to findthe .ngc peripherals that wereblack-boxed by Synplify, ref-erenced in fpga_top.edf.These peripherals were creat-ed by XST during PlatGen.
Project implementationthen follows a normalProjNav flow producingtranslate, map, place androute, and timing reports.
You can easily incorporateembedded processor softwarechanges made by the EDKGNU compiler into the final.bit file without hardware recompiles byrunning Generate Programming File, oralternatively, the Data2Mem utility. Whenusing Data2Mem, the BMM file specified(-bm) must use the BitGen-generated sys-tem_stub_bd.bmm in the /implementa-tion directory.
Custom Peripheral CoresXPS provides a Create Peripheral Wizardthat generates core description files andensures that custom peripherals complywith the Xilinx implementation of theIBM CoreConnect PLB and OPB busstandard. The PLB and OPB buses willconnect to an IPIF, allowing user logic toconnect to the IPIC side of the interface.Unfortunately, the wizard currently sup-ports only VHDL. Peripheral cores canalso be created in Verilog, but cannottake advantage of the templates createdby the wizard.
DCR and OCM bus IP cores are notcurrently supported through a templateor wizard. DCR and OCM bus protocolsare simple to understand, however, andyou can easily create Pcores for thesebuses either in VHDL or Verilog. Thecurrent EDK-provided OCM buses nowallow configurable multi-slave capabili-ties, providing an easy way to create low-latency slave-only peripherals.
You can integrate custom IP cores intothe EDK project either as a black box
synthesized with Synplify or as an XSTnetlist. The Synplify-generated IP corerequires associated MPD (microproces-sor peripheral definition) and BBD(black box definition) files. The XSTnetlist is synthesized by PlatGen alongwith the system and requires MPD andPAO files.
Directory StructureFigure 5 shows the required Pcore directorystructure. PlatGen searches for IP accordingto the following priorities:
1. /pcores directory in the project directory
2. <library_path>/<LibraryName>/pcores if -lp option set (proj-ect options/peripheral repository)
3. $EDK/hw/XilinxProcessorIPLib/pcores
Pcore FilesThe Pcore HDL source files must belocated in the /verilog or /vhdl directoryif they are to be synthesized by XST withPlatGen. If the Pcore is provided as aSynplify-generated netlist, the EDIFmust be located in the /netlist directoryand indicate its black-box status in aBBD file. Required MPD, PAO, andBBD files for the peripheral must beplaced in the /data directory.
The .mpd file specifies PORTs,PARAMETERs, BUS_INTERFACEs,and OPTIONs. For Verilog files, theHDL option specified is OPTION HDL= VERILOG.
If XST is used as the synthesis tool forcreation of the peripheral, the netlist optionis OPTION IMP_NETLIST = TRUE.
If Synplify is used for the creation ofthe peripheral, the netlist option isOPTION IMP_NETLIST = FALSE.This would tell PlatGen to not run XSTsynthesis for this peripheral. A peripher-al wrapper is still created and instantiat-ed in system.v and the project synthesisrun in Synplify would again create ablack box for this peripheral.
ConclusionYou can easily integrate Xilinx embeddedprocessor subsystems created using EDKinto a Synplicity flow by instantiating theEDK-generated embedded subsysteminto the top-level HDL design. You canuse Synplicity tools not only as the overallproject synthesis tool but also as theperipheral core synthesis tool in the cre-ation of custom peripherals.
For more information, visitwww.CommLogicDesign.com. Comm LogicDesign is a Xilinx XPERTS partner focusedon architecting, building, and deliveringsystem solutions for wired-network, tele-com, and storage applications.
Synopsys® Design Compiler® FPGA (DCFPGA) allows you to meet your high-performance design goals by using a pow-erful set of optimization algorithms andfeatures specifically tuned for the Xilinx®
Virtex-4™ architecture. These algorithmsuse special Virtex-4 resources such as theDSP48 block and block RAM to achievethe lowest overall area utilization and theoptimal circuit timing performance.
Design Compiler FPGA OverviewDesigns that target complex devices suchas Virtex-4 FPGAs require the same powerand flexibility in synthesis that only ASICdesigners had access to in the past. DCFPGA is built on Design Compiler’sindustry-leading ASIC synthesis technologyand then customized to include FPGA-specific optimizations to handle even themost challenging designs. FPGA-specificoptimizations enable optimal mapping toFPGA basic primitives such as LUTs andcomplex components like RAM, multipliers,and DSP blocks.
Synopsys Design Compiler FPGA can take your high-speed design to the next level of performance.Synopsys Design Compiler FPGA can take your high-speed design to the next level of performance.
ENGINEER ING SOLUT IONS
DC FPGA includes innovative AdaptiveOptimization™ (AO) technology todynamically tune the synthesis algorithmsbased on the design context, as well as tim-ing constraints to provide faster synthesisruntime and optimal timing. DC FPGAinherits Design Compiler’s reliability –proven through the development of morethan 125,000 ASIC designs. DC FPGAbrings the powerful ASIC-strength synthe-sis of Design Compiler to FPGA designs.
In addition to AO technology, DCFPGA deploys a rich set of optimizations toachieve the best timing Quality of Results(QoR) for FPGA devices. These include:
• Constraint-driven synthesis and designspace exploration
• Automatic finite state machine (FSM)extraction and optimization
• Automatic inference of special FPGAresources, such as RAM, ROM, multi-pliers, DSP blocks, shift registers, andglobal clock buffers
DC FPGA is part of a family of prod-ucts from Synopsys that work in conjunc-tion with the Xilinx ISE™ tool tostreamline the FPGA design process.
In this article, we’ll show how DCFPGA optimizes for high performance inXilinx Virtex-4 FPGAs.
Constraint-Driven SynthesisDC FPGA uses a true timing-driven synthe-sis engine. You can greatly influence the finalimplementation choice by specifying appro-priate timing and design-specific constraintsduring synthesis. Therefore, we recommendthat you drive DC FPGA synthesis with thesame set of constraints as the Xilinx ISE tool.
At a minimum, you should specifyappropriate design timing constraints such
which was impossible to achieve with thecarry logic structure. At the overall designlevel, a 29% timing improvement is achievedwith a minor area increase of 11 slices.
Flexible FSM SupportDC FPGA contains sophisticated FSMextraction and optimization algorithms toensure optimum high-performance statelogic implementation. Once the FSM isdetected and extracted from the RTL code,DC FPGA’s powerful state machine opti-mization engine performs various opti-mization schemes, such as optimizingunreachable states or removing duplicatestates to produce the best logic implemen-tation to meet timing.
At the same time, you have the flexibili-ty to select a different FSM coding stylesuch as one-hot, binary, gray, and zero-one-hot on a state-machine-by-state-machinebasis, design basis, and global basis. ThisFSM encoding exploration flexibility allowsyou to customize the synthesis script to
address design bottlenecks.For an FPGA implementation,
one-hot state implementations typ-ically provide the best timing QoRfor most designs at the expense of ahigher register-to-LUT ratio. Formost designs this is not a problembecause of the register-rich architec-ture of FPGA devices.
High-Performance DSP Inference CapabilityThe availability of special FPGA resourcessuch as block RAM, dedicated DSP slice,and carry logic combined with your speci-fied design and timing constraints guidesDC FPGA’s specialized optimization algo-rithms to determine the best optimum cir-cuit implementation.
DC FPGA is highly capable of inferringcomplex circuit topology from yourdesign’s RTL coding structure, effectivelydeciding the final implementation that bestexploits the resources of the targetedFPGA. DC FPGA minimizes overallresource usage while providing the best cir-cuit performance possible.
This powerful optimization feature allowsDC FPGA to effectively infer and map com-plex logic configurations into special
as clock frequency, I/O offsets, and anytiming exceptions applicable to your design(such as multicycle and false paths). Anyother design-specific constraints – such ascontrolling special FPGA resource usage –could also be specified. For best perform-ance, your design should not be over-constrained, which in some cases can leadto unnecessary increases in area.
Without any timing constraints, DCFPGA will perform area-based optimiza-tions with good timing results. With prop-er timing constraints, DC FPGA appliesthe AO technology to explore the area-timing tradeoffs of various optimizations,selecting the final implementation that bestfits your constraints.
For example, your timing goals enableDC FPGA to decide whether distributedRAM, block RAM, or a LUT with register-based implementation is sufficient for aninferred memory component in yourdesign. Otherwise, DC FPGA optimizesfor the lowest area utilization possible.
Table 1 shows two implementations fora small sub-module with two differentclock constraints. The module is the criticalone for a larger design of about 8,600slices. The design contains a single clockdomain with only one clock period con-straint specified in DC FPGA.
In the first case, the module is constrainedat 10 ns. DC FPGA exceeds the timingrequirement after its area-based implementa-tion and does not invoke the timing opti-mization phase. The critical path of thedesign runs through a series of carry logic.
In the second case, when a much tighterconstraint (3 ns) is applied, DC FPGA per-forms aggressive timing optimizations andreplaces the carry logic on its critical pathswith parallel circuit structures built by LUTs.This results in a design with a slightly largerarea but meets the new timing requirement,
First Quarter 2005 Xcell Journal 93
Clock Post-PAR Area Post-PAR Constraint (# of Slices) Fmax (MHz)
Case 1 10 ns 105 260.1
Case 2 3 ns 116 334.8
Table 1 - Design example showing area-timing tradeoffs in DC FPGA
ENGINEER ING SOLUT IONS
resources such as the Virtex-4 dedicatedDSP48 slice. To illustrate this powerfulfeature, Figure 1 shows a simple multiplyaccumulate (MAC) logic structure, whereA- and B-registered input signals are mul-tiplied. The registered multiplier interme-diate output is then accumulated in thelast adder stage, feeding the registered Qoutput signal.
The RTL code for this simple MACfunction is:
module test ( Q, A, B, clk );output [47:0] Q;input [16:0] A, B;input clk;
DC FPGA is able to effectively imple-ment the logic configuration shown inFigure 1 in a single DSP48 slice, fully recog-nizing and taking advantage of the DSP48’sembedded 18 x18 signed multipliers, accu-mulated adder mode, and integratedpipeline registers to obtain the highest per-formance system clock speed.
Figure 2 shows the final DC FPGA sin-gle DSP48 implementation without theuse of other logic resources. TheOPMODE control input pin of theDSP48 element is set to “0100101” torealize the overall MAC functionality modeintended by circuit topology, while theAREG, BREG, MREG, and PREG attrib-utes are set to “1,” respectively, to signify asingle-stage register pipeline.
Furthermore, the high-performanceDSP inference feature in DC FPGA sup-ports very complex design topologies.Such topologies are extensively used inDSP-intensive applications such as a digi-
94 Xcell Journal First Quarter 2005
VDD
0, A[16:0]
0,B[16:0]
"0100101"
clk
Q[47:0]
GND
CLK
PCIN[47:0]
BCIN[17:0]
CECARRYIN
CECINSUB
CECTRL
CEP
CEM
CEC
CEBCEA
CARRYINSEL[1:0]
CARRYIN
SUBTRACT
OPMODE[6:0]
C[47:0]
B[17:0]
A[17:0]
RSTCARRYIN
RSTCTRLRSTP
RSTM
RSTC
RSTB
RSTA
DSP48
PCOUT[47:0]
BCOUT[17:0]
P[47:0]
X[n]
y[n]
D
Q
D Q
DSP48 Slice 2
OPMODE = 0010101Multiply-Add
h1
D Q
D Q
D
Q
D Q
OPMODE = 0000101
Multiply
h0
D Q
D
Q
D Q
DSP48 Slice 3
OPMODE = 0010101
Multiply-Add
h2
D Q
D Q
D
Q
D Q
DSP48 Slice 4
OPMODE = 0010101
Multiply-Add
h3
D Q
D Q
0
D Q D QD Q
A[16:0]
B[16:0] Q[47:0]
D Q
Figure 1 - Simple multiply accumulate (MAC) logic
Figure 2 - DC FPGA single DSP48 implementation for MAC logic
Figure 3 - Four-tap systolic FIR digital filter structures
ENGINEER ING SOLUT IONS
tal FIR filter, commonly found in wirelesscommunication applications.
Figure 3 shows the schematic of a four-tap systolic FIR digital filter structure. DCFPGA uses advanced DSP inference toimplement this design in only four DSP48slices without the use of external logicresources. The integrated pipeline registersare further exploited for faster clockthroughput performance for this type offilter structure.
The following shows the RTL code forthe systolic FIR filter:
DC FPGA can also implement othercomplex logic configurations in a DSP48slice. Table 2 shows a sample of some of thesecomplex logic structures.
The designs shown in Table 2 weresynthesized using DC FPGA and place
and routed using Xilinx ISE 6.3i ServicePack 2, while targeting an XC4VFX20-11 Virtex-4 device. The purpose of thisexercise is to show the performance andarea improvements performed by DCFPGA’s advanced DSP inference capabil-ity. Each design was synthesized withand without DSP inference enabled dur-ing synthesis.
ConclusionComplex devices such as Virtex-4 require aflexible ASIC-strength synthesis solution.The advanced optimization engine inSynopsys Design Compiler FPGA efficient-ly utilizes the special resources available inVirtex-4 devices to provide the highest per-formance design possible.
DC FPGA gives you the freedom tomodify synthesis scripts to addressdesign bottlenecks, implement different
FSM encoding styles, or to explore otherdesign optimizations to reach your designgoals. Now you have access to the powerand flexibility of Design Compiler toimplement your complex FPGA designs.
DC FPGA is an integral part of thecomplete ASIC-strength prototypingsolution from Synopsys. Other tools sup-ported in the Xilinx flow are Formality™for formal verification, DesignWare®
Library IP, Leda® for RTL design andcode checking, PrimeTime® for statictiming analysis, VCS® for simulation,Module Compiler™ for datapath synthe-sis, and HSPICE™ for analysis of multi-gigabit serial I/Os.
DC FPGA has a rapidly growing base ofmore than 100 customers. For more infor-mation about Design Compiler FPGA,visit www.synopsys.com/products/dcfpga/dcfpga.html.
First Quarter 2005 Xcell Journal 95
Design Test DescriptionImplementation Implementation
with DSP48 without DSP48Max Delay (ns) Max Delay (ns)
Test5 A[16:0], B[16:0], C[47:0] 5.680 8.177Q[47:0] = sel ? C + (A * B) : C - (A * B)
Test6 A[16:0], B[16:0], C[16:0], D[16:0] 6.151 7.631E[16:0], F[16:0], G[16:0], H[16:0]mult1[33:0] (FD) <= A * B + C * Dmult2[33:0] (FD) <= E * F + G * HQ[47:0] = mult1 + mult2
* Input and output signals are signed
Table 2 - Design examples showing performance improvement of advanced DC FPGA DSP inference
ENGINEER ING SOLUT IONS
by Marc DefossezSr. Staff Applications EngineerXilinx, [email protected]
In modern high-speed digital designs, con-nectors require careful attention; you can’tjust use any one that’s available. Whendesigning with Xilinx® Virtex-4™ multi-gigabit transceiver (MGT) devices, withdata transfer rates increasing to 10 Gbps,connectors are part of the total solution.
It is often said that the silicon, in ourcase the FPGA, does all the work in a sys-tem. Passive components such as connec-tors get the blame for increasing designcost, complexity, and size, and therefore areoften neglected.
Today’s digital designs enter the RFworld with transfer speeds of 10 Gbps andmore per data pair; thus, you can no longerignore the overall impact connector choicehas on a design.
Connector manufacturers must keeptrack of high-speed digital design needswhile meeting the demand for multiplehigh-speed low-loss connections in a smallconnector shape. Connector design, there-fore, becomes increasingly difficult.
The two worlds need to be combined;therefore, we advise following these stepswhen selecting a connector:
• Choose your connector type – back-plane, board-to-board, board-to-cable,or mezzanine
• Find manufacturers carrying connec-tors with the right physical parameters
• Carefully examine the manufacturer’selectrical specifications, test reports,and other published references
Board-to-Backplane or Board-to-Board ConnectorsDesigning a system in which multipleMGT signals (3.125 Gbps to 10 Gbps)cross directly from board to board or runover a backplane need special connectors.The Teradyne™ GBx connector is a high-density, optimized differential connectorfamily delivering data rates greater than 5Gbps (tested up to 12 Gbps) (Figure 1).
Tyco™-AMP offers in this same rangethe Z-Pack HM-Zd differential connectorsystem, designed for serial switching appli-cations from 3.125 Gbps to 6.4 Gbps(demonstrated at 12 Gbps) (Figure 2).
Both connector families are madespecifically for high-data-transfer-ratedesigns such as enterprise switching equip-ment, telecommunications equipment, andmass data storage. They are robust, have amodular setup, and offer routability andoptimal system performance.
Teradyne’s GbX advanced performanceinterconnects provide high-density opti-mized differential connectors. They areavailable in three-, four-, and five-pair ver-sions and permit vertical and horizontalrouting, making them the ideal solution forstar or mesh backplane designs.
Tyco-AMP’s high-speed, differential,board-to-backplane electrical connectorsare an extension of the already establishedIEC 61076-4-101 hard metric connectorfamily. However, HM-Zd also provides ahigh-speed differential solution. Z-PackHM-Zd connectors are available in two-,three-, and four-paired versions.
In board-to-board designs where sizematters, Samtec’s™ QSE and QTE con-nector families are for data transfer rates upto 6 Gbps (Figure 3).
For board to board, with a point-to-point setup, Samtec offers a reliable cableconnection based on the QSE/QTE con-
Selecting Connectors for Multi-Gigabit Transceiver DesignsSelecting Connectors for Multi-Gigabit Transceiver Designs
96 Xcell Journal First Quarter 2005
With data transfer rates at 10 Gbps, connector choice is crucial.With data transfer rates at 10 Gbps, connector choice is crucial.
ENGINEER ING SOLUT IONS
YFS/YFT single-ended and differential-pair-array connector arrays calledSamArray (Figure 6). These connectorshave a performance up to 10 Gbps andcomprise a vast amount of single-endedconnections. Differential signaling isobtained through pin layout (Figure 7).
Connectors are offered as five-, eight-,or ten-row with as many as 50 contacts perrow, for stacking heights from 5 to 25 mm.Technical figures are provided in PDF for-mat at www.samtec.com/signal_integrity/technical_specifications/electrical.asp?series=Y F S - D P & s t a c k = 2 5 & m e n u = S i g n a l_Integrity.
Mezzanine connectors have a BGA foot-print and can be treated by assemblymachines as regular BGA components.Experience with these connectors showedthat before soldering, they are best glued tothe PCB. If not glued, there is a greatchance that the connector will move duringsoldering.
Connectors for Cable ConnectionsFor design reasons you may not be able touse the connectors described above. In thiscase you can still turn to older solutions,such as the well-known SMA connectorand the small MMCX connector.
SMA is an acronym for “SubMiniatureversion A,” first developed in the 1960s.They are 50 ohm, semi-precision subminia-ture units that provide excellent electricalperformance from DC to 18 GHz with athreaded interface. These high-performanceconnectors are compact in size and haveoutstanding mechanical specifications.
Besides the standard straight, 90degrees, and edge-launch version, anSMT-mount device version is now alsoavailable (Figure 8). This SMT version ispreferable over the other because of its per-formance characteristics.
The MMCX series is sometimes alsocalled MicroMate. It is the smallest RFconnector and was developed in the 1990s.MMCX is a micro-miniature connectorseries with a lock-snap mechanism, allow-ing for 360 degrees rotation and thusenabling great flexibility in PCB layouts.MMCX connectors conform to theEuropean CECC 22000 specification.
nector technology. The 50 ohm controlledimpedance, 38 AWG mini coax ribboncable (Figure 4) is available with as many as240 signal lines, as well as a differential orsingle-ended flex-strip solution.
You can create custom connector specifi-cations for both the QSE/QTE and ribboncable on Samtec’s website and downloadcable specifications and test reports oncross-talk, travel delay, and impedance.
Mezzanine Board-to-Board ConnectorsMezzanine card systems are mostly used torelocate high-pin-count devices onto mez-zanine or module cards, simplifying boardrouting without compromising systemperformance.
Mezzanine cards need a high bandwidthand high amount of parallel connections aswell as several serial connections. Teradyne’sversion is the NexLev connector family, withperformance up to 12 Gbps. This connectorenables a vast amount of connection possi-bilities at different connector heights.
The NexLev connector is built in astripline construction, providing a continu-ous ground plane for each signal contact(Figure 5). The connectors come as ten-row connectors with 100, 200, or 300 posi-tions at possible stacking heights from 10mm to 30 mm. You can find technical fig-ures at www.teradyne.com/prods/tcs/products/connectors/mezzanine/nexlev/signintegr.html#differential.
Samtec offers a similar solution with its
First Quarter 2005 Xcell Journal 97
Data Pair
Grounded Shielded Plates
Ro
w
1
J
I
H
G
F
E
D
C
B
A
2 3 4 5 6
Figure 1 – Teradyne Gbx connector
Figure 2 – Tyco-AMP Z-Pack HM-Zd connector
Figure 4 – Samtec ribbon cable
Figure 5 – Stripline construction of NexLev
Figure 3 – Samtec QSE and QTE connector
ENGINEER ING SOLUT IONS
MMCX products range to 6 GHz for a50 Ω interconnect system. A set of connec-tors includes surface mount, edge card, andcable connectors. Here the SMT version ispreferable (Figure 9).
You can purchase ready-made, custom,and length-matched cable interconnect forthis type of connection from differentsources and choose between flexible orsemi-rigid cabling.
Connector BasicsSuppose you’ve selected your IC devicesand your board has been laid out with all of
the right design rules, such as:
• Controlled impedance traces
• Controlled time delay of stubs
• Stubs shorter than about 20% of thefastest signal’s rise time
• Time delay of discontinuities shorterthan about 15% of the fastest signal’srise time.
• Adjacent traces paced far enough apartto keep crosstalk at an acceptable level
• A stack-up with power and groundplanes on adjacent layers of silicon
• A continuous return path under eachsignal trace
You’re not quite done yet. In high-performance systems, every elementmust be optimized for the entire systemto meet performance, schedule con-straints, size, and cost. It is like a chain –every link must be strong for the wholeto meet the demanding performancespecs of today’s high-speed products.
How can components like connectorsaffect system performance? Usually thepotential problems are lumped into twocategories: timing and noise, togetherreferred to as signal integrity (SI).
What is important when selectingconnectors?
• EMI, translated to series inductance
• Crosstalk, translated to mutual inductance
• Signal propagation, as parasitic capacitance
Series InductanceThe most fundamental effect a connectoradds to a circuit is series inductance. Theprimary factor for the series inductance isthe pin length of the connector. Togetherwith the series inductance of each connec-tor pin, the pin layout of the connectordetermines the radiated EMI (electromag-netic interference).
Signals traveling through a connectorneed a current return path (ground).Even if no return path is providedthrough the connector, large inductiveloops can be created (Figure 10). Thiswill result in substantial EMI emission.
Differential signaling solves the prob-lem of current return paths by eliminat-ing it. Differential signaling uses twoidentical but opposite signals. The returnpaths are therefore also opposite to eachother (Figure 11). This effect will cancelout. The only signal returning from adifferential pair is because of an imbal-ance between the two signals. The sub-traction of both signals will not beexactly zero.
Mutual InductanceCurrent loops illustrate mutual inductivecoupling in Figure 12. Current leaving
98 Xcell Journal First Quarter 2005
Ro
w
J
I
H
G
F
E
D
C
B
A
1 2 3 4
Column
Best Case Pin Setup
Data Pair
Worst Case Pin Setup
5 6 7 8
Ro
w
J
I
H
G
F
E
D
C
B
A
1 2 3 4
Column
5 6 7
Ground
8
Figure 6 – Samtec YFS/YFT connector
Figure 7 – Best- and worst-case pin layout for YFS/YFT
Figure 8 – SMA edge launch, SMT
Figure 9 – MMCX edge launch, SMT
ENGINEER ING SOLUT IONS
device A returns through signal return pathX. Even currents leaving devices B and Chave signal return paths through Y and Z.
Because all of these paths overlap, mag-netic fields from one path induce electricvoltages (noise) in other paths. Theinduced noise will be larger or smaller withthe physical location of a path. In ourexample, Y will receive more noise than Zbecause it shares more area.
Do not worry about crosstalk betweendifferential signals. Because of their nature,crosstalk is canceled out.
Parasitic CapacitanceMutual and shunt (pin-to-pin) capacitanceis another effect that comes with a connec-tor – usually you can ignore it. The effectcapacitance has is to slow down systemedge rate. In multi-drop backplane applica-tions, parasitic capacitance places moreburdens on connectors than in point-to-point applications.
Signals transmitted pass each tap on thebus; the cumulative effect of the parasiticcapacitance can distort the signals and theseries inductance of the source connector.
Connector SelectionTo provide excellent high-speed connec-tors, manufacturers need to control andmanage the above parameters as well as alot more. Engineers now have access to anextensive amount of data measured andcalculated by connector manufacturers.
On most manufacturers’ websites, elec-trical, mechanical, and SI information is
available, together withPCB drawing and sim-ulation aids:
• Mechanical
– Dimensiondrawing in PDFformat
– 3D models inIGES, STEP, orParasolid ACISformat
– Mechanical qual-ification andstress test reports
– PCB layout tool library components
• Electrical
– Electrical test reports
– Application notes
– SI parameters and results
– Datasheets
• Simulation
– IBIS and SPICE models
An extra service offered by Samtec is the“Final Inch” website, for designing a con-nector break-out region on a PCB.
The manufacturers mentioned in thisarticle are not the only high-speed connec-tor manufacturers on the market. There areother companies such as ERNI™, Hirose,Molex™, Amphenol™, and Radiall™manufacturing (under license) similar con-
nectors. Many other companies have theirown range of high-speed connectors.
ConclusionToday’s high-speed digital design engineerscan benefit from the RF knowledge of con-nector suppliers, using the informationavailable in datasheets, application notes,and on the Internet.
You can use this article as a starting pointfor better PCB and connector design.
For more information, see the books“High-Speed Digital System Design” byStephen H. Hall, Garrett W. Hall, and JamesA. McCall; “High-Speed Digital Design” byHoward Johnson; or visit www.johnson-comp.com, www.samtec.com, www.samtec.com/sudden_service/current_literature/q-pairs/index.html, www.samtec.com/sudden_service/current_literature/SamArray/index.html,www.teradyne.com/prods/tcs, and hmzd.tycoelectronics.com.
First Quarter 2005 Xcell Journal 99
Loop 2
A
B
Connector
Loop 1
Return Current Splits
This effect can be minimized
through the use of enough
ground pins in the connector.
APath X
Path Y
Path Z
B
C
DRIVER RECEIVER
Positive Current Loop
Negative Current Loop
Figure 10 – EMI generated due to improper current return paths Figure 11 – Differential eliminated returned signal currents
Figure 12 – Mutual inductive coupling through a connector
ENGINEER ING SOLUT IONS
by Mike BlackStrategic Marketing ManagerMicron Technology, [email protected]
With network line rates steadily increas-ing, memory density and performance arebecoming extremely important inenabling network system optimization.Micron Technology’s RLDRAM™ andDDR2 memories, combined with Xilinx®
Virtex-4™ FPGAs, provide a platformdesigned for performance.
This combination provides the criticalfeatures networking and storage applicationsneed: high density and high bandwidth. TheML461 Advanced Memory DevelopmentSystem (Figure 1) demonstrates high-speedmemory interfaces with Virtex-4 devices andhelps reduce time to market for your design.
Micron MemoryWith a DRAM portfolio that’s among themost comprehensive, flexible, and reliablein the industry, Micron has the ideal solu-tion to enable the latest memory platforms.Innovative new RLDRAM and DDR2architectures are advancing system designsfarther than ever, and Micron is at the fore-front, enabling customers to take advan-tage of the new features and functionalityof Virtex-4 devices.
RLDRAM II MemoryAn advanced DRAM, RLDRAM II mem-ory uses an eight-bank architecture opti-mized for high-speed operation and adouble-data-rate I/O for increased band-width. The eight-bank architecture enables
RLDRAM II devices to achieve peakbandwidth by decreasing the probability ofrandom access conflicts.
In addition, incorporating eight banksresults in a reduced bank size compared totypical DRAM devices, which use four.The smaller bank size enables shorteraddress and data lines, effectively reducingthe parasitics and access time.
Although bank management remainsimportant with RLDRAM II architec-ture, even at its worst case (burst of two at400 MHz operation), one bank is alwaysavailable for use. Increasing the burstlength of the device increases the numberof banks available.
I/O OptionsRLDRAM II architecture offers separateI/O (SIO) and common I/O (CIO)options. SIO devices have separate readand write ports to eliminate bus turn-around cycles and contention. Optimizedfor near-term read and write balance,RLDRAM II SIO devices are able toachieve full bus utilization.
In the alternative, CIO devices have ashared read/write port that requires oneadditional cycle to turn the bus around.RLDRAM II CIO architecture is optimizedfor data streaming, where the near-term busoperation is either 100 percent read or 100percent write, independent of the long-termbalance. You can choose an I/O version thatprovides an optimal compromise betweenperformance and utilization.
The RLDRAM II I/O interface pro-vides other features and options, includingsupport for both 1.5V and 1.8V I/O lev-
els, as well as programmable output imped-ance that enables compatibility with bothHSTL and SSTL I/O schemes. Micron’sRLDRAM II devices are also equippedwith on-die termination (ODT) to enablemore stable operation at high speeds inmultipoint systems. These features providesimplicity and flexibility for high-speeddesigns by bringing both end terminationand source termination resistors into thememory device. You can take advantage ofthese features as needed to reach theRLDRAM II operating speed of 400 MHzDDR (800 MHz data transfer).
At high-frequency operation, however, itis important that you analyze the signal driv-er, receiver, printed circuit board network,and terminations to obtain good signalintegrity and the best possible voltage andtiming margins. Without proper termina-tions, the system may suffer from excessivereflections and ringing, leading to reducedvoltage and timing margins. This, in turn,can lead to marginal designs and cause ran-dom soft errors that are very difficult todebug. Micron’s RLDRAM II devices pro-vide simple, effective, and flexible termina-tion options for high-speed memory designs.
On-Die Source Termination ResistorThe RLDRAM II DQ pins also have on-die source termination. The DQ outputdriver impedance can be set in the range of25 to 60 ohms. The driver impedance isselected by means of a single external resis-tor to ground that establishes the driverimpedance for all of the device DQ drivers.
As was the case with the on-die end ter-mination resistor, using the RLDRAM II
Xilinx/Micron Partner to ProvideHigh-Speed Memory Interfaces
100 Xcell Journal First Quarter 2005
Micron’s RLDRAM II and DDR/DDR2 memory combines performance-critical features to provide both flexibility and simplicity for Virtex-4-supported applications.
ENGINEER ING SOLUT IONS
on-die source termination resistor elimi-nates the need to place termination resistorson the board – saving design time, boardspace, material costs, and assembly costs,while increasing product reliability. It alsoeliminates the cost and complexity of endtermination for the controller at that end ofthe bus. With flexible source termination,you can build a single printed circuit boardwith various configurations that differ onlyby load options, and adjust the MicronRLDRAM II memory driver impedancewith a single resistor change.
DDR/DDR2 SDRAMDRAM architecture changes enable twice thebandwidth without increasing the demand onthe DRAM core, and keep the power low.These evolutionary changes enable DDR2 tooperate between 400 MHz and 533 MHz,with the potential of extending to 667 MHzand 800 MHz. A summary of the functional-ity changes is shown in Table 1.
Modifications to the DRAM architec-ture include shortened row lengths forreduced activation power, burst lengths offour and eight for improved data bandwidthcapability, and the addition of eight banksin 1 Gb densities and above.
New signaling features include on-die ter-mination (ODT) and on-chip driver (OCD).ODT provides improved signal quality, withbetter system termination on the data signals.OCD calibration provides the option of tight-ening the variance of the pull-up and pull-down output driver at 18 ohms nominal.
Modifications were also made to the moderegister and extended mode register, includingcolumn address strobe CAS latency, additivelatency, and programmable data strobes.
ConclusionThe built-in silicon features of Virtex-4devices – including ChipSync™ I/O tech-nology, SmartRAM, and Xesium differentialclocking – have helped simplify interfacingFPGAs to very-high-speed memory devices.A 64-tap 80 ps absolute delay element as wellas input and output DDR registers are avail-able in each I/O element, providing for thefirst time a run-time center alignment of dataand clock that guarantees reliable data cap-ture at high speeds.
Virtex-4 devices. The ML461 system,which also includes the whole suite of ref-erence designs to the various memorydevices and the memory interface genera-tor, will help you implement flexible, high-bandwidth memory solutions withVirtex-4 devices.
Please refer to the RLDRAM informa-tion pages at www.micron.com/products/dram/rldram/ for more information andtechnical details.
Xilinx engineered the ML461Advanced Memory Development Systemto demonstrate high-speed memory inter-faces with Virtex-4 FPGAs. These includeinterfaces with Micron’s PC3200 andPC2-5300 DIMM modules, DDR400and DDR2533 components, andRLDRAM II devices.
In addition to these interfaces, theML461 also demonstrates high speedQDR-II and FCRAM-II interfaces to
Figure 1 – ML461 Advanced Memory Development System
ENGINEER ING SOLUT IONS
by Matt DiPaolo APD Product Application EngineerXilinx, [email protected]
Ryan CarlsonDirector of Marketing, High Speed Serial I/OXilinx, [email protected]
Xilinx® introduced FPGAs with integratedmulti-gigabit serial transceivers (MGTs)more than three years ago. Since then,Virtex-II Pro™ devices have enabled hun-dreds of applications to move from parallelinterfaces to high-speed serial interfaces, asdesigners took advantage of the integratedRocketIO™ transceivers.
With Virtex-II Pro devices, Xilinx led theindustry with a transceiver capable of 622Mbps-3.125 Gbps operation. Xilinx contin-ues this trend with its new Virtex-4™ fam-ily, in which RocketIO transceivers canoperate from 622 Mbps to over 10 Gbps(Figure 1). This broad speed range – coupled with a host of user-friendly, pro-grammable options – creates an extremelyflexible multi-gigabit transceiver.
Multiple Interface StandardsOne trend occurring in multiple end-marketsegments is the widespread adoption of high-speed differential signaling schemes to addressincreased bandwidth demands. As designsmove to faster interface speeds, a serial imple-mentation saves power, board space, designcomplexity, and ultimately cost.
Virtex-4 RocketIO transceivers weredesigned to enable high-speed data trans-mission for many different protocols. Table1 shows all of the serial standards support-ed in Virtex-4 FPGAs.
Harvesting the Flexibility of Virtex-4 RocketIO TransceiversHarvesting the Flexibility of Virtex-4 RocketIO Transceivers
102 Xcell Journal First Quarter 2005
New features include support for all major serial I/O standards and multiple encoding schemes.New features include support for all major serial I/O standards and multiple encoding schemes.
ENGINEER ING SOLUT IONS
Flexibility and ProgrammabilityXilinx brings its approach to FPGAs –making them user-programmable, with maximum flexibility – to its multi-gigabit transceivers. This approach hasimpacted both of the major functionalcomponents of the RocketIO transceiv-er: the physical media attachment(PMA) block and the physical codingsublayer (PCS) block.
PMA BlockThe Virtex-4 RocketIO PMA block sup-ports all major serial I/O standards andis compliant to their physical layerrequirements. For example, theRocketIO transceiver meets the OC-48SONET/SDH specification (2.488Gbps) for both transmit jitter generationand receive jitter tolerance.
This same transceiver can also meet therequirements of the Fibre Channel physi-cal layer specification, and it can do so at1.0625 Gbps, 2.125 Gbps, 4.25 Gbps,and 8.5 Gbps.
Other PMA features of the Virtex-4RocketIO transceiver include:
built into the transceiver. You can select a10-bit based data path (for Ethernet anddata communications protocols) or a 16-bit based data path (for SONET/SDH-based protocols).
User-programmable clock correctionsequences (CCS) allow synchronizationdifferences between remote transceivers tobe tolerated and corrected. Channel bond-ing sequences (CBS) enable you to connectmultiple RocketIO transceivers together tocreate a logical channel with even morebandwidth. All of these features are com-pliant to industry standards (makingdesigns easier to complete), while still sup-porting proprietary designs.
For applications requiring lower latency,a new feature of the Virtex-4 RocketIOtransceiver is a reduced latency mode thatallows you to bypass the receive and trans-mit FIFOs (as well as other function blocks),offering a 50% reduction in latency fromprevious generations of Xilinx transceivers.
Other PCS features of the Virtex-4RocketIO transceiver include:
• Multiple loopback modes, including aPMA Rx to Tx path
• Comma detection, includingA1A1A2A2 for SONET applications
• PCI Express-compliant electrical idlesupport
• PCI Express-compliant beaconingsupport
• PCI Express-compliant spread spec-trum clocking support
• Multiple loopback modes, including aPMA Rx to Tx path
PCS BlockThe Virtex-4 RocketIO PCS block sup-ports multiple encoding schemes; both8B10B and 64B66B encoders/decoders are
• Built-in clock dividers to reduce theneed of DCMs for clocking use models
Figures 2 and 3 show block diagrams of theVirtex-4 PCS (both receiver and transmitter).
ConclusionThe Virtex-4 RocketIO transceiver is the com-plete solution for today’s high-speed serialdesigns, with a broad speed range (622 Mbpsto over 10 Gbps) and programmable PCSfunctions (optional encoding schemes, channelbonding, and clock correction).
For more information about the Virtex-4FPGA family, visit www.xilinx.com/virtex4/.For more details about the functionality anddesign recommendations with Virtex-4RocketIO transceivers, see the Virtex-4RocketIO transceiver user guide at www.xilinx.com/bvdocs/userguides/ug076.pdf.
104 Xcell Journal First Quarter 2005
Reset
RXPRXN
User-Selectable Alignment and Clock Correction;Enables Aurora, Ethernet, Fibre Channel, and SONET
Clock
DynamicConfig
PMA
PMAAttr.
Sync Control Logic
CommaDetectAlign
10GBlockSync
8B 10BDecode
10GDescram
Channel Bonding &Clock Correction
16x52 bitRing
Buffer10G
Decode
Clock 2
DATA andSTATUS
Low-Latency Bypass Modes for Custom Designs
Reset
CLOCKCLOCK 2
TXPTXN
Built-In Support for Multiple Protocols
Low-Latency Bypass Modes for Custom Designs
Real-Time Reconfiguration of RocketIO Settings (e.g., Rx EQ)
DynamicConfig
8B 10BDecode
64B 66BEncode
6x40 bitRing
Buffer
10GbEGearbox
10GbEScrambler
DATA andSTATUS
PMA
PMAAttributes
Figure 2 – Virtex-4 RocketIO PCS (receiver)
Figure 3 – Virtex-4 RocketIO PCS (transmitter)
Xilinx Events and Tradeshows
Xilinx participates in numerous trade shows and events throughout
the year. This is a perfect opportunity to meet our silicon and software experts,
ask questions, see demonstrations of new products and technologies, and hear other customers’ success stories
with Xilinx products. For more information and the most up-to-date
schedule, visit: www.xilinx.com/events/.
Worldwide Events Schedule
North America
Jan. 31 - Feb. 3 DesignCon WestSanta Clara, CA
February 15-17 TI Developers ConferenceHouston, TX
March 1-3 Intel Developer ForumSan Francisco, CA
March 8-10 Embedded Systems ConferenceSan Francisco, CA
Europe
Jan. 31 - Feb. 2 Elektronik Systeme im AutomobilMunich, Germany
February 1-3 EP05 Electronic ExhibitionStockholm, Sweden
February 14-17 3GSM World CongressCannes, France
February 22-24 Embedded World Nurenberg, Germany
March 16-17 Workshop SoC DéfenseBrussels, Belgium
March 16-17 Hi-Tech TechnologiesTel Aviv, Israel
March 17-18 AMAA Conference and ExhibitionBerlin, Germany
Japan
January 29-30 EDSFYokohama, Japan
February 15 Processor SeminarOsaka, Japan
February 21 Processor SeminarTokyo, Japan
ENGINEER ING SOLUT IONS
by Scott Beekman Business Development ManagerToshiba America Electronic Components, [email protected]
Among the many cost/performance trade-offs system designers face, one of the criticaldecisions in network systems, communica-tions equipment, and high-performanceconsumer electronics is the type of memoryto use to ensure that performance can keeppace with the processor.
Traditionally, network system designershad to choose between dynamic randomaccess memory (DRAM), available at alower cost-per-bit because of the high vol-umes used in personal computers, or high-er performance static random accessmemory (SRAM), available only in lowdensities and at a much higher cost. A com-bination of the two is typically used withDRAM for buffer memory and SRAM forlook-up table (LUT) memory.
More recently, high-performance, low-latency DRAM solutions developed specifi-cally for high-bandwidth applications,including Toshiba’s™ Network FCRAM™(fast cycle random access memory), provideanother alternative. Which type of memoryis right for your particular system? Whatadditional requirements for memory con-trollers are associated with each choice?
Optimize Memory SubsystemPerformance with Network FCRAM
First Quarter 2005 Xcell Journal 105
Toshiba’s Network FCRAM often provides the best cost/performance by combining DRAM densities with random cycle performances that approach SRAM speeds.
ENGINEER ING SOLUT IONS
Generally, you can choose the optionthat provides the highest performancewithin the system’s specified cost con-straints, and in the time available to bringthe system to market. In many cases,Network FCRAM provides the bestcost/performance for networking and com-munications customers by combiningDRAM densities with random cycle per-formances that approach SRAM speeds.This allows equipment manufacturers todevelop higher performance, lowercost, and lower power communica-tions systems than they could withdouble-data-rate synchronousdynamic RAM (DDR SDRAM) andhigh-speed static RAM (HSSRAM).
In this article, we provide anoverview of Network FCRAM andthe advantages it offers in compari-son to standard DDR SDRAM orhigh-speed SRAM, and discuss thealternatives available for memorycontrollers supporting NetworkFCRAM.
Network FCRAMToshiba Network FCRAM is a high-performance, low-cost replacementto DDR SDRAM and high-speedSRAM targeted primarily for buffermemory and LUT memory in networking/telecom applications.Network FCRAM incorporatesenhanced DRAM technology opti-mized for the high-bandwidth, low-latency requirements of network andcommunication systems. Narrowingthe active memory area achieves lowpower consumption and randomcycle time performances almosttriple that of standard DRAM.
Network FCRAM devices offerthe following advantages:
• Fast random cycle time (tRC) of20 ns to 25 ns
• Fast data transfer rate of 666Mbps+ (For purposes of meas-uring data transfer rate in thiscontext, megabit per secondand/or Mbps = 1,000,000 bitsper second.)
Network FCRAM technology excels inapplications where you need DRAM den-sities and random cycle performanceapproaching SRAM-like speeds. Its highbandwidth and low latency makesNetwork FCRAM suitable for networkapplications, cache applications, andhigh-performance consumer applications.Typical network equipment applicationsinclude packet buffer memory, tablelook-up memory, and external cache
memory in servers. NetworkFCRAM is also being used in dig-ital consumer and supercomputerapplications.
Performance ComparisonNetwork FCRAM and the specifi-cation-compatible, dual-sourceSamsung™ Network DRAM™feature one of the shortest cycletimes and latency among existingDRAM. As a result, NetworkFCRAM can improve system per-formance approximately 20 to 25percent in comparison to DDRSDRAM. This is achieved as aresult of higher data transfer rates,as shown in Figure 1, and anapproximately threefold faster ran-dom cycle time (tRC), as shown inFigure 2.
As an alternative to HSSRAM,Network FCRAM costs approxi-mately 1/16th as much per bit,and offers much higher densities(up to 512 Mb) compared to max-imum densities of 36 Mb or 72Mb for HSSRAM. NetworkFCRAM offers not only perform-ance improvement alternatives butalso lower-cost solutions, as shownin Figure 3.
Customers today are takingadvantage of these features toboost performance and bringdown their system’s cost by replac-ing DDR SDRAM with NetworkFCRAM, thus reducing chipcount and board space because ofNetwork FCRAM’s higher per-formance, and/or by replacingHSSRAM.
• Large density up to 512 Mb (Whenused in relation to memory density,megabit and/or Mb means 1,024 x1,024 = 1,048,576 bits. Usable capaci-ty may be less. For details, please referto specifications.)
• Simplified command input
• Low power consumption
• Multiple sources
106 Xcell Journal First Quarter 2005
0
200
400
600
800
1000
1200
2000 2001 2002 2003 2004 2005
Year
Network
FCRAM
DDR DDR-II
SDR
Da
ta T
ran
sfe
r R
ate
(M
bp
s)
0
10
20
30
40
50
60
70
80
2000 2001 2002 2003 2004 2005
Year
SDR DDR-IIDDR
Network
FCRAM
Ra
nd
om
Cy
cle
Tim
e (
ns
)
Figure 1 – Faster data transfer rates with Network FCRAM
Figure 2 – Network FCRAM typically provides 20 to 25 percent higher system performance than DDR
SDRAM offers, in part because of its faster random cycle time(approximately three times faster).
ENGINEER ING SOLUT IONS
Selecting the Right FCRAMNetwork FCRAM is available witha selection of interfaces, speeds,and organizations to meet variousrequirements:
• 256 Mb (x8/ x 16) NetworkFCRAM1 (up to 400 Mbpswith tRC = 25 ns)
• 288 Mb (x18) NetworkFCRAM2 (up to 666 Mbps with tRC = 20 ns)
• 512 Mb (x8/ x 16) Network FCRAM1(up to 533 Mbps with tRC = 22.5 ns)
Network FCRAM1 supports non-ECC bit densities (such as 256 Mb and512 Mb as a single component), whileNetwork FCRAM2 supports ECC bitdensities (such as 288 Mb with roadmapsto higher densities).
Memory ControllersOnce you have selected Network FCRAMas the memory of choice for a design, thenext step is to determine the best source ofa memory controller for your system. Forlarge-volume applications, some customersdevelop custom ASICs that include thememory controller; in addition, many net-work processors (NPUs) now supportNetwork FCRAM. However, for manysmaller volume applications, FPGAs offerlower cost and faster time to market.
When evaluating memory alternativesfor network systems, consider the perform-ance advantages of Network FCRAM andthe time-to-market advantages of anFPGA-based memory controller.
Development Tools Toshiba offers several design guides to helpcustomers and systems architects identifythe key advantages of incorporatingNetwork FCRAM technology into theirhigh-performance applications. Network
FCRAM devices are also supported byadvanced simulation models to facilitateand accelerate design-in activity. Modelssupported include Verilog™, HSPICE™and IBIS models, and SOMA modelsjointly developed by Toshiba and Denali™Software Inc. For more information, visitwww.fcram.toshiba.com.
ConclusionAs a result of Network FCRAM’scost-performance advantages,today it is designed into more than100 network solutions at morethan 70 companies. Toshiba firstintroduced Network FCRAMworking samples in 1999 and hascontinued to expand its productoffering and build momentum inthe network/telecom market.
Today, Network FCRAM is inproduction with data transfer ratesas high as 666 Mbps and randomcycle time performance as low as 20
ns. Toshiba now supports three densitiesin mass production, with higher density,higher bandwidth, and faster devicesplanned for 2005.
The official Network FCRAM/DRAMwebsite can be found at www.networkfcram.com.
First Quarter 2005 Xcell Journal 107
TM TM
Virtex-II and Virtex-II Pro are trademarks of Xilinx, Inc.
0
20
40
60
Bit Cost
Higher Performance ‡ tRC is 3 times faster
Lower Cost/bit 10 to 16 times or Less
Low High
18Mb NtRAM288Mb Network FCRAM
Higher Performance ‡ tRC is 3 times fasterHigher Performance ‡ tRC is 3 Times Faster
Lower Cost/bit 10 to 16 times or LessLower Cost/Bit ‡ 10 to 16 Times or Less
256 Mb DDR1 SDRAM
288Mb Network FCRAM288 Mb Network FCRAM
Ra
nd
om
Cy
cle
Tim
e (
ns
)
FCRAM (Fast Cycle RAM ) is a trademark or a registered trademark of Fujitsu Limited, Japan. Memory Modeler AV is a trademark of Denali Software Inc. Network DRAM is a trademark or a registered trademark of Samsung Electronics Co., Ltd. Korea.
Figure 3- Network FCRAM can also be a lower cost alternative to HSSRAM, as it costs approximately 1/10th to 1/16th as much per bit.
ENGINEER ING SOLUT IONS
by Suhel DhananiSr. Marketing Manager, Spartan SolutionsXilinx, [email protected]
All low-cost FPGAs provide basic logiccapability at attractive prices and serve abroad range of general-purpose designrequirements. When you consider embed-ding DSP functions in an FPGA fabric,however, you may believe that you mustchoose high-end FPGAs to get platformfeatures such as embedded multipliers anddistributed memory.
With Spartan-3™ FPGAs, the land-scape for embedded DSP has changed.Spartan-3 devices may be low cost, butthey also have the platform featuresrequired for DSP designs. These plat-form features allow area-efficient imple-mentation of signal processing functions– allowing you to realize significantlylower price points.
Spartan-3 devices are ideal ascoprocessors or pre-/post-processors,offloading highly computational func-tions from a programmable DSP toenhance system performance.
Using Spartan-3 FPGAs to Implement High-Performance DSP
Using Spartan-3 FPGAs to Implement High-Performance DSP
108 Xcell Journal First Quarter 2005
Spartan-3 FPGAs provide breakthrough cost points for embedded DSP.Spartan-3 FPGAs provide breakthrough cost points for embedded DSP.
Optimized for DSPThe Spartan-3 family from Xilinx uses 90nm process technology in conjunction with300 mm wafers to dramatically lower thecost of FPGAs. At the same time, thedevices incorporate key DSP resources suchas embedded 18 x 18-bit multipliers andlarge blocks (18 kb) of memory, distributedRAM, and shift-register logic. Thisadvanced feature set means that you canuse Spartan-3 FPGAs to implement DSPalgorithms at a significantly lower cost thancompeting FPGAs. The specific featuresthat help in efficiently implementing DSPare shown in Figure 1.
In addition to increasing the basic per-formance of systems, these embedded fea-tures enhance device utilization. Forinstance, the embedded Spartan-3 multiplierwould take 300-400 logic elements (LEs) ifimplemented in the logic fabric. And becausethe embedded multiplier is adjacent to logicfabric, augmenting the functionality (such ascreating accumulators or concatenating themultipliers to create complex arithmeticfunctions) is fairly straightforward.
Many DSP functions are best imple-mented in pipelines with time multiplexingfor efficiency. This allows you to createfaster systems with higher bandwidth, butit comes at the expense of requiring moreinterim storage elements. For example, atime-multiplexed filter would store theresults of individual multiply-accumulatecells in shift registers. Such designs can run
is capable of implementing logic functionsor acting as a 16-bit shift register.
As shown in Figure 2, this architectureenhancement allows you to use a singleLUT in place of 16 registers – maximizingarea efficiency when implementing time-multiplexed DSP functions.
Many DSP functions are also extremelymemory-intensive – requiring scratch-padmemory for storing coefficients, imple-menting FIFOs, and large buffers. Asshown in Figure 3, Spartan-3 devices pro-vide more memory bits than other low-costFPGAs available today.
For many DSP designs, the criticalresource is the embedded memory withinthe FPGA – not logic or multipliers.Because of insufficient memory, designersusing competing low-cost devices may haveto migrate to a larger device or use externalmemory for systems that would fit into asingle, small Spartan-3 FPGA.
out of registers or memory before they runout of logic resources. The Spartan-3FPGA family is unique in providing amode where a single look-up table (LUT)
First Quarter 2005 Xcell Journal 109
16
40
16x 16x 16x
k0 k1 k2 k3
One LUTD
CE
A3
A2
A1Q
A0
D Q
59K78K
92K
239K
294K
432K
576K
288K
216K
XC3S50
XC3S200
XC3S400
XC3S1000
SPARTAN-3
XC3S1500
72K
Competing Low-Cost FPGA Family
700
600
500
400
300
200
100
0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000
LEs
Em
bed
ded
Mem
ory
(kb
)
Figure 1 – Spartan-3 architecture optimized for lower DSP costs
Figure 2 – You can implement 16 registers in one LUT.
Figure 3 – Spartan-3 fabric provides significantly more memory resources than other competing low-cost FPGAs.
Common DSP FunctionsLet’s see how these features impact deviceutilization by looking at two implementa-tion examples of a finite impulse response(FIR) filter. One is a MAC-based imple-mentation, while the other is a multi-channel distributed arithmetic (DA)implementation.
FIR filters are commonly used in basestations, digital video, wireless LANs,xDSL, and cable modems. Our benchmarkis the implementation of a 64-tap, MACFIR filter with 16-bit data and coefficientsrunning at 130 MHz in a Spartan-3XC3S400 FPGA. The first implementationuses a single MAC; the second implementa-tion uses four MACs. Figure 4 shows thedevice utilization section of the report filefor both implementations.
Going from a one-MAC to a four-MACimplementation dramatically increases theperformance of the FIR filter. The number ofLUTs only doubles and remains at just 4% ofthe total available logic. A four-MAC imple-mentation uses four block RAMs and fourmultipliers to efficiently implement the FIRfilter using minimum device logic resources.
Another interesting implementation isthat of a multi-channel FIR function. Inthis case we can look at how the device uti-lization changes when we go from a one-channel FIR to an eight-channel FIR filter.
As shown in Figure 5, a single channeldistributed arithmetic FIR filter uses 29%of the logic resources and 39% of the regis-ters of a XC3S1000 Spartan-3 device.When implementing an eight-channel ver-sion of the same filter, we would normallytime multiplex the different channels toconserve logic. But this would use a lot ofregisters, or a significant amount of on-chipmemory to store the intermediate results.
With Spartan-3 FPGAs, the intermedi-ate results are stored in LUTs configured as16-bit shift registers (SRL-16). This allowsthe eight-channel version of the same filterto be implemented using only 10% moreof the available logic and only 7% more ofthe available registers – 8x more channelsfor only 25% more device resources (seeFigure 6).
This dramatic savings is directly relatedto the use of the SRL-16s available in theSpartan-3 device. In the report file, youcan see that an additional 1,343 LUTs areused in the SRL-16 mode for the eight-channel implementation.
Implementing this design in an FPGAwithout SRL16 capability would requirean additional 10,744 (1343 x 8) flip-flopsused as storage elements, demanding amassive device for the register count andlikely squandering the associated combina-torial logic resources.
ConclusionThe Spartan-3 architecture is optimizedto give you very high area efficiency whenimplementing signal processing func-tions. By combining these DSP-friendlysystem features with low unit costs,Spartan-3 FPGAs enable the industry’slowest price points for high-performanceDSP functions. This allows a Spartan-3device to act as a low cost but highly effi-cient and high-performance co-processorto a programmable DSP processor.
110 Xcell Journal First Quarter 2005
Excerpt from the Four-MAC Implementation Report File
Excerpt from the One-MAC Implementation Report File
Figure 6 – The eight channel version of the same DA FIR filter only uses 10%
more logic and 7% more registers.
Figure 5 – This single channel DA FIR filter uses 29% of the logic and 39% of the registers
in a Spartan-3 XC3S1000 device.
Figure 4 – Using the embedded multipliers and block RAM features of the Spartan-3 fabric for higher performance DSP functions
This seminar will explore the following topics: integratedPowerPC™ processors, the world’s most popular embeddedprocessor architecture, next generation Xtreme™ DSP technology, Advanced Silicon Modular Block (ASMBL)architecture and RocketIO™ serial transceivers.
ADS-BASEX-BUNDLE Xilinx Virtex-4 LX25 Evaluation $550.00 USD*Kit bundled with ISE BaseX (only available with purchase of Virtex-4 LX25 Evaluation Kit)
ADS-FOUNDATION- Xilinx Virtex-4 LX25 Evaluation $2,400.00 USD*BUNDLE Kit bundled with ISE Foundation
only available with purchase ofVirtex-4 LX25 Evaluation Kit)
Xilinx is revolutionizing the fundamentals of FPGA economics with the Virtex-4™ family. To help you get a jumpstart on your next design, Avnet ElectronicsMarketing has created the Virtex-4 LX Evaluation Kit and a SpeedWay Seminar.™
The Virtex-4 SpeedWay Seminar will allow you to:• Learn about the Virtex-4 product family features• Learn how to use Virtex-4 in your specific application• Learn about the key features of the new Xilinx ISE
6.3i integrated software environment
For your convenience, the seminar can take place at yourlocation at a time of your choosing.
Kit Information and Purchases - www.em.avnet.com/virtex4lx
Ready.
Set.
Go to market.™
Get Started Now with Xilinx®
Virtex-4™ FPGAs
*Pricing valid only within 60 days of attending a seminar.
Part Number Description
• Multi-Platform FPGA family
• Support for (3) application domains
• 90 nm process technology
• Reduced power consumption
• Dramatic reduction in cost per function
Virtex-4 FPGAs Virtex-4LX25 Evaluation Kit
Support Across The Board.™
• Virtex-4LX25 FPGA
• 8 MB Flash and 32 MB DDR SDRAM
• Cypress CY7C68013 USB 2.0 controller
• National Semiconductor DP8384710/100 Ethernet PHY
• 128x64 OSRAM graphical display
Special Pricing forSeminar Attendees*
Shrinking budgets and design cycles make evaluating, designing, andtesting complex systems more challenging than ever before. Xilinx®
provides the answer with the Virtex-4™ ML401 evaluation platform.Powered by the XC4VLX25 device and incorporating industry-
standard peripherals, connectors, and interfaces, the Virtex-4 ML401evaluation platform provides a rich feature set that spans a wide rangeof applications.
Xilinx also provides expert guidance to designers with hardware-verified reference designs, application notes, and user-friendly tools.
The Virtex-4 ML401 evaluation platform specifications include:
– Four SMA connectors (differential clocks), two PS/2 connectors (keyboard/mouse), LVDS personality module,audio (line in, line out, microphone, headphone), RS-232serial port, USB (one host and two peripheral), ParallelCable-IV header, DB15 VGA display, RJ-45 Ethernet port
112 Xcell Journal First Quarter 2005
R
The Virtex-4 ML401 evaluation platform is a low-cost, full-featured development system.
Virtex-4 ML401 Evaluation PlatformFeatures
• Support for multiple clock sources and differential clock inputs
• Memory interfaces for DDR SDRAM, ZBT SRAM, and Linear Flash
• Multiple FPGA configuration modes: Platform Flash, System ACE™ CF solution, Linear Flash, and Parallel Cable-IV
• Audio and video interfaces
• Multiple user interfaces: dual PS/2, IIC Bus, RS-232, USB, and tri-mode Ethernet
• Reference designs and IP cores for numerous applications speed up your design cycle
• A comprehensive suite of application notes guides you every step of the way
• Demonstrations ship in Platform Flash, Linear Flash, and System ACE CF solution
Order your Virtex-4 ML401 evaluation platform today toget a head start on your design. For more information aboutthe Virtex-4 FPGA family, visit www.xilinx.com/virtex4/.
T H E B O A R D R O O M
Today’s telecom and networking systems use high-bandwidth inter-faces based on LVDS, HyperTransport™, and other differential I/Ostandards. These standards simplify system design by lowering pincount and power consumption and improving signal integrity.
Protocols based on these standards, such as SPI-4.2, RapidIO™,and HyperTransport, are central to leading-edge system design.
Xilinx® Virtex-4™ FPGAs offer up to 1 Gbps SelectIO™parallel I/O, with the flexibility to use any I/O pair as differen-tial I/O. Additional benefits for higher level protocol imple-mentation include:
• ChipSync™ source-synchronous I/O technology for dynamic precision phase alignment and data centering with per-bit de-skew
• Design with major differential I/O standards in networking, computing, storage, and wireless
• Pre-engineered IP and reference designs
• A unique built-in silicon feature enables 1 Gbps performance
Buy the source-synchronous interfaces tool kit today to get started on your design. For more information about
the kit, the Virtex-4 FPGA family, ChipSync technology, and available optional IP, visit www.xilinx.com/virtex4/.
T H E B O A R D R O O M
Building interfaces to high-performance memory devices pres-ents challenges such as high-speed synchronous data capturing,along with implementing complex physical-layer interfaces andcontrol logic.
Virtex-4 FPGAs solve these challenges with advanced siliconcapabilities, including ChipSync™ source-synchronous technology,Xesium clocking, and Smart RAM.
• ChipSync technology provides 80 ps resolution for clock-to-data alignment, ensuring reliable data capture
To shorten design time, Xilinx provides expert guidance in theform of free hardware-verified reference designs, applicationnotes, user-friendly tools, and advanced development systems.This combination of unique silicon capabilities and comprehen-sive support enables you to build and verify robust memory inter-faces quickly and easily.
The advanced memory development system, ML 461, offers
an excellent platform to develop and verify high-performancememory interfaces.
Xilinx also offers a menu-based tool, the memory interfacegenerator, to further customize reference designs (Figure 2). Thetool generates the pin placement file and a complete modular setof HDL files.
114 Xcell Journal First Quarter 2005
Virtex-4 FPGAs make complete memory interface solutions possible.
ML461 – Advanced Memory Development SystemFeatures
• Memory interfaces: DDR2 SDRAM, DDR SDRAM, QDR II SRAM, RLDRAM II, FCRAM II (Figure 1)
• Four Xilinx® Virtex-4™ LX-25 devices
• JTAG interface
•System ACE™ Compact Flash card
• CD-ROM with complete documentation
• 5V power supply
You can download the reference design, applicationnotes, memory interface generator, and other resources
for memory interface designs by visitingwww.xilinx.com/virtex4/. If you are interested in
purchasing the ML461, please contact your local sales representative, or e-mail [email protected].
R
Parameter DDR2 SDRAM DDR SDRAM QDR II RLDRAM II FCRAM II
Data Width 144-bit (DIMM) 144-bit (DIMM) (72+72)-bit 36-bit 36-bit28-bit 28-bit
I/O Standard SSTL 18 SSTL 2 HSTL HSTL SSTL 18
Figure 1 – Memory architectures supported by ML461
Figure 2 –Memory interface generator
T H E B O A R D R O O M
The Memec™ LC development kit for Xilinx® Virtex-4™devices creates an easy-to-use yet effective Virtex-4 prototypingenvironment. The LC board provides prototype features commonto most designers’ needs, with a focus on usability in real-worldapplications.
The kit bundles a full-featured, expandable Virtex-4-based sys-tem board with a power supply, user guide, and reference designs.Optional Xilinx ISE™ software, JTAG cable, and application-specific P160 expansion modules are also available.
First Quarter 2005 Xcell Journal 115
The Virtex-4 LC development kit accelerates design time.
The Memec MB development kits for Xilinx Virtex-4 devices pro-vide advanced functions and interface features for your mostdemanding Virtex-4 prototype needs.
The MB board is available in both LX25 and LX60 densities,and for DSP applications, the SX35.
The kit bundles an expandable Virtex-4-based system board witha power supply, user guide, reference designs, and optional ISE soft-ware and JTAG cable. The new P240 expansion module standardincluded on the board provides both LVDS and single-ended signalsto support more challenging expansion requirements.
The Virtex-4 MB development kits give you maximum flexibility to target high-end applications.
Memec Virtex-4 Board SolutionsVirtex-4 LC Development KitFeatures
• XC4VLX25-10SF363 FPGA• 10/100 Ethernet PHY• 32M x 16 DDR memory• P160 interface• 2 x 16-character LCD• RS232• System ACE™ interface• Low cost
Virtex-4 MBDevelopment KitFeatures
• XC4VLX25, LX60, or SX35-10FF668 FPGA
• 10/100 Ethernet PHY• 32M x 16 DDR memory• 2M x 16 Flash memory• P240 high-performance
interface• High-speed LVDS interface• 2 x 16-character LCD• RS232 and USB interface• System ACE interface• High performance
For more information or to order your Virtex-4 development kit from Memec,
visit www.memec.com/xilinx-v4/or call (888) 488-4133 (in the U.S.) and
(858) 314-8910 (outside the U.S.).
T H E B O A R D R O O M
The Virtex-4 family of FPGAs delivers powerful new capabilities fordesigns in the programmable logic, DSP, embedded processing, andhigh-speed serial I/O applications domains. As a Xilinx distributor,Avnet plays a critical role in helping customers rapidly adopt theVirtex-4 solution into innovative, feature-rich end products.
Avnet is now shipping three new evaluation kits: the Virtex-4LX25 and LX60 Evaluation Kits and the Virtex-4 SX35 EvaluationKit (Figure 1). The LX Evaluation Kits feature an XC4VLX25 orXC4VLX60 device. These two kits are optimized for general logicintegration applications.
The SX35 Evaluation Kit, which is optimized for high-performance DSP applications, uses the same board populatedwith a Virtex-4 XC4VSX35 device.
All three kits offer a choice of affordable, easy-to-use platformsfor evaluating and experimenting with a Virtex-4 LX or SX design.And by tying in expansion cards available from Avnet, such as add-on memory, audio/video, and adapters for data conversion, thesekits can serve as powerful prototyping platforms.
Purchasing any Avnet Design Kit gets you into an AvnetSpeedWay Design Workshop™ for free, where you’ll learn how toleverage Xilinx solutions using real-world design examples. SpeedWayWorkshops are hardware-based and lab-oriented. You’ll work with realhardware and development tools to build actual designs and leave with
an in-depth knowledge of the FPGA architecture and design methodsused in the lab. For more information or to register for a SpeedWayWorkshop, visit www.em.avnet.com/xlxspeedwayindex/.
116 Xcell Journal First Quarter 2005
Virtex-4 LX25, LX60, and SX35 Evaluation Kits are now available.
Avnet Virtex-4 Evaluation Kits Features
• Xilinx® XC4VLX25 FF668, XC4VLX60 FF668, or XC4VSX35 FF668 FPGA
• Cypress™ CY7C68013 USB 2.0 controller
• National Semiconductor™ DP83847 10/100 Ethernet PHY
Avnet’s design kits and technical workshops are powerfultools that you can leverage to increase your design advantage
when implementing Virtex-4-based solutions. For more information, visit www.em.avnet.com/xlxv4kits/.
Virtex-4 LX Platform
Featured Device Avnet Part Number Price
XCV4LX25 ADS-XLX-V4LX-EVL25 $349.00 USD
XCV4LX60 ADS-XLX-V4LX-EVL60 $599.00 USD
Virtex-4 SX Platform
Featured Device Avnet Part Number Price
XCV4SX35 ADS-XLX-V4SX-EVL35 $449.00 USD
Virtex-4 FX Platform
...coming soon
T H E B O A R D R O O M
Support for Multiple Clock Sources and Differential Clock Inputs• Memory interfaces for DDR2 SDRAM at 533 MHz, ZBT SRAM, and Linear Flash• Multiple FPGA configuration modes: Platform Flash, System ACE™ CF, Linear
Flash, and Parallel Cable-IV• Audio and video interfaces• Multiple user interfaces: dual PS/2, IIC Bus, RS-232, USB, and tri-mode
Ethernet• High-speed data acquisition expansion module interface supporting single-
ended and LVDS I/O standardsOptimize Your Design with Unique Built-In Silicon Features• ChipSync™ source-synchronous technology embedded in every I/O ensures reli-
able data capture• Xesium differential global clocks minimize skew and jitter for increased
design margins
Finish Faster Using Proven Reference Designs• Reference designs and IP cores for numerous applications speed up your
design cycle• A comprehensive suite of application notes guides you every step of the way
* Demonstrations ship in Platform Flash, Linear Flash, and System ACE CF formats
First Quarter 2005 Xcell Journal 117
Evaluate and implement your design by leveraging the ML401 board’s rich feature set.
All of the designs and related documentation for the Virtex-4 board are available on the Nu Horizons
website at www.nuhorizons.com/v4/.
Nu Horizons Virtex-4 Development PlatformThe NH401 from Nu Horizons Electronics Corp. is designed as alow-cost, high-value development platform to provide a demonstra-tion of the Xilinx® Virtex-4™ LX/SX/FX family. The NH401 plat-form showcases the enormous power and flexibility of Virtex-4FPGAs, including new and improved clock technology, systemmonitors, DSP blocks, Smart RAM blocks, advanced I/Os, embed-ded MACs, 10/100/1000 Ethernet MAC, RocketIO™ MGTs, andembedded processors (Power PC™ 405 hard-core andMicroBlaze™ soft-core processors).
The NH401 is built around a Virtex-4 FPGA and is designed tooffer a user-friendly and highly useful set of features at an extreme-ly low price point. The board is envisioned to function as an easy-to-use demonstration platform, as well as a high-performance DSPdevelopment or embedded processing platform. Included with theNH401 are simple tutorials, reference designs, and interestingdemos, including a full embedded computer that can you can easi-ly expand or adapt for your own applications.
• VGA controller (resolutions as high as 1024 x 768 at 60 Hz)• Audio in/out CODEC (microphone in, line-in/out, and headphone output jacks)• LCD display (16 x 2 character)• RS232 serial port• 2 x PS/2 (P/C keyboard and mouse)• GPIO: 5 Buttons + 13 LEDs + 8 DIP switches• 4 SMAs (differential clock in/out) + CLK oscillator socket• ADC system monitor (-3V or 0-6V swing can be sampled)• 64-bit expansion I/O connector routed for LVDS, Agilent Soft Touch connector• 10/100/1000 Ethernet PHY• PC4 connector (allow for JTAG debug/download via the Parallel-IV cable)• USB host/peripheral interface• CPLD for Flash configuration of FPGA• High-speed frequency synthesizer - 622 MHz
Additional plug-in evaluation modules are available:• Linear Technology high-speed A/D converters
With recent, rapid progress in memory-related technology, thestandard of SDRAM is shifting from SDR to DDR, furtherenabling the rise of DDR2 SDRAM. It is becoming the defacto standard in the industry with its numerous advantages oflow power consumption, high speed, and reduced EMI.
The TED DDR2 memory evaluation board from HiTechGlobal Distribution allows you to evaluate DDR2 SDRAM withthe Virtex-4 LX series (LX25/40/60). The DDR2 SDRAM com-prises two embedded component chips and two DIMM modules,thus allowing use in various memory evaluation applications.
Additionally, so that you can use the board immediatelyafter purchase, the board is under plan to provide a 533 Mbpsreference design.
We also offer a Gerber file as well as a board schematic file,which can assist you in developing high-speed interfaces forDDR2 SDRAM and FPGAs.
118 Xcell Journal First Quarter 2005
High-performance, easy-to-use, and low-cost platforms for the rapid evaluation of DDR-II memory devices.
The designs and related documentation for this board areavailable on the HiTech Global Distribution, LLC website at
www.hitechglobal.com/ted/virtex4ddr.htm.
T H E B O A R D R O O M
Xil
inx V
irte
x-4
™FP
GA
sh
ttp
://w
ww
.xil
inx
.co
m/d
ev
ice
s/
120 Xcell Journal First Quarter 2005
Easy
Path
™ S
olut
ions
4VFX
124V
FX20
4VFX
404V
FX60
4VFX
100
4VFX
140
12,3
1219
,224
41,9
0456
,880
94,8
9614
2,12
8
648
1,22
42,
592
4,17
66,
768
9,93
6
4VSX
254V
SX35
4VSX
55
23,0
4034
,560
55,2
96
2,30
43,
456
5,76
0
4VLX
154V
LX25
4VLX
404V
LX60
Virt
ex-4
LX
(Log
ic)
4VLX
804V
LX10
04V
LX16
04V
LX20
0
24,1
9241
,472
59,9
0480
,640
110,
592
152,
064
200,
448
Logi
c Ce
lls
Pow
erPC
™ P
roce
ssor
Blo
cks
Ana
log-
to-D
igit
al C
onve
rter
s (A
DC)
10/1
00/1
000
Ethe
rnet
MAC
Blo
cks
Rock
etIO
™ S
eria
l Tra
nsce
iver
s
Virt
ex-4
FX
(Em
bedd
ed P
roce
ssin
g &
Ser
ial C
onne
ctiv
ity)
Virt
ex-4
SX
(Sig
nal P
roce
ssin
g)
1,29
61,
728
2,88
03,
600
4,32
05,
184
6,04
8
44
812
1220
48
88
88
1212
1212
00
48
88
04
44
44
88
88
320
320
448
576
768
896
320
448
640
448
640
640
768
960
960
960
160
160
224
288
384
448
160
224
320
224
320
320
384
480
480
480
3232
4812
816
019
212
819
251
248
6464
8096
9696
00
01
11
00
00
00
11
11
11
22
22
——
——
——
——
——
22
44
44
——
——
——
——
——
08
1216
2024
——
——
——
——
——
5,01
7,08
87,
641,
088
15,8
38,4
6422
,262
,016
35,1
22,2
4050
,900
,352
9,65
1,07
214
,476
,608
24,0
88,3
20
13,8
24
864 4 0 320
160
32 0 — — —
4,87
5,39
28,
037,
312
12,6
47,6
8018
,315
,520
24,1
01,4
4031
,818
,624
41,8
63,2
9650
,648
,448
Tota
l Blo
ck R
AM
(kbi
ts)
Dig
ital
Clo
ck M
anag
ers
(DCM
)
Phas
e-m
atch
ed C
lock
Div
ider
s
Max
Sel
ectI
O™
Max
Diff
eren
tial
I/O
Pai
rs
Xtre
meD
SP™
Slic
es
Conf
igur
atio
n M
emor
y Bi
ts
——
——
——
XCE4
VFX1
40XC
E4VF
X100
XCE4
VFX6
0XC
E4VF
X40
XCE4
VSX5
5XC
E4VL
X40
XCE4
VLX6
0XC
E4VL
X80
XCE4
VLX1
00XC
E4VL
X160
XCE4
VLX2
00
Pb
-free
sol
utio
ns a
re a
vaila
ble.
For
mor
e in
form
atio
n ab
out P
b-fre
e so
lutio
ns, v
isit
ww
w.x
ilinx
.com
/pbf
ree/
.
1.
Num
ber o
f ava
ilabl
e Ro
cket
IO M
ulti-
Gig
abit
Tran
scei
vers
240
320
240
448
320
240
240
SF36
317
x 1
7 m
m—
240
320
448
448
320
FF66
827
x 2
7 m
m
—44
8
640
640
768
768
768
FF11
4835
x 3
5 m
m
—76
8
960
960
960
FF15
1340
x 4
0 m
m
—96
0
320
(8)1
352
(12)
135
2 (1
2)1
FF67
227
x 2
7 m
m
1235
2
448
(12)
157
6 (1
6)1
576
(20)
1FF
1152
35 x
35
mm
20
576
768
(20)
176
8 (2
4)1
FF15
1740
x 4
0 m
m
2476
8
896
(24)
1FF
1760
42.5
x 4
2.5
mm
24
896
4VFX
204V
FX40
4VFX
604V
FX10
04V
FX14
04V
SX35
4VSX
554V
FX12
4VLX
154V
LX25
4VLX
404V
LX80
448
448
448
320
320
448
320
FF67
627
x 2
7 m
m
—44
8
640
4VLX
604V
LX10
04V
LX16
04V
LX20
04V
SX25
Pack
age
Are
aM
GT
Pins
Pro
du
ct S
ele
ctio
n M
atr
ix
Impo
rtan
t:Ve
rify
all
data
in t
his
docu
men
t w
ith
the
devi
ce d
ata
shee
ts f
ound
at
http
://w
ww
.xili
nx.c
om/p
arti
nfo/
data
book
.htm
Xil
inx S
part
an
™-3
FPG
As
htt
p:/
/ww
w.x
ilin
x.c
om
/de
vic
es/
First Quarter 2005 Xcell Journal 121
Pro
du
ct S
ele
ctio
n M
atr
ixPack
ag
e O
pti
on
s an
d U
serI
/O1
CLB
Reso
urce
sM
emor
y Re
sour
ces
CLK
Reso
urce
sD
SPI/O
Fea
ture
sSp
eed
PRO
M
System Gates (see note 1)
CLB Array (Row x Col)
XC3S
5050
K16
x 1
2
Number of Slices
768
Logic Cells (see note 2)
1,72
8
CLB Flip-Flops
1,53
6
Max. Distributed RAM Bits
12K
# Block RAM4
Block RAM (bits)
72K
Dedicated Multipliers
4
DCM Frequency (min/max)
24/3
30
# DCMs
2
Frequency Synthesis
YES
Phase Shift
YES
Digitally Controlled Impedance
Number of Differential I/O Pairs
Maximum I/O
I/O Standards
Commercial Speed Grades(slowest to fastest)
YES
5612
4Sin
gle-en
ded
LVTT
L, LV
CMOS
3.3/2.
5/1.8/
1.5/1.
2, PC
I 3.3V
– 32
/64-bi
t 33
MHz,
SSTL
2 Clas
s I &
II,
SSTL
18 Cl
ass I
, HST
L Clas
s I,
III, H
STL1
.8 Cla
ss I, I
I & III
,GT
L, GT
L+
Diffe
rentia
lLV
DS2.5
, Bus
LVDS
2.5,
Ultra
LVDS
2.5, LV
DS_e
xt2.5,
RSDS
, LDT
2.5, LV
PECL
-4 -5
Industrial Speed Grades(slowest to fastest)
-4
Configuration Memory (Bits)
.4M
XC3S
200
200K
24
x 2
01,
920
4,32
03,
840
30K
1221
6K12
24/3
304
YES
YES
YES
7617
3-4
-5-4
1.0M
XC3S
400
400K
32
x 2
83,
584
8,06
47,
168
56K
1628
8K16
24/3
304
YES
YES
YES
116
264
-4 -5
-41.
7M
XC3S
1000
10
00K
48 x
40
7,68
017
,280
15,3
6012
0K24
432K
2424
/330
4YE
SYE
SYE
S17
539
1-4
-5-4
3.2M
XC3S
1500
15
00K
64 x
52
13,3
1229
,952
26,6
2420
8K32
576K
3224
/330
4YE
SYE
SYE
S22
148
7-4
-5-4
5.2M
XC3S
2000
20
00K
80 x
64
20,4
8046
,080
40,9
6032
0K40
720K
4024
/330
4YE
SYE
SYE
S27
056
5-4
-5-4
7.7M
XC3S
4000
40
00K
96 x
72
27,6
4862
,208
55,2
9643
2K96
1,72
8K96
24/3
304
YES
YES
YES
312
712
-4 -5
-411
.3M
XC3S
5000
50
00K
104
x 80
33,2
8074
,880
66,5
6052
0K10
41,
872K
104
24/3
304
YES
YES
YES
344
784
-4 -5
-413
.3M
Not
e:
1. S
yste
m G
ates
incl
ude
20-3
0% o
f CLB
s us
ed a
s RA
Ms
2.
For
Spa
rtan
-3, a
Log
ic C
ell i
s de
fined
as
a 4-
inpu
t LU
T +
flip
-flop
3. A
utom
otiv
e Q
-Gra
de S
olut
ions
for S
part
an-3
will
be
avai
labl
e 2H
2004
.
Spar
tan-
3 Fa
mily
– 1
.2 V
olt
(see
not
e 3)
Not
e 1:
Num
bers
in ta
ble
indi
cate
max
imum
num
ber o
f use
r I/O
sN
ote
2: A
rea
dim
ensi
ons
for l
ead-
fram
e pr
oduc
ts a
re in
clus
ive
of th
e le
ads.
Pb-fr
ee s
olut
ions
are
ava
ilabl
e. F
or m
ore
info
rmat
ion
abou
t Pb-
free
solu
tions
vis
it w
ww
.xili
nx.c
om/p
bfre
e/.
XC3S50
XC3S200
XC3S400
XC3S1000
XC3S1500
XC3S2000
XC3S4000
Are
a2Pi
nsI/O
’s12
417
326
439
148
756
571
278
4XC3S5000
30.6
x 3
0.6
mm
208
16.0
x 1
6.0
mm
100
6363
22.0
x 2
2.0
mm
144
97
124
141
141
9797
PQFP
Pac
kage
s (P
Q) –
wir
e-bo
nd p
last
ic Q
FP (0
.5m
m le
ad s
paci
ng)
VQFP
Pac
kage
s (V
Q) –
ver
y th
in T
QFP
(0.5
mm
lead
spa
cing
)
TQFP
Pac
kage
s (T
Q) –
thi
n Q
FP (0
.5m
m le
ad s
paci
ng)
31 x
31
mm
900
565
633
633
35 x
35
mm
1156
712
784
17 x
17
mm
256
23 x
23
mm
456
264
333
27 x
27
mm
676
391
487
489
173
173
173
333
19 x
19
mm
320
221
221
221
FGA
Pac
kage
s (F
T) –
wir
e-bo
nd fi
ne-p
itch
thi
n BG
A (1
.0 m
m b
all s
paci
ng)
FGA
Pac
kage
s (F
G) –
wir
e-bo
nd fi
ne-p
itch
BG
A (1
.0 m
m b
all s
paci
ng)
Spar
tan-
3 (1
.2V)
Impo
rtan
t:Ve
rify
all
data
in t
his
docu
men
t w
ith
the
devi
ce d
ata
shee
ts f
ound
at
http
://w
ww
.xili
nx.c
om/p
arti
nfo/
data
book
.htm
FPG
A a
nd C
PL
D D
evic
esht
tp://
ww
w.x
ilinx
.com
/dev
ices
/
Con
figu
rati
on a
nd S
tora
ge S
yste
ms
http
://w
ww
.xili
nx.c
om/c
onfig
soln
s/
Pack
agin
ght
tp://
ww
w.x
ilinx
.com
/pac
kagi
ng/
Soft
war
eht
tp://
ww
w.x
ilinx
.com
/ise/
Dev
elop
men
t R
efer
ence
Boa
rds
http
://w
ww
.xili
nx.c
om/b
oard
_sea
rch/
IP R
efer
ence
http
://w
ww
.xili
nx.c
om/ip
cent
er/
Glo
bal S
ervi
ces
http
://w
ww
.xili
nx.c
om/su
ppor
t/gsd
/
For t
he la
test
info
rmat
ion
and
prod
uct s
pecif
icatio
ns o
n al
l Xilin
x pr
oduc
ts, p
lease
visi
t the
follo
wing
link
s:
Track
Track
Vo 2 = 2.5 V
Vo 1 = 3.3 VVIN = 3.3 V, 5 V, or 12 V
Track
Vo 3 = 1.8 V
20 A
30 A
15 A
The new PTHxx family of plug-in power modules from Texas Instruments providesindustry-leading features that allow designers to take charge of point-of-load (POL) powerproblems and designs. New Auto-Track sequencing via single-pin control simplifies multimodule power up/down. In addition to those listed below, other key featuresinclude wide adjustable output voltage, on/off inhibit, overcurrent protection and remote sense.
Samples shipped in 24 hours.
The Industry’s Most Advanced Plug-In Power ModulesFeaturing TI’s New Auto-TrackTM Sequencing
Applications
– Networking
– Servers
– Data communications
– Workstations
– Industrial electronics
Features– Auto-Track sequencing simplifies power
up/down sequencing of multiple modules
– Pre-bias startup capability allows usewith all ASICs and FPGAs
– Margin up/down provides for additionaltest capability during manufacturing
– A 96% efficiency rating means morepower in a smaller package
– Point-of-Load Alliance (POLA) compatibilityassures interoperable second sources