-
www.xilinx.com/xcell/
S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D
Xcell journalXcell journalI S SUE 79 , S ECOND QUAR TER 2012
Using Formal Verification for HW/SW Co-verification of an FPGA
IP Core
How to Use the CORDIC Algorithm in Your FPGA Design
Smart, Fast Financial Trading Platforms Start with FPGAs
Xilinx Unveils Vivado Design Suite for theNext Decade of All
Programmable Devices
FPGAs Provide Flexible Platform for High SchoolRobotics
page28
-
Xilinx Spartan-6 FPGAsD e v e l o p m e n t T o o l s , D e s i
g n e d b y A v n e t
Xilinx Spartan-6 FPGA Motor Control Development Kit
AES-S6MC1-LX75T-GGo beyond traditional MCUs to:t Execute complex
motor control algorithmst Achieve higher levels of integrationt
Implement custom safety features
$1,095
Xilinx Spartan-6 LX150T Development
KitAES-S6DEV-LX150T-GPrototype high-performance designswith ease:t
PCI Express x4 end-pointt SATA host connectort Two general-purpose
GTP portst Dual FMC LPC expansion slots
$995
Xilinx Spartan-6 LX16 Evaluation KitAES-S6EV-LX16-Gt First-ever
battery-powered Xilinx FPGA
development boardt Achieve on-board FPGA configuration and
power measurement with the included Cypress PSoC 3
$225
Xilinx Spartan-6 FPGA LX75T Development KitAES-S6PCIE-LX75T-Gt
Optimize embedded PCIe applications
using a standard set of features in a compact PCIe form factort
Dual banks of DDR3 memoryt Expand your design using a card
edge-
aligned FMC slot
$425
Xilinx Spartan-6 FPGA LX9 MicroBoardAES-S6MB-LX9-Gt Explore the
MicroBlaze soft processor
and Spartan-6 FPGAs t Leverage the included pre-built
MicroBlaze systemst Write & debug code using the
included
Software Development Kit (SDK)
$89
Xilinx Spartan-6 FPGA Industrial Video Processing
KitAES-S6IVK-LX150T-GPrototype & develop systems such as:t High
resolution video conferencingt Video surveillancet Machine
vision
$2,195 $2,695(for a limited time only)
Xilinx Spartan-6 FPGA Industrial Ethernet Kit
AES-S6IEK-LX150T-GPrototype & develop systems such as:t
Industrial networkingt Motor controlt Embedded control
$1,395 $1895(for a limited time only)
Xilinx Spartan-6 FPGA SP605 Evaluation KitEK-S6-SP605-GDesigned
by Xilinx, this kit enablesimplementation of features such as:t
High-speed serial transceiverst PCI Expresst DVI & DDR3
$450 $495(for a limited time only)
NEW!
FOR A LIMITED TIME,AVNET IS OFFERING
DISCOUNTS ONSEVERAL OF OUR
MOST POPULAR TOOLS.www.em.avnet.com/
s6solutions
Copyright 2012, Avnet, Inc. All rights reserved. AVNET and the
AV logo are registered trademarks of Avnet, Inc.All other
trademarks are the property of their respective owners.
-
New stacked silicon architecture from Xilinx makes your big
design much easier to prototype.Partitioning woes are forgotten,
and designs run at near final chip speed. The DINI Group
DNV7F1board puts this new technology in your hands with a board
that gets you to market easier, faster andmore confident of your
designs functionality running at high speed. DINI Group engineers
put thefeatures you need most, right on the board:
10GbE
USB 2
PCIe, Gen 1, 2, and 3
240 pin UDIMM for DDR3
There is a Marvel Processor for any custom interfaces you might
need andplenty of power and cooling for high speed logic emulation.
Software andfirmware developers will appreciate the productivity
gains that come with thislow cost, stand-alone development
platform.
Prototyping just got a lot easier, call DINI today and get your
chip up to speed.
www.dinigroup.com 7469 Draper Avenue La Jolla, CA 92037 (858)
454-3419 e-mail: [email protected]
-
L E T T E R F R O M T H E P U B L I S H E R
Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone:
408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/
2012 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo,
and other designated brands includedherein are trademarks of
Xilinx, Inc. All other trade-marks are the property of their
respective owners.
The articles, information, and other materials includedin this
issue are provided solely for the convenience ofour readers. Xilinx
makes no warranties, express,implied, statutory, or otherwise, and
accepts no liabilitywith respect to any such articles, information,
or othermaterials or their use, and any use thereof is solely atthe
risk of the user. Any person or entity using suchinformation in any
way releases and waives any claim itmight have against Xilinx for
any loss, damage, orexpense caused thereby.
PUBLISHER Mike
[email protected]
EDITOR Jacqueline Damian
ART DIRECTOR Scott Blair
DESIGN/PRODUCTION Teie, Gelwicks &
Associates1-800-493-5551
ADVERTISING SALES Dan [email protected]
INTERNATIONAL Melissa Zhang, Asia
[email protected]
Christelle Moraga, Europe/Middle
East/[email protected]
Miyuki Takegoshi, [email protected]
REPRINT ORDERS 1-800-493-5551
Xcell journal
www.xilinx.com/xcell/
Welcome to the Programmable Renaissance
T ime flies. I recently celebrated my fourth anniversary here at
Xilinx and have had agreat time participating in the Herculean task
of bringing to market two new generations of siliconthe
40-nanometer 6 series FPGAs and the 28-nm 7 seriesdevices. Im proud
to be a part of what is most likely the single most inspirational
and inno-vative moment in the history of programmable logic since
Xilinx introduced the very firstFPGA, the XC2064, in 1985.
Along with being the first to market with 28-nm silicon in 2011,
Xilinx introduced tworevolutionary technologiesthe Zynq-7000
Extensible Processing Platform and theVirtex-7 2000T FPGA.
If youve been a faithful reader of Xcell Journal over the last
couple of years, you are famil-iar with these two great devices.
The Zynq-7000 EPP marries a dual ARM Cortex-A9MPCore processor with
programmable logic on the same device, and boots from theprocessor
core rather than from programmable logic. It enables new vistas in
system-levelintegration for traditional FPGA designers and opens up
the world of programmable logicto a huge user base of software
engineers. The possibilities are endless.
Im not alone in my enthusiasm for this device: The editors and
readers of EE Times andEDN recently voted the Zynq-7000 the
Ultimate SoC of 2011 in the UBM Electronics ACEAwards competition
(see
http://www.eetimes.com/electronics-news/4370156/Xilinx-Zynq-7000-receives-product-of-the-year-ACE-award).
Not to be outdone, the Virtex-7 2000T, an ACE Awards finalist in
the Ultimate Digital ICcategory, is in my opinion an equally if not
even more technologically impressive accom-plishment. It is the
first commercially available FPGA with Xilinxs 3D
stacked-siliconinterconnect (SSI) technology, in which four 28-nm
programmable logic dice (what we callslices) reside side-by-side on
a passive silicon interposer. By stacking the dice, Xilinx wasable
to make the Virtex-7 2000T the worlds single largest device in
terms of transistorcounts and by far the highest-capacity
programmable logic device that has every existed.
The SSI technology not only allows customers to speed past
Moores Law but also opensup new integration possibilities in which
Xilinx can integrate different types of dice on a sin-gle device,
speeding up the pace of user innovation. For example, Xilinx has
announced theVirtex-7 HT family of devices, enabled by SSI
technology. Each member of this family willinclude transceiver
slices alongside programmable logic slices. The Virtex-7 HT family
willallow wired communications companies to create equipment to
conform to new bandwidthstandards for 100 Gbps and beyond. The
biggest device in the family, the Virtex-7 H870T, willallow
companies to create equipment that can run at up to 400
Gbpsdeveloping equip-ment at the leading edge of advanced
communications standards.
And now, to put the icing on the cake so to speak, Xilinx is
launching its new VivadoDesign Suite (cover story). Vivado, which
the company started developing four years ago, notonly blows away
the runtimes of the ISE Design Suite but is built from the ground
up usingopen standards and modern EDA technologies, even high-level
synthesis, that should dramat-ically speed up productivity for the
7 series devices and many generations of FPGAs to come.
I highly recommend you check out the new 7 series devices and
the Vivado Design Suite.If you happen to be available for a trip to
San Francisco in early June, Xilinx will be exhibit-ing at the
Design Automation Conference (www.dac.com) from June 3 to 7 at
Booth 730.Youll find me there, or at three of the Pavilion Panels
Im organizing on DACs show floor(Booth 310): Gary Smith on EDA:
Trends and Whats Hot at DAC, on Monday, June 4, 9:15-10:15 a.m.;
Town Hall: Dark Side of Moores Law on Wednesday, June 6, 9:15 to
10:15 a.m.;and Hardware-Assisted Prototyping and Verification: Make
vs. Buy? on Wednesday, June6, 4:30 to 5:15 p.m. I hope to see you
there.
Mike SantariniPublisher
-
C O N T E N T S
VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 2020
1414
Cover StoryXilinx Unveils Vivado Design Suite for the Next
Decade of All Programmable Devices
88
Xcellence in Communications
High-Level Synthesis Tool DeliversOptimized Packet Engine Design
14
Xcellence in Distributed Computing
Accelerating Distributed Computing with FPGAs 20
Xcellence in Education
FPGAs Enable Flexible Platform for High School Robotics 28
Xcellence in Financial
Smart, Fast Trading Systems Start with FPGAs 36
Letter From the PublisherWelcome to the
Programmable Renaissance 4
3636
-
S E C O N D Q U A R T E R 2 0 1 2 , I S S U E 7 9
Xperts Corner
Accelerate Partial Reconfiguration with a 100% Hardware Solution
44
Xplanation: FPGA 101
How to Use the CORDIC Algorithm in Your FPGA Design 50
Xplanation: FPGA 101
Using Formal Verification for HW/SW Co-verification of an FPGA
IP Core 56
THE XILINX XPERIENCE FEATURES
6262
2011
Excellence in Magazine & Journal Writing2010, 2011
Excellence in Magazine & Journal Design and Layout2010,
2011
Tools of Xcellence New tools take the pain out of FPGA synthesis
62
Xamples A mix of new and popular application notes 66
Xtra, Xtra The latest Xilinx tool updates and patches, as of May
2012 68
Xclamations! Share your wit and wisdom by supplyinga caption for
our techy cartoon, for a chance to win an
Avnet Spartan-6 LX9 MicroBoard 70
XTRA READING
44442828
-
8 Xcell Journal Second Quarter 2012
COVER STORY
Xilinx Unveils Vivado DesignSuite for the Next Decade of All
Programmable Devices
State-of-the-art EDA technologies and methods underliea new tool
suite that will radically improve design productivity and quality
of results, allowing designers to create better systems faster and
with fewer chips.
After four years of develop-ment and a year of beta test-ing,
Xilinx is making itsVivado Design Suite avail-
able to customers via its early-accessprogram, ahead of public
access thissummer. Vivado provides a highly inte-grated design
environment with a com-pletely new generation of system- to
IC-level tools, all built on the backbone of ashared scalable data
model and a com-mon debug environment. It is also anopen
environment based on industrystandards such as the AMBA AXI4
inter-connect, IP-XACT IP packaging metada-ta, the Tool Command
Language (Tcl),Synopsys Design Constraints (SDC) andothers that
facilitate design flows tai-lored to the users needs. Xilinx
archi-tected the Vivado Design Suite to enablethe combination of
all types of program-mable technologies and to scale up to100
million ASIC equivalent-gate designs.
by Mike SantariniPublisher, Xcell JournalXilinx,
[email protected]
-
Second Quarter 2012 Xcell Journal 9
Over the last four years, Xilinx haspushed semiconductor
innovation tonew heights and unleashed the fullsystem-level
capabilities of program-mable devices, said Steve Glaser, sen-ior
vice president of corporate strate-gy and marketing. Over this
time,Xilinx has evolved into a company thatdevelops All
Programmable Devices,extending programmability beyondprogrammable
logic and I/O to soft-ware-programmable ARM subsystems,3D ICs and
analog mixed signal. Weare enabling new levels of programma-ble
system integration with devicessuch as the award-winning Zynq-7000
Extensible Processing Platform,the 3D Virtex-7 stacked-silicon
inter-connect (SSI) technology devices andthe worlds most advanced
FPGAs.Now, with Vivado, we are offering astate-of-the-art tool
suite that willaccelerate the productivity of cus-tomers using
these All ProgrammableDevices for the next decade.
Glaser said Xilinx developed AllProgrammable Devices to enable
cus-tomers to achieve new levels of pro-grammable systems
integration,increased system performance, lowerBOM cost and total
system powerreduction, and ultimately to acceleratedesign
productivity so they can gettheir innovations to market quickly.
Toaccomplish this, Xilinx needed to cre-ate a tool suite as
innovative as its newsilicona suite that would addressnagging
integration and implementa-tion design-productivity
bottlenecks.
Customers face a number of inte-gration bottlenecks, including
integrat-ing algorithmic C and register-transferlevel (RTL) IP;
mixing the DSP, embed-ded, connectivity and logic domains;verifying
blocks and systems; andreusing designs and IP, said Glaser.They
also face several implementationbottlenecks, including hierarchical
chipplanning and partitioning; multidomainand multidie physical
optimization;multivariant design vs. timing clo-
sure; and late ECOs and the ripplingeffects of design changes.
The newVivado Design Suite addresses thesebottlenecks and empowers
users totake full advantage of the systemintegration capabilities
of our AllProgrammable Devices.
In developing the Vivado DesignSuite, Xilinx leveraged industry
stan-dards and employed state-of-the-artEDA technologies and
techniques. Theresult is that all designersfrom thosewho require a
highly automated, push-button flow to those who are extreme-ly
hands-onwill be able to designeven the largest Xilinx devices
farfaster and more effectively thanbefore, while working in a
state-of-the-art EDA environment that retains afamiliar, intuitive
look and feel.
The Vivado Design Suite gives cus-tomers a modern set of tools
with full-system programmability features thatfar surpass the
capabilities of the long-time flagship ISE Design Suite. Tohelp
customers transition smoothly,Xilinx will continue to develop and
sup-port ISE indefinitely for those targeting7 series and older
Xilinx FPGA tech-nologies. Going forward, the VivadoDesign Suite
will be the companys flag-ship design environment, supporting all7
series and future devices from Xilinx.
Tom Feist, senior director ofdesign methodology marketing
atXilinx, expects that when customerslaunch the Vivado Design
Suite, thebenefits over ISE will become imme-diately evident.
The Vivado Design Suite improvesuser productivity by offering up
to 4Xruntime improvements over competingtools, while heavily
leveraging indus-try standards such as SystemVerilog,SDC,
C/C++/SystemC, ARMs AMBA
AXI version 4 interconnect and interac-tive Tcl scripting, said
Feist. Otherhighlights include comprehensivecross-probing of the
Vivados manyreports and design views, state-of-the-art
graphics-based IP integration and,
last but not least, the first fully support-ed commercial
deployment of high-level synthesisC++ to HDLby anFPGA vendor.
TOOLS FOR THE NEXT ERA OF PROGRAMMABLE DESIGN Xilinx originally
introduced its ISEDesign Suite back in 1997. The suitefeatured a
then very innovative tim-ing-driven place-and-route engine
thatXilinx had gained in its April 1995acquisition of NeoCAD. Over
a decadeand a half, Xilinx added numerousnew technologiesincluding
multi-language synthesis and simulation, IPintegration and a host
of editing andtest utilitiesto the suite, striving toconstantly
improve its design tools onall fronts as FPGAs became capableof
performing increasingly more com-plex functions. In creating the
newVivado Design Suite, Feist said thatXilinx drew upon all the
lessonslearned with ISE, appropriating itskey technologies while
also leverag-ing modern EDA algorithms, toolsand techniques.
The Vivado Design Suite will great-ly improve design
productivity fortodays designs and will easily scalefor the
capacity and design-complexi-ty challenges of 20-nanometer
siliconand beyond, said Feist. EDA technol-ogy has evolved greatly
over the last15 years. In building this tool fromscratch, we were
able to create a suitethat employs the latest EDA technolo-gies and
standards and will scale nice-ly into the foreseeable future.
DETERMINISTIC DESIGN CLOSUREAt the heart of any FPGA
vendorsintegrated design suite is the physi-cal-implementation
flowsynthesis,floorplanning, placement, routing,power and timing
analysis, optimiza-tion and ECO. With Vivado, Xilinxhas built a
state-of-the-art implemen-tation flow to help customers
quicklyachieve design closure.
C O V E R S T O R Y
-
SCALABLE DATA MODEL ARCHITECTURETo cut down on iterations and
overalldesign time and to improve overall pro-ductivity, Xilinx
built its implementa-tion flow using a single, shared, scalabledata
modela framework also found intodays most advanced ASIC
designenvironments. This shared scalabledata model allows all the
steps in theflowsynthesis, simulation, floorplan-ing, place and
route, etc.to operate onan in-memory data model that enablesdebug
and analysis at every step in theprocess, so that users have
visibilityinto key design metrics such as timing,power, resource
utilization and routingcongestion much earlier in the
designprocesses, said Feist. These estimatesbecome progressively
more accurate asthe design progresses through the stepsin the
implementation processes.
Specifically, the unified data modelallowed Xilinx to tightly
link its newmultidimensional, analytical place-and-route engine
with the suites RTL syn-thesis engine, new
multiple-languagesimulation engines as well as individualtools such
as the IP Integrator, Pin
Editor, Floor Planner and DeviceEditor. Customers can use the
toolsuites comprehensive cross-probingfunction to track and
cross-probe agiven problem from schematics, timingreports or logic
cells to any other viewand all the way back to HDL code.
You now have analysis at every stepof the design process and
every step isconnected, said Feist. We also pro-vide analysis for
timing, power, noiseand resource utilization at every stageof the
flow after synthesis. So if I learnearly that my timing or power is
wayoff, I can do short iterations to addressthe issue proactively
rather than runlong iterations, perhaps several ofthem, after its
been placed and routed.
Feist said that tight integrationafforded by the scalable data
modelenhanced the effectiveness of push-button flows for users who
wantmaximum automation, relying ontheir tools to do the vast
majority ofthe work. At the same time, he said,it also gives those
users who requiremore-advanced controls betteranalysis and command
of their everydesign move.
HIERARCHICAL CHIP PLANNING, FAST SYNTHESIS Feist said that
Vivado provides userswith the ability to partition the designfor
processing by synthesis, implemen-tation and verification,
facilitating adivide-and-conquer team approach tobig projects. A
new design-preserva-tion feature enables repeatable timingresults
and the ability to perform par-tial reconfiguration of the
design.
Vivado also includes an entirely newsynthesis engine that is
designed to han-dle millions of logic cells. Key to the
newsynthesis engine is superior support forSystemVerilog. Vivados
synthesisengine supports the synthesizable sub-set of the
SystemVerilog language betterthan any other tool in the market,
saidFeist. It is three times faster than XST,the Xilinx Synthesis
Technology in theISE Design Suite, and supports a quickoption that
lets designers rapidly get afeeling for the area and size of
thedesign, allowing them to debug issues15 times faster than before
with anRTL or gate-level schematic. Withmore and more ASIC
designers mov-ing to programmable platforms, Xilinx
10 Xcell Journal Second Quarter 2012
C O V E R S T O R Y
25
20
15
10
5
00.0E+00 5.0E+05 1.0E+06 1.5E+06 2.0E+06
12h/MLC
4.6h/MLC
Vivado
ISE
Competitor tools
Ru
nti
me
(ho
urs
)
Design size (LC)
Figure 1 The Vivado Design Suite implements large and small
designs more quickly and with better-quality results than other
FPGA tools.
-
is also leveraging Synopsys DesignConstraints throughout the
Vivadoflow. The use of standards opens upnew levels of automation
where cus-tomers can now access state-of-the-industry EDA tools for
things like con-straint generation, cross-domainclock checking,
formal verificationand even static timing analysis withtools like
PrimeTime from Synopsys.
MULTIDIMENSIONAL ANALYTICAL PLACER Feist explained that the
older-genera-tion FPGA vendor design suites useone-dimensional
timing-driven place-and-route engines powered by simulat-ed
annealing algorithms that determinerandomly where the tool should
placelogic cells. With these routers, usersenter timing; then the
simulated anneal-ing algorithm pseudorandomly placesfeatures to get
a best as it can matchto timing requirements. In those daysit made
sense, because designs weremuch smaller and logic cells were
themain cause of delays, said Feist. Buttoday, with complex designs
andadvances in silicon processes, intercon-nect and design
congestion contributeto the delay far more.
Place-and-route engines with simu-lated annealing algorithms do
an ade-quate job for FPGAs below 1 milliongates, but they really
start to underper-form as designs grow, said Feist. Notonly do they
struggle with congestion,
but the results start to become increas-ingly more unpredictable
as designsgrow further beyond 1 million gates.
With an eye toward the multimillion-gate future, Xilinx
developed a modernmultidimensional analytic placementengine for the
Vivado Design Suite thatis on par with those found in
million-dollar ASIC place-and-route tools. Thisengine analytically
finds a solution thatprimarily minimizes three dimensionsof a
design: timing, congestion and wirelength. The Vivado Design Suites
algo-rithm globally optimizes for best timing,congestion and wire
length simultane-ously, taking into account the entiredesign
instead of the local-moveapproach done with simulated anneal-ing,
said Feist. As a result, the tool canplace and route 10 million
gates quickly,deterministically and with consistentlystrong quality
of results (see Figure 1).Because it is solving for all three
fac-tors simultaneously, it means you runfewer iterations in your
flow.
To illustrate this advantage, Xilinxran the raw RTL for the
Zynq-7000 EPPemulation platform, a very large andcomplex design, in
both the ISE DesignSuite and Vivado Design Suite in a push-button
mode. Each tool was instructedto target Xilinxs largest FPGA
devicethe SSI-enabled Virtex-7 2000T FPGA.The Vivado Design Suites
place-and-route engine took five hours to placethe 1.2 million
logic cells, while the ISEDesign Suite version 13.4 took 13
hours
(Figure 2). The Vivado Design Suite alsoimplemented the design
with much lesscongestion (as seen in the gray and yel-low portions
of the design) and in asmaller area, reflecting the total
wire-length reduction. In addition, theVivado Design Suite
implementationhad better memory compilation effi-ciency, taking
only 9 Gbytes to imple-ment the designs required memory toISE
Design Suites 16 Gbytes.
Essentially what youre seeing isthat the Vivado Design Suite met
allconstraints and only needed three-quarters of the device to
implement theentire design, said Feist. That meansusers could add
even more logic func-tionality and on-chip memory to theirdesigns
[in the extra space] or, alterna-tively, even move to a smaller
device.
POWER OPTIMIZATION AND ANALYSISToday, power is one of the most
criticalaspects of FPGA design. As such, theVivado Design Suite
focuses onadvanced power-optimization tech-niques to provide
greater power reduc-tions for users designs. The technologyuses
advanced clock-gating techniquesfound in todays advanced ASIC
toolsuites and is capable of analyzing designlogic and removing
unnecessary switch-ing activity by applying clock gating,said
Feist. Specifically, the new tech-nology focuses on the
switching-activityfactor alpha. It is able to achieve up toa 30
percent reduction in dynamicpower. Feist said Xilinx introduced
thetechnology in the ISE Design Suite lastyear but is carrying it
forward and willcontinue to enhance it in Vivado.
In addition, with the new sharedscalable data model, users can
getpower estimates at every stage of thedesign flow, enabling
up-front analysisso that problem areas can be addressedearly in the
design flow, said Feist.
SIMPLIFYING ENGINEERINGCHANGE ORDERSIncremental flows make it
possible toquickly process small design changes by
Second Quarter 2012 Xcell Journal 11
C O V E R S T O R Y
ISE13 hrs.P&R runtime
Memory usage
*Zynq emulation platform
Wire lengthand congestion
16 GB
Vivado5 hrs.9 GB
Figure 2 The Vivado Design Suites multidimensional analytic
algorithm optimizes layouts for best timing, congestion and wire
length, not just best timing.
-
simply reimplementing a small part ofthe design, making
iterations fasterafter each change. They also enableperformance
preservation after eachincremental change, thus reducing theneed
for multiple design iterations.Toward this end, the Vivado
DesignSuite includes a new extension to thepopular ISE FPGA Editor
tool calledthe Vivado Device Editor. Feist said thatusing the
Vivado Device Editor on aplaced-and-routed design, designers
now have the power to make engineer-ing change orders (ECOs)to
moveinstances, reroute nets, tap a register toa primary output for
debug with ascope, change the parameters on a dig-ital clock
manager (DCM) or a lookuptable (LUT)late in the design
cycle,without needing to go back throughsynthesis and
implementation. Noother FPGA design environment offersthis level of
flexibility, he said.
FLOW AUTOMATION, NOT FLOW DICTATIONIn building the Vivado Design
Suite,the Xilinx tool teams mantra was toautomatenot dictatethe way
peo-ple design. Whether they start in C,C++, SystemC, VHDL, Verilog
orSystemVerilog, MATLAB or Sim-ulinkand whether they use our IP
orthird-party IPwe offer a way to auto-mate all those flows and
help cus-tomers be more productive, said Feist.We also accounted
for the broad rangeof skill sets and preferences of ourusersfrom
folks who want an entirelypushbutton flow to folks who do analy-sis
at each phase of the design, andeven for those who think GUIs are
for
wimps and want to do everything incommand-line or batch mode via
TCL.Users are able to suit the suites featuresto their specific
needs.
THE IP PACKAGER, INTEGRATOR AND CATALOGXilinxs tool architecture
team placedtop priority on giving the new suite spe-cialized IP
features to facilitate the cre-ation, integration and archiving of
intel-lectual property. To this end, Xilinx has
created three new IP capabilities inVivado, called IP Packager,
IP Integratorand the Extensible IP Catalog.
Today, it is hard to find an IC designthat doesnt incorporate
some amountof IP, said Feist. By adopting industrystandards and
offering tools to specifi-cally facilitate the creation,
integrationand archiving/upkeep of IP, we are help-ing IP vendors
in our ecosystem andcustomers to quickly build IP andimprove design
productivity. More than20 vendors are already offering IP
sup-porting the new suite.
IP Packager allows Xilinx customers,IP developers and ecosystem
partners toturn any part of their designor indeed,the entire
designinto a reusable core atany level of the design flow: RTL,
netlist,placed netlist and even placed-and-rout-ed netlist. The
tool creates an IP-XACTdescription of the IP that users can easi-ly
integrate into future designs. For itspart, the IP Packager
specifies the datafor each piece of IP in an XML file. Feistsaid
that once you have the IP packaged,you can use the new IP
Integrator tostitch it into the rest of your design.
IP Integrator allows customers tointegrate IP into their designs
at the
interconnect level rather than at the pinlevel, said Feist. You
can drag anddrop the pieces of IP onto your designand it will check
up front that therespective interfaces are compatible. Ifthey are,
you draw one line between thecores and it will automatically write
thedetailed RTL that connects all the pins.
Once youve merged, say, four orfive blocks into your design with
IPIntegrator, he said, you can take theoutput of that [process] and
run it backthrough the IP Packager. The resultthen becomes a piece
of IP that otherpeople can reuse, said Feist. And thisIP isnt just
RTL, it can be a placednetlist or even a placed-and-routed
IPnetlist block, which further saves inte-gration and verification
time.
A third feature, the Extensible IPCatalog, allows users to build
theirown standard repositories from IPtheyve created or licensed
from Xilinxand third-party vendors. The catalog,which Xilinx built
to conform to therequirements of the IP-XACT standard,allows design
teams and even enter-prises to better organize their IP andshare it
across their organization. Feistsaid that the Xilinx System
Generatorand IP Integrator are part of the VivadoExtensible IP
Catalog so that users caneasily access catalogued IP and inte-grate
it into their design projects.
Instead of having third-party IPvendors deliver their IP in a
zip fileand with various deliverables, theycan now deliver it to
you in a unifiedformat that is instantly accessible andcompatible
with the Vivado suite,said Ramine Roane, director of prod-uct
marketing for Vivado.
VIVADO HLS TAKES ESL MAINSTREAMPerhaps the most forward looking
ofthe many new technologies in theVivado Design Suite release is
VivadoHLS (high-level synthesis), which Xilinxgained in its
acquisition of AutoESL in2010. Xilinx conducted an
extensiveevaluation of commercial electronicsystem-level (ESL)
design offerings
C O V E R S T O R Y
12 Xcell Journal Second Quarter 2012
The tools will work for all levels of users, 'from folks who
want an entirely
pushbutton flow to folks who do analysisat each phase of the
design.'
-
before acquiring the best in the indus-try. A study by research
firm BDTIhelped Xilinxs acquisition choice (seeXcell Journal issue
71, BDTI StudyCertifies High-Level Synthesis Flowsfor DSP-Centric
FPGA
Design,http://www.xilinx.com/publications/archives/xcell/Xcell71.pdf).
Vivado HLS provides comprehen-sive coverage of C, C++ and
SystemC,and does floating-point as well as arbi-trary precision
floating-point [calcula-tions], said Feist. This means that youcan
work with the tool in an algorithm-development environment rather
thana typical hardware environment, if youwish. A key advantage of
doing this isthat the algorithms you developed atthat level can be
verified orders of mag-nitude faster than at the RTL. Thatmeans you
get simulation accelerationbut also the ability to explore the
feasi-bility of algorithms and make, at anarchitectural level,
trade-offs in termsof throughput, latency and power.
Designers can use the Vivado HLStool in many ways to perform a
widerange of functions. But for demonstra-tion purposes, Feist
outlined a commonflow users can employ for developingIP and
integrating it into their designs.
In this flow, users create a C, C++or SystemC representation of
theirdesign and a C testbench thatdescribes its desired behavior.
Theythen verify the system behavior oftheir design using a GNU
CompilerCollection/G++ or Visual C++ simula-tor. Once the
behavioral design isfunctioning satisfactorily and theaccompanying
testbench is ironed out,they run the design through VivadoHLS
synthesis, which will generate anRTL design: Verilog or VHDL. With
theRTL they can then perform Verilog orVHDL simulation of the
design or havethe tool create a SystemC versionusing the C-wrapper
technology. Userscan then perform SystemC architec-tural-level
simulation and further veri-fy the architectural behavior and
func-tionality of the design against the pre-viously created C
testbench.
Once the design has been solidi-fied, users can put it through
theVivado Design Suites physical-imple-mentation flow to program
theirdesign into a device and run it in hard-ware. Alternatively,
they can use theIP Packager to turn the design into areusable piece
of IP, stitch the IP intoa design using IP Integrator or run itin
System Generator.
This is merely one way to use the tool.In fact, in this issue of
Xcell Journal,Agilents Nathan Jachimiec and XilinxsFernando
Martinez Vallina describe howthey used the Vivado HLS
technology(called AutoESL technology in the ISEDesign Suite flow)
to develop a UDPpacket engine for Agilent.
VIVADO SIMULATORIn addition to Vivado HLS, Xilinx alsocreated a
new mixed-language simulatorfor the suite that supports Verilog
andVHDL. With a single click of the mouse,Feist said, users can
launch behavioralsimulations and view results in an inte-grated
waveform viewer. Simulations areaccelerated at the behavioral level
usinga new performance-optimized simulationkernel that executes up
to three timesfaster than the ISE simulator. Gate-levelsimulations
can also run up to 100 timesfaster using hardware
co-simulation.
AVAILABILITY IN 2012Where Xilinx offered the ISE DesignSuite in
four editions aimed at differenttypes of designers (Logic,
Embedded,DSP and System), the company will offerthe Vivado Design
Suite in two editions.The base Design Edition includes thenew IP
tools in addition to Vivados syn-thesis-to-bitstream flow.
Meanwhile, theSystem Edition includes all the tools ofthe Design
Edition plus SystemGenerator and Xilinxs new Vivado HLS.
The Vivado Design Suite version2012.1 is available now as part
of anearly-access program. Customers shouldcontact their local
Xilinx representativefor more information. Public access
willcommence with version 2012.2 in themiddle of the second
quarter, followedby WebPACK availability later in theyear. ISE
Design Suite Edition customerswith current support will receive
thenew Vivado Design Suite Editions inaddition to ISE at no
additional cost.
Xilinx will continue to support anddevelop the ISE Design Suite
for cus-tomers targeting devices prior to the 28-nm generation. To
learn more aboutVivado, please visit www.xilinx.com/design-tools or
come see the suite inaction at the Design Automation Con-ference
(DAC), June 3-7 in San Francisco,Booth 730.
Second Quarter 2012 Xcell Journal 13
C O V E R S T O R Y
FunctionalSpecification
CDesign
CTestbench
RTLDesign
Synthesis
CWrapper
Verification
Packaging
Vivado IP Packager
Vivado IP Packer
ArchitecturalVerification
IP Integrator System Generator RTL
Starts at C C C++ SystemC
Produces RTL Verilog VHDL SystemC
Automates Flow Verification Implementation
Figure 3 Vivado HLS allows design teams to begin their designs
at a system level.
-
14 Xcell Journal Second Quarter 2012
Gigabit Ethernet is one of the mostubiquitous interconnect
optionsavailable to link a workstation or lap-top to an FPGA-based
embedded platformdue to the availability of the hardened
tri-Ethernet MAC (TEMAC) primitive. The pri-mary impediment in
developing Ethernet-based FPGA designs is the perceived proces-sor
requirement necessary to handle theInternet Protocol (IP) stack. We
approachedthe problem using the AutoESL high-levelsynthesis tool to
develop a high-performanceIPv4 User-Datagram Protocol (UDP)
packettransfer engine.
Our team at Agilent's Measurement ResearchLab wrote original C
source code based onInternet Engineering Task Force requests
forcomments (RFCs) detailing packet exchangesamong several
protocols, namely UDP, theAddress Resolution Protocol (ARP) and
theDynamic Host Configuration Protocol (DHCP).This design
implements a hardware packet-pro-cessing engine without any need
for a CPU. Thearchitecture is capable of handling traffic at
linerate with minimum latency and is compact inlogic-resource area.
The usage of AutoESLmakes it easy to modify the user interface
withminimum effort to adapt to one or more FIFOstreams or to
multiple RAM interface ports.AutoESL is a new addition to the
Xilinx ISE
Design Suite and is called Vivado HLS in thenew Vivado Design
Suite (see cover story).
AutoESL enabled the creation of an in-fabric,processor-free UDP
network packet engine.
XCELLENCE IN COMMUNICATIONS
by Nathan Jachimiec, PhDR&D EngineerAgilent
TechnologiesTechnology Leadership
[email protected]
Fernando Martinez Vallina, PhD Software Applications
EngineerXilinx, [email protected]
High-Level Synthesis ToolDelivers Optimized Packet Engine
Design
-
Second Quarter 2012 Xcell Journal 15
IPV4 USER DATAGRAM PROTOCOLInternet Protocol version 4 (IPv4)
isthe dominant protocol of the Internet,with version 6 (IPv6)
growing steadi-ly in popularity. When most develop-ers discuss IP,
they commonly refer tothe Transmission Control Protocol, orTCP, a
connection-based protocolthat provides reliability and conges-tion
management. But for many appli-cations such as video
streaming,telephony, gaming or distributed sen-sor networks,
increased bandwidthand minimal latency trump reliability.Hence,
these applications typicallyuse UDP instead.
UDP is connectionless and pro-vides no inherent reliability. If
packetsare lost, duplicated or sent out oforder, the sender has no
way of know-ing and it is the responsibility of theusers
application to perform somepacket inspection to handle theseerrors.
In this regard, UDP has beennicknamed the unreliable protocol,but
in comparison to TCP, it offershigher performance. UDP support
isavailable in nearly every major operat-ing system that supports
IP. High-levelsoftware programming languagesrefer to network
streams as socketsand UDP as a datagram socket.
SENSOR NETWORK ARCHITECTUREAt Agilent, we developed a
LAN-basedsensor network that interfaces an ana-log-to-digital
converter (ADC) with aXilinx Virtex-5 FPGA. The FPGA per-forms data
aggregation and thenstreams a requested number of sam-ples to a
predetermined IP addressthat is, a host PC. Because the blockRAM of
our FPGA was almost com-pletely devoted to signal processing,we did
not have enough memory tocontain the firmware for a softprocessor.
Instead, we opted to imple-ment a minimal set of
networkingfunctions to transfer sensor data viaUDP back to a host.
Due to the needfor high bandwidth and low latency,UDP packet
streaming was the pre-ferred network mode.
Because of the time-sensitive natureof the data, a new set of
sample data ismore pertinent than any retransmissionof lost
samples. One of the two challeng-ing issues we faced was to avoid
over-loading the host device. That meant wehad to find a way of
efficiently handlingthe large number of inbound samples.The second
major challenge was quicklyformatting the UDP packet and
calculat-ing the required IP header fields and theoptional, but
necessary, UDP payloadchecksum, before the next set of sam-ples
overflowed internal buffers.
INITIAL HDL DESIGNAn HDL implementation of the packetengine was
straightforward given pre-existing pseudocode, but not optimal
forour FPGA hardware. C and pseudocode
provided from various sources simpli-fied verification. In
addition, tools suchas Wireshark, the open-source packetanalyzer,
and high-level languages suchas Java simplified the process of
simula-tion and in-lab verification.
Using provided pseudocode, the taskof developing Verilog to
generate thepacket headers involved coding a statemachine, reading
the sample FIFO andassembling the packet into a RAM-basedbuffer. We
broke the design into threemain modules, RX Flow, TX Flow andLAN
MCU, as shown in Figure 1. Aspackets arrive from the LAN, the
RX
Flow inspects them and passes themeither to the instrument core
or to theLAN MCU for processing, such aswhen handling ARP or DHCP
packets.
The TX Flow packet engine reads NADC samples from a TX FIFO
andcomputes a running payload check-sum for calculating the UDP
check-sum. The TX FIFO buffers new sam-ples as they arrive, while
the LANMCU prepares the payload of a yet-to-be-transmitted packet.
After fetchingthe last requested sample, the LANMCU computes the
remaining headerfields of the IP/UDP packet. In net-work
terminology, this procedure is aTX checksum offload.
Once the packet fields are generated,the LAN MCU sends the
packet to theTEMAC for transmission but retains it
until the TEMAC acknowledges suc-cessful transmissionnot
reception bythe destination device. As this firstpacket is awaiting
transmission by theTEMAC, new sensor samples are arriv-ing into the
TX FIFO. When the firstpacket is finished, our packet
enginereleases the buffer to prepare for thenext packet. The
process continues in adouble-buffered fashion. If the TEMACsignals
an error and the next transmitbuffer overflow is imminent, then
thepacket is lost to allow the next sampleset to continue, but an
exception isnoted. Due to time-stamping of the
X C E L L E N C E I N C O M M U N I C AT I O N S
RX FIFO
InstrumentCore Logic
andADC I/F
TX FIFO
TX Flow
RX Flow
TEMAC
LAN MCUAutoESL
Control Packets, UDP,Data Streaming, ARP,
DHCP
Figure 1 Our UDP packet engine design consisted of three main
modules: RX Flow, TX Flow and LAN MCU.
-
sample set incorporated into our pack-et format, the host will
realize a discon-tinuity in the set and accommodate it.
The latency to transmit a packet isthe number of cycles it takes
to readin N ADC samples plus the cycles togenerate the packet
header fields,including the IPv4 flags, source anddestination
address fields, UDP pseu-
do header and both the IP and UDPchecksums. The checksum
computa-tions are rather problematic sincethey require reading the
entire packet,yet they lie before the payload bytes.
CODING HDL IN THE DARKTo support the high-bandwidth
andlow-latency requirements of the sen-sor network, we needed an
optimalhardware design to keep up with therequired sample rate. The
straight-forward approach we implementedfirst in Verilog failed to
meet a 125-MHz clock rate without floorplan-ning, and took 17 clock
cycles togenerate the IP/UDP packet headerfields. As we developed
the initialHDL design, ChipScope was vitalto understanding the
nuances of theTEMAC interface, but it also imped-ed the goal of
achieving a 125-MHzclock. The additional logic-capturecircuits
altered the critical path andwould require manual floorplanningfor
timing closure.
The critical path was calculatingthe IP and UDP header
checksums,because our straightforward designused a four-operand
adder to summultiple header fields together in var-ious states of
our design. Our HDLdesign attempted a greedy schedul-ing algorithm
that tried to do as muchwork as possible per cycle of the state
machine. By removing ChipScope onthese operations and by
floorplanning,we closed timing.
The HDL design also used only oneport of a 32-bit-wide block RAM
thatacted as our transmit packet buffer. Wechose a 32-bit-wide
memory becausethats the native width of the BRAMprimitive and it
allowed for byte-enable
write accesses that would avoid theneed for read-modify-write
access tothe transmit buffer.
Using byte enables, the finite statemachine (FSM) writes
directly to theheader field bytes needing modifica-tion at a RAM
address. However, whatseemed like good design choicesbased on
knowledge of the underlyingXilinx fabric and algorithm yielded
anonoptimal design that failed to meettiming without manual
placement ofthe four-input adders.
Because the UDP algorithms werealready available in various
forms in Ccode or written as pseudocode in IP-related RFC
documentation, recodingthe UDP packet engine in C was not amajor
task and proved to yield a betterinsight to the packet header
process-ing. Just taking the pseudocode andstarting to write
Verilog may havemade for quicker coding, but thismethodology would
have sacrificedperformance without fully studyingthe data and
control flows involved.
ADVANTAGE AUTOESLThe ability for AutoESL to abstract theFIFO and
RAM interfaces proved to beone of the most beneficial
optimiza-tions for performance. With the abilityto code directly in
C, we could now eas-ily include both ARP and DCHP rou-tines into
our packet engine. Figure 2
shows a flowchart of our design. OurHDL design utilized a
byte-wide FIFOinterface that connected to the aggrega-tion and
sensor interface of our design,which remained in Verilog. Also,
ourVerilog design utilized a 32-bit memoryinterface that collected
4 bytes of sam-ple data and then saved it in the trans-mit buffer
RAM as a 32-bit word.
By means of its array reshapedirective, AutoESL optimized
thememory interface so that the transmitbuffers, while written in C
code as an8-bit memory, became a 32-bit memo-ry. This meant the C
code could avoidhaving to do many bit manipulationsof the header
fields, as they wouldrequire bit shifting to place into a 32-bit
word. It also alleviated little-endianvs. big-endian byte-ordering
issues.This optimization reduced the latencyof the TX offload
function that com-putes the packet checksums and gen-erates header
fields from 17 clocks, asoriginally written in Verilog, to
justseven clock cycles while easily meet-ing timing. AutoESL could
do better inthe future, since this current versiondoes not have the
ability to manipulatebyte enables on RAM writes. Byte-enabled
memory support is on thelong-term road map for the tool.
Another optimization that AutoESLperformed, which we found
byserendipity, was to access both ports ofour memory, since Xilinx
block RAM isinherently dual-port. Our Verilog designreserved the
second port of the trans-mit buffer so that its interface to
theTEMAC would be able to access thebuffer without any need for
arbitration.By allowing AutoESL to optimize forour true dual-port
RAM, it was capableof performing reads or writes from twodifferent
locations of the buffer. Ineffect, this wound up halving the
num-ber of cycles necessary to generate theheader. The reduction in
latency waswell worth the effort in creating a sim-ple arbiter in
Verilog for the secondport of the memory so that the TEMACinterface
could access the memory portthat AutoESL usurped.
16 Xcell Journal Second Quarter 2012
X C E L L E N C E I N C O M M U N I C AT I O N S
AutoESLs ability to abstract the FIFO and RAMinterfaces proved
to be one of the most beneficial
optimizations for performance.
-
We controlled the bit widths of thetransmit buffer and the
sample FIFOinterfaces via directives. Unfortunately,AutoESL does
not automatically opti-mize your design. Instead, you have
toexperiment with a variety of directivesand determine through
trial and errorwhich of them is delivering an improve-ment. For our
design, reducing thenumber of clock cycles to process thepacket
fields while operating at 125MHz was the goal.
The array reshape and looppipeline directives were importantfor
optimizing the design. The reshapedirective alters the bit width of
theRAM and FIFO interfaces, which ulti-mately led to processing
multiple head-er fields in parallel per clock cycle andwriteback to
memory. The optimalcombination that yielded the leastcycles was a
transmit buffer bit widthof 32. The width of the FIFO feedingADC
samples was not a factor inreducing the overall latency becauseits
impossible to force samples toarrive any faster.
The loop-pipelining directive isextremely important too, because
it indi-cates to the compiler that our loops thatpush and pop from
our FIFO interfacescan operate back-to-back. Otherwise,without the
pipeline directive, AutoESLspent three to 20 clock cycles
betweenpops of the FIFO due to scheduling rea-sons. It is therefore
vital to utilizepipelining as much as possible to attainlow latency
when streaming databetween memories.
Xilinx block RAM also has a program-mable data output latency of
one tothree clock cycles. Allowing threecycles of read latency
enables the mini-mum clock to Q timing. To experimentwith different
read latencies was only amatter of changing the latency direc-tive
for the RAM primitive or coreresource. Because of the
schedulingalgorithms that AutoESL performed,adding a read latency
of three cycles toaccess the RAM only tacked on oneadditional cycle
of latency to the overallpacket header generation. The extracycle
of memory latency allowed for
more slack in the design, and thataided the place-and-route
effort.
We also implemented ARP andDHCP routines in our AutoESL
designthat we had avoided doing beforebecause of the level of
effort required tocode them in Verilog. While not difficult,both
ARP and DHCP are extremelycumbersome to write in Verilog andwould
require a great number ofstates to perform. For instance, theARP
request/response exchangerequired more than 70 states. Onecoding
error in the Verilog FSM wouldlikely require multiple days to
undo.For this reason alone, many designerswould prefer just to use
a CPU to runthese network routines.
Overall, AutoESL excelled at gener-ating a synthesizable netlist
for the UDPpacket engine. The module it generatedfit between our
two preexisting ADCand TEMAC interface modules and per-formed the
necessary packet headergeneration and additional tasks. Wewere able
to integrate the design it cre-ated into our core design and
simulate itwith Mentor Graphics ModelSim to per-form functional
verification. With thestreamlined design, we were able toreach
timing closure with less synthe-sis, map and place-and-route effort
thanwith our original HDL design. Yet wehave significantly more
functionalitynow, such as ARP and DHCP support.
Comparing our original design inVerilog with our hybrid design
that uti-lized AutoESL to craft our LAN MCUand TX Flow modules
yielded impres-sive results. Table 1 shows a compari-son of lookup
table (LUT) usage. OurHDL version of TX Flow was smaller bymore
than 37 percent, but our AutoESLdesign incorporated more
functionality.Most impressive is that AutoESLreduced the number of
cycles to per-form our packet header generation by59 percent. Table
2 shows the latencyof the TX Offload algorithm.
The critical path of the HDL designwas computing the UDP
checksum.Comparing this with the AutoESLdesign shows that the HDL
design suf-
X C E L L E N C E I N C O M M U N I C AT I O N S
Second Quarter 2012 Xcell Journal 17
RXInterrupt
DHCPExchange
IdentifyPacket
UDPControl
UDPDHCP
PrepareUDP
Packet
Stream ControlInstruction to Core
ADC Samplesfrom Core
GenerateChecksums
Stream toTEMAC
ARPResponse
ARPRequest
Figure 2 Packet engine flowchart shows inclusion of ARP and
DHCP.
-
fered from 10 levels of logic and atotal path delay of 6.4
nanoseconds,whereas AutoESL optimized this toonly three levels of
logic and a pathdelay of 3.5 ns. Our development timefor the HDL
design was about amonth of effort. We took about thesame amount of
time with AutoESL,but incorporated more functionalitywhile gaining
familiarity with thenuances of the tool.
LATENCY AND THROUGHPUTAutoESL has a significant advantageover
HDL design in that it performscontrol and data-flow analyses and
canuse this information to reorder opera-tions to minimize latency
and increasethroughput. In our particular case, weused a greedy
algorithm that tried todo too many arithmetic operations perclock
cycle. This tool rescheduled ourchecksum calculations so as to
useonly two input adders, but scheduledthem in such a way to avoid
increasingoverall execution latency.
Software compilers intrinsicallyperform these types of
exercises. Asstate machines become more com-plex, the HDL designer
is at a disad-vantage compared to the omniscienceof the compiler.
An HDL designerwould typically not have the opportu-nity to explore
the effect of more thanjust two architectural choicesbecause of
time constraints to delivera design, but this may be a vital taskto
deliver a low-power design.
The most important benefit of thistool was its ability to try a
variety ofscenarios, which would be tedious inVerilog, such as
changing bit widths ofFIFOs and RAMs, partitioning a largeRAM into
smaller memories, reorder-ing arithmetic operations and
utilizingdual-port instead of single-port RAM.In an HDL design,
each scenario wouldlikely cost an additional day of writingcode and
then modifying the testbenchto verify correct functionality.
WithAutoESL these changes took minutes,were seamless and did not
entail anymajor modification of the source code.
Modifying large state machines isextremely cumbersome in
Verilog. Theadvent of tools like AutoESL is reminis-cent of the
days when processor design-ers began to employ
microprogramminginstead of hand-constructing themicrocoded state
machines of earlymicroprocessors such as the 8086 and68000. With
the arrival of RISC architec-tures and hardware description
lan-guages, microprogramming is nowmostly a lost art form, but its
lesson iswell learned in that abstraction is neces-sary to manage
complexity. As micro-programming offered a higher layer
ofabstraction of state machine design, sotoo does AutoESLor
high-level syn-thesis tools in general. Tools of this cal-iber
allow a designer to focus more onthe algorithms themselves rather
thanthe low-level implementation, which iserror prone, difficult to
modify andinflexible with future requirements.
18 Xcell Journal Second Quarter 2012
X C E L L E N C E I N C O M M U N I C AT I O N S
TX Flow Resource Usage
HDL TX AutoESL - TX % Increase
LUTs 858 1,372 37.5
Table 1 - The AutoESL design used more lookup tables but
incorporated more functionality.
Latency
HDL AutoESL % Improved
Clock Cycles 17 7 58.8%
Table 2 AutoESL improved the latency of the TX Offload
algorithm.
-
20 Xcell Journal Second Quarter 2012
XCELLENCE IN DISTRIBUTED COMPUTING
by Frank Opitz, MScHamburg University of Applied Sciences
Faculty of Engineering and Computer ScienceDepartment of Computer
[email protected]
Edris Sahak, BScHamburg University of Applied Sciences Faculty
of Engineering and Computer ScienceDepartment of Computer
[email protected]
Bernd Schwarz, Prof. Dr.-Ing.Hamburg University of Applied
Sciences Faculty of Engineering and Computer ScienceDepartment of
Computer [email protected]
An SoC network that uses Xilinx partial-reconfiguration
technology offers cloud computing for algorithms under test with
large stimulus data sets.
Accelerating DistributedComputing with FPGAs
-
R ather than install faster, morepower-hungry supercomputersto
tackle increasingly complexscientific algorithms, universities
andprivate companies are applying distrib-uted platforms upon which
projectslike SETI@home compute their datausing thousands of
personal comput-ers. [1, 2] Current distributed comput-ing networks
typically use CPUs orGPUs to compute the project data.
FPGAs, too, are being harnessed inprojects like COPACOBANA,
whichemploys 120 Xilinx FPGAs to crackDES-encrypted files using
brute-forceprocessing. [3] But in this case, theFPGAs are all
collected in one placean expensive proposition not appro-priate for
small university or companybudgets. Currently FPGAs are notnoted as
a distributed computing utili-ty because their use demands
theinvolvement of a PC to continuallyreconfigure the whole FPGA
with anew bitstream. But now, with theapplication of the Xilinx
partial-recon-figuration technology, its feasible todesign
FPGA-based clients for a dis-tributed computing network.
Our team at the Hamburg Universityof Applied Sciences created a
prototype
for such a client and implemented it in asingle FPGA. We
structured the designto consist of two sections: a static and
adynamic part. The static part loads atstartup of the FPGA, while
its imple-mented processor downloads thedynamic part from a network
server.The dynamic part is the partial-reconfig-uration region,
which offers sharedFPGA resources. [4] With this configu-ration,
the FPGAs may be situated any-where in the world, offering
computingprojects access to a high amount ofcomputing power with a
lower budget.
DISTRIBUTED SOC NETWORKWith their parallel
signal-processingresources, FPGAs provide four timesthe data
throughput of a microproces-sor by using a clock that is eight
timesslower and with eight times lowerpower consumption. [5] To
leverage thiscomputational power for high-data-input rates,
designers typically imple-ment algorithms as a pipeline, like
DESencryption. [3] We developed the dis-tributed SoC network (DSN)
prototypeto increase the speed of such algo-rithms and to process
large data setsusing distributed FPGA resources. Ournetwork design
applies a client-broker-
server architecture so that we canassign all registered
system-on-chip(SoC) clients to every network partici-pants
computational project (Figure 1).This would be impossible in a
client-server architecture, which connectsevery SoC client to only
one project.
Furthermore, we chose this broker-server architecture to reduce
the num-ber of TCP/IP connections of eachFPGA to just one. The DSN
FPGAscompute the algorithms with dedicateddata sets while the
broker-server man-ages the SoC clients and the projectclients. The
broker schedules the con-nected SoC clients so that each projecthas
nearly the same computing powerat the same time, or uses time
slices ifthere are fewer SoCs than projectswith computational
requests available.
The project client delivers the partial-reconfiguration module
(PRM) and a setof stimulus input data. After connectingto the
broker-server, the project clientsends the PRM bit files to the
server,which distributes them to SoC clientswith a free partially
reconfigurableregion (PRR). The SoC clients staticpart, a
MicroBlaze-based microcon-troller, reconfigures the PRR
dynamical-ly with the received PRM. In the next
Second Quarter 2012 Xcell Journal 21
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
Dynamic Part
Static Part
SoC Client 1
Dynamic Part
Static Part
SoC Client n
NetworkInfrastructure
TCP/IPConnection
Broker-ServerComputer
SoC 1
PRM
Data
ProjectClient 1
PRM
Data
ProjectClient m
SoC n
Project 1
Project m
NetworkInfrastructure
Figure 1 Distributed SoC network with SoC clients provided by
FPGAs and managed by a central broker-server. Project clients
distribute the partial-reconfiguration modules and data sets. The
dynamic part of an SoC client supplies resources
via the PRR, and a microcontroller contained in the static part
processes the reconfiguration.
-
the Xilinx EDK, such as the FastSimplex Link (FSL), PLB slave
and PLBmaster. We chose a PLB master/slavecombination to get an
easy-to-configureIP that sends and receives data requestswithout
the MicroBlazes support, signif-icantly reducing the number of
clockcycles per word transfer.
For the client-server communica-tion, the FPGAs internal hard
EthernetIP is an essential peripheral of theprocessor systems
static part. With thesoft-direct-memory access (SDMA) ofthe
local-link TEMAC to the memorycontroller, the data and bit file
trans-fers produce less PLB load. After
receiving a frame of 1,518 bytes, theSDMA generates an interrupt
request,so that the lwip_read() functionunblocks and can handle
this piece ofdata. The lwip_write() function tellsthe SDMA to
perform a DMA transferover the TX channel to the TEMAC.
step, the project client starts sendingdata sets and receives
the computedresponse from the SoC client via thebroker-server.
Depending on the projectclients intentions, it compares
differentcomputed sets or evaluates them for itscomputational aims,
for example.
THE SOC CLIENTWe developed the SoC client for a XilinxVirtex-6
FPGA, the XC6VLX240T,which comes with the ML605 evalua-tion board.
A MicroBlaze processorruns the clients software, which man-ages
partial reconfigurations along withbitstream and data exchanges
(Figure
2). A Processor Local Bus (PLB) periph-eral that encapsulates
the PRR in itsuser logic is the interface between thestatic and the
dynamic parts. In thedynamic part reside the shared FPGAresources
for accelerator IP cores sup-plied by the received PRM. To
store
22 Xcell Journal Second Quarter 2012
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
received and computed data, we choseDDR3 memory instead of
CompactFlashbecause of its higher data throughputand the unlimited
amount of writeaccesses. The PRM is stored in a dedi-cated data
section to control its size andto avoid conflicts with other data
sets.The section is set to 10 Mbytes, which isbig enough to store a
complete FPGAconfiguration. Thus, every PRM shouldfit in this
section.
We also created data sections for thereceived and the computed
data sets.These are 50 Mbytes in size so as toensure enough address
space forimages or encrypted text files, for
example. Managing these data sectionsrelies on an array of 10
administrationstructures; the latter contain the startand end
addresses of each data set pairand a flag that indicates computed
sets.
To connect the static part to the PRR,we evaluated IP
connections given by
MicroBlaze100 MHz
MPMC400 MHz
DDR3SDRAM
GMIIEthernet
PHY
Static Part
CF Card
SysACELL_TEMAC
InterruptController HW-ICAP
XilkernelTimer Dynamic
Part
PRR
IRQ
Timer IRQ
IXCL
DXCL
SDMA
SDMA Rx and Tx IRQ
TX Local Link
RX Local Link
TEMAC IRQ
PLBv46 PLB-Master IP
IF
A MicroBlaze processor runs the clients software,which manages
partial reconfigurations along
with bitstream and data exchanges.
Figure 2 The SoC client is a processor system with a static part
and a bus master peripheral, which contains the partially
reconfigurable region (PRR). Implemented with Virtex-6 FPGA
XC6VLX240T on an ML605 board.
-
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
Second Quarter 2012 Xcell Journal 23
We implemented the Xilkernel, akernel for Xilinx embedded
proces-sors, as an underlying real-time oper-ating system of the
SoC clients soft-ware in order to utilize the light-weight TCP/IP
stack (LwIP) librarywith the socket mode for the TCP/IPserver
connection. Figure 3 providesan overview of the clients
threadsinitialization, creation, transmissionand processing
sequences. The SoCclient thread initiates a connection tothe server
and receives a PRM bit-stream (pr), which it stores inDDR3 memory,
applying the XILMFSfile system. Thereafter theXps_hwicap (hardware
internal con-figuration access point) reconfiguresthe PRR with the
PRM. Finally, thebus master peripheral sets a status
bit that instructs the SoC client tosend a request to the
server. The serv-er responds with a data set (dr),which the SoC
client stores in theonboard memory as well. These datafiles contain
a content sequence suchas output_length+ol+data_to_com-pute. The
output_length is the bytelength, which reserves the memoryrange for
the result data followed bythe character pair ol. With the
firstreceived dr message, a compute anda send thread get
created.
The compute thread transfers theaddresses of the
input-and-result datasets to the slave interface of the
PRRperipheral and starts the PRMsautonomous data set processing.
Anadministration structure providesthese addresses for each data
set and
contains a done flag, which is setafter the result data is
completelyavailable. In the current version of theclients software
concept, the computeand send threads communicate viathis structure,
with the send threadchecking the done bit repeatedly andapplying
the lwip_write() calls onresults stored in memory.
When testing the SoC client, wedetermined that with all
interruptsenabled while the reconfiguration ofthe PRR is in
progress, this process getsstuck randomly after the Xilkernelstimer
generates a scheduling call to theMicroBlaze. This didnt happen
with allinterrupts disabled or while using astandalone software
module for theSoC clients MicroBlaze processorwithout the
Xilkernels support.
Main()
ICAPinitialize
Xilkernelstart
..xilkernel.. LwIPinitialize
Setupnetwork
interfaces
LwIP Read Threadprocessing
SoC Client Threadprocessing
SoC Client
Socketcreate
Data memoryinitialize
SoCsend
Socket read
dr
aa/ao
pr
PRMreceive
Input datareceive
Handle toSend Thread
SendThread
ComputeThread
Reconfiguration
Figure 3 SoC clients software initialization and processing
cycles include reconfiguration of the PRR with a PRM, data set
retrieved from the server, start of processing and the data sets
return to the server threads. Black bars
indicate thread creation by sys_thread_new() calls from the
Xilkernel library.
-
BUS MASTER PERIPHERAL WITH PRM INSTANTIATIONTo achieve a
self-controlled stimulusdata and result exchange between thePRM and
the external memory, westructured the bus master peripheralas a
processor element with a dataand a control path (Figure 4).
Withinthe data path, we embedded the PRMinterface between two FIFO
blockswith a depth of 16 words each in orderto compensate for
communicationand data transfer delays. Both FIFOs
of the data path are connected direct-ly to the PLBs bus master
interface. Inthis way, we obtain a significant tim-ing advantage
from a straightforwarddata transfer operated by a finite
statemachine (FSM). No software is
24 Xcell Journal Second Quarter 2012
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
involved, so no intermediate data stor-age takes place in the
MicroBlazesregister file. This RISC processorsload-store
architecture always requirestwo bus transfer cycles for loading
aCPU register from an address locationand storing the registers
content toanother PLB participant. With theDXCL data cache link of
theMicroBlaze to the memory controlleras a bypass to the PLB, the
timing ofthese load-store cycles would notimprove. Thats because
the received
data and the transmitted computingresults are all handled once,
word byword, without utilizing caching bene-fits. As a consequence,
the PRR periph-erals activities are decoupled from theMicroBlazes
master software process-
ing. Thus, the PRR data transfer causesno additional Xilkernel
context switch-es. But there is still the competition oftwo masters
for a bus access, whichcant be avoided.
The peripherals slave interface con-tains four software-driven
registersthat provide the control path with startand end addresses
of the input and out-put data sets. Another software regis-ter
introduces a start bit to the FSM,which initiates the master data
transfercycles. The status of a completed cycleof data processing
is available with theaddress of the fifth software register tothe
clients software.
With the state diagram of the con-trol paths FSM, the strategy
to priori-tize the write cycles to the PLBbecomes clear (Figure 5).
Pulling outthe data from the OUT_FIFO domi-nates over filling the
IN_FIFO, to pre-vent a full OUT_FIFO from stoppingthe PRM from
processing the algo-rithm. Reading from or writing to theexternal
memory occurs in alternatesequences, because only one kind ofbus
access at a time is available. Whena software reset from the
clients com-
IN_FIFO OUT_FIFO Memory access Next state
dont care not empty writing WRITE_REQ
not full empty reading READ_REQ
full empty STARTED
Table 1 FSM control decisions in state STARTED with write
priority
IPIFUser_Logic
Data Path
IN_FIFO
Control Path
PRM-Interface OUT_FIFO
5 SWRegisters
FSMAddress
Generator
32Data_in
Data_in_en
Data_in
Data_in_ready
FIFO_full_n
FIFO_read
FIFO_empty_n
Data_out
Data_out
Enable
Data_out_free
Data_out_enFIFO_write
Data_in
FIFO_full_n
FIFO_read
FIFO_empty_n
Data_out
FIFO_write
3232
32
32
8
32
Bus2IP_MstRd_d
Bus2IP_DataBus2IP_BE
Bus2IP_WRCE
Bus2IP_RDCE
Bus2IP_Mst_CmdAck
Start
Start_read IP2Bus_MstRD_Req
IP2Bus_MstWR_Req
IP2Bus_MstWR_d
IP2Bus_Mst_addr
IP2Bus_Mst_BE
End_read
Start_writeEnd_write
Bus2IP_Mst_Cmplt
Figure 4 Bus master peripheral operates as a processor element.
The PRM interface includes the dynamic part with a component
instantiation of the PRM.
-
pute thread starts the FSM (Figure 3),the first thing that
happens is a readfrom the external memory (stateREAD_REQ). From
then on, the busmaster follows the decision logic givenby the
transition conditions from stateSTARTED (Table 1).
The FSM Mealy outputs (labelExit/) prepare the address counters
toincrement when a bus transfer is com-pleted. Here, the two
counters areintroduced directly into the FSM code.Usually we prefer
timers and addresscounters as separate clocked process-es enabled
simply by FSM outputs, inorder to keep the counters transitionlogic
small and free from unnecessarymultiplexer inputs for counter
statefeedback. At this point, the XST syn-thesis compiler results
present RTLschematics with a clear FSM extrac-tion parallel to
loadable counters, withclock-enable inputs driven by anexpected
state decoding logic. Despitea more readable behavioral VHDL-
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
coding style, the FPGA resources andsimple primitives get
utilized without aloss of features.
DEFINING THE DYNAMIC PARTWITH PLANAHEADThe design flow for
configuration of astatic and a dynamic part within theFPGA is a
complex developmentprocess that involves several steps withthe
physical-design constraints toolPlanAhead. Our first effort was
ascript-based design flow for aPetaLinux-driven
dynamic-reconfigura-tion platform implemented on an ML505board. [6]
With the current iteration, thedesign steps for integrating a
PRRdirectly into a peripherals user logic aremuch more practical
than the formermethod of adding bus macros and adevice control
register (DCR) as a PLBinterface for the PRM and an extra PLB-DCR
bridge for enabling the bus macros.
Here is how we fixed the dynamicparts size and position with
the
AREA_GROUP constraints, which areincluded in the UCF file of
thePlanAhead project, as shown in thecode below.
INST "dyn_interface_0/dyn_inter-face_0/USER_LOGIC_I/PRR"
AREA_GROUP ="pblock_dyn_interface_0_USER_LOGIC_I_PRR ";
AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR
"RANGE=SLICE_X0Y0:SLICE_X57Y239 ;
AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR
"RANGE=RAMB18_X0Y0:RAMB18_X3Y95 ;
AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR
"RANGE=RAMB36_X0Y0:RAMB36_X3Y47 ;
An instance name concatenationspecifies the inner
partial-reconfigura-tion region (prm_interface.vhd) withinstance
name PRR. For all FPGAresources we want to include in thedesired
PRR, we specify a rectangularregion with its lower-left and
upper-right coordinates.
This special choice covers slicesand BRAM only, because the
availableDSP elements belong to dedicated
Resource Amount
LUT 55,680
FD_LD 111,360
SLICEL 7,440
SLICEEM 6,480
RAMBFIFO36E1 192
Table 2 Allocated resources for
the dynamic part of the SoC client
STARTED
WRITE_REQ
WAIT_FOR_WCMPWAIT_FOR_WCMP
WAIT_FOR_CMP
READ_REQ
IDLE
IN_FIFO_full_n ==1and
Read_address !=end_address_read_data
Exit / Read_address
-
clock regions and are utilized for theMultiport Memory
Controller (MPMC)implementation (Table 2).
To prevent the PRM netlists thatISE generates from using
excludedresources, we set the synthesisoptions to
dsp_utilization_ratio = 0;use_dsp48 = false; iobuf = false.Finally,
the FPGA Editor offers aninsight: that the static parts place-ment
is located in an area separatedcompletely from the PRR, which
inthis special case uses very fewresources (Figure 6).
AN SOC CLIENT WITH IMAGE-PROCESSING PRMWe proved the SoC clients
operationand its TCP/IP server communicationwith a Sobel/median
filter combina-tion implemented in a PRM (Figure7). We developed
the image-process-ing neighborhood operations with theXilinx System
Generator, which gaveus the advantage of Simulink simu-lation and
automatic RTL code gener-ation. A deserializer converted theinput
pixel stream to a 3 x 3-pixelarray, which sequences like a maskover
the whole image and providesthe input to the filters parallel sum
ofproducts or to the successive com-parisons of the median filter.
[7]Input and output pixel vectors of thefilters have a width of 4
bits, so weinserted a PRM wrapper that multi-plexes the eight
nibbles of the 32-bitinput vector from the synchroniza-tion FIFO.
With a MATLAB script,we convert an 800 x 600 PNG imageto 4-bit
gray-scale pixels for the PRMinput stimulus. At the filters
output,eight 4-bit registers are successivelyfilled and
concatenated for the wordtransfer to the OUT-FIFO (Figure 4).
Table 3 summarizes the results oftiming measurements performed
withthree operational steps of the SoCclient: receiving a PRM bit
file, recon-figuration of the PRR and image-pro-cessing sequences.
We captured thereceiving and image-processingcycles, from the first
to the last data
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
Interval duration
Filter Slices PRM receive Reconfiguration Image module (seconds)
3.5-Mbyte processing
bit file (sec) (ms)
Binarize 3 77 31.25 25.25
Erosion 3x3 237 73 31.25 85.93
Median 3x3 531 73 31.25 77.09
Sobel 3x3 479 73 31.25 86.45
Table 3 Timing measurement results; reconfiguration with
disabled interrupts. Processor and peripheral clock rate fclk = 100
MHz.
Figure 6 Resource placement of the static part (right side) and
dynamic part (left side, with white oval) according to the area
specification for the PRR
26 Xcell Journal Second Quarter 2012
-
transfer, with a digital oscilloscopemeasurement at a GPIO
output tog-gled by XGpio_WriteReg() calls.
The reconfiguration intervals allhave the same duration, because
noXilkernel scheduling events disrupt-ed the software-driven HWICAP
oper-ation. An FSM-controlled HWICAPoperation without MicroBlaze
inter-action will yield a shorter durationwith a reconfiguration
speed of morethan 112 kbytes/second, even withenabled
interrupts.
During PRM transmission fromthe broker to the SoC client, the
con-nection soon aborted. With a 1-mil-lisecond delay between each
trans-mitted 100 bytes, the SoC client per-formed a nondisturbed
communica-tion. Parallel to the image-processingcycles, normal
Xilkernel threadingcaused PLB access competition andtherefore, the
SoC client operatedunder typical conditions. The bina-rize sequence
has a duration value of600 x 800/100 MHz = 4.8 ms, becauseonly a
single comparison is active.This sequence is nested in two
imagetransfers via the PLB, which take aminimum of five clocks per
word, asextracted from a functional bus sim-ulation: 2 x 5 x 600 x
800/(8 x 100MHz) = 6 ms. Because all measure-ment numbers for the
data transfersare larger than rough estimates ledus to expect, we
are in the midst of adetailed analysis of the full timingchain
buildup by bus reading, FIFOfilling and emptying, image-process-ing
pipeline and bus writing.
X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N
G
Second Quarter 2012 Xcell Journal 27
POWER OF PARTIALRECONFIGURATIONTo compute complex algorithms, it
isprofitable to employ the power of dis-tributed computing
networks. State-of-the-art implementations of thesenetworks operate
with CPUs andGPUs only. Our prototype of anFPGA-based distributed
SoC networkarchitecture utilizes the parallel sig-nal-processing
features of FPGAs tocompute complex algorithms.
The Xilinx partial-reconfigurationtechnology holds the key to
utilizingshared FPGA resources all over theworld. In our
architecture, the staticpart of the SoC client reconfiguresthe
dynamic part of the FPGA withupdated accelerators in a
self-con-trolled way. We have to improve theSoC client to run the
HWICAP withenabled interrupts, so that it keepsfully reactive. A
step in that directionis an FSM-controlled reconfigura-tion, which
puts no load on theprocessor. But we need to analyzethe influence
of PLB transfers andthe MPMC bottleneck as well.
To manage the SoC client, aXilkernel linked with the LwIP
sup-plies concurrency with threads for thereconfiguration drivers,
the dynamicparts bus interface and other applica-tions. We further
concentrate on tim-ing analysis of the client-server systemand the
dynamic parts processingcycles in order to identify the
soft-ware/RTL-model configuration withan improved data throughput
and areliable communication.
For the next stage of our SoC clientdesign, we have to take the
AXI4 busfeatures into account. In general, PRMexchanges can be
treated as additionalhardware tasks operating in conjunc-tion with
a set of software tasks. Lastbut not least, we are still refining
theservers software design to achieveimproved user
satisfaction.
References
1. Unofficial BOINC Wiki, BoincFAQ: Introduction to
boinc,http://www.boinc-wiki.info/
2. Markus Tervooren, all
projectstats.com,http://www.allprojectstats.com/
3. S. Kumar, J. Pelzl, G. Pfeiffer, M. Schimmler and C. Paar,
Breakingciphers with COPACOBANA, a cost-optimized parallel code
breaker, orhow to break DES for 8,980 eur,
http://www.copacobana.org/paper/CHES2006_copacobana_slides.pdf
4. Frank Opitz, Development of anFPGA-based distributed
computingplatform. Masters thesis, HAWHamburg, 2011,
http://opus.haw-ham-burg.de/volltexte/2012/1450/pdf/Masterarbeit_Frank_Opitz.pdf
5. Ivo Bolsens, Programming ModernFPGAs,
http://www.xilinx.com/univ/mpsoc2006keynote.pdf
6. Armin Jeyrani Mamegani,Implementation and evaluation
ofmethods for partial and dynamicreconfiguration of SoC-
FPGAs.Masters thesis, HAW Hamburg,2010,
http://opus.haw-hamburg.de/volltexte/2010/1083/pdf/MA_A_Jeyrani.pdf
7. Edris Sahak, Partial reconfigura-tion of an SoC-based
image-process-ing pipeline. Bachelors thesis, HAWHamburg, 2011,
http://opus.haw-ham-burg.de/volltexte/2011/1420/pdf/BA_E_Sahak.pdf
Figure 7 PRM processing results for edge detection. Gray-scale
input stimulus image to the PRM is shown at left, while the
response from a PRM with a
Sobel/median filter combination is seen at right.
-
28 Xcell Journal Second Quarter 2012
Theres probably no better vehicle than a robot to get highschool
and middle school students hooked on science andtechnology. For
pupils in grades 8-12, tinkering with a robotis a hands-on way to
grasp new ideas and to see technology inaction. Building a robot
can provide a powerful motivation for stu-dents to overcome
intellectual challenges and achieve levels ofexcellence in
scientific and technical subjects.
To that end, I would like to present an educational platform
thatteachers can use to support educational robotics activities for
techni-cal high schools. The idea for this platform was born at the
highschool where I teachITCS Erasmo da Rotterdam, near
Milan,Italyin the context of building an entry for the RoboCup
Juniorcompetition. Our group participated in the category rescue
robotthat is, a machine able to identify victims within a
re-created disasterscenario. These robots must accomplish tasks
varying in complexityfrom walking a line on a flat surface up to
negotiating paths throughobstacles on uneven terrain and saving
some well-defined victims.
The fundamental characteristics of our FPGA-based design,
ifcompared with the normally diffuse platforms based on
differentkinds of microcomputers, are openness, flexibility,
capability toevolve and reusability.
A Xilinx Spartan FPGA forms thebasis for a powerful teaching
tool
thats able to evolve,reinventing itself
according to student needs.
XCELLENCE IN EDUCATION
FPGAs Enable Flexible Platformfor High School Robotics
by Giulio VitaleProfessor and FPGA Design ConsultantITCS Erasmo
da Rotterdam, Bollate, [email protected]
-
Second Quarter 2012 Xcell Journal 29
We built the 2011 version of therobot on a Xilinx Spartan-3E
device,using the Digilent Nexys2 educationalboard. As of this
writing, we are portingthis versionwhich will compete in thenext
RoboCup Junior Italy in April2012to a Spartan-6 FPGA. The next,2013
version is scheduled to run on aZynq-7000 Extensible
ProcessingPlatform device.
Working within this paradigm, wewere able to design a rescue
robotthat evolved, year after year, from thefirst prototype to the
current version,named Nessie 2011 (a pun on bothNexys and the Loch
Ness monster,whose long neck is reminiscent ofour robots). The
flexibility of theFPGA allowed a complete remodel-ing of the robots
architecture, fol-lowing the progression of the stu-dents
knowledge, while leaving itsbasic physical structure
substantiallyunaltered and maintaining the samedesign
infrastructure.
THE CHALLENGEThe ITCS Erasmo da Rotterdam is atechnical high
school located in a sub-urb of Milan in northern Italy, with
astudent population that is stronglyheterogeneous and, often, not
so easyto coax into deepening their scientificand technological
knowledge.
Starting four years ago, I decided toactivate an open space,
which I namedthe Permanent Laboratory for DidacticRobotics, where
students can experi-ment in a different way from the stan-dard
classroom. Here, they approachthe technical disciplines in an
interac-tive environment in which pupils cannegotiate some
unexpected aspects ofthe subject matter, make choicesregarding the
subjects they will pur-sue, organize their own jobs andreceive
direct feedback from theresults of their actions. In other
words,they get to experiment in an activelearning space based on
the old andwell-known model of situated cogni-tion [1], where
students collaboratewith one another and with their
instructor while pursuing a commongoal with a shared
understanding.
In this learning space, pupils canpractice a cognitive
apprenticeshipthat uses problem-solving methodolo-gies. The teacher
assumes the role of aprofessional expert who proposesspecific
processes involving authentictasks and strategies, and allows
stu-dents to try them independently,coaching only as needed.
Robotics was the natural choice toprovide a fertile breeding
ground forthe convergence of different disci-plines and the
exchange of knowl-edge. We decided that the fun of takingpart in
the RoboCup Junior competi-tion would provide a strong stimulusto
incentivize student participation.
THE SOLUTIONI understood that to be effective, Iwould have to
propose topics normal-ly covered in regular classes on
digitalelectronics and informatics, but aimedat more complex
applications thanthose the students would be able tosolve alone.
Instead, they would needto work in groups or with the supportof an
experienced teacher who couldpropose appropriate models.
I knew what to build but I did notknow how to build it.
Everything hadto be born and developed in the labo-ratory, with
students discussing thedesign and seeking solutions together.
After some deliberation, I came tothe conclusion that the most
likely
solution was one based on a flexibleplatform, such as an FPGA,
ratherthan on standard microcomputers.Thats because an FPGA was the
onlydevice able to provide the requiredcharacteristics, and to keep
pacewith the dynamic and evolutionaryscope of the laboratory
activities.
I chose, initially, to use an educa-tional card based on the
Spartan-3Ebecause it could provide the necessarycharacteristics we
were seekingnamely, openness, flexibility, ability toevolve,
reusability of the hardwareand richness of performance.
Openness, because students mustactively participate in the
entiredesign flow, from the sensor inter-face to the CPU and from
this tothe actuators.
Flexibility, because the com-plete architecture of the systemand
the nature and type of devicesshould not be fixed in advance,but
must emerge from theresearch process activated withina creative
context for learning.
Ability to evolve, because aftereach RoboCup competition,
stu-dents must learn the shortcom-ings of their work and know howto
make the appropriate modifi-cations to try to reach moreadvanced
solutions. The systemmust grow in parallel with thestudents
expertise.
X C E L L E N C E I N E D U C AT I O N
Figure 1 Thanks to FPGA flexibility, the same platform can
evolve in parallel with studentunderstanding, reusing the same
hardware and redesigning as needed. Photos show
Nessies evolution from 2008 through 2011.
-
Reusability, in order to avoidunnecessary waste of the hard-ware
and the school budget.
High performance at an afford-able cost. We had to control
alarge number of devices andperipherals that were not fullydefined,
but needed to operatewith a high degree of parallelism.The CPU
should be very powerfulbut relatively simple in its archi-tecture
and easy to interface.
NESSIE 2012: BLOCK DIAGRAMAND DESCRIPTIONWe achieved our goal
using a Digilentcard carrying a Spartan 3E-1200onboard, which was
the commonthread of the four-year project devel-opment. As you can
see in Figure 1,the rescue robot the students
designed showed a clear evolutionfrom a machine that barely
moved in2008 to one that, in 2011, made us oneof just 15 teams, out
of 65 participants,to reach the finals.
The level of student expertise hasgrown from year to year,
laying a foun-dation for further improvements thatwe are planning
for this yearsRoboCup Junior competition in April.
First, we have moved from theSpartan-3E to the Spartan-6 family,
con-verting the bus infrastructure of thestandard Processor Local
Bus to theAXI4 interface. Second, we have modi-fied some critical
sensors for trackingof the reference line and redesigned themotor
interface, porting a PID algo-rithm for automatic speed
controldirectly into the FPGA fabric.
Figure 2 illustrates the completeblock diagram of the system, as
it
appears in the current design. Lookingat it, you can clearly see
the richness ofdidactic topics that it enables a teacherto cover in
a course on digital controlsystems. Equally clear is the high
levelof parallelism that the system canobtain in terms of
performance, com-pared with a robotic platform built on astandard
microcomputer.
Moreover, the activity in the roboticslaboratory has had an
interesting impacton the ordinary educational courses atour school.
The FPGA has become atool for rapid and effective implementa-tion
of the theoretical aspects of tech-nology, sparking students keen
interestin the topics covered.
MANAGING NESSIES WALKING PROBLEM Armed with the rich supply
ofresources available in the Spartan-6
30 Xcell Journal Second Quarter 2012
X C E L L E N C E I N E D U C AT I O N
InstructionandDataLocal
MemoryBRAM
MDMDebugModule
MicroBlaze
ExternalMemory
Controller
LED GPIO
GPIO IP
GPIO IP
UART Lite IP
Lightness sensorinterface IP
Accelerometerinterface IP
Graphic displayinterface
Gripcontrol IP
Ultrasonicsensor interface IP
Infraredsensor interface IP
Line follower IP
PID motor control