magazine - Xilinx · 2019-10-17 · station signal processing, radar signal pro-cessing), multimedia processing (video pro-cessing, audio signal processing), and other application

R

INSIDE

FPGAs for DSP: A Fast-Growing Market

The Role of FPGAs inDigital Radio Subsystems

Rapid Prototyping and Verification of MIMO Systems

Accelerate Video DSP Co-Processing Designs

Implementing OptimalFilters Quickly

Floating- to Fixed-PointConversion of MATLABAlgorithms Targeting FPGAs

INSIDE


The Role of FPGAs inDigital Radio Subsystems



Implementing OptimalFilters Quickly

Floating- to Fixed-PointConversion of MATLABAlgorithms Targeting FPGAs

Issue 2May 2006

Optimizing DSPSystem DesignsOptimizing DSPSystem Designs

DSPmagazineDSPmagazineS O L U T I O N S F O R H I G H - P E R F O R M A N C E S I G N A L P R O C E S S I N G D E S I G N S

Enabling success from the center of technology™

1 800 332 8638em.avnet.com

© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Avnet Electronics Marketing designs, manufactures, sells and

supports a wide variety of hardware evaluation, development and

reference design kits for developers looking to get a quick start on

a new project.

With a focus on embedded processing, communications and

networking applications, this growing set of modular hardware kits

allows users to evaluate, experiment, benchmark, prototype, test

and even deploy complete designs for field trial.

Gain hands-on experience with these design kits and other

development tools by participating in a SpeedWay Design

Workshop™ this spring.

For a complete listing of available boards, visit

www.avnetavenue.com

For more information about upcoming SpeedWay workshops, visit

www.em.avnet.com/speedway

Support Across The Board.™

Design Kits Fuel Feature-Rich Applications

Build your own system bymixing and matching:

• Processors

• FPGAs

• Memory

• Networking

• Audio

• Video

• Mass storage

• Bus interface

• High-speed serial interface

Available add-ons:

• Software

• Firmware

• Drivers

• Third-party development tools

W

High-Performance DSP –Executing to PlanWelcome to the second edition of Xilinx® DSP Magazine. In the last issue we outlined five pillarsthat underline our vision in DSP: market focus, design methodology, tailored solutions, ecosystem,and awareness. Since publishing that issue, we have delivered many exciting tailored solutions, such as starter and co-processing kits for video and imaging and JTRS development platforms forsoftware-defined radio.

With the acquisition of AccelChip and its MATLAB-to-RTL synthesis tools, we have alsoincreased our investment in providing you with the most capable and easiest to use designmethodology solutions. We are also continuing to work with our core DSP partners like TexasInstruments and The MathWorks to deliver complementary solutions, as unveiled in our recentSerial RapidIO interoperability announcement at TI’s Developer Conference in February.

For this second edition of DSP Magazine we welcome DSP industry icon Will Strauss ofForward Concepts with snippets from his latest DSP industry research report. His insightsregarding shifts in the DSP industry are provided in his article, “FPGAs for DSP: A Fast-Growing Market.” In addition, our partners Avnet, Lyrtech, The MathWorks, and Nuvationhighlight their latest innovations for our XtremeDSP™ platforms. Our own experts providetutorials on implementing floating-point DSP, optimizing filter design, and achieving high-bandwidth simulations, among others.

I’m sure you’ll find our second edition of DSP Magazine informative and inspiring as weendeavor to help you unlock the full capabilities of Xilinx reconfigurable signal processing.Enjoy the read!

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. PowerPC is atrademark of IBM, Inc. All other trademarks are theproperty of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using such infor-mation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

Omid Tahernia

Vice President and General ManagerXilinx DSP Division

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

ADVERTISING SALES Dan Teie1-800-493-5551

TECHNICAL COORDINATOR Narinder Lall

INTERNATIONAL Dickson Seow, Asia [email protected]

Andrea Barnard, Europe/Middle East/[email protected]

Yumi Homura, [email protected]

www.xilinx.com/xcell/

DSPmagazineDSPmagazine

DaVinci™ Technology makes astounding

creativity possible in digital video devices

for the hand, home and car. The DaVinci

platform includes digital signal processor

(DSP) based SoCs, multimedia codecs,

application programming interfaces, applica-

tion frameworks and development tools, all

of which are optimized to enable innovation

for digital video systems. DaVinci products

will save OEMs months of development time

and will lower overall system costs to

inspire digital video innovation. So what are

you waiting for? You bring the possibilities.

DaVinci will help make them real.

Portable Media Player IP Set-Top Box Automotive Infotainment Digital Still Camera Digital Video Innovations Video Surveillance Video Phone & Conferencing

What is DaVinci?

Now that DaVinci products are here, your

digital video innovations are everywhere.

That’s the DaVinci Effect.

Processors: Digital Video SoCs:- TMS320DM6446 – Video encode/decode

- TMS320DM6443 – Video decode

Tools: Validated Software and Hardware Development- DVEVM (Digital Video

Evaluation Module)

- MontaVista Development Tools

- Code Composer Studio IDE

IP SET-TOP BOX:Stream andrecord any format videofrom anywhereonto your TV.

SPEED VIDEO DESIGN:TI’s digital video framework simplifies development.

Digital video evaluation moduleallows for rapidprototyping of newdesigns.

Program the SoC via industry recognized APIs.

PORTABLE MEDIA PLAYER:Video on the go–playing on the TV, in the car or in your hands.

DIGITAL STILL CAMERA:Crops photographs, cleans up pictures andrecords memories.

VIDEO SURVEILLANCE:Intelligent system notifies you when someoneapproaches and instantlyemails you a photo.

DaVinci, Code Composer Studio IDE, Technology for Innovators and the red/black banner are trademarks of Texas Instruments. 1321A0 © 2006 TI

Performance Benchmarks: Software: Open, Optimized and Production Tested- Platform Support Package

- MontaVista Linux Support Package

- Industry-recognized APIs

- Multimedia frameworks

- Platform-optimized, multimedia codecs:

- H.264

- MPEG4

- H.263

- MPEG2

- JPEG

- AAC+

- AAC

- WMA9

- MP3

- G.711

- G.728

- G.723.1

- G.729ab

- WMV9/VC1

>>> For complete technical

documentation or to get

started with our Digital Video

Evaluation Module, please visit

www.thedavincieffect.com

+ denotes available processor headroom for analytics and/or other features

STANDALONE CODECS DM6446 DM6443

MPEG-2 MP ML Decode1080i+ (60 fields

/30 frames)720p+

MPEG-2 MP ML Encode D1+ n/a

MPEG-4 SP Decode 720p+ 720p+

MPEG-4 SP Encode 720p+ n/a

VC1/WMV 9 Decode 720p+ 720p+

VC1/WMV 9 Encode D1+ n/a

H.264 (Baseline) Decode D1+ D1+

H.264 (Baseline) Encode D1+ n/a

H.264 (Main Profile) Decode D1+ D1+

Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

BUSINESS VIEWPOINT

FPGAs for DSP: A Fast-Growing Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

MULTIMEDIA, VIDEO, and IMAGING

Implementing Bluetooth CVSD Codec on an FPGA . . . . . . . . . . . . . . . . . . . . . . . . . .8

Developing Video IP in a Fully Integrated Design Environment . . . . . . . . . . . . . . . . .10

Accelerate Video DSP Co-Processing Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

WIRELESS

The Role of FPGAs in Digital Radio Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Rapid Prototyping and Verification of MIMO Systems . . . . . . . . . . . . . . . . . . . . . . .21

AEROSPACE AND DEFENSE

Making the Adaptivity of SDR and Cognitive Radio Affordable . . . . . . . . . . . . . . . .25

The Design of an FPGA-Based MIMO Transceiver for Wi-Fi . . . . . . . . . . . . . . . . . . .28

Floating- to Fixed-Point Conversion of MATLAB Algorithms Targeting FPGAs . . . . . . . . .32

CUSTOMER SUCCESS

BAE Systems Proves the Advantages of Model-Based Design . . . . . . . . . . . . . . . . . .36

GENERAL PURPOSE

Achieving High-Bandwidth DSP Simulations Using Ethernet Hardware-in-the-Loop . . . . .42

Hardware DSP Analysis Techniques Using the Z-Transform . . . . . . . . . . . . . . . . . . . .45

Implementing Optimal Filters Quickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48

Model-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

Accelerating FFTs in Hardware Using a MicroBlaze Processor . . . . . . . . . . . . . . . . .56

PRODUCTS

Virtex-4 SX 35 XtremeDSP Development Kit for Digital Communication Applications . . .60

EDUCATION

DSP Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

DSP Implementation Techniques for Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . .63

Designing with Multi-Gigabit Serial I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64

C O N T E N T S

D S P M A G A Z I N E I S S U E 2 , M A Y 2 0 0 6

by Will StraussPresidentForward [email protected]

FPGAs employedfor DSP havepassed the half-

billion dollar mark; in fact, that marketsegment is growing faster than the largerand more mature DSP chip market. Thereasons are varied, but performance is theprime driver, as FPGAs easily outdistanceconventional DSP chips in maximumbandwidth and the number of communi-cation channels or video streams that canbe processed simultaneously.

As FPGAs have become more power-ful and cheaper through advancedCMOS processing, stand-alone FPGADSP solutions are becoming practical.In a recent survey of more than 300DSP professionals from 30 countries,Forward Concepts asked, “Which chiptypes are employed for DSP algorithmexecution (rather than data processing)in your applications”? The results com-paring DSPs and FPGAs in Figure 1clearly show that FPGAs have anincreasing and varied role in DSP.

As expected, the general-purpose(GP) fixed-point DSP garnered themost mentions, followed by GP float-ing-point DSPs. But significantly,stand-alone FPGAs for DSP showedstrength in the number of responsesgarnered, equaling the number ofresponses for FPGAs paired with DSPsas an accelerator. Surprisingly, FPGAspaired with RISCs also showed signifi-

cant strength. This provides an appropri-ate segue to our next chart.

We also asked our survey participants,“If a RISC core is employed (for any pur-

pose) in your DSP application, whichbrand(s) is (are) used or being stronglyconsidered for your next design”? Asexpected, of those respondents employing

a RISC core, ARM, Ltd. topped theresponses. Unexpected, though, wasthe strong response for soft RISCcores, with Xilinx® MicroBlaze™ andPicoBlaze™ processors ranking a clearsecond to ARM in popularity, as indi-cated in Figure 2.

Although we did not explicitly ask ifa soft RISC was employed in the“stand-alone FPGAs for DSP” inFigure 1, we can conclude that is prob-ably the case for such implementations.Considering that the soft RISC is moreconcerned with control code and thatthe FPGA array can be significantlydevoted to DSP functions, the pairingmakes a lot of sense. Moreover, ourinformal interviews after the surveywas performed (Q4/05) revealed thatthere are a number of FPGA imple-mentations employing multiple softRISC cores, with MicroBlaze proces-sors mentioned most often.

It is clear that FPGAs have a strongplay in the world of DSP and Xilinx isaggressively providing new products andapplication support to meet increasingmarket demand for ever-higher per-formance DSP products in communica-tions and professional multimedia.

The charts in this article are excerptsfrom Forward Concepts’s new 322-pagemarket study, “DSP Strategies:Embedded Chip Trend Continues”(www.fwdconcepts.com).


6 DSP magazine May 2006

A survey by Forward Concepts validates the increasing role of FPGAs in DSP applications.

DSPs and FPGAs Employed for DSP

GP Fixed-Point DSP

GP Floating-Point DSP

Stand-Alone FPGA for DSP

DSP w/FPGAAccelerator

RISC w/FPGADSP Engine

Responses (Multiple)

0 40 80 120 160

Source: Forward Concepts

ARM

Xilinx (Micro/PicoBlaze)

PowerPC-Freescale

Altera (Nios/Nios II)

Proprietary RISC Core

PowerPC-IBM

Intel XScale (IP)

MIPS

Freescale (ColdFire)

Tensilica

ARC

Other

1% 10% 100%

Source: Forward Concepts

RISC Core Used in DSP Application(For Any Purpose)

Responses (Multiple)

Figure 1 – FPGAs exhibit growing roles in DSP.

Figure 2 – FPGA soft RISCs prove popularin DSP applications.

B U S I N E S S V I E W P O I N T S

by Chirag Vishwas VichareSenior Engineer (VLSI)MindTree Consulting Pvt. [email protected]

Over last few years, FPGAs have evolved toinclude significant DSP-centric enhance-ments in their architectures. Theseenhancements have enabled FPGAs to sup-port many complex DSP applications indomains such as telecommunications (basestation signal processing, radar signal pro-cessing), multimedia processing (video pro-cessing, audio signal processing), and otherapplication areas. However, implementing

such complex systems in FPGAs can bequite time-consuming.

In most of these DSP systems, the algo-rithms used were developed for processor-based systems. Translating thesealgorithms into hardware to achieve equalor better performance can take months,compared to few weeks on DSP proces-sors. This is largely attributed to themature development tools available toDSP engineers while working with DSPprocessors such as TI or analog devices.Usually these tools provide optimizedmacros (in high-level language or assemblylanguage) for many DSP algorithms, and

these macros are a major contributor inreducing system development time.

Including More System-Level IP CoresTo achieve similar results in terms ofdevelopment cost, time, and perform-ance when implementing such complexDSP systems in hardware, it is just notenough to have only primitive macrossuch as adders and multipliers alreadyavailable in a hardware designer’s library.The hardware DSP design flow shouldincorporate more and more system-levelmacros for sub-blocks such as encoders,decoders, FFTs, DCTs, trigonometric

Implementing Bluetooth CVSD Codec on an FPGA Implementing Bluetooth CVSD Codec on an FPGA


Using a Xilinx MAC FIR filter core reduces the design time.Using a Xilinx MAC FIR filter core reduces the design time.

functions, sample rate converters, andFIR filters, which are optimal implemen-tations in terms of area, memory, or speedand already proven in hardware. This canreduce development time drastically, asthe only task is to integrate these IP coresinto the system architecture and verifythe functionality of the overall system.Even if your goal is to finally developyour own macros and you are using theFPGA only as an ASIC prototype ordemonstration platform, using thesemacros to validate your design can giveyou the opportunity and time to explorethe design space in terms of area and per-formance (speed and power).

To illustrate the methodology, let’s useas an example a MindTree bluetoothCVSD codec implemented on a Xilinx®

Virtex™-II FPGA, which uses a COREGenerator™ software MAC FIR filter asa sub-block for interpolation and decima-tion filtering.

The aim here was to validate the con-tinuous variable slope delta modulation(CVSD) codec on a Virtex-II-based ASICprototype platform for MindTree’sBluetooth baseband.

CVSD Codec ValidationBluetooth technology has reached a stageof maturity and is being extensively inte-grated in many devices such as mobilephones, PDAs, laptops, and GPSreceivers. One of the major applications

Integrating these filters into theMindTree Bluetooth baseband core was afairly easy task, as the MAC FIR I/O inter-face is clearly defined in Xilinx COREGenerator software.

The only task was then to generate filtercoefficients for interpolation and a decima-tion filter from SCILAB (a free scientificsoftware package for numerical computa-tions), and pass them to Xilinx COREGenerator software to produce an .edn filefor filters, which was included during placeand route.

Using a Xilinx MAC FIR filter corehelped in exploring the optimal set ofparameters (filter length, coefficients, word-length precision) for these filters withoutcompromising CVSD voice quality. It alsohelped in terms of reduced design time byavoiding the design iterations to achieve thedesired voice quality. This is largely becausethe design had already been proven in hard-ware much earlier in design cycle. It alsohelped in hardware optimization duringactual implementation of interpolation anddecimation filters.

An Alternative ApproachCORE Generator software also provides abehavioral model for all of its IP cores. Youcan integrate these models into the system tobe developed during the verification stage,which can aid in analyzing the performanceand fine-tuning the design parameters.

Conclusion Today’s FPGAs are capable of implementingmost high-performance complex DSP algo-rithms efficiently. However, achieving yourdesired performance goals and meeting tighttime-to-market constraints requires a changefrom conventional DSP design flows.

Incorporating pre-verified and moreoften optimized system-level IP cores at var-ious stages in the DSP design flow can sig-nificantly help in reducing developmenttime and achieving your desired performanceobjectives through design space exploration.

For more information on COREGenerator software, visit www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=dr_dt_coregenerator or www.xilinx.com/ise/products/coregen_overview.pdf.

for Bluetooth in these devices is as a car-rier of voice data over synchronous logictransport (SCO). The Bluetooth standardspecifies a 64 kbps log PCM (pulse codemodulation) format (A-law or µ-law) or a64 kbps CVSD format for on-the-airinterface. CVSD encoding is consideredthe more robust format for voice-over-the-air interfaces. However, CVSD is acomplex technique, which involves adap-tive delta pulse-code modulation, whereonly two levels are used to represent thedifferential in amplitude (delta). TheCVSD encoder/decoder processes 16-bitsamples (as an encoder) and single-bitsymbols at 64 kHz (as a decoder).

Figure 1 shows the block diagram of theBluetooth CVSD codec.

The CVSD codec requires an interpo-lation filter in the encoder path to up-sample the 8 KHz samples from voicecodec to 64 kHz, which is given as aninput to the CVSD encoder. Similarly,the 64 kHz output of the decoder mustbe down-sampled to 8 kHz using a low-pass filter with negligible spectral powerdensity (above 4 kHz) before beingplayed back with the 8 kHz voice codec.

After verifying the CVSDencoder/decoder block independentlywith the golden reference model, I tookthe approach of validating the CVSDcodec on the Virtex-II FPGA platform,with the Xilinx FIR filter core used as aninterpolation and decimation filter.

May 2006 DSP magazine 9

16-bit Linear Datafrom PCM Codecevery 8 kHz

16-bit Linear Datato PCM Codecevery 8 kHz

UpSampler

(Xilinx MACFIR Core)

CVSD

Encoder

BluetoothBasebandTX FIFO

DownSampler

(Xilinx MACFIR Core)

CVSD

Decoder

BluetoothBasebandRX FIFO

Figure 1 – Bluetooth CVSD codec block diagram

by Sabine Lam DSP Technical Marketing Engineer Xilinx, [email protected]

Often the implementation of video pro-cessing systems requires support for variousvideo and audio standards and involvesconverting signals from one standard toanother. Multimedia applications requireprocessing signals at video rates, whichmeans that simulation should run in realtime during the development process.

Typical video processing systems use amicroprocessor to control a video pipelinecomprising a video source and sink, a largememory for storage of video data, and avideo processing system (Figure 1).

As you implement and debug the vari-ous video algorithms, you will need to ver-ify the functionality through software andhardware simulation. Simulation of video-processing applications creates specialchallenges given the real-time nature ofvideo streams and the enormous amountof video data required per frame.

Developing Video IP in a FullyIntegrated Design EnvironmentDeveloping Video IP in a FullyIntegrated Design Environment


The Video Starter Kit is an ideal prototyping platform for multimedia, video, and imaging.The Video Starter Kit is an ideal prototyping platform for multimedia, video, and imaging.

Design EnvironmentThe Video Starter Kit (VSK) enables rapiddevelopment and debugging of high-per-formance video processing systems for awide range of video applications. The VSKis powered by the Xilinx® Virtex™-4XC4VSX35 device, which is optimized forDSP processing thanks to the high ratio ofmultiply accumulate blocks (also known asDSP48) in the fabric and supported by arich feature set of video interfaces such asDVI, VGA, component (HD), composite,S-Video, and SDI.

Typically, developing video algorithmsrequires hardware to prove the video oper-ation on real-time data streams, as well as asimulation environment to develop andtest the video processing components. TheVSK provides both software simulationand real-time operation for each of thecomponents in a video system, allowingyou to develop video IP (including filters,video blocksets, accelerators, and videointerface conversion) or end applicationssuch as codecs, image enhancement,dynamic gamma correction, and motionestimation. The integration with the toolkit and the I/O diversity makes it fast andeasy to get video onto the board and opti-mize algorithms to operate on it.

Also provided with the VSK are refer-ence designs, some in HDL and othersmodeled in the Xilinx System Generatorfor DSP design environment. To abstractaway the complexity of bringing data inthrough the various video interfaces andsending them down to the Virtex-4 device,a library of video interface blocksets isincluded, all controlled through aMicroBlaze™ controller.

To highlight some of the VSK capabili-ties, I’ll describe the MPEG-4 Part 2decoder demonstration design.

MPEG-4 Part 2The MPEG-4 decoder demonstration sys-tem comprises an FPGA hardware evalua-tion platform, Xilinx IP cores, andembedded software operating together toperform video decompression on industry-standard encoded video bitstreams.

For this design, the FPGA is programmedto perform the decompression and drive the

mented in the XC4VSX35 FPGA. TheZBT memory, DDR memories, SystemACE™ technology, Compact Flash con-nector, two-line LCD display, and a digital-to-analog converter are located on thehardware platform.

Embedded ProcessorVideo systems often require a controlprocessor. The processor is typically used tocommunicate with a host system, set upvideo processing operations, compute coef-ficients, and generally operate as a low-ratedata processor.

In the MPEG-4 demonstration design,the embedded MicroBlaze processor oper-ates as the overall system-level controller,

video display. A Compact Flash card holdsseveral compressed video streams and theFPGA configuration bitstreams. An embed-ded processor within the FPGA reads the bit-stream out of the Compact Flash card, writesit to an external DDR memory, and sends itto the MPEG-4 Part 2 decoder. The outputfrom the decoder is then reformatted to thevideo standard of your choice for display onan external monitor through the video I/Odaughtercard.

An overview of the system is shown inFigure 2. The MPEG-4 decoder core,DDR memory controller, color space con-verter, VGA interface, macroblock formatconverter, and MicroBlaze soft-core proces-sor and associated peripherals are imple-


VideoSource

VideoSink

VideoFunction

VideoMemory

Micro- processor

HostInterface

Video

Control

DisplaySystem Monitor

Port (to PC)

Board - Xilinx ML402

VideoDAC

BufferedVGA

Interface

Memory Controllerwith DDR Interface

DDR Memory ZBT MemorySystem

ACE LCD

LCD Driver

MicroBlazeSoft CoreProcessor

UART

FIFO

FIFO

FIFO

System ACEInterfacewith DMA

DisplayController

MPEG4Decoder

Figure 1 – Video system diagram

Figure 2 – MPEG-4 design overview

handling such functions as the user inter-face, reading compressed bitstreams fromCompact Flash, transmitting the bitstreaminto the MPEG-4 decoder core, and mon-itoring all system status flags.

With Xilinx System Generator for DSP,the design flow for incorporating aMicroBlaze processor into the framework isgreatly simplified. You can use XilinxSystem Generator and the EmbeddedDevelopment Kit (EDK) software toolstogether to implement and simulate a sys-tem with a processor and FPGA videoprocessor functions operating on live videostreams. System Generator automaticallygenerates software drivers to read and writedata to the System Generator design.

Two methodologies are currently sup-ported to integrate a MicroBlaze controller:

• A System Generator design exportedinto an EDK system. When used inpcore (processor core) export mode,the memory map block and all otherblocks are packaged into a pcoreperipheral. Software drivers and docu-mentation for the memory-map inter-face are also generated and deliveredwith the peripheral.

• An EDK project imported into aSystem Generator design for hardwareco-simulation. When used in EDKimport mode, an EDK project file isimported into System Generator byrunning the EDK import wizard.When the import wizard is completed,the EDK system is pulled into theSystem Generator design as a blackbox. During the import process, theEDK system is augmented with FastSimplex Link (FSL) interfaces thatcommunicate with the memory map.

Hardware Co-SimulationViewing the resulting output video is animportant quality measurement metric forall video systems. The video standard input

and output sources featured on the VSK,coupled with System Generator hardwareco-simulation capability, allow you toquickly test and debug your system withreal-time video streams.

System Generator provides hardwareco-simulation interfaces that make it possi-ble to compile a System Generator diagraminto an FPGA bitstream and associate thisbitstream with a new run-time hardwareco-simulation block. When the design issimulated in Simulink, results for the com-piled portion are calculated in hardwareinstead of software.

System Generator provides high-speedhardware co-simulation interfaces thatallow the full contents of a Simulink vec-tor or matrix signal to be read from orwritten to FPGA hardware in a singletransaction. By using these interfaces, youcan significantly reduce the number ofPC/hardware transactions during simula-tion and further accelerate simulationspeeds beyond what is traditionally possi-ble with hardware co-simulation.

By taking advantage of the ubiquity andadvancement of Ethernet technologies, theinterface facilitates a convenient and high-bandwidth co-simulation to an externalFPGA device.

The VSK supports two Ethernet co-simulation modes:

• The network-based Ethernet hardwareco-simulation interface provides co-simulation access to an FPGA plat-form over an IPv4 networkinfrastructure. Because IPv4 networksare widespread, the interface providesa straightforward way to communicatewith remote hardware connected toeither a wired or wireless network.This interface is ideal in situationswhere the FPGA platform is remote(such as across the office or across thecountry) or when multiple designersmust share a single development

board. The network-based Ethernetinterface supports operations in10/100 Mbps half/full duplex modes.

• The point-to-point Ethernet hardwareco-simulation provides a co-simulationinterface using a raw Ethernet connec-tion. The raw Ethernet connectionrefers to a Layer 2 (data link layer)Ethernet connection, between a sup-ported FPGA development board anda host PC, with no routing networkequipment along the path. The point-to-point Ethernet interface supportsoperations in 10/100/1000 Mbpshalf/full duplex modes. Jumbo framesare also supported on a GigabitEthernet connection, as long as it isenabled by the underlying connection.

ConclusionWith this complete and easy-to-use solu-tion, the Video Starter Kit is the ideal hard-ware platform to evaluate Xilinx FPGAs ina wide range of video and imaging applica-tions. Fully integrated and supported by theXilinx System Generator for DSP software,the VSK takes advantage of the new high-speed Ethernet hardware co-simulationcapability and enables system integration,development and verification of codecs, IP,and video algorithms in real time.

The VSK comprises software, hard-ware, camera, cables, and a detailed usersguide and reference designs. It includes alimited edition of System Generator forDSP, ISE™ software, and EmbeddedDesign Kit (EDK) FPGA design tools, aswell as a Xilinx ML402-SX35 develop-ment board, video I/O daughtercard(VIODC), CMOS image sensor camera,power supply, and cables.

For more information, see the VSKUser Guide at www.xilinx.com/bvdocs/userguides/ug217.pdf, or, for the MPEG-4demonstration design, www.xilinx.com/bvdocs/userguides/ug234.pdf.


The video standard input and output sources featured on the VSK, coupled with System Generator hardware co-simulation capability, allow you to quickly test and debug your system with real-time video streams.

by Chris Hallahan VP Sales & Marketing [email protected]

Did you know that you can evaluate, test,develop, and benchmark custom video pro-cessing applications utilizing an appropriatemixture of FPGA and DSP processing? Thenew Xilinx® Video Virtual Socket Adapter(VSA) is designed to accelerate FPGA/DSPvideo co-processing development.

The VSA is a plug-and-play systemcomprising a Spectrum Digital DM642EVM DSP evaluation board; a SpectrumDigital XEVM642 daughtercard; aVHDL “virtual socket” framework for theVirtex™-4 SX FPGA; DSP firmware; aSystem Generator for DSP demo modulefeaturing a two-dimensional 5 x 5 VideoFIR filter; user guide; application notes;and a PC-based network streamingMJPEG video player. With the bundleddemo system, you instantly get the infra-structure to rapidly prototype your videoapplications to accelerate FPGA/DSP co-processing product development. The sys-tem was developed by Nuvation, a XilinxAlliance Program design services firm, andis being distributed by Xilinx.



Design, develop, and test your algorithms with the Video Virtual Socket Adapter.

player server and HTTP server components.It passes new frames to the MJPEG playerserver when they are available.

MJPEG Player ServerThis task awaits incoming connectionsfrom the PC-based MJPEG player clientand streams out newly captured JPEG

Video Conversion PipelineThis component comprises two independ-ently running pipelined tasks. The firstreceives a video stream from the FPGAvideo output port through a DM642 videoinput port and converts it to YUV420 for-mat. The second compresses it into an in-memory JPEG image for use by the MJPEG


The VSA System Figure 1 shows a block diagram of theVSA system. Spectrum Digital’s DM642EVM showcases TI’s DM642 digitalmedia processor (TMS320DM642). On-board components include 32 MBSDRAM, 4 MB Linear Flash, two videodecoders, one video encoder, two S-Video/composite video inputs, one S-Video/composite/VGA output, 10/100Ethernet PHY, mic and headphone jacks,and an off-board connector driventhrough the DM642’s EMIF interface.You can develop DSP firmware on theDM642 EVM with TI’s Code ComposerStudio and a JTAG emulator.

XEVMDM642 Virtex-4 Daughtercard Spectrum Digital’s XEVM642 is a Virtex-4SX35-based daughtercard that plugs intothe DM642 EVM. In addition to theVirtex-4 device, the XEVM has memory,clocks, a JTAG port, and a Compact Flashcard socket. From video algorithm acceler-ation, data compression filters, and customlogic, the Virtex-4 FPGA is easily program-mable with Xilinx System Generator forDSP and ISE™ software.

VHDL and FirmwareThe Video VSA includes a set of VHDLmodules and the firmware to directly con-trol them (illustrated in Figure 2). Thesemodules include a generic user logic mod-ule (the function that fits into the “virtualsocket”), a video input module, a test pat-tern generator, a 2:1 video switch, a videooutput module, and a host interface mod-ule. All of these modules – and the firmwarethat controls them – are reusable.

The Video VSA modules are connectedtogether in a way that provides a “virtualsocket” where the user function can reside.Any appropriate functionality and imple-mentation approach of the user function ispossible; the surrounding Video VSA mod-ules are connected together to form theinfrastructure that allows you to focus yourdesign effort on the user function.

The demo firmware comprises fourmain components, shown in Figure 3 asshaded blocks within the overall demofirmware framework.

SDRAM

XilinxXC4VSX35

VideoDecoder

VideoDecoder

TI DM642

VideoEncoder

EthernetPHY

DM642 EVM

XEVMDM642

PC-Based Stream Player

Note: Shaded components arenot utilized in this demo or bythe current implementation of the Video VSA.

VideoInput

TPGFrameSYNC User Mode

VideoOutput

Host Interface

VSA Firmware

Demo Firmware

Demo Software

Demo UserFunction

(5 x 5 Filter)

Figure 2 – Video VSA system block diagram

Figure 1 – Block diagram of EVM642 and XEVM system

images while the connection is active. Thistask receives notification from the videoconversion pipeline when a new JPEGimage is available.

Demo ControlThe demo control component is a taskthat polls the video’s locked status andvideo standard for changes. This taskwill re-initialize the video conversionpipeline when video lock has beenrestored. It is also used to handle525/625 video standard changes.

During the same polling loop, thedemo’s processing window position isupdated (if movement is enabled).This implements the window’s bounc-ing behavior.

The remainder of the demo controlcomponent is a series of demo-specificfunctions responsible for controllingthe following demo settings:

• Processing window position, size,enabled status, and auto move

• Filter kernel (including chromabypass)

• Video source selection (test pat-tern or live video)

Web ServerThe Web server is configured to use astandard HTTP port and offers thefollowing content/services:

• The current video frame is madeavailable as a JPEG file

• A JavaScript-based player/controlconsole for the demo is available asthe default web page on the server;this page also loads two static logoimages from the Web server

• Three dynamic CGI scriptsprocess incoming HTTP POSTconfiguration change requestsused by both the MJPEG playerand the default Web-based playerto control the demo

• A dynamic website for runningautomated tests on the hostinterface

VSA Demo ApplicationThe system includes a filter application thatdemonstrates one of many possible functionsthat you can implement in the VSA. Thefunction is a two-dimensional 5 x 5 FIR fil-ter with a configurable rectangular window

that filters video samples within the process-ing window while passing all other samplesunmodified. The position and size of theprocessing window is implemented to allowuninterrupted video streaming during mod-ification. Figure 4 shows a snapshot of a live

streaming video display featuring anedge detect filter kernel.

The filter coefficients are repre-sented in 16-bit signed 2’s comple-ment fixed-point format, allowingimplementations of high-precisionvideo filters with gain. The coeffi-cients are loadable at runtime and aredesigned to engage without disturb-ing the video stream. The resultingvideo is normalized and clamped inaccordance with ITU-R BT.656/601as a post-processing step in the filter.

The filter is fully implemented ina Xilinx System Generator for DSPworkflow that operates under TheMathWorks’s Simulink environment.System Generator for DSP providesabstractions that enable you to devel-op highly parallel systems in XilinxFPGAs, providing system modelingand automatic code generation fromSimulink and MATLAB, also fromThe MathWorks.

The purpose of the Video VSAdemo is to showcase the process to cus-tomize Video VSA modules for yourapplication. A detailed application noteis included with the package, alongwith a demo user guide to facilitate aquick start to your development.

ConclusionThe new Video Virtual SocketAdapter from Xilinx enables rapidalgorithm porting and verification forvideo system development, utilizingXilinx Virtex-4 SX platform devicesand TI DM642 digital media proces-sor DSPs in the System Generator forDSP tool flow. The VSA hardwareand associated TI DSP tools are avail-able from Spectrum Digital(www.spectrumdigital.com). For allother VSA inquiries, please contactyour local Xilinx representative or visitwww.nuvation.com.


Restart

Dem

o P

aram

eter

s

Cap

ture

d JE

PG

Fra

me

Captured Frame

Wra

ps/C

alls

DM

A D

ata

System Initialization(EVM inititialization, resourceallocation, and task creation)

VSA Access Macros(connects to VSAʼs Host

Interface over EMIF)

Video Port Interface(Captured from FPGA video

output stream)

Demo Control(Background Task and Demo

Parameter Control)

Video Conversion Pipline(YUV Conversion and JPEG

Compression)

Web Server(Control CGI and

Web Based Player)

MJPEG Player Server(MJPEG Streaming)

Network Connectivity(TCP/IP Stack and Network Drivers)

Figure 4 – MJPEG player (showing live video with edge-detect function)

Figure 3 – VSA firmware component overview

by Steve Cooper CTO Axis Network [email protected]

A significant goal for mobile wireless infra-structure suppliers is to develop base sta-tions (BTS) light enough to be deployednext to an antenna and reliable enough notto require tower climbs for servicing. Theseproducts ultimately will have the lowestcost structure, both in capital expenditure(CAPEX) and operating expenditure(OPEX). CAPEX and OPEX are two ofthe biggest issues affecting operators, andtherefore base station OEMs, today.

Hardware and site preparation are majorcontributors to CAPEX costs, whereasmajor OPEX costs are site leasing, backhaul,and electricity. Products such as remoteradio heads (RRHs) or compact integratedbase stations (CiBTS) will go a long way inimproving these contributing factors.

Key to making compact integratedproducts successful (in terms of size andreliability) is reducing power consump-tion. The power amplifier in the base sta-tion is the component that consumes mostpower. A number of DSP algorithms areavailable and under development thatwork to improve power amplifier efficien-cy. FPGAs play a significant role in theimplementation of these algorithms.

The Role of FPGAs in Digital Radio SubsystemsThe Role of FPGAs in Digital Radio Subsystems


Digital techniques for reducing analog costs.Digital techniques for reducing analog costs.

Historically Inefficient In some UMTS base stations, only a tinyfraction of the total DC power consumed isactually transmitted as useful RF power.Around 50% of the radio frequency (RF)power output from the cabinet is dissipat-ed in the feeder cable running up the tower.The remaining power is dissipated as heat,requiring large heatsinks, air conditioning,and large cabinets.

A base station with the enhancementsoutlined in this article can benefit from aten-fold increase in conversion efficiency.This significantly reduces heat dissipation inthe system, allowing convection-cooledproducts to be deployed and enabling a dra-matic size reduction. Smaller convection-cooled products can be mounted at theantenna, saving the cost of the feeder cable.

Leasing and installation costs aredirectly linked to the size, weight, andcomplexity of a base station. Small, con-vection-cooled CiBTSs or RRHs providemany more deployment options – andhence a reduction in leasing costs.Naturally, a ten-fold increase in conver-sion efficiency causes a similarly signifi-cant reduction in electricity costs.

For these enhancements to work, it iscritical that DSP algorithms maintainexcellent signal performance and keep thepower consumption of the transistors to aminimum.

Crest Factor ReductionTelecommunications standards are prolifer-ating (UMTS, HSDPA, HSUPA,WiMAX, DVB-T, DAB, UWB). To maxi-mize spectral efficiency, each of these airinterfaces uses complex modulationschemes that have a high peak-to-averagepower ratio (also known as PAPR, or “crestfactor”). Figure 1 shows the different crestfactors evident in orthogonal frequencydivision multiplexing (OFDM) signals.Signals with high crest factors require alarge range of dynamic linearity from theamplifier. This means that the poweramplifier has to be set to operate well away(backed off ) from its most efficient point.

DSP can reduce the peaks of the signal,while some techniques in the basebandprocessing of the base station can reduce

a power transistor is proportional to its peakpower handling.

An unclipped UMTS waveform, such as3GPP-defined Test Model 1, has a 10 dBpeak-to-average ratio. Thus, a 20W UMTSbase station without a crest factor reduction(CFR) algorithm requires 200W of peakpower handling. Using one of the crest fac-tor algorithms discussed here can reduce thepeak requirement by half, saving significantcost and power per transmit path.

Moving silicon (and cost) away from the

power amplifier and into the FPGA willbecome more prevalent as technologyadvances continue to add functionality toFPGAs, and the cost-per-gate continues totrack the downward trend of Moore’s law.

In addition to reducing the total systemcost, CFR will also significantly improvethe power efficiency of the base station,because not only is the price of the powertransistor proportional to its peak power,but so is its power consumption. In today’sUMTS base stations, the power transistoris biased to handle its peak power.Therefore the peak power sets the efficien-cy of the power transistor and the overallsystem power consumption.

For these reasons, CFR algorithms arebecoming commonplace in UMTS sys-tems. It is now possible to implement a

the incidence of the peaks. For example,code selection and tone reservation aretwo approaches proposed for widebandcode division multiple access (WCDMA)and OFDM, respectively. Theseapproaches typically have good perform-ance, although they require interventionin the baseband processing layer beforethe individual codes or tones are com-bined into a composite stream.

Fortunately, a number of peak-limitingalgorithms can be implemented on the

composite I and Q signal. In one approachcalled peak windowing, the signal is atten-uated in the region of each peak. An alter-native method is to clip the signal usingpolar or Cartesian clipping. WithCartesian clipping, the in-phase and quad-rature components are clipped independ-ently. With polar clipping, the magnitudeof the signal is clipped while preserving thephase. Although either method can beused to limit the crest factor of the signal,polar clipping provides better results interms of overall signal distortion (lowererror vector magnitude [EVM]).

By reducing the crest factor, it is possibleto obtain significantly more RF power fromthe same power transistor. Alternatively,you can use smaller transistors and achievethe same output requirements. The cost of


Figure 1 – Cumulative distribution function for OFDM signals (taken from “Peak to Average Power Ratio Reduction of OFDM Symbols” by T. Aaron Gulliver, Department of Electrical

and Computer Engineering, University of Victoria, Victoria, BC Canada)

CFR algorithm and obtain asignificant clipping reduc-tion by using a Xilinx®

Virtex™-4 FPGA. An unclipped UMTS sig-

nal has a cumulative distribu-tion function, as shown inFigure 2. If clipping is turnedon, the crest is clipped by 4dB to 6.5 dB, as shown inFigure 3. Figure 4 is a plotthat compares the results ofboth clipped and unclippedmeasurements. In the plots,the black trace has a similarspectral emission to the bluetrace. However, the blacktrace is output from theamplifier at 2 dB higheraverage output power. Youcan achieve 60% more aver-age power with the samepower transistor using clip-ping. Plus, the increase inpower consumption is onlymarginal, leading to signifi-cantly enhanced efficiencynumbers.

Achieving the same per-formance in adjacent chan-nel and spectral emissions –but driving the amplifierharder – has a significantimpact on amplifier effi-ciency. An amplifier oper-ates in its most efficientregion when it is most com-pressed. Results on typicalUMTS 20W amplifiersshow that driving the ampli-fier 2 dB higher results in apower consumption increaseof only 25%.

Limitations of CFRUnfortunately, as discussed,clipping the peaks of a signaldegrades its purity andincreases the occurrence ofbit errors, especially in areasof weak reception. UMTSRelease 99 uses the QPSKmodulation scheme. This

scheme is relatively tolerant ofsignal impurities; the 3GPPstandard for UMTS allows asmuch as 17.5% EVM degrada-tion. As UMTS networks aretypically interference limited,the impact of the increasedEVM is of limited importance,as other factors dominate thesystem bit error rate. Currentstate-of-the-art UMTS CFRalgorithms are demonstratingclipping to a 6 dB peak-to-aver-age ratio while meeting 3GPPRelease 99 EVM requirements.

As modulation schemes(shown in Figures 5 and 6,respectively) change from QPSK(UMTS Release 99) to higherlevel schemes such as 16 and 64QAM (used by HSDPA andWiMAX), the tolerance of thesystem to any impurity isreduced. As Figures 5, 6, and 7show, the relative distancebetween each point on the con-stellation diagram is reduced.Impurities in the signal willcause the detection points tomerge together, creating biterrors. This error can be seen inthe constellation measurementsshown in Figures 7 and 8.Currently algorithms are onlyproviding 8 dB of peak-to-aver-age ratio levels while meetingthe tight EVM requirements for64 QAM signals.

Clearly, clipping is of value inthose systems that can toleratehigher levels of EVM degrada-tion. But to improve the efficien-cy of systems using higher levelmodulation schemes, additionaltechniques are required.

Digital Pre-DistortionAnother important parameteraffecting the power transistorchoice is the adjacent channelpower ratio (ACPR). The plotsshown in Figure 4 were deliber-ately chosen to show an ampli-


Figure 2 – Cumulative distribution of unclipped signal

Figure 3 – Clipped cumulative distribution

Figure 4 – Plot showing the comparative results of both clipped and unclipped signals

fier that was passing the 3GPP adjacentchannel and spectral emission require-ments without linearization. Linearizationallows the operation of the amplifier evenfurther into its highest efficiency area. Anumber of available techniques will havethis effect. These techniques originated inthe analog domain with feed-forward and

cross-cancellation and have now movedinto digital pre-distortion carried out inthe I and Q domain.

Pre-distorters have been demonstratedthat have almost perfect performance –removing all non-linearities and minimiz-ing the adjacent channel power down tothe noise floor of the system. However, thealgorithms that achieve this performanceare very processor-intensive and typicallydeployed in very large ASICs. The algo-rithms in the ASICs must be able to copewith many different amplifiers and topolo-gies to address the broadest market. This inturn makes the ASIC more complex andpower-hungry.

Instead of an ASIC, using an FPGA toimplement pre-distortion allows you to usethe flexibility of the device and implement

a specific algorithm tailored to the specificamplifier being pre-distorted.

This is ideal for compact integratedproducts, as the transceiver, algorithm, andamplifier are permanently integratedtogether in one field-upgradable unit. It isnot necessary for the algorithm to be over-ly complex; hence it takes less silicon space

and can compete in cost with an all-encompassing ASIC.

By working closely with power transistorsuppliers, it is possible to develop very code-efficient, custom digital pre-distortion algo-rithms. During factory testing the bestalgorithm for each particular amplifier canbe chosen and programmed into the FPGA.

Pre-distortion not only improves thespectral emissions of the amplifier but canalso have a significantly positive effect onthe signal EVM. This fact will likely lead topre-distortion being widely used withinWiMAX systems.

The characteristics of an amplifierchange according to the signal passingthrough them, temperature and frequencyeffects, and device technology. To produceconsistent optimized performance, pre-dis-

tortion algorithms require the widebandcapture of the amplifier output. This ana-log signal must be converted into the digi-tal domain using high-performance ADCs.Once in the digital domain, very rapid,real-time manipulation of large mathemat-ical arrays is required. Essentially, theinverse of the non-linear response of the

amplifier must be applied to the input sig-nal. This inverse response must closelytrack the changes within the power transis-tor. The mathematical processing is oftencarried out by a dedicated DSP device.

As FPGA performance continues itsrapid advancement, the manipulation ofthe array required for digital pre-distortioncan be carried out on the FPGA. This leadsto a one-chip solution that can interface tothe DAC for the RF transmit path and theADC for the pre-distortion capture receiv-er. This same single chip can also carry outthe signal processing required for digitalupconversion, CFR, and all of the process-ing and algorithm manipulation for digitalpre-distortion.

Combined together, digital pre-distor-tion and CFR are the current state-of-the-


Figure 5 and 6 – Plots showing relative constellations of QPSK (left) and QAM16 (right) Figure 7 – Plot showing 64 QAM constellations with good EVM

As FPGA performance continues its rapid advancement, the manipulation ofthe array required for digital pre-distortion can be carried out on the FPGA.

This leads to a one-chip solution that can interface to the DAC for the RF transmit path and the ADC for the pre-distortion capture receiver.

art techniques for improving efficiency ofUMTS systems. The majority of existingsystems use ASICs or ASSPs. However, asboth amplifier technology and higher levelmodulation schemes are implemented, thealgorithms required will need enhancementand upgrading in the field.

Other Analog TechniquesOf course there are techniques in the ana-log domain to improve efficiency. A tech-nique known as Doherty has startedbeing deployed in UMTS systems.Doherty uses two output stage transistorsbiased at different points. One of thetransistors is on all the time; the second

only turns on as the signal approaches itspeak. This reduces current consumptionas the transistors are not turned on all thetime, and when they are on they are oper-ating in their more efficient regions. It isimportant to note that Doherty works atits best for signals with 6 dB of peak-to-average. In simple terms, the first ampli-fier covers the lower 3 dB and the secondamplifier covers the upper 3 dB.Efficiencies of 32% have been demon-strated for signals with a 6 dB peak-to-average ratio.

For signals with more than 6dB peak-to-average, the two transistors are not ableto operate completely within their most

efficient regions. Therefore for signals suchas 64 QAM (currently with 8 dB PAR),the improvements obtained with Dohertyare not as significant.

Envelope TrackingOne efficiency enhancement techniquegenerating significant interest is envelopetracking, in which the drain voltage of thetransistor is varied at the same time as thesignal passing through the transistor. Whentraffic is light, it continues to run efficient-ly by reducing the current consumption ofthe transistor. These techniques hold thepromise of 35% efficient power amplifiers,even if the PAR is greater than 8 dB.

Supporting envelope tracking in the dig-ital domain and combining it with digitalpre-distortion allows for the developmentof reliable compact integrated products.These two algorithms are similar: they bothrequire very fast tracking loops, processingof the I and Q data stream, and they varybetween different amplifier designs.

Envelope tracking requires access to theI and Q data. It is necessary for the I andQ stream to be both delayed and processedso that the power amplifier biasing signalis modulated at exactly the same momentas the composite analog waveform. Thisrequires different digital modules imple-mented in the FPGA, according to theparticular efficiency technique deployed.A diagram showing this technique at theblock level is shown in Figure 9.

ConclusionThe Virtex-4 family of FPGAs is ideallysuited to implement the existing algo-rithms required for a digital radio modem.In addition, the use of an FPGA withinthe digital modem provides the future-proofing necassary to allow software fieldupgrades for higher efficiency and higherlevel modulation systems.

For more information about how youcan implement these techniques withindigital radio systems, please contact theauthor at [email protected].

The author would like to thank Rohde and Schwarzfor the use of the test equipment (FSQ and SMU)required to carry out the measurements.


DSP FPGA

UMTS Complex

Baseband Signal

I2 + Q2 AdjustableDelay

EnvelopeTracking

Processing

RFChain

Power Amplifier

RF OutputDigital

Predistortion

Figure 8 – Plot showing 64 QAM constellations with bad EVM caused by clipping

Figure 9 – Block diagram of envelope tracking implementation using FPGAs to control and bias the power amplifier transistor efficiently.

by Tom Feist Director, DSP Tools Marketing Xilinx, [email protected]

Spatially multiplexed multiple-input mul-tiple-output (MIMO) transmitters andreceivers promise significant performancegains for wireless communications systemsover their existing single-input single-out-put (SISO) counterparts. Next-generationwireless standards, such as 802.11n, willsupport data transmission rates as high as600 Mbps and wireless local area networktransmission rates in excess of 1 GHz.

The design of these systems, however,forces a compromise in cost and power thatcan have significant consequences for hand-held devices running on batteries. The chal-lenge facing design teams is to determine theoptimal balance between these designrequirements for their particular application.

At the heart of this technology is the con-cept of multipath, which refers to the reflec-tion of radio frequency (RF) signals in aphysical environment. Whereas multipathdegrades the performance of existing 802.11devices, spatially multiplexed orthogonal fre-quency division multiplexing (OFDM)MIMO – a key element of the 802.11n stan-dard – takes advantage of these reflections to“tune” transmissions, minimize errors, andimprove overall performance. But at thesebandwidths, scattering, diffraction, andabsorption by objects in the transmissionpath are an important consideration.Designing a MIMO system requires thatthese effects are profiled as accurately as pos-sible in the form of a channel model.

There are three primary sources ofchannel models: software-based mathemat-ical models, often available from the stan-dards committees; hardware-based MIMOchannel emulators, either designed in-house or provided by companies such asAzimuth; and, best of all, the real-worldenvironment that the MIMO system isintended to operate. Verifying a MIMOsystem in the real world requires the abilityto rapidly prototype the transmitter andreceiver on a MIMO-oriented FPGA hard-ware platform, such as the VHS-ADC-V4card from Lyrtech.




A practical approach to system implementation using MATLAB and Virtex-4 FPGAs. A practical approach to system implementation using MATLAB and Virtex-4 FPGAs.

The MIMO Performance AdvantageThe benefit of spatially multiplexed MIMOtechnology is the ability to increase trans-mission speed with the number of antennas.The data rate of a today’s existing SISO sys-tems is determined by the formula:

R = Es * Bw

where R is the data rate (bits/second), Es is thespectral efficiency (bits/second/Hertz), andBw is the communications bandwidth (Hz).For instance, for the 802.11a standard thepeak data rate is determined by the formula:

Bw = 20 MHz

Es = 2.7 bps/Hz

R = 54 Mbps

An additional variable “Ns” is intro-duced into this equation when usingMIMO, which is the number of independ-ent data streams that are transmitted simul-taneously in the same bandwidth but indifferent spatial paths. The spectral efficien-cy is now measured as the transmission perstream Ess, and the data rate of the MIMOsystem becomes:

R = Ess * Bw * Ns

Let’s compare the previous 802.11aexample with what is obtainable with thecurrent 802.11n proposal, operating at a 20MHz bandwidth and using four antennas:

Bw = 20 MHz

Ess = 3.6 bps/Hz

Ns = 4

R = 288 Mbps

The use of MIMO technology has deliv-ered a 5.3x data rate improvement for the802.11n proposed standard.

MIMO System Hardware ComplexityThe performance gains of a spatially mul-tiplexed MIMO system come at theexpense of hardware complexity. A trans-mit/receive system that uses multipleantennas not only transmits data betweenthe corresponding antennas but alsobetween adjacent antennas. As you can seein Figure 1, data is received in the form ofa “MIMO channel matrix.”

the most efficient algorithm for a particularapplication. In the case of the SVD, this mayinvolve choices between adaptive estimationtechniques, vector rotations, or other simpli-fications that result from channel matriceswith special properties such as symmetry.

Once an algorithm has been finalized,you will need to tune the hardware perform-ance to overall system requirements.Maximizing the performance of a MIMOsystem in hardware will require that partialparallelism of the multiplication operationsbe implemented in key areas of the designthat will have the greatest impact on overallperformance. The Givens rotation algorithmshown in Figure 2 provides a nice example ofthe performance gains possible through par-allel multiplication operations. Givens rota-

Linear algebra techniques such as singularvalue decomposition (SVD) or matrix inver-sion are required to decouple the channelmatrix in the spatial domain and recover thetransmitted data. Backwards compatibilityrequirements to the 802.11g standard limitthe number of antennasfor the 802.11n standardto either two or four,which subsequently limitsthe channel matrix size toeither a 2 x 2 or 4 x 4.

Developing a MIMOsystem prototype in hard-ware that performs at theactual system data ratesrequires the use of anFPGA-based hardwareplatform. The Xilinx®

Virtex™-4 family ofFPGAs provides fargreater performance thana DSP processor for thisclass of applications byproviding as many as 512hardware multiplierscapable of parallel opera-tion. In designing thisprototyping system, how-ever, you are faced with two considerablechallenges: the first is to design something ascomplex as an SVD or matrix inverse inhardware and the second is tuning the imple-mentation for optimal performance.

Implementing Matrix Operations on FPGAsThe specific SVD or matrix inversion algo-rithm selected for implementation will be atradeoff between numerical stability andhardware efficiency. You will need to developa high-level MATLAB model to determine

1

Source2

1

h11

2Sink

Modulationand Coding

Modulationand Codingh22

h21 h12

function [v, w] = givens_rotation(x, y)

r_sqr = x(1)*x(1) + y(1)*y(1);

r_inv = 1/sqrt(r_sqr);

sin_phi = y(1)*r_inv;

cos_phi = x(1)*r_inv;

vt = x*cos_phi + y*sin_phi;

wt = y*cos_phi – x*sin_phi;

if (x(1) == 0) & (y(1) == 0)

v = x;

w = y;

else

w = wt;

v = vt;

end


Figure 1 – MIMO channel

Figure 2 – Givens rotation algorithm

tions are commonly used to solve the sym-metric eigenvalue problem and are a keybuilding block of the QRD matrix inverse.

You can implement this algorithm usingeither multipliers or a CORDIC approxi-mation method. The Xilinx AccelDSP™Synthesis tool’s design exploration featureswere used to increase performance byinserting parallelism into the architecturewithout code rewrites. As shown in Table 1,this allowed performance gains as much as10x over the parallel CORDIC implemen-tation. Algorithms based on Givens rota-

tions have received greater attentionrecently because they lend themselves nice-ly to a parallel implementation.

For large systems, the added hardwarethat results from increased parallelism mustnot exceed the resources of the targetFPGA. The number of architectural possi-bilities you must evaluate can be consider-able. The process of determining optimal

hardware architecture is well suited for ahigh-level algorithmic synthesis tool suchas AccelDSP.

A MATLAB-Based FPGA Design FlowMATLAB from The MathWorks providesa truly unique environment for the designand implementation of spatially multi-plexed MIMO systems. The inherent lan-guage support for loops, complex numbers,vector and matrix operations, and mathe-matical functions provides a highly efficient modeling environment for the lin-

ear algebra algorithmsrequired for MIMO.

Figure 3 illustrates thebenefits of the AccelDSPSynthesis tool, includingthe flexibility to defineand implement customarchitectures for spatiallymultiplexed MIMO sys-tems on FPGAs usingfloating-point MATLAB.

Automated floating- to fixed-point conver-sion is provided to assist in solving the com-plex quantization issues resulting from theiterative nature of linear algebra functionssuch as an SVD. Once you have determinedan acceptable fixed-point model, you canrapidly explore performance-versus-hard-ware tradeoffs using algorithmic synthesis,quickly increasing the number of dedicated

hardware multipliers to improve perform-ance and take full advantage of the flexibili-ty of the Virtex-4 architecture. Thegenerated RTL from AccelDSP Synthesis isautomatically verified against the golden-source MATLAB to ensure bit-true func-tional correctness.

ConclusionPrototyping a spatially multiplexed MIMOsystem for use in real-world verification isdramatically simplified through the adoptionof a MATLAB-based design flow for thechannel-matrix DSP hardware development.Development and verification cycle times arereduced by using the MATLAB algorithm asthe golden source for FPGA developmentand eliminating re-writes into other lan-guages or design environments. Additionally,the high-level nature of MATLAB allows theAccelDSP Synthesis tool to quickly explorehardware alternatives for an algorithm,including the use of DSP blocks, RAMs, andpipelining.

The AccelDSP Synthesis tool andLyrtech prototyping environment bothhave interfaces to the Xilinx SystemGenerator for DSP design environmentto provide an automated MATLAB toprototyping design flow.

For more information about the AccelChip solution, visit www.accelchip.com.


Floating-PointAlgorithm

Fixed -PointConversion

ArchitectureDefinition

Floating-PointAlgorithm

AccelChip/AccelWare

RTLSynthesis

Create/IntegrateIP Blocks

Create RTLDesign

RefineArchitecture

VerifyRTL

RTLSynthesis

Typical MATLAB DSP Design Flow

AccelDSP Design Flow

Steps Performedby AccelDSP

Architecture DSP48s Slices Throughput

Resource Shared Multiplier 26 943 2.8 MSPS

Parallel Multipliers 46 1774 54 MSPS

Resource Shared CORDIC 0 870 1.3 MSPS

Parallel CORDIC 0 2237 5.7 MSPS

Development and verification cycle times are reduced by using the MATLAB algorithm as the golden source for FPGA development and eliminating re-writes into other languages or design environments.

Table 1 – The range of results obtained by synthesizing a 4 x 4 matrixusing the AccelDSP Synthesis tool and targeting a Virtex-4 device.

Figure 3 – AccelDSP Synthesis design flow

by Manuel UhmSenior Marketing Manager, DSP DivisionXilinx, [email protected]

The flexibility of software-defined radios(SDRs) and cognitive radios (CRs) pro-vides great value relative to interoperability,upgradability, and future-proofing. Thisflexibility also enables a highly desiredattribute in SDRs and CRs: “adaptivity.”Adaptivity can range from a cognitiveradio’s ability to adapt to its spectral envi-ronment to a software-defined radio’s abili-ty to adapt a waveform to compensate forchannel fading. Like flexibility, adaptivityis enabled by the reprogrammable andreconfigurable processors used inSDRs/CRs, FPGAs, DSPs, and general-purpose processors (GPPs).

Unfortunately, this adaptivity typicallycomes at a price – both in terms of powerand system cost. Recent technologicaladvances, however, are making adaptivitymore affordable. For example, partiallyreconfigurable (PR) FPGAs that embedGPP processors and DSP engines on plat-form FPGAs can provide adaptivity to awide range of SDR and CR applications,while lowering the power and cost penalties.

Adaptivity in SDRs and CRsAs a key capability of SDRs/CRs (particu-larly for military or homeland securitypurposes), adaptivity can take many formson the battlefield. With it, you can:

• Change waveforms to interoperate withother friendly communication devices

• Choose the most appropriate commu-nications channel or network fortransmission

• Create a mesh network through ad-hoc networking

• Adapt to the radio frequency (RF)environment by using spectral aware-ness to transmit in an open area ofspectrum

• Adapt the waveform to compensate forchannel fading

• Collaborate with multiple radios toreceive a weak signal that could not oth-erwise be detected by individual radios

• Jam or null an interfering signal

• Accommodate damage to some of aradio’s processing resources by recon-figuring the remaining resources tosupport the most critical services

For the purposes of an SDR/CR, adap-tivity falls into four broad classifications,as illustrated in Figure 1. The lowest func-tional levels include filters or transforms,such as Kahlman filters, finite impulseresponse (FIRs), and fast Fourier trans-forms (FFTs). These low-level functionsare basic building blocks for mostSDRs/CRs. Thus, you would probablyhave to adapt the parameters of a functionsuch as an FIR to support a waveformwith changing bit rates.

At the component level, adaptivity inan SDR/CR is useful in digital down con-verters (DDCs) and digital up converters(DUCs). These components must oftenadapt to waveforms that support differentbit rates or sampling rates.

In an SDR/CR, adaptivity at the func-tion or component level is “under thehood,” insofar as it is transparent to theend user. At those levels, it does not mat-ter what modifications are necessary tosupport the service required. On the otherhand, the next two levels – the applicationand services levels – are visible and, assuch, you may desire some form of controlover adaptivity.

Adaptivity at the application level sup-ports modifications that occur within a

Making the Adaptivity of SDR and Cognitive Radio AffordableMaking the Adaptivity of SDR and Cognitive Radio Affordable


Going beyond flexibility to adaptivity in FPGAs.Going beyond flexibility to adaptivity in FPGAs.

given application. The most commonapplications in an SDR/CR are waveformssuch as the Wideband NetworkingWaveform (WNW). These waveformscomprise various waveform components,and depending on the mission profile, youmay require access to different waveformson an as-needed basis.

The highest functional level of adaptivityis at the services level. Services like radioservices, network awareness services, ad-hocnetworking, and even anti-jam services mustbe able to adapt to changing conditions bycalling on available applications as needed.

These levels of adaptivity are in manyways interdependent because adaptivity at

offer a variety of adaptivity levels to thisenvironment, as depicted in Figure 2.

At the lowest level, FPGAs can be one-time configurable (hence not reconfig-urable). Obviously, such devices are notideal for SDRs/CRs, as they cannot adaptto support new functionality after the firsttime they are programmed.

The FPGA’s inherent re-configurabili-ty, and more specifically, the capabilitysome FPGAs have to be dynamicallyreconfigured on the fly, makes them ideal-ly suited for SDRs/CRs. However, inmany cases these FPGAs have to be com-pletely reconfigured, which limits thecapability of the FPGA to support adap-tivity at the component or function levelbecause these levels involve more granu-larity. As a result, support is generally lim-ited to adaptivity at the application level.Most reconfigurable FPGAs today,including the Xilinx® Spartan™ family ofFPGAs, fit into this category.

PR FPGAs allow for the next level ofadaptivity. Like reconfigurable FPGAs, youcan dynamically reconfigure PR FPGAsmultiple times. However, only a portion ofthe device can be configured at any giventime. This provides a level of granularitysuitable for adaptivity at the componentand even the function level. The XilinxVirtex™-II and Virtex-4 device familiesare examples of PR FPGAs.

Finally, the “ultimate” level of reconfig-urability is the ability to reconfigure on anindividual basis the smallest atomic pro-grammable unit of an FPGA, the config-urable logic block (CLB). This allows foreven finer levels of granularity for adaptiv-ity at the function level. However, it is notclear that the benefit of such a fine-grained approach outweighs the cost ofimplementation associated with such ahigh level of sophistication.

Use CasesSDRs/CRs can benefit from the adaptivityof a PR FPGA in many ways, from thefunction level to the service level. Two suchcases are offered here. The first is an SDRsupporting an adaptive waveform, whichdemonstrates adaptivity at the applicationlevel. The second is a multi-INT platform

each level depends on the previous levelsfor implementation. For example, youmight call on the radio service to transmitdata. The service would include adaptingto available spectrum by scanning RF, andthen selecting the best waveform for send-ing the data. If the waveform is an adaptivewaveform, then certain channel character-istics (such as channel fading) mightrequire modification of waveform compo-nents and functions to compensate.

Supporting Adaptivity in FPGAsFPGAs provide great value as processingplatforms in today’s SDRs/CRs because oftheir processing throughput. They also


COMPONENT(i.e., upconverter)

SERVICE(i.e., network awareness)

APPLICATION(i.e., waveform)

FUNCTION(i.e., filter)

Hig

hLo

w

Lev

el o

f Ad

apti

vity

Level of SophisticationHighLow

CONFIGURABLE(i.e., one time)

RECONFIGURABLE(i.e., multiple times)

PARTIALLYRECONFIGURABLE

(i.e., partial device)

ULTIMATELYRECONFIGURABLE(i.e., individual CLBs)

Hig

hLo

w

Lev

el o

f C

on

fig

ura

bili

ty

Level of SophisticationHighLow

Figure 2 – Levels of configurability in FPGAs

Figure 1 – Levels of adaptivity in an SDR/CR

providing multiple intelligence-relatedapplications, demonstrating adaptivity atthe service level.

In the first use case, you would transmitvoice or data on an adaptive waveform fromyour SDR to another radio. At some point,perhaps because of environmental condi-tions, the channel starts to fade. To theSDR, this is characterized by an increased

bit error rate beyond a certain threshold. Tomaintain the channel, the radio determinesthat the waveform must be adapted to thenew environmental conditions.

Adaptivity in this case could take manypotential forms, including changing themodulation technique, changing themethod of forward error correction, orchanging the bit rate. For the example illus-trated in Figure 3, let’s assume that theradio has determined that a change inmodulation technique is optimal. Themodulator component is represented inFigure 3 as a 16-QAM modulator. Hence,this needs to be swapped out with anotheravailable modulator component, in thiscase a BPSK, QPSK, or OFDM modula-tor. The OFDM modulator is chosen forits resistance to multipath.

To support this type of component leveladaptivity, a regular reconfigurable FPGA

would have to be sufficiently large to load allpossible components, even if many of themare not being used at any single point intime. Moreover, reconfiguring the entiredevice would result in losing the communi-cations channel – an unacceptable outcome.

By contrast, a PR FPGA would onlyhave to load the OFDM modulator com-ponent in an available portion of the deviceand then make the switch from the 16-QAM modulator to the OFDM modula-tor. The 16-QAM modulator could thenbe unloaded to free up resources for anoth-er application or component. The outcomeis the same but the PR FPGA can be muchsmaller – resulting in significant power andcost savings.

The second use case, as illustrated inFigure 4, involves a multi-INT platformthat is capable of using services to invokemany possible applications, includingradio, spectral analysis, surveillance, jam-ming, and anti-jamming. Although multi-INT platforms are not commonly usedtoday, the advent of CR will bring aboutthe next revolution in communications.

In this scenario, you may be receivingdata. The radio service is being utilized tocall on two applications – the spectralanalysis application to characterize thespectrum and identify potential threats orsignals of interest, and the radio applicationto receive the data. At some point, an inter-fering signal may be attempting to jam thereceiver, severely impairing your ability toreceive crucial intelligence.

In such a case, you may call on theanti-jamming service to null the interfer-ing signal. This service would characterizethe interfering signal using the spectralanalysis application, and would then loadthe anti-jammer application to null theinterferer. Once the jamming signal goesaway, the anti-jammer application can beunloaded by the anti-jamming service.Other services could then load availableapplications on an as-needed basis, suchas an ad-hoc networking application tocreate a mesh network.

ConclusionIt is clear that adaptivity at all levels, fromfunctions to services, is a highly desiredattribute in SDRs/CRs. Although you willprimarily be exposed to adaptivity at theservice and application level, adaptivity atthe component and function level is nec-essary for implementation.

As CRs and multi-INT platformsbecome more prevalent, the need todynamically adapt to changing condi-tions will increase and will ultimatelybecome a competitive advantage for thosevendors who are better able to accommo-date different levels of adaptivity. PRFPGAs are ideally suited for drivingadaptivity at all levels. PR FPGAs havesufficient granularity to allow you toreconfigure portions of the device downto the size of typical functions in anSDR/CR. They are also able to supportwhole applications.


QP

SK

Mod

OFD

M M

odO

FDM

Mod

16-QA

M M

od

Upconverter

Encoder

Scram

bler

BP

SK

Mod

Reconfigvia ICAP

Radio

Special A

nalysis

Jamm

er

Anti-Jam

mer

Reconfigvia ICAP

Ad-H

oc Netw

orking

Figure 4 – Adaptivity at the service level in a multi-INT platform

Figure 3 – Adaptivity at the application level in a PR FPGA

by Sébastien RoyProfessor, Department of Electrical and Computer Engineering Université [email protected]

Louis BélangerExecutive Vice President, Product Management and [email protected]

Wireless communications, through cellulartelephony, came of age in the 1980s and1990s, achieving as an industry a growthrate beyond microcomputers. It seems clearthat cellular is on the verge of a major evo-lutionary leap, as the advent of 4G willspell the end of the connection-orientedvoice-centric paradigm in favor of a net-work-oriented (read packet) resource-shar-ing multiservice framework. This evolutionalso implies the definitive advent of broad-band wireless, an elusive proposition so farin the cellular world.

The Fast-Moving Wi-Fi World Upstart wireless LAN (WLAN) technolo-gies under the 802.11 (Wi-Fi) umbrella

have leapfrogged cellular and other effortsedging towards broadband wireless (such as802.16/WiMAX) and have led to the firstwidespread, commercially successful broad-band wireless access technology. In fact, Wi-Fi is a runaway success around the globe.

However, all is not perfect in the WLANworld. Offering nominal bit rates of 11Mbps (802.11b) and 54 Mbps (802.11aand 802.11g), the effective throughputs areactually much lower – owing to packet col-lisions, protocol overhead, and interferencein the increasingly congested unlicensedbands at 2.4 GHz and 5 GHz.Furthermore, operation in these bandsentails a strict regulatory transmit powerconstraint, thus limiting range and even bitrates beyond a certain distance. Comparethis with Gigabit Ethernet, and you will seethat a huge rate gap exists between thewired and wireless portions of the network.

MIMO Techniques and 802.11nEnter MIMO (multiple input multipleoutput), a wireless technology allowinghuge increase in bit rates without consum-ing additional bandwidth. It is currently avery hot trend in the wireless industry, and

with good reason. It basically works byhaving multiple antennas at both trans-mitter and receiver and performing appro-priate signal processing at both ends. Thiscan be used to effectively create a pluralityof channels in space sharing the samebandwidth, a feat referred to as spatialmultiplexing. MIMO happens to be themeans through which the 802.11n work-group plans to boost Wi-Fi nominal bitrates to hundreds of megabits per second,perhaps as much as 600 Mbps over a max-imum bandwidth of 40 MHz.

Compliant terminals and access pointswould be equipped with anywhere fromone to four antennas. In principle, havingfour antennas at both ends enables a four-fold rate increase within a given band-width. Wi-Fi channels have a nominalbandwidth of 20 MHz. 802.11n proposalsadvocate a combination of spatial multi-plexing, advanced modulation, beam-forming, and space-time coding, as well aschannel bonding (merging two adjacentchannels to form a 40 MHz aggregatechannel) to achieve the highest bit rates.

One recent specification proposal by theEnhanced Wireless Consortium details the

The Design of an FPGA-Based MIMO Transceiver for Wi-FiThe Design of an FPGA-Based MIMO Transceiver for Wi-Fi


We implemented a Virtex-based layered space time algorithm on a Laval University 802.11 MIMO testbed. We implemented a Virtex-based layered space time algorithm on a Laval University 802.11 MIMO testbed.

802.11n transmitter structure shown inFigure 1. The data from the MAC layer isfirst scrambled and then demultiplexed intoa number NES of streams, which are routedto NES forward-error correction (FEC)encoders. These encoders can be either bina-ry convolutional coders with puncturing orlow-density parity check (LDPC) coders.The encoder outputs are then divided into anumber NSS of spatial streams, each beinginterleaved and mapped onto a QAM con-stellation. Collectively, the spatial streamsare then optionally subjected to space-timeblock coding (STBC), a form of two-dimen-sional linear coding generally intended toimprove reliability and not necessarily bitrate (see sidebar, “Space-Time Techniques”).

Another optional feature is spatial map-ping (beamforming), which relies on fore-knowledge of the MIMO channel obtainedthrough a priori sounding. This mecha-nism aims to maximize energy delivery tothe receiving array. In general, the SNR atthe receiver is influenced by spatial map-ping and the STBC, while the bit rate is afunction of the number of bitstreams andthe size of the QAM constellation.

Whether you use the optional featuresor not, the receiver must somehow separatethe superimposed spatial streams. This canbe accomplished through spatial filtering,provided there are at least as many anten-nas as there are streams. One effectivereceiver architecture exploits both spatialfiltering and successive interference cancel-lation (SIC) in a layered structure. Such a

layered space-time (LST) receiver wasrecently the object of an FPGA implemen-tation at Laval University, in collaborationwith Lyrtech as part of an 802.11n-orient-ed research testbed under construction.

Part of this testbed is shown in Figure 2.It comprises a custom-built RF front-endsupporting as many as four antennas at2.4 GHz and 5 GHz, Skycross printedantennas, and Lyrtech SignalMaster Quadplatforms with integrated multichannelADC and DAC modules.

The structure of the LST receiver isshown in Figure 3. Our implementationconstructs the covariance matrix from vec-tor channel estimates according to

were cn is the vector channel estimate asso-ciated with the nth spatial stream, <.>H isthe conjugate transpose (Hermitian) opera-tor, I is the identity matrix, and σ2

n is thethermal noise power on each antenna.

Co-author Sébastien Roy and colleagueI. Laroche developed a novel matrix inver-sion circuit specifically for this application.The covariance matrix structure is exploit-ed to jointly perform the inversions at alllayers simultaneously, a complexity/speedgain of order NSS.


Rxx = ˆ c n ˆ c nH + Iσ n

2

n=1

N

∑

Scr

ambl

er

Enc

oder

Par

ser

Stre

am P

arse

r

FEC

Enc

oder

FEC

Enc

oder

Interleaver QAMMapping



Stre

am P

arse

r

Spa

tial M

appi

ng (O

ptio

nal)

Interleaver

Nss SpatialStreams

Nss Space TimeStreams

Nss TransitChains

QAMMapping

CS

CS

CS

CSIFFT

IFFT

IFFT

IFFT

CS

CSAnalogand RF

Insert GIand

Windows

Analogand RF

Insert GIand

Windows

Analogand RF

Insert GIand

Windows

Analogand RF

Insert GIand

Windows

Layered Space Time (LST) Decoder

Vect

oriz

atio

n

CovarianceMatrix

Estimator

MatrixInversion

ChannelBuffer

FrameBuffer

Detection /FEC Decoding

InferenceReconstruction

InferenceRemoval

Data OutOptimum

Combining

StreamOrdering

MMSE WeightComputation

VectorChannelEstimator

From

FFT

s

Figure 1 – Transmitter block diagram

Figure 2 – Laval University MIMO testbedusing Lyrtech VHS hardware

Figure 3 – LST decoder

Implementation ResultsThe entire receiver was implemented inVHDL on a Xilinx® Virtex™-IIXC2V8000 FPGA. It achieves a top fre-quency of 88 MHz while utilizing 27%of the slices and 95% of the dedicated18x18 multipliers. Implementationresults for the matrix inversion unit byitself are listed in Table 1, also showingresults on a Xilinx Virtex-4 XC4VFX140device while consuming 8% of the slicesand 52% of the dedicated DSP48 multi-ply and accumulate (MAC) units with atop speed of 140 MHz.

ConclusionMIMO techniques for 802.11 wirelessLAN promise huge increases in bit rates.Several different approaches are stillbeing debated by industry standardiza-tion bodies.

In collaboration between LavalUniversity and Lyrtech, we implementedthe LST technique, which exploits bothspatial filtering and successive interfer-ence cancellation (SIC), in Virtex-II andVirtex-4 FPGAs on an 802.11n-orientedtestbed. The next steps will be to opti-mize and test the LST implementationextensively over 2.4 GHz and 5 GHzchannels, while the testbed will beextended to encompass other aspects ofthe upcoming 802.11n standard, includ-ing hardware-efficient LDPC codecs andbeamforming techniques.


Given a single transmit antenna and two or more receive antennas, it is well known thatspatial diversity can be exploited to improve link quality. Provided that the two receiveantennas are sufficiently spaced, each will observe the received signal through a differentmultipath channel. Therefore, the probability that both channels experience a deep fadesimultaneously is very low, and you can exploit this diversity advantage in a number ofways. One simple method is to utilize a linear weight-and-sum structure, which co-phas-es and weighs each branch in proportion to its SNR before adding them up. This iscalled maximal-ratio combining (MRC) and it results in an optimal SNR in the presenceof white noise only, without other sources of interference.

Space-Time CodingThe first and simplest of the space-time block coding schemes was proposed in 1998 byAlamouti as a clever means to procure the same diversity advantage as MRC when theantenna array is situated at the transmitter instead of the receiver.

The original code is designed for twin antennas at the transmitter and a single antennaat the receiver, although various generalizations exist for arbitrarily sized antenna arrays atboth ends. A pair of symbols is combined and sent twice in succession over two signalingintervals as two vectors, which are spatially orthogonal. This allows the single antenna todecode each symbol with a diversity order of two.

Given a pair of symbols s1 and s2, the transmitter/coding matrix is:

where each column corresponds to a signaling interval and each row to an antenna.The received signal during the first and second signaling intervals are given by:

where hn is the channel (expressed as a complex gain according to baseband equivalent con-ventions) between the nth transmit antenna and the receiver, and {n1,n2} are thermal white-noise components.

The receiver estimates h1 and h2 and decodes the first symbol by performing the follow-ing combination:

where v1 = h1*n1 + h2n2

* is the noise component after combining.Likewise, the second symbol is estimated as:

Because each symbol’s amplitude is proportional to the sum of the individual channelpowers, this is indeed equivalent to MRC.

y2 = −h2r1* + h1

*r2

= h1

2+ h2

2( )s2 + v2 .

y1 = h1*r1 + h2r2

*

= h1

2s1 − h1

*h2s2* + h1

*h2s2* + h2

2s1 + h1

*n1 + h2n2*

= h1

2+ h2

2( )s1 + v1,

r1 = h1s1 − h2s2* + n1

r2 = h1s2 + h2s1 + n2

X =s1 −s2

*

s2 s1*

⎡

⎣ ⎢

⎤

⎦ ⎥

Space-Time Techniques

XC2V6000 XC2V8000 XC4VFX140

Slices 17% 12% 8%

Four-Input LUTs 12% 8% 5%

Block RAM 2% 2% N/A

FIFO16/RAMB16 N/A N/A 0%

MULT18x18 70% 60% N/A

DSP48 N/A N/A 52%

GCLK 6% 6% 3%

Max Clock Frequency (MHz) 108.3 92.1 140.0

Table 1 – Implementation results for the matrixinversion unit on Virtex-II and Virtex-4 FPGAs


Layered Space-Time ArchitectureThe LST architecture was originally proposed by Foschini. With N

antennas at both transmit and receive ends, this scheme allows the

spatial multiplexing of N streams, while the receiver complexity

grows only linearly with N.

Little or no processing is required at the transmitter. In the variant

designated V-BLAST (Vertical Bell Labs LAyered Space-Time), each

stream is simply assigned to a transmit antenna, so that for N=4 the

transmission matrix is:

where xkn is the kth symbol in the nth stream.

D-BLAST (Diagonal BLAST) adds temporal staggering of the

streams:

Although D-BLAST can improve performance when combined

with FEC of individual streams, it is spectrally inefficient because of

the zeros introduced. Threaded space-time (TST) eliminates this

problem by simply rotating the assignment of streams:

The received signal vector at a given instant is:

r = Hx + v

where H is the NxN channel matrix, x is a column of X and v is

the thermal noise vector. The received vectors corresponding to a

whole frame are stored in a buffer for multi-pass processing. The

individual received signals are then ordered according to their rel-

ative power. For the sake of simplicity and without loss of gener-

ality, the V-BLAST variety will be assumed henceforth, where

each of the received signals corresponds directly to one of the

streams at the transmitter.

X =

x11 x1

4 x13 x1

2

x21 x2

3 x24 x2

3

x31 x3

2 x31 x3

4

x41 x4

1 x42 x4

1

...

...

...

...

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

X =

x11 x1

2 x13 x1

4

0 x21 x2

2 x23

0 0 x31 x3

2

0 0 0 x41

...

...

...

...

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

X =

x11 x1

2 x13 x1

4

x21 x2

2 x23 x2

4

x31 x3

2 x33 x3

4

x41 x4

2 x43 x4

4

...

...

...

...

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Thus, the most powerful stream is detected first by passing the

buffer contents through a linear combiner:

where <.>H denotes complex conjugate transposition of the

weight vector w(1) is typically chosen according to the zero-forc-

ing criterion, which forces to zero the interfering contributions

of the N-1 signals in y(1) . The vector is given as:

where <.>+ denotes the pseudo-inverse, and

is the interference covariance matrix from the point-of-view of the

first stream denoted by the subscript (1).The symbols can then be

detected with a high degree of reliability.

Given an estimate of the corresponding channel , the

first signal’s contribution can be subtracted from the received

signal buffer. Processing then moves on to the next layer, which

targets the second most powerful signal. The buffer contents

are then , where is an estimate of and

is an estimate of the corresponding transmitted stream .

It follows that . The process

continues until, at the last layer, only the weakest signal xN is left

in the buffer. The contents are

.

Because in the absence of detection errors no interference is

left at this point, MRC is used:

w(N) = h(N)

Only the interference nulling capacity of the zero forcing

array combining is used for the strongest signal. On the other

hand, detection of the weakest signal relies on successive

interference cancellation. A combination of nulling and can-

celling applies for the intermediate signals. An interesting

aspect of this approach is that the weakest signal enjoys the

highest degree of spatial diversity.

r − h(n )x(n )n=1

N −1

∑ ≈ r − h(n )x(n ) =n=1

N −1

∑ h( N )x(N ) + v

y(2) = w(2)H r − h(1) x(1)( )

x(1)

x(1)h(1)h(1)r − h(1)x(1)

h(1)

R I(1)= h(2) h(3) ... h( N )⎡⎣ ⎤⎦

h(2)

h(3)...

h(N )

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

*

w(1) = R I(1)

+ h(1)

y(1) = w(1)H r

by Tom Hill Technical Marketing Engineer, DSP Tools Marketing Xilinx, [email protected]

In a recent survey conducted by AccelChipInc. (recently acquired by Xilinx), 53% ofthe respondents identified floating- tofixed-point conversion as the most difficultaspect of implementing an algorithm on anFPGA (Figure 1).

Although MATLAB is a powerful algo-rithm development tool, many of its ben-efits are reduced during the fixed-pointconversion process. For example, newmathematical errors are introduced intothe algorithm because of the reduced pre-cision of the fixed-point arithmetic. Youmust rewrite code to replace high-levelfunctions and operators with low-levelmodels that reflect the actual hardwaremacro-architecture. And simulation runtimes can be as much as 50 times longer.For these reasons, MATLAB, the over-whelming choice for algorithm develop-ment, is often abandoned in favor ofC/C++ for fixed-point modeling.

Floating- to Fixed-Point Conversion of MATLAB Algorithms Targeting FPGAs

Floating- to Fixed-Point Conversion of MATLAB Algorithms Targeting FPGAs


Accelerating fixed-point model generation and verification using the AccelDSP Synthesis tool.Accelerating fixed-point model generation and verification using the AccelDSP Synthesis tool.

Generating Fixed-Point Models The fixed-point representation of a floating-point MATLAB algorithm will not trulyreflect the response of the final hardware ifthe high-level functions and operators arenot replaced with hardware-accurate macro-architectures (Figure 2).

This is highlighted in Figure 3, whichcompares the fixed-point response of aMATLAB divide operator against a hard-ware-implementable CORDIC divide

growth without incurring additionalnumeric error. This is a tremendous advan-tage for applications such as radar, naviga-tion, and guidance systems that require ahigh degree of numeric precision.

In most cases bit growth rules arestraightforward and well understood. Theresult of an addition, for example, grows byone bit and the result of a multiplicationgrows to a length equal to the sum of theinput word lengths (Figure 5). Making thesedeterminations for variables in an actualdesign, however, is a highly iterative process.Allowing unchecked bit growth to occur isexpensive in hardware and generally unnec-essary. If you’re savvy, you can employ a vari-ety of techniques to minimize word lengthswhile preserving numerical accuracy.

The process of determining an initialquantization value for a variable and thesubsequent refinement of that value is wellsuited for automation. The AccelDSPSynthesis tool includes automated floating-to fixed-point conversion in which thefloating-point MATLAB model is analyzedduring simulation to determine the

dynamic range requirements ofthe input data and constants.These values provide the start-ing points to an auto-quantiza-tion process that then drawsupon a wealth of built-in expe-rience, gained from more than6,000 designs, to determineoptimal word lengths for thedownstream variables.

The initial fixed-pointmodel obtained through auto-quantization provides a goodstarting point, but refinements

algorithm using a random set of input vec-tors quantized to 8-bit signed twos com-plement. Depending on the data values, asignificant divergence exists between thecalculated outputs.

During the fixed-point generationprocess, the AccelDSP™ Synthesis tool’s IPExplorer™ technology will automaticallyreplace high-level MATLAB functions andoperators with hardware-accurate represen-tations (Figure 4). This step is transparentand does not require MATLAB code modi-fications. You can redefine the initialmacro- and micro-architecture selections byusing a synthesis directive.

Once these operations have beenreplaced with hardware-accurate macro-architectures, the process of quantizationmay begin.

Graphically Assisted Auto-QuantizationThe FPGA fabric, unlike a fixed-point DSPprocessor, allows for variable, fixed-pointword lengths. By not limiting a variable to afixed 16- or 24-bit boundary, you can per-form arithmetic calculations requiring bit


Using Third-Party IP

Floating-Pointto Fixed-PointConversion

* Source: AccelChip Survey

RTLGeneration

Test BenchGeneration

DivideOperation

a

b

inout

MATLAB "/" Operator CORDIC

y = a/b;

Lookup Table

Bipartite Tables

Library of Hardware Macro-Architectures

Divide OperationIP Explorer

a/b

CORDIC

Newtin-Raphson

Goldschmit

+

+

*Fixed [10 7]

Fixed [10 7]

Fixed [10 7] Fixed [11 7]

Fixed [12 7]

Fixed [22 7]

Constant "1.3" quantized to unfixed [11 10]

Figure 5 – Fixed-point bit growth

Figure 4 – Automatic hardware-accurate IP insertion

Figure 3 – Fixed-point response of the MATLAB “/” versus CORDIC

Figure 2 – Replacing built-in operators and functions

Figure 1 – AccelChip DSP design challenges survey

to the model are generally necessary. Thisprocess is highly iterative and tightly cou-pled to analysis of the data effects. To min-imize this iteration cycle time, theAccelDSP Synthesis tool provides an accel-erated fixed-point simulation flow.

Analyzing Fixed-Point Data EffectsMATLAB provides a highly efficient envi-ronment for developing the mathematics ofan algorithm that you can generally accom-plish with a small set of simulation vectors.When targeting that algorithm to fixed-point hardware, however, you will needincreased data sets to accurately determinethe real-world environment response.MATLAB, which is an interpreted simula-tor, may not provide the necessary per-formance for these larger, moreCPU-intensive fixed-point simulations.For this, developers often turn to C/C++.

Accelerated Fixed-Point SimulationThe AccelDSP Synthesis tool’s M2C-Accelerator automatically generates a hard-ware-accurate fixed-point C++ model and

test bench to accelerate fixed-point simula-tions. Eliminating the manual recoding stepsaves development time and minimizes theintroduction of errors. Because C++ is com-piled, it can provide as much as a 1000x sim-ulation performance advantage (Figure 6).

This level of performance is often necessaryfor the large vector sets required to under-stand fixed-point data effects.

If you wish to continue using the MAT-LAB visualization environment, includingthe plotting features, M2C-Accelerator alsogenerates a fixed-point C/C++ dll that canbe simulated with the original MATLABtest bench script file.

When you have obtained the initialfixed-point results, the process of analysisand refinement can begin. The AccelDSPSynthesis tool provides a set of graphicalaids, including tabulated reports, variableprobes, and plots to assist in this process.

Observing Fixed-Point Bit GrowthA design must be considered in its entire-ty to effectively convert a floating-point

algorithm into a fixed-point model. Ifleft unchecked early in the datapath, bitgrowth can quickly escalate to produceunreasonable hardware, while overlyconstrained bit growth may result in anunacceptable loss of numeric accuracy. Acommon technique to gain betterobservability into bit-growth progressionis to enter the variables into a spread-sheet. The AccelDSP Synthesis tool pro-vides this same level of observability bygenerating a tabular, formatted FixedPoint Report (Figure 7).

Before optimizing the hardware, youmust obtain an acceptable fixed-pointresponse. If the signal-to-noise ratio (SNR)of an output is not above a desired specifica-tion, then adjustments to the inferred quan-tization values are required. This processtypically starts by looking for gross errorscaused by variable overflows and underflows.

Overflows and UnderflowsPoor assumptions about the dynamic rangeof the input data can lead to large fixed-point errors caused by overflowing the mostsignificant bit (MSB) and (to a lesser degree)underflowing the least significant bit (LSB)of a variable. You will need to address theseerrors first before observing and correctingmore subtle fixed-point errors.

Overflow and underflow reporting,inherent to MATLAB fixed-point datatypes, are not native to C/C++ and are oftensacrificed during the model rewrite. TheC++ models generated by M2C-Accelerator,however, include quantization routines thatreport all overflows and underflows encoun-tered during a simulation. When these con-ditions occur, they are summarized in the“Verify FixedPoint Report” (Figure 8).

Once you have addressed any overflowand underflow issues, the refinement ofthe fixed-point model becomes moredependent on visualization. If additionalfixed-point numeric errors persist, then


Time in Seconds

100000

MATLAB Script & ModelAccelChip Quantizer

10000

1000

100

10

Simulation Model Format

MATLAB Script & Model& Quantizer

MATLAB Script, AccelChip M2C & Quantizer

C++ Script, AccelChip M2C & Quantizer

MATLAB provides a highly efficient environment for developing the mathematics of an algorithm that you can generally

accomplish with a small set of simulation vectors.

Figure 6 – FFT example simulation run times

you must analyze the effects of constants.Otherwise, you can continue the process ofrefining the hardware by reducing variablebit widths. In both cases, knowing thefixed-point error introduced by the quanti-zation of a particular variable is a valuableaid in the refinement process.

Fixed-Point VisualizationDetermining the appropriate fixed-pointresponse of an algorithm to a given set ofdata is generally not an exact science. Youwill often have to make compromises innumerical accuracy to improve hardwareefficiency. This process is highly iterativeand tightly coupled to a visual analysis ofthe fixed-point effects displayed in plots.Observing an unacceptable SNR on anoutput signal, however, does not alwaysindicate where a quantization value hasbeen incorrectly specified. For that, addi-tional analysis is necessary.

To assist in this process, AccelDSPSynthesis’s AccelProbe graphically comparesthe floating- and fixed-point values for anyvariable during a given simulation (Figure9). If you are using AccelProbe, you willquickly gain a sense of the magnitude that aparticular variable’s contribution makes tothe cumulative error of the final result. Youcan “probe” a variable by adding the state-ment, “accel_probe(variable_name)” to theMATLAB source.

The “Fixed-Point Histogram” plot givesyou a sense of how often a value may beencountered during simulation. The addi-tional hardware required to store a value inthe upper or lower dynamic range may beof little value if that value rarely occurs.

ConclusionWhen inventing the mathematics of a DSPalgorithm, MATLAB is the natural choiceand should be used unencumbered by hard-ware considerations. Converting an algo-rithm into a fixed-point model forimplementation on an FPGA is an involvedprocess that benefits greatly from theautomation, acceleration, and visualizationoffered by the AccelDSP Synthesis tool.

For more information about theAccelDSP Synthesis tool, visit www.xilinx.com/dsp.


Figure 9 – Accel probe plot for a variable

Figure 8 – AccelDSP Verify FixedPoint Report

Figure 7 – AccelDSP Synthesis Fixed PointReport for an adaptive filter

GetonTarget

Is your marketingmessage reachingthe right people?

Hit your target audience by advertising your product or service

in DSP Magazine. You’ll reach more than 30,000 engineers, designers,

and engineering managers worldwide.

We offer very attractive advertising rates to meet any budget!

Call today: (800) 493-5551 or e-mail us at

[email protected]

by Jack Wilber Technical WriterThe MathWorks

Engineers at BAE Systems, engaged in thedesign and development of embedded sys-tems involving FPGAs, are highly skilledin the well-established, traditional designprocess based on hand-coded VHDL.

Recently, they seized a rare opportunityto directly compare their process with anew approach built on model-baseddesign. While developing a software-defined radio (SDR) waveform for MIL-STD-188 satellite communications, theyran two development efforts in parallel –one using the traditional approach, theother using model-based design withMathWorks and Xilinx® tools. They dis-covered that model-based design reduceddevelopment time by more than 80%.

BAE Systems Proves the Advantages of Model-Based Design

BAE Systems Proves the Advantages of Model-Based Design


BAE Systems achievedan 80% reduction insoftware-defined radio development timewith The MathWorksand Xilinx tools.

BAE Systems achievedan 80% reduction in software-defined radio development time with MathWorks and Xilinx tools.

Traditional Design Flow vs. Model-Based DesignThe traditional design flow at BAE Systemsinvolves three distinct phases (Figure 1):

• The system engineering phase involvestranslating a set of system requirementsto a system architecture and a waveformdesign to be implemented, for which amodel is constructed and performance isverified against system requirements.

• The hardware engineering phase pro-duces a second model of the algorithm,this time by hand-coding in VHDL, andrequires a second round of simulation,debugging, and analysis to verify thatthis implementation matches the systemengineering model.

• The physical design phase involves con-verting the VHDL behavioral model toan FPGA-compatible netlist, integratingonto the FPGA hardware and verifyingthat operation on the part matchesexpected behavioral performance.

Ideally, all of the information and detaildefining the algorithm is carried forwardfrom the system engineers to the hardwareengineers who will implement the design,but there is inevitably some loss in thistransfer. Engineers attempt to fully capturethe detail of the design as a hardware speci-fication and as input/output test vectors.Not only is this process quite lengthy, but itis also prone to error.

To overcome these limitations, engi-neers used model-based design withSimulink from The MathWorks and XilinxSystem Generator to build a model thatbecomes an executable specification of thesystem for development teams to follow.This approach eliminates the need to passwritten documentation to a software teamfor VHDL coding and reduces the threephases of the traditional design approachto two phases (Figure 2):

• The system and algorithm design phasenow encompasses both the system andhardware engineering phases of the orig-inal process.

• The physical design phase, as before,focuses on integration and verificationof performance on the hardware.


1. Specification

2. Model

3. Simulate

4. Analyze

8. Synthesis

9. Place and Route

10. Static Timing Analysis

5. HDL Coding

6. HDL Simulation

7. HDL Analysis

11. System Integration

12. System Test

First Design and Debug

Second Design and Debug

• Harware Spec• Test Vectors

System Engineering

Physical Design

Hardware Engineering

Model-Based Design Flow

1. Specification

2. Model

3. Simulate

4. Analyze

5. Synthesis

6. Place and Route

7. Static Timing Analysis

8. System Integration

9. System Test

Design and Debug

System & Algorithm Design

Physical Design

Figure 2 – Model-based design eliminates hand-coding in VHDL.

Figure 1 – The three phases of the traditional design flow at BAE Systems

An Experiment in Concurrent Development Two groups worked in parallel to developthe SDR waveform’s signal-processingchain, hardware interfaces, and clocking.One group, led by Robert Regis, a seniorengineer at BAE Systems with more than15 years of experience in VHDL develop-ment, used the traditional workflow. Theother, led by Sean Gallagher, a senior engi-neer at Xilinx, used model-based design.

Each group tracked the hours spent onthe following tasks:

• Algorithm interface specification anddocumentation

• Module design definition

• Modeling, simulation, and design verification

• VHDL coding

• VHDL code behavioral verification

• Hardware integration and lab testing

Regis and Gallagher each implement-ed the same subset of MIL-STD-188-165a (Figure 3). In this design, thetransmitted data passes through a scram-bler, differential encoder, Reed-Solomonencoder, matrix interleaver, convolutional

encoder, and quadrature amplitude mod-ulation (QAM) modulator to producebaseband complex samples, which in thetransmitter are passed to a pair of digital-to-analog converters (DAC) throughlow-voltage differential signal (LVDS)serial links.

The receive chain reverses these steps:the complex baseband samples received bya pair of analog-to-digital converters(ADC) pass through a QAM demodulator,Viterbi decoder, matrix deinterleaver,Reed-Solomon decoder, differentialdecoder, and descrambler to produce a seri-al bitstream. A sync detect function identi-fies Reed-Solomon block boundaries.

Both groups had access to an equiva-lent set of pre-validated components (IPcores): Regis used existing VHDL code forthe Reed-Solomon encoder, whileGallagher used a Reed-Solomon encoderblock available within System Generator.Similarly, both had access to a Reed-Solomon decoder, Viterbi decoder, andinterleavers. In Regis’s case these wereavailable as IP cores, and in Gallagher’scase they were incorporated through theinclusion of an associated SystemGenerator block that, in turn, referencedand instantiated a Xilinx IP core.


LVDS SerialBitstreamand Clock

LVDS SerialBitstreamand Clock

Scrambler

Descrambler

DifferentialEncoder

DifferentialDecoder

BittoInt

BittoInt

InttoBit

InttoBit

RSEncoder

RSDecoder

FrameSyncWord

MatrixInterleaver

Convolu-tional

Encoder

QAMModulator

Matrix De-Interleaver

SyncDetect

ViterbiDecoder

QAM De-Modulator

DAC

ADC

LVDS

LVDS

Buffer

Scrambler DifferentialEncoder

Bit to IntegerConverter

Zero Pad

RS Encoder

Integer-InputRS Encoder

Selector

Satcom Modem Model

U U(E)Integer to Bit

ConverterMatrix

InterleaverTo

Sample

-C-

ToFrame

SyncWord

Frame StatusConversion2

Frame StatusConversion1

ConvolutionalEncoder Rectangular

QAM

Unbuffer

Descrambler DifferentialDecoder

Integer to BitConverter

Selector1 En

U(E) U

Errors Corrected

Integer-OutputRS Decoder

RS Decoder

Zero Pad1

Bit to IntegerConverter

MatrixDeinterleaver

Constant

CC*8*COL+32+32

Sync Word Location

Sync Detect Indicator

AddSync Generation

Out1

Out4

In2

Out z-iU(E) UIn

Delay Viterbi DecoderRectangular

QAM

Error Rate Calculation

Error Rate Calculation

Tx

Rx

Selector2Variable

Integer Delay

RandomInteger

Random IntegerGenerator

+

–

DOC

Text

0

0

66

1

-1

5.505e+004

Figure 4 – Simulink model of the satellite communications transceiver

Figure 3 – A generic satellite communications transceiver

Model-Based Design with Simulink and System GeneratorBAE Systems engineers modeled and simu-lated the transceiver waveform usingSimulink and the CommunicationsBlockset (Figure 4). They used frame-basedprocessing in the model to increase simula-tion speed. (Frame-based connectionscause the model to pass an entire frame, orpacket of data, between blocks, therebyreducing block execution managementoverhead and increasing simulation speed.)

The Simulink model gave engineers theflexibility to implement the waveform on arange of targets – for example, they couldhave used Real-Time Workshop from TheMathWorks to automatically generate codefor a DSP implementation that met theperformance requirements of the originalspecification. To ensure a valid comparisonwith the traditional design flow, however,they implemented the design on an FPGA.

The Simulink model was handed off toGallagher, who used it as a reference designin building the equivalent model in XilinxSystem Generator. Gallagher used existingXilinx blocks for Reed-Solomon encoding,matrix interleaving, and Viterbi decoding.He built other high-level blocks for whichthere was no direct substitute using lower-level Xilinx blocks (Figure 5).

Gallagher used scopes and bit-error-ratemeters to debug the model and verify oper-ational performance before using XilinxSystem Generator to automatically gener-ate VHDL and synthesize the FPGA.

The waveform involves the processingof single bits or scalar, complex values pro-duced by the QAM modulator. In thereceiver, real soft symbols produced by theQAM demodulator are represented usingfinite precision values. Therefore, unlikethe floating-point double-precision valuesthat exist in the Simulink model, in theSystem Generator model it was necessaryto select the scaling to be used to representthese values. Gallagher selected the scalingparameters based on trial and error, untilbit widths were as small as possible withouta significant impact to performance.

Regis’s team required 645 hours to com-plete the design and development of the sig-nal-processing chain (Table 1). Gallagherneeded only 46 hours, a reduction in devel-opment time greater than 10 to 1. Factoringin equal amounts of hardware integrationand lab testing – estimated at 137 hours –the end-to-end improvement is still greaterthan 4 to 1.

David Haessig, a senior BAE Systemstechnical staff member who was involvedin both efforts, points out that in addition

to the obvious time savings provided byautomatic code generation, model-baseddesign accelerated the project by shiftingdebugging and analysis forward in theschedule. “When following the traditionaldesign flow process, a large portion of thesimulation and debugging work tends tooccur later in the design, during VHDLcoding,” Haessig says. “With model-baseddesign the model defines the code, and youare therefore obliged to include in themodel every detail needed to define thewaveform.

“Typically the model is built and test-ed incrementally, and you deal with thebugs and the algorithmic issues as theyoccur. Debugging is handled almostentirely during the modeling phase of thedesign, with a bit-true, cycle-true model.With access to Similink and MATLABdata visualization tools, bugs are mucheasier to identify and fix prior to VHDL.The alternative, debugging the VHDLcode, is more difficult and tedious.”

Further Advantages of Model-Based DesignBAE Systems identified three additionalfactors that contributed to the substantialreduction in development time achievedusing model-based design: clocking, defectdiscovery, and component interfaces.


In1 Out1

Stretcher

din

endout

Scrambler_Encoder

fpt dbl

dout1

fpt dbl

dout

dbl fpt

din

xlviterbi

din1

din2

din1din1din1din1din1din1din1

vin

dout

voutvoutvoutvoutvout

usamp 8

Up Sample

Sine Wave

gisted

enq

Register1

z-1d

enq

Register

xlrsencode(126, 112)

dout

vout

info

rfd

din

vin

start

bypass

RS Encoder

xlrsdecode(126,112)

dout

vout

info

fail

err_cnt

din

vin

start

RS Decoder

Q_in

I_in

Q_out

I_out

QPSK_ Mod

Q_in

I_in

Q_out

I_out

QPSK Demod

PulseGenerator

Product

xlp2sp s

Parallel to Serial

xlinterleaver

din dout

vout

vin

en

Interleaver Deinterleaver1

xlinterleaver

dindout

vout

vin

en

Interleaver Deinterleaver

K-

Gain

din

dout

Deint_start

RS_start

Frame Aligner

Error

Monitordata

en

dout

Descrambler

DifferentialDecoder

z-2

z-1

Delay7

z-24

z-201

Delay2

z-1en

Delay1

z-1

Delay

din dout1

dout2

voutvin

Convolutional Encoder

lconvertcast

Convert2

1

Constant2

0

Constant11

Constant

din

dout

start

start_I

AttachASM1

|u|

Abs

System

Generator

lconvencoder

xlregisterrre

Figure 5 – Simulink model of the satellite communications transceiver using Xilinx blocks.

In the traditional design flow, Regisspent substantial time hand-calculating acombination of clock enables and multipleclock domains necessary for generating theodd sample rates associated with Reed-Solomon encoding and decoding. In con-trast, Gallagher relied on Xilinx SystemGenerator to automatically derive a com-mon clock at the highest rate and buildenabling logic to throttle multiple rates.

As reflected in the time logged for veri-

fication in each approach, finding andrepairing defects was greatly simplified inthe approach using model-based design.“With Simulink and System Generator, themodel is directly connected to the resultingcode, which enables you to discover bugs atthe modeling stage using source and sinkblocks, not at the VHDL behavioral teststage using test benches,” Haessig explains.

Regis notes that while he had to carefullyread the interface specifications of each

block he used, Gallagher could wire blockstogether with relative ease. “With SystemGenerator, it is amazing how easily the IPblocks are connected together. There is noneed to study data sheets or clocking andcontrol options. This may be the mostunderrated aspect of model-based design.”

Looking Ahead: SCA-Compliant SDRBased on the results of the experiment,Haessig expects model-based design withMathWorks tools to become an integralpart of the BAE Systems software develop-ment process.

In addition, BAE Systems is exploringways to use model-based design tools toenable waveform portability by automati-cally generating software communicationsarchitecture (SCA)-compliant softwareand firmware. Joint Tactical Radio System(JTRS) radios must follow SCA as a stan-dard for achieving waveform portability.Responding to a request from the JointProgram Executive Office (JPEO), BAESystems is involved in the deployment ofwaveforms that JPEO can use in currentand future radio systems without redevel-oping the waveform components. Thesame SDR code will be able to run on newhardware platforms without modification.

BAE Systems is also working with TheMathWorks, Virginia Tech, Xilinx, andZeligsoft to create an interface that willenable code generated by Real-TimeWorkshop and System Generator to bedirectly incorporated into SCA-compliantradios. According to Haessig, “This initia-tive holds great potential for reducing timeto market and the cost of JTRS radiodevelopment, allowing seamless transitionfrom simulation to an SCA-compliantimplementation; increasing reusability andportability of components; and expandingthe lifespan of source code.”

For more information, please visit theaerospace and defense (www.mathworks.com/industries/aerospace/) and communi-cations (www.mathworks.com/industries/comms/) sections of The MathWorks web-site, or contact Dan Raun at [email protected], (508) 647-7098 or MikeMcHenry at [email protected],(508) 647-7858.


Algorithm Interface

Specification/ Documentation

Module Design

Definition

Modeling,Simulation, and DesignVerification

VHDL Coding

VHDL CodeBehavioralVerification

HardwareIntegration &Lab Testing Notes

Algorithm Interface

Specification/ Documentation

Module Design

Definition

Modeling,Simulation, and DesignVerification

VHDL Coding

VHDL CodeBehavioralVerification

HardwareIntegration &Lab Testing

Traditional Approach (hours worked)

Reed-Solomon RS Encode 40 40 0 40 60 20 Integrate purchased IP

Reed-Solomon Decode 20 80 0 60 100 20 Integrate purchased IP

Scrambler / Descrambler 1 1 0 1 6 3

Convolutional Encode 1 1 0 1 1 1

Viterbi Decode 8 8 0 8 16 24 Integrate inhouse IP, development

not shown

Differential Encoder / Decoder 1 1 0 1 4 2

Interleaver / Deinterleaver 40 16 0 16 36 60

PSK Modulator (2,4,8) 5 5 0 4 3 3

RS Frame Sync 4 6 0 4 6 4

TOTALS: 120 158 0 135 232 137 782

Rapid Development Approach (hours worked)

Reed-Solomon RS Encode 1 0.25 2 0 0 *

Reed-Solomon Decode 1 0.5 3 0 0 *

Scrambler / Descrambler 0 0.25 3 0 0 *

Convolutional Encode 0 0.25 1.5 0 0 *

Viterbi Decode 0 0.5 2 0 0 *

Differential Encoder / Decoder 0 0.25 1 0 0 *

Interleaver / Deinterleaver 0 0.5 2 0 0 *

PSK Modulator (2,4,8) 1 0.5 4 0 0 *

RS Frame Sync 1 4 16 0 0 *

TOTALS: 4 7 34.5 0 0 45.5

Table 1 – Table of results

Now, There’s A Flow You Could Get Used To.

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. MATLAB and SimuLink are registered trademarks of The MathWorks, Inc.All other trademarks are the property of their respective owners.

www.xilinx.com/dsp

Sidestep tedious hardware development tasks with XilinxSystem Generator for DSP and AccelDSP design tools. When combined withour library of algorithmic IP and hardware platforms, Xilinx XtremeDSPSolutions get you from design to verified silicon faster than ever.

Visit www.xilinx.com/dsp today and speed your design with AccelDSP andSystem Generator for DSP.

by Ben ChanSoftware Engineer IIXilinx, [email protected]

Nabeel ShiraziSenior Staff Software EngineerXilinx, [email protected]

Jonathan BallaghStaff Software EngineerXilinx, [email protected]

Designers often struggle with lengthy simu-lation times when designing large FPGA sig-nal-processing systems. FPGA design toolssuch as Xilinx® System Generator for DSPaim to address this challenge by providingrobust hardware-in-the-loop interfaces thatallow you to bring FPGA hardware directlyinto your design simulations.

By emulating a portion of your designin hardware, these interfaces enable consid-erable simulation acceleration – typicallyby an order of magnitude or more. Usinghardware-in-the-loop also brings you real-

time FPGA hardware debugging and veri-fication capabilities.

System Generator for DSP offers hard-ware-in-the-loop interfaces for many typesof FPGA development platforms. Theseplatforms typically expose different types ofphysical interfaces through which the PCcommunicates with the FPGA hardware.For example, a JTAG co-simulation inter-face allows any FPGA board with a JTAGheader and Xilinx FPGA to be co-simulat-ed inside System Generator for DSP. Otherboards, such as the XtremeDSP™Development Kit, communicate over aPCI bus connection. Until recently, co-simulation of systems with high memorybandwidth and throughput requirements(such as video and image processing) werelimited exclusively to development boardsthat connected directly to a PC using PCIor PCMCIA interfaces.

Co-Simulation over EthernetSystem Generator for DSP 8.1 includes anew Ethernet co-simulation interface thatfor the first time brings high-bandwidthco-simulation capabilities to the Xilinx

ML402 Evaluation Platform. The ML402board connects to your PC either directlyusing a standard Ethernet cable or remote-ly across a network.

At the heart of the interface is the Xilinxtri-mode Ethernet MAC core, which sup-ports operation in 10/100/1000 Mbps half-and full-duplex modes. When you generatea design using the Ethernet hardware co-simulation interface, System Generator forDSP automatically wraps your design withthe logic necessary to communicate withthe FPGA over an Ethernet connectionduring simulation (Figure 1).

You may generate a design for Ethernethardware co-simulation by double-clickingon the System Generator block in anydesign to open its configuration parame-ters box. From the compilation menu,select the ML402/Ethernet compilation(see Figure 2) under the hardware co-simu-lation menu. You can choose between twodifferent modes of Ethernet co-simulation.

Network-Based Co-SimulationA network-based interface allows you toco-simulate FPGA hardware that is

Achieving High-Bandwidth DSPSimulations Using Ethernet Hardware-in-the-Loop

Achieving High-Bandwidth DSPSimulations Using Ethernet Hardware-in-the-Loop


System Generator v8.1 offers a new Gigabit Ethernet hardware-in-the-loop interface that enables high-bandwidth co-simulation using the Xilinx ML402 FPGA platform.

System Generator v8.1 offers a new Gigabit Ethernet hardware-in-the-loop interface that enables high-bandwidth co-simulation using the Xilinx ML402 FPGA platform.

nications on a local Ethernet segment. Co-simulation data is transmitted across a stan-dard UTP Ethernet cable that connects theML402 board directly to the PC. Thismeans that you must have an availableEthernet jack exposed on your PC to makethe connection.

The point-to-point interface supportsthe Gigabit Ethernet standard, which whenconfigured to use jumbo frames substan-tially bolsters the performance of large datatransfers. Using this interface allows you toco-simulate even the most bandwidth-intensive applications.

Device ConfigurationBoth Ethernet co-simulation interfaces sup-port a novel approach to device configura-tion, using the Xilinx System ACE™solution to support configuration over anEthernet cable. The configuration process isperformed over the same Ethernet connec-tion used for co-simulation, thus eliminat-ing the need for a second programmingcable (such as a Xilinx Parallel Cable IV orPlatform Cable USB). A CompactFlashcard is installed on the ML402 board andcontains a special boot-loader image that isautomatically loaded into the FPGA atpower up. This image allows the FPGA tobe reconfigured with new FPGA co-simula-tion bitstreams that are transferred over theEthernet cable at the start of a simulation.The entire configuration process is handledtransparently by System Generator for DSP.

Design ExampleA 5x5 filter operator design model namedconv5x5_video_ex is included with theSystem Generator for DSP 8.1 softwaretool. This design shows how a 2D image fil-ter can be realized efficiently using n-tapMAC FIR filters. The System Generator forDSP top-level design is shown in Figure 4.

Also included with the design is a hard-ware co-simulation test bench for streaminga looped video sequence through the 5x5kernel at real-time frame rates. During eachsimulation cycle, individual video framesare transmitted to the FPGA for processing.Once in the FPGA, each frame is filteredusing a 5x5 kernel, and then transmittedback to the PC for analysis in Simulink.

attached to a standard IPv4network. Because these net-works are virtually ubiqui-tous, the network-basedinterface provides a conven-ient way to reach a remoteFPGA development boardconnected to a wired or wire-less network. The interfacemanages the details of com-munication and error han-dling (retransmissions afterpacket loss) behind thescenes. System Generator forDSP uses the IP address ofan ML402 board to deter-mine which platform tocommunicate with duringco-simulation (Figure 3).

Point-to-Point Co-SimulationThe second Ethernet co-sim-ulation mode is a point-to-point interface that uses rawEthernet frames to enablehigh-bandwidth communi-cation with the ML402board over the data linklayer. In contrast to the net-work-based counterpart, thepoint-to-point interfacefocuses on low-level commu-


User Design

FPGA Fabric

Hardware Co-Simulation Interface

System ACEMPD Interface

EthernetPHYIO

BU

Fs

ExternalI/Os

Ethernet PHYInterface

System ACEReconfiguration

Controller

Tri-ModeEthernet

MAC

EthernetCo-Simulation

Processor

Figure 3 – Specifying the IP address of an ML402 board for Ethernet hardware co-simulation

Figure 2 – Selecting Ethernet hardware co-simulation as the System Generator compilation type

Figure 1 – Block diagram of FPGA fabric using Ethernet hardware co-simulation interface

Two Simulink Matrix Viewer blocks displaythe unfiltered and filtered images duringsimulation. Data flow through the testbench is shown in Figure 5.

BenchmarkingThe 5x5 filter design example was com-piled for point-to-point Ethernet hardwareco-simulation and co-simulated using theXilinx ML402 development board. Thesimulation speed in hardware was com-pared against the simulation speed in soft-

ware. Specifically, the benchmark consid-ers the number of processed frames beingread back per second, and the results arecompared to the software simulation timeof the filter operation on a single frame.

Figure 6 summarizes the simulationspeedup achieved through Ethernet co-simulation with respect to a pure softwaresimulation. The results show a significantspeedup in simulation by a factor ofapproximately 50 to 1,000 times. In reali-ty, the achievable speedup may vary basedon different factors: design complexity,number of I/O ports, and volume of I/Odata. The figure also reflects two otherimportant factors regarding the Ethernetsettings – the link speed and maximumframe size – that can affect co-simulationperformance.

With the increase of link speed, we see adramatic reduction in simulation timebecause more bandwidth is available for co-simulation data. With jumbo frames enabledon a gigabit connection, the co-simulationperformance is further bolstered by increas-ing the maximum frame size to ensure thegreatest efficiency of burst data transfers.

ConclusionThe System Generator for DSP Ethernethardware co-simulation interfaces provideconvenient, high-bandwidth solutions forsimulating video and image processingapplications on the Xilinx ML402 plat-form. These interfaces make it possible tosimulate remote FPGA platforms, or forhigher performance, a board attacheddirectly to the host PC using an Ethernetcable. By using the SystemACE solution,device configuration is accomplished overan Ethernet connection, eliminating theneed for a second programming cable. Asshown from the benchmark results, theinterface can enable simulation speedupsby several orders of magnitude.

Both the Ethernet co-simulation inter-faces and video processing reference designare distributed with the Xilinx SystemGenerator v8.1 software tool.

To learn more about SystemGenerator and Ethernet co-simulation,see the User Guide at www.xilinx.com/system_generator.htm.


UnfilteredVideo

Display

Matrix Viewer

FilteredVideo

Display

Matrix Viewer

Looped128x128

VideoSequence

5x5ImageKernel

FPGA Fabric

ML402 Board

Ethernet Cable Simulink

Ethernet Co-Simulation Performance(5x5 Filter Operation of 128x128 Video Frames)

Ethernet Setting (Link Speed, Maximum Frame Size)

1400

1200

1000

800

600

400

200

0

10 Mbps,1514 Bytes

46x

345x

809x

928x

1237x

1114x

100 Mbps,1514 Bytes

1 Gbps,1514 Bytes

1 Gbps,2048 Bytes

1 Gbps,4096 Bytes

1 Gbps,8192 Bytes

Sim

ula

tio

n S

pee

du

p w

.r.t

. So

ftw

are

Figure 6 – System Generator for DSP 5x5 filter benchmark results

Figure 5 – System Generator for DSP 5x5 filter streaming video test bench

Figure 4 – System Generator for DSP 5x5 filter operator example

by Luc Langlois Global Technical Marketing Manager, DSP [email protected]

Crafting DSP algorithms for optimumperformance in hardware often requiressophisticated design techniques, such aspipelining and overclocked control logic.Such is the case for implementations usingthe Xilinx® Virtex™-4 DSP48 slice,which attains maximum efficiency whenoperating at its peak clock rate of 500MHz with internal registers enabled.

However, synchronizing calculations ina structure of overclocked pipeline registerscan be daunting when using traditionaltime-domain analysis of waveforms tovisualize dataflow. The z-transform is aviable alternative. In this article, I’ll presenta simple, efficient methodology for analyz-ing high-performance DSP algorithmsusing the z-transform to obtain predictableresults without guesswork. My exampleswill demonstrate quick pencil-and-papercalculation techniques of key performancemetrics (such as latency) using three differ-ent structures of finite impulse response(FIR) filters, with an emphasis on Virtex-4DSP48-based implementations.

Hardware DSP Analysis TechniquesUsing the Z-TransformHardware DSP Analysis TechniquesUsing the Z-Transform


Harness the power of the Virtex-4 DSP48 architecture without guesswork.

Harness the power of the Virtex-4 DSP48 architecture without guesswork.

The Z-TransformDSP uses the z-transform to operate onsampled signals in discrete time, as opposedto the Laplace and Fourier transforms usedfor analog signals in continuous time.Hardware designers will recognize the stan-dard notation, z-1 for a unit-sample delay,commonly implemented with a register.This refers to an important property of thez-transform: a delay in the time domaincorresponds to the z-transform of the signalwithout delay, multiplied by a power of z inthe frequency domain. The expression ofthis relationship between a signal delayedby k unit samples and its z-transform is:

x[n-k]�� z-k X(z)

The Signal Flow GraphThe signal flow graph is a time-tested toolfor visualizing DSP algorithms. Figure 1is the signal flow graph of a direct-formFIR filter.

Three elements comprise a signal flowgraph:

• Branch node: sends a copy of theinput signal to several output paths

• Summing node: outputs thesum of all signals flowing into it

• Delay element: stores a delayedsample of the input signal

Pipelining for PerformanceApplying the analysis method previouslydiscussed to a high-performance FIR filterstructure known as the parallel systolicform results in Figure 2. It is derived bypipelining the time-skew buffer and theadder chain of the direct form, producing astructure that maps naturally to the Virtex-4 DSP48 slice (for more details, seeChapter 4 of the XtremeDSP User Guide).

A glance at this structure would suggestextra latency compared to the direct formof Figure 1, but how can you quickly deter-mine the exact amount of latency withoutthe time-consuming exercise of actuallybuilding a model for simulation? Withpencil and paper, a three-step analysisprocess using z-transforms will answer thisin no time.

1. The z-transform annotated time-skewbuffer is shown in Figure 2. Note theeven powers of z in the time-skewbuffer because of the double registers.

2. Derive the output by tracing the signalthrough the graph. The trick here is torecognize that each signal crossing of aregister from left to right in the addertree causes a z-1 to multiply the entirebracketed expression thus far:

Y(z) = {[(b0z-1X + b1z-2X)z-1 + b2z-4X]z-1 +

b3z-6X}z-1

3. Simplify:

Again the familiar FIR filter sum-of-products appears, revealing a latency offour sample periods when factored out.

Splitting the Unit DelayThe semi-parallel FIR filter is a structure oftime-shared DSP48s, each operating on asubset of coefficient taps at an overclockedcomputation rate fclk relative to the datasampling rate fs (also referred to asthroughput). Extensive pipelining (dis-played as red squares) allows clocking ofthe DSP48 at its maximum computationrate (500 MHz in a Virtex-4 -12 speedgrade device), for an optimal trade-off of

Y(z) = z-4 ∑ bkz-k XN-1

k=0

Note the filter coefficients b0...3, whichmultiply the signal flowing through eachbranch. For simplicity, multipliers are notexplicitly shown.

AnalysisReferring to Figure 1, I recommend the fol-lowing method to analyze the signal flowgraph using z-transforms:

1. Annotate each node of the time-skewbuffer with the z-transform of theinput signal, multiplied by increasingnegative powers of z as the signalmoves through delay elements.

2. Derive the output by tracing the sig-nal through the graph, multiplyingthe input signal by the coefficient ineach branch, and summing the result-ing products in the summing nodes:

Y(z) = b0X + b1z-1X + b2z-2X + b3z-3X

3. Simplify:

The result is the familiar FIR filter sum-of-products, equivalent to discrete-timeconvolution in the time domain:

y[n] = x[n]*b[n].

Y(z) = ∑ bkz-k XN-1

k=0


Z -1x[n]

y[n]

z -1XX

b0 b1 b2 b3

Z -1 z

-2XZ

-1 z -3X

Z -1 Z

-1x[n]

y[n]

z -2X

X

b0 b1

Z -1

Z -1 Z

-1z

-4X

b2

Z -1

Z -1 Z

-1z

-6X

b3

Z -1 Z

-1

Figure 1 – Signal flow graph of direct-form FIR filter

Figure 2 – Signal flow graph of parallel systolic FIR filter

throughput versus filter order. The definingquantity is taps/DSP48 = fclk / fs = numberof computation phases required to com-

pute each output value Y (Chapter 5 of theXtremeDSP User Guide).

Figure 3 shows an 8-tap semi-parallelFIR structure using DSP48s, with 4x over-clocking. The time-skew buffer is a cascadeof addressable shift registers with shiftenable, known as SRL16E, in FPGA fabric.The enable signal (not shown) is assertedevery 1/fs to shift data through the time-skew buffer at the sampling rate, resultingin unit-delays (whole powers of z).

At each of four computation phases, shiftregister addressing operating at fclk = 4fs (notshown) selects one of four data samples ineach SRL16E and presents it to the DSP48input register, while the corresponding coef-ficient, denoted bk, is fetched from distrib-uted memory. The four computation phasesare summed sequentially to produce theoutput Y in the accumulator, which is thencleared to start calculation of the next out-put. Overclocking all registers at fclk = 4fs

accounts for the fractional-delays z-1/4.Aligning the computation phases to

ensure proper synchronization of theseoperations can be a daunting task usingtime-domain waveforms to visualize

dataflow in simulation. Z-transformsreduce the job to simple algebra by group-ing unit and fractional delays as lumpedsums in the exponent, as follows:

1. The z-transform annotated time-skewbuffer is shown in Figure 3 for the first offour computation phases. Note the com-bination of fractional and whole powers ofz in the time-skew buffer.

2. Each 4x overclocked register causes thesignal to accumulate a z-1/4 delay. Thepost-adder register applies its z-1/4 delay bymultiplying the entire bracketed signalexpression thus far. The first computationphase Yph_0 is :

The full output expression of the 8-tapFIR filter is the accumulated sum of thefour consecutive computation phases,

each of which is shown delayed an extra1/(4fs) from the previous phase in thesequence:

The accumulator has the effect of“realigning” each computation phase toproduce the sequentially accumulatedsum Y at the next 1/fs boundary. This isaccounted for with extra fractional delay:

3. Simplify:

Again the familiar FIR filter sum-of-products appears, revealing a latency oftwo sample periods when factored out.

ConclusionWith a simple, efficient methodologyusing the z-transform for the analysis ofDSP algorithms, you can easily applyhigh-performance techniques such aspipelining and overclocked control logicto your hardware DSP designs.

The techniques described in this arti-cle were presented at Speedway 2005DSP sessions. The Spring 2006 SpeedwayDesign Workshop Series features two newDSP-related workshops: “Xilinx DSPDevelopment Workshop” and “XilinxDSP for Video Workshop.” For moreinformation, visit http://em.avnet.com/xlxspringspeedway.

Y(z) = z-2 ∑ bkz-k XN-1

k=0


z -1/4X

z -1/2 z

-4X

b0

b3

0

DSP48

SRL16E SRL16E

DSP48 DSP48

Accumulator

X

X

Yph_0+ + +

.

.

b4

b7

.

. X

Yph_0 = (b0z-1X + b4 z-5) z-1/4

effect of post-adder register

(1st computation phase)

(2nd computation phase)

(3rd computation phase)

(4th computation phase)

Y = (b0z-1X + b4 z-5) z-1/4

+ z-1/4 (b1z-2X + b5 z-6X) z-1/4

+ z-1/2 (b2z-3X + b6 z-7X) z-1/4

+ z-3/4(b3z-4X + b7 z-8X) z-1/4

computation phases re-aligned in the accumulator

Y = z-3/4 (b0z-1X + b4 z-5X) z-1/4

+ z-1/2 z-1/4(b1z-2X + b5 z-6X) z-1/4

+ z-1/4 z-1/2(b2z-3X + b6 z-7X) z-1/4

+ z0 z-3/4(b3z-4X + b7 z-8X) z-1/4

With a simple, efficient methodology using the z-transform for the analysis of DSP algorithms, you can easily apply high-performance techniques such as pipelining

and overclocked control logic to your hardware DSP designs.

Figure 3 – Eight-tap, semi-parallel FIR filter with 4x overclocking (fclk/fs = 4)

by Niall BattsonDSP Technical Marketing ManagerXilinx, [email protected]

Although the well-known finite impulseresponse (FIR) filter algorithm is extremelysimple, the number of variants in theimplementation specifics is immense.These implementation specifics have keptresearch institutions busy and DSP hard-ware engineers struggling to reach opti-mal performance and usage of the siliconavailable to them.

Introduced in September 2004, Xilinx®

Virtex™-4 devices demonstrated that 400MHz DSP designs (in the slowest speedgrade) were feasible, especially for designsthat had an abundance of FIR filters. This iscertainly the case in wireless and defense sys-tems today, especially in the radio portion.

Implementing Optimal Filters QuicklyImplementing Optimal Filters Quickly


You can obtain high performance with minimal resources in Virtex-4 FPGAs using the new FIR Compiler.

You can obtain high performance with minimal resources in Virtex-4 FPGAs using the new FIR Compiler.

Figure 1 shows a typical three-carrierUMTS digital up converter, of whichabout 60% of the design is consumed bythe FIR filters. This significant 2x perform-ance improvement over Virtex-II ProFPGAs enables DSP designs to shrink inresource utilization significantly – oftenmore than 50% – allowing support formore channels, functionality, and a lowerpower or cost solution.

At the forefront of this performanceleap is the XtremeDSP™ slice (also

referred to as the DSP48). The XtremeDSPslice is a unique high-performance multi-plier and arithmetic unit with great flexi-bility, laid out in a column structure in theFPGA with dedicated cascade routingbetween each slice.

However, to take advantage of the sig-nificant improvements in Virtex-4 DSP,hardware engineers must adopt a newimplementation style for their FIR filters.This implementation is based on an adderchain architecture and takes specific advan-tage of the XtremeDSP slice. But as the

The more traditional adder tree-basedMACFIR architecture is still an excellentfit for low-cost Spartan™ devices, as thesedevices do not have the XtremeDSP slice.The DAFIR is extremely valuable for low-bit-width applications and logic-slice-heavy FPGAs.

These easy trade-offs give you the abilityto select the most resource- and power-effi-cient solutions.

Trade-Offs and OptimizationsOne of the fundamental trade-offs that theFIR Compiler enables is data rate versusarea. For example, a 16-bit single-rate 64-tapfilter will yield three very different results,depending on the data rate required.Comfortable with the knowledge that highclock frequencies of 400 MHz are easilyobtainable with the FIR Compiler, Figures2, 3, and 4 illustrate the most optimumstructures for 6.35 MHz, 25 MHz, and 100MHz data rates, respectively.

Note how the available clock cycles inthe lower data rate designs are exploited toresult in the smaller resource solutions. Alsonote how the XtremeDSP slice is used andthe cascade routing exploited by the adderchain structures for more than single multi-plier implementations.

(For more detailed information on thesearchitectures, please refer to the Virtex-4XtremeDSP Slice User Guide.)

The trade-off between data rate and area isvery simply made in the FIR Compiler toolthrough the sample frequency parameter inthe GUI interface (see Figure 5). The struc-tures and size of the filter that can be imple-mented are very different based on the datarate requirement. The FIR Compiler makesautomatic decisions about whether to use ablock memory or distributed memory struc-ture, as well as the amount of multipliersrequired to meet the sample frequencyentered. These automatic capabilities keepresource usage to a minimum and greatlyreduce design time for filter implementations.

FIR filter is one of the most ubiquitous andfundamental building blocks in DSP sys-tems, the amount of time spent under-standing and reworking old designs caneasily become very significant. To alleviatethis impact to hardware engineers andaccelerate time to market for those adopt-ing DSP in FPGAs, Xilinx has created theFIR Compiler.

The FIR Compiler v1.0 is a new line ofpowerful and comprehensive IP fromXilinx. The FIR Compiler allows hardware

engineers and DSP algorithm engineers torapidly generate the high-performance fil-ters that Virtex-4 devices promise.Furthermore, the FIR Compiler allows youto make trade-offs between differing high-performance hardware implementations ofyour FIR filter specification.

• Adder-chain based multiply accumu-late FIR (MACFIR)

• Adder tree-based MACFIR

• Distributed arithmetic FIR (DAFIR)


RRC Filter

X

Complex

Baseband

Channel 0

Gain

0

exp (jα0t)

Interp-By-3

Filter

Halfband

Interpolator

X

RRC Filter

X

Complex

Baseband

Channel 1

Gain

1

exp (jα1t)

Complex

Composite

Output at

Fs = 46.08 Msps

Interp-By-3

Filter

Halfband

Interpolator

X

+

RRC Filter

X

Complex

Baseband

Channel 2

Fs = 3.84 MSPS

(Fs = Fchip)

Fs = 46.08 MSPS

(Fs = 12xFchip)

Fs = 15.36 MSPS

(Fs = 4xFchip)

Fs = 7.68 MSPS

(Fs = 2xFchip)

Gain

2

exp (jα2t)

Interp-By-3

Filter

Halfband

Interpolator

X

FIR Filters of

Differing Data

Figure 1 – FIR filters consume the majority of digital radio designs.

The FIR Compiler v1.0 is a new line of powerful and comprehensive IP from Xilinx.The FIR Compiler allows hardware engineers and DSP algorithm engineers to rapidly

generate the high-performance filters that Virtex-4 devices promise.

Higher Clock Performance, Smaller DesignsClearly, one of the most valuable aspectsof the FIR Compiler is its ability to shrinkthe size of a design by exploiting the per-formance on the silicon in the FPGA.Figures 2, 3, and 4 demonstrate thatresources are kept to a minimum byachieving high clock frequencies. It alsomeans a higher data-rate capability for theparallel filter shown in Figure 4.

Figure 6 really emphasizes this point,adjusting the clock frequency for a 33-tapfilter while maintaining a constant samplerate of 10 MSPS. It also compares the dif-ferences between symmetrical and non-symmetrical coefficient sets, emphasizingthe benefits offered by both. Overall, youshould aim to maximize clock frequency, asthe significant reduction in area cannot beoverlooked.

The Complexity of the FIR CompilerFIR filter specifications are, however, morecomplex than what I have discussed so far.As shown in Figure 1, both interpolationand multi-channel capabilities are critical indesigning the system. Multiple channels ofdata are very common in video (red, green,and blue); wireless communications (anten-na diversity, adaptive antenna arrays); andgeneral DSP processing (complex data).

You can exploit these multiple streamsof data and use a single FIR filter structurein a time-division multiplexed fashion tofilter the channels. This provides a signifi-cant resource utilization reduction overmultiple instances of the same filter.However, the clock frequency must runfaster for a multi-channel filter versus a sin-gle-channel filter; specifically, the numberof channels multiplied by the clock fre-quency. The promised high-performanceclock capabilities of the FIR Compilermake implementing these multi-channelfilters feasible, easy to generate in the tool,and greatly reduce resource utilization.

Multi-rate filters are also extremelycommon in DSP designs, especially in dig-ital radios (demonstrated in Figure 1).With these filters, the input and outputdata rates are not the same. For an inter-polation filter, the output is larger by a fac-tor of the interpolation ratio; in adecimation filter, the output is smaller bya factor of the decimation ratio. You canexploit these differences in input and out-put frequencies by using a well-knowntechnique that creates what is known as a


+X

DSP48 Sliceopmode = 0100101

Input Data64 x 18

Coefficients64 x 18

Control

xn

ynCE

Loadz-3

DData AddrWE

Coef Addr

Q40

16

opmode = (5)

+

x(n)

DSP48 Sliceopmode = 0000101

0

16

y(n)40

X

+DSP48 Slice

opmode = 0010101DSP48 Slice

opmode = 0010010

X

+

X

+

Coefficients16 x 16

SLR16EStoring 16 Coefficients

X

+ CED Q

Coefficients16 x 16

Coefficients16 x 16

Coefficients16 x 16

+DSP48 Slice

opmode = 0000101DSP48 Slice

opmode = 0010101

0

K0X

+

K1X

+

K2X

+

K62X

+

K63X

x(n)

y(n)

16

40

Figure 4 – 400 MSPS single-rate 64-tap FIR filter

Figure 3 – 25 MSPS single-rate 64-tap FIR filter

Figure 2 – 6.35 MSPS single-rate 64-tap FIR filter

polyphase filter to reduce the computa-tional requirement. This polyphase filtertechnique, added to the high clock fre-quency performance and automatic gener-ation, means that the FIR Compiler canrapidly create extremely resource- andpower-efficient multi-rate filters. The tooleven has the ingenuity to optimize thestructure even further if the filter is a half-band multi-rate filter, as required in digitalradios. Any hardware engineer implement-ing digital radios will take advantage ofthese capabilities.

In addition to the filter types I’vedescribed, the FIR Compiler also offersthe ability to change coefficients in the fil-ter on the fly. This is very important invideo filtering and agile digital radioreceivers. You can select a fully reloadablecoefficient filter or a filter that contains asmany as 16 different coefficient sets; acontrol port selects the set beingemployed. Once again, the FIR Compilercan make an optimal choice betweenblock memory or distributed memory.

Furthermore, you can combine all ofthese capabilities to provide a very widerange of possible filters, with the mostcomplicated being multiple channel, mul-tirate FIR filters with reloadable coeffi-cients. Table 1 shows performance andresource utilization for numerous compli-cated FIR filters, including the filters inFigure 1, and demonstrates the capabilitiesof the FIR Compiler.

ConclusionThe FIR Compiler is available in bothSystem Generator and Core Generator™software and is an extremely valuable toolfor both DSP algorithm and hardwareengineers. It provides rapid generation ofdifficult-to-implement, high-performanceFIR filters and positively impacts designtime and risk.

Most importantly, the filters generatedtake full advantage of the FPGA; conse-quently, their performance reaches themaximum 400 MHz offered by a Virtex-4device (-10 slowest speed grade) withextremely efficient resource utilization.

For more information about the FIRCompiler, visit www.xilinx.com/ipcenter.

Clock and Sample Frequency Parameters

Slice / Block RAM Utilization for a 33-Tap Filter

79 / 1122 / 2

74 / 2124 / 1

119 / 0155 / 0

140 / 0160 / 0

154 / 0208 / 0

202 / 0252 / 0

218 / 0282/ 0

250 / 0321 / 0

335 / 0509 / 0

60 / 0576 / 0

Non-SymmetricSymmetric

Performance Goal

Cloc

k Fr

eque

ncy (

MH

z)

Virt

ex-4

SX5

5-10

0

0

50

100

200

250

300

350

400 1.400

3.300

4.1503.150

3.100

4.70

5.50

6.307.20 10.15

Symmetric FIR

17.10

10.4012.30 18.20 Non-Symmetric FIR

5.100

6.708.50

1.300

150

5 10 15 20 25 30 35

The higher the clock

frequency, the smaller

the FIR filter.


Resource Utilization

Filter Type Clock Frequency Slices DSP48 Block RAM(MHz)

395 MSPS, 128-Tap, Decimate by 4, Single-Channel, 16-Bit-Data FIR Filter 395 500 33 0

3.5 MSPS, 196-Tap, Interpolate by 2, 8-Channel 12-Bit-Data FIR Filter 399 300 15 14

22 MSPS, 20-Tap, Single-Rate, 3-Channel, 18-Bit-Data FIR Filter 400 123 5 0

3.84 MSPS, 47-Tap, Interpolate by 2 RRC, 6-Channel, 16-Bit-Data FIR Filter 305 234 4 0

7.68 MSPS, 23-Tap, Half-Band Interpolator, 6-Channel, 16-Bit-Data FIR Filter 333 119 1 2

Table 1 – Designs generated using the FIR Compiler.

Figure 6 – 33-tap 10 MSPS FIR filter resource utilization for differing clock performance

Figure 5 – First page of FIR Compiler GUI

by Ali Behboodian Applications Engineer The [email protected]

Embedded systems have transformed tech-nology products – from everyday consumerelectronic devices to complex industrial sys-tems. As hardware and memory becomeless expensive and more powerful, embed-ded systems will become even more perva-sive. At the same time, the designs willbecome more complex. To meet thisdemand, engineers must find ways to effi-ciently develop software and hardware at aneven faster rate. A methodology thataddresses this is model-based design.

The MathWorks Simulink productfamily enables you to apply model-baseddesign in a graphical, interactive envi-ronment, where you can visualize yoursystem models and subsystem designsusing intuitive block diagrams. Themodels are hierarchical and you can par-tition the system into functional units.The graphical environment allows you tounderstand the design and the interac-tions of the subsystems more easily thantext-based models.

In this article, I’ll present a model-baseddesign methodology in the context of thedesign and implementation of the Sobeledge-detection algorithm on an FPGA.Note that you can readily apply these con-cepts for embedded designs in a wide rangeof applications in different industries, suchas aerospace and defense, automotive,communications, consumer electronics,and medical electronics.

Model-Based Design


A methodology that addresses today’s growing challenges of designing embedded systems.

Figure 1 shows the elements of model-based design. The center focus of thisdesign methodology is a model, whose fourmain elements are:

• Executable specifications

• Design with simulation

• Implementation with code generation

• Continuous test and verification

I’ll explain the above four elements andapply them to the design and implementa-tion of the Sobel edge-detection algorithm.For a more comprehensive application ofmodel-based design, see the article, “TheDesign and Implementation of a GPSReceiver Channel” from Issue 1 of DSPMagazine (www.xilinx.com/publications/magazines/dsp_01/dsp_gps01.htm).

Executable SpecificationAs designs become larger and more com-plicated, it becomes necessary to firstdescribe them at a high level of abstrac-tion. Simulink, together with application-specific blocksets such as the SignalProcessing Blockset, the CommunicationsBlockset, and the Video and ImageProcessing Blockset, provides an excellentgraphical environment for a high-leveldescription of embedded algorithms.System engineers usually develop thishigh-level description.

A high-level Simulink model servesseveral purposes:

• It enables designers to perform simula-tions by directly executing theSimulink model

• It is used throughout the developmentprocess for testing, verification, andimplementation

• It allows developers to identify bugsearly on and avoid costly bug discoverytowards the end of development

• It eliminates the need for paper-basedspecification, which is easily prone tomisinterpretations, and replaces it withthe executable specification

• Each member of a design team canunderstand and execute the model and

model is the start for a path that will lead allthe way to an FPGA implementation.

Figure 2 also shows the input image tothe algorithm as well as the output of thealgorithm. In the Simulink environment,you can also examine and visualize everysignal throughout the model.

Note that the input and output imagesin the executable specification are testvectors for the algorithm. You can usethese test vectors throughout the designprocess to validate your design against theexecutable specification. Because theentire design is performed in theSimulink environment, there is no needfor extra overhead in porting the test vec-tors into different applications, or creat-ing test harnesses in HDL that are proneto human errors. The test harness used inthe executable specification is usedthroughout the design.

Design with Simulation When designing the executable specifica-tion, the system engineer generally does notkeep the implementation details in mind,but rather designs the algorithm to matchthe behavioral requirements for the system.Once the system engineer submits the exe-

can focus further in developing parts ofthe main model

We call this high-level model the exe-cutable specification, or golden reference.

The executable specification for theSobel edge-detection algorithm is illustratedin Figure 2. The algorithm comprises two2D filters, each with a 3 x 3 kernel (one fil-ter estimating the edges at the x directionand one in the y direction), two squareoperations, and a threshold operation. The


Figure 2 – Executable specification/golden reference for the Sobel edge-detection algorithm

Figure 1 – Elements of model-based design

cutable specification to the developmentteam, the team may need to make modifi-cations to it to fit the design into a real-timeembedded system that may have limitedresources, such as memory or processingpower. These modifications may cause theoutput of the new design to deviate fromthe original design. Design engineersshould decide if the deviation is acceptable.

In this section, I’ll make two modifica-tions to the algorithm to make it suitablefor hardware implementation and demon-strate how to continuously verify thedesign against the executable specification.

Redesigning the AlgorithmLet’s say that the developers decide toeliminate the square operations in Figure2 and replace them with the absolutevalue operations for more efficient hard-ware implementation. Generally, suchchanges in the model are required forhardware implementation and are mostlydone by experienced engineers in a designteam. Simulink provides an environmentwhere you can redesign an algorithm andvalidate your designs in a relatively shorttime. After switching the square opera-tions with the absolute value operations,the final result does not exactly match theoutput of the executable specification,but the difference is quite small and inthis case acceptable.

Fixed-Point ImplementationBecause the ultimate goal is to implementthe algorithm in an FPGA, for my exampleI must convert my double-precision designto a fixed-point design. This can be doneeasily using Simulink. I used the double-precision model I developed to directlydevelop a fixed-point model without intro-ducing any new blocks.

Simulink allows you to determine thenumber of bits and scaling for data as wellas mathematical operations, and provides agreat environment for analyzing the fixed-point operation of a system.

In the fixed-point design, the inputs tothe filters are signed 9-bit integers and theoutputs of the filters are signed 11-bit inte-gers. The developers can tune the bit widthand scaling related to the internal compu-

tations of the blocks. This gives huge lever-age to the designer to compromise betweenmatching the output of the executablespecification while using the least numberof bits necessary to save area on the device.

Figure 3 shows the new fixed-pointdesign after replacing the square opera-tions with absolute value operations. Inthis figure, the new design is compared tothe executable specification and the dif-ference is shown both visually as well asnumerically. Continuous test and verifi-cation is a key part of model-based designand is crucial to the success of a project.Simulink provides an excellent environ-ment for this purpose.

Elaboration of the DesignIn my example, the input to the edge-detection algorithm has been a two-dimen-sional image of 200 x 100 pixels. In areal-time system, the input is most likely

not a matrix but a serial stream of data; forexample, this serial stream of data can begenerated by a charge-coupled device(CCD). Therefore, I need to modify thestructure of the design such that the edge-detection algorithm accepts and performs2D filtering on a serial stream of data.

To this extent, I first serialized the inputimage. Then I performed the 2D filteringon this serial data. I later de-serialized thestream of data to be able to compare theoutput to the executable specification. Thisoperation is done only for the bottom filter.I also added two delay elements to com-pensate for the buffering in the serializerblock. As expected, the new design is stillproducing the same exact results as before.

This design also showcases the multi-rate capability of Simulink. The outputrate of the serializer block is 20,000 timeshigher than the input rate. (Rememberthat the image size is 200 x 100. Becausethe image rate is 1 image per second, thesample rate after serialization is 20,000samples per second.) Figure 4 illustratesthe elaborated design.

ImplementationNow that I’ve elaborated the design of oneof the 2D filters in the Sobel edge-detec-tion algorithm, I can now hand the elabo-rated design to the hardware designers forHDL implementation. There are two dif-ferent approaches to consider in this sec-


Figure 4 – Elaborated design. A serializer anddeserializer are designed and the 2D filter is

operating on a stream of 1D data.

Figure 3 -The image on the left shows the output of the executable specification. The one

in the middle shows the output of the new design, including fixed-point design, and

replaces the square operations of Figure 2 withabsolute value operations. The one on the rightdepicts the difference between the two designs.

The mean difference is 2.793%.

tion. The first approach assumes that thehardware designers will hand-code the fil-ter algorithm in VHDL or Verilog. Thesecond approach assumes that the develop-ers will translate the Simulink model fromthe last section to a Simulink model basedon Xilinx System Generator blocks andautomatically generate HDL code. In bothcases, the developers will verify their designagainst the executable specification andcheck the validity of their design in theSimulink environment.

Manual HDL, Co-Simulation, and VerificationThe HDL designer on the developmentteam can use the 2D filter design depict-ed in the bottom window of Figure 4 towrite the corresponding VHDL orVerilog code. Once the code is written,the HDL designer can use Link forModelSim, also from The MathWorks, tosimulate the HDL design usingModelSim in the Simulink environmentand compare the output of the HDLdesign to the output of the executablespecification. Note that in this process,there is no need to generate an HDL testbench. The Simulink model feeds theinput test vector to ModelSim throughLink for ModelSim and extracts the datafrom ModelSim back to the Simulinkenvironment. The HDL designer can

readily verify whether the HDL code runsin accordance with the specifications. Themodel in Figure 5 co-simulates ModelSimand Simulink and allows you to verify thevalidity of the VHDL code. As you cansee, the mean difference is the same as theprevious model.

Automatic HDL Generation, Xilinx System Generator Using Xilinx System Generator, you canbuild and debug DSP systems inSimulink using the Xilinx blockset. Youcan also automatically generate VHDL orVerilog code and run hardware-in-the-loop simulations. Figure 6 illustrates a fil-ter design (using Xilinx System Generatorblocks) that is equivalent to the filterdepicted in the bottom window of Figure4. The simulation in the Simulink envi-ronment is bit-true and cycle-true. Onceyou have verified the results of the SystemGenerator design against the executablespecification, you can automatically gen-erate synthesizable VHDL or Verilogcode for the filter.

ConclusionModel-based design helps you create betterembedded software and hardware byincreasing the accuracy and speed of systemdevelopment. You can confidently beginintegration, test, and deployment of yourembedded application knowing that youhave identified design errors and met yourrequirements.

Model-based design provides a provensolution that reduces development timeand cost and fosters quality and innovationin the development of embedded systems.For more information, visit www.mathworks.com/applications/dsp_comm/.


Figure 6 – Modeling of the 2D filter depicted in the bottom window of Figure 4 using

Xilinx System Generator

Figure 5 – Co-simulation with ModelSim

GetPublished

Would you like to be published

in DSP Magazine?

It's easier than you think!

Submit an article draft for our Web-based

or printed DSP Magazine and we will

assign an editor and a graphic artist

to work with you to make your work

look as good as possible.

For more information on this

exciting and highly rewarding program,

please contact:

Forrest Couch

Publisher, Xcell Publications

[email protected]

by John Williams, Ph.D. CEO PetaLogix [email protected]

Scott Thibault, Ph.D.PresidentGreen Mountain Computing Systems, [email protected]

David PellerinCTOImpulse Accelerated Technologies, [email protected]

FPGAs are compelling platforms for hard-ware acceleration of embedded systems.These devices, by virtue of their massivelyparallel structures, provide embedded sys-tems designers with new alternatives forcreating high-performance applications.

There are challenges to using FPGAs assoftware platforms, however. Historically,low-level hardware descriptions must be

written in VHDL or Verilog, languagesthat are not generally part of a softwareprogrammer’s expertise. Other challengeshave included deciding how and when topartition complex applications betweenhardware and software and how to struc-ture an application to take maximumadvantage of hardware parallelism.

Tools providing C compilation andoptimization for FPGAs can help solvethese problems by providing a new level ofprogramming abstraction. When FPGAsfirst appeared two decades ago, the pri-mary method of design for these deviceswas the venerable schematic. FPGA appli-cation developers used schematics toassemble low-level components (registers,logic gates, and larger blocks such as coun-ters and adders/subtractors) to createFPGA-based systems. As FPGA devicesbecame more complex and applicationstargeting them grew larger, schematicswere gradually replaced by higher level

methods involving hardware descriptionlanguages like VHDL and Verilog. Now,with ever-higher FPGA gate densities andthe proliferation of FPGA embeddedprocessors, there is strong demand foreven higher levels of abstraction. C repre-sents that next generation of abstraction,allowing you to access the resources ofFPGAs for application acceleration.

For applications that involve embeddedprocessors, a C-to-hardware tool such asImpulse C (Figure 1) can abstract awaymany of the details of hardware-to-soft-ware communication, allowing you tofocus on application partitioning withouthaving to worry about the low-level detailsof the hardware. This also allows you toexperiment with alternative software/hard-ware implementations.

Although such tools can dramaticallyimprove your ability to create FPGA-based applications, for the highest per-formance you still need to understand

Accelerating FFTs in Hardware Using a MicroBlaze Processor


A simple FFT, generated as hardware from C language, illustrates how quickly a software concept can betaken to hardware and how little you need to know about FPGAs to use them for application acceleration.

tively, through an analysis of how the appli-cation is being compiled to the hardwareand through the experimentation that C-language programming allows.

Graphical tools (see Figure 2) can helpto provide initial estimates of algorithmthroughput such as loop latencies andpipeline effective rates. Using such tools,you can interactively change optimizationoptions or iteratively modify and recompileC code to obtain higher performance. Suchdesign iterations may take only a matter ofminutes when using C, whereas the sameiterations may require hours of even dayswhen using VHDL or Verilog.

Case Study: Accelerating an FFTThe Fast Fourier Transform (FFT) is anexample of a DSP function that mustaccept sample data on its inputs and gener-ate the resulting filtered values on its out-puts. Using C-to-hardware tools, you cancombine traditional C programming meth-ods with hardware/software partitioning tocreate an accelerated DSP application. TheFFT developer for this example is compati-ble with any Xilinx® FPGA target, anddemonstrates that you can achieve resultssimilar to hand-coded HDL without resort-ing to low-level programming methods.

Our FFT, illustrated in Figure 3, uti-lizes a 32-bit stream input, a 32-bit streamoutput, and two clocks, allowing the FFTto be clocked at a different rate than theembedded processor with which it com-municates. The algorithm itself isdescribed using relatively straightforward,hardware-independent C code, with someminor C-level optimizations for increasedparallelism and performance.

The FFT is a divide and conquer algo-rithm that is most easily expressed recur-sively. Of course, recursion is not possibleon the FPGA, so the algorithm must beimplemented using iteration instead. Infact, almost all software implementationsare written iteratively (using a loop) forefficiency. Once the algorithm has beenimplemented as a loop, we are able toenable the automatic pipelining capabilitiesof the Impulse compiler.

Pipelining introduces a potentially highdegree of parallelism in the generated

certain aspects of the underlying hardware.In particular, you must understand howpartitioning decisions and C coding styleswill impact performance, size, and powerusage. For example, the acceleration of crit-ical computations and inner-code loopsmust be balanced against the expense ofmoving data between hardware and soft-ware. Fortunately, modern tools for FPGAcompilation provide various types of analy-sis tools that can help you more clearlyunderstand and respond to these issues.

Practically speaking, the initial results ofsoftware-to-hardware compilation from C-language descriptions will not equal theperformance of hand-coded VHDL, butthe turnaround time to get those first resultsworking may be an order of magnitude bet-ter. Performance improvements occur itera-


Impulse CCompiler

Hardware Accelerator

MicroBlaze

PERIPHERALS

ME

MO

RY

FSL

Clk2Clk1

32 32

3232

SoftwareApplication

HardwareProcess

FSL

FFT

Figure 1 – Impulse C custom hardware accelerators run in the FPGA fabric to

accelerate µClinux processor-based applications.

Figure 2 – A dataflow graph allows C programmers to analyze the generated hardware andperform explorative optimizations to balance tradeoffs between size and speed. Illustrated in

this graph is the final stage of a six-stage pipelined loop. This graph also helps C programmersunderstand how sequential C statements are parallelized and optimized.

Figure 3 – The FFT includes a 32-bit stream input, a 32-bit stream output, and two clocks,allowing the FFT to be clocked at a different rate than the embedded processor.

logic, allowing us to achieve the best pos-sible throughput. Our radix-4 FFT algo-rithm on 256 samples requiresapproximately 3,000 multiplications and6,000 additions. Nonetheless, using thepipelining feature of Impulse C, we wereable to generate hardware to compute theFFT in just 263 clock cycles.

We then integrated the resulting FFThardware processing core into an embed-ded Linux (µClinux) application runningon the Xilinx MicroBlaze™ soft-proces-sor core. MicroBlaze µClinux is a freeLinux-variant operating system ported atthe University of Queensland and com-mercially supported by PetaLogix.

The software side of the applicationrunning under the control of the operatingsystem interacts with the FFT through datastreams to send and receive data, and to ini-tialize the hardware process. The streamsthemselves are defined using abstract com-munication methods provided in theImpulse C libraries. These stream commu-nication functions include functions foropening and closing data streams and read-ing and writing those streams. Other func-tions allow the size (width and depth) ofthe streams to be defined.

By using these functions on both the soft-ware and hardware sides of the application, itis easy to create applications in which hard-ware/software communication is abstractedthrough a software API. The Impulse com-piler generates appropriate FIFO buffers andFast Simplex Link (FSL) interconnectionsfor the target platform, thereby saving youfrom the low-level hardware design thatwould otherwise be needed.

Embedded Linux IntegrationThe default Impulse C tool flow targets astandalone MicroBlaze software system. Insome applications, however, a fully featuredoperating system like µClinux is required.Advantages of embedded Linux include afamiliar development environment (appli-

cations may be prototyped on desktopLinux machines), a feature-rich set of net-working and file storage capabilities, atremendous array of existing software, andno per-unit distribution royalties.

The µClinux (pronounced “you-see-Linux”) operating system is a port of theopen-source Linux version 2.4. The µClinuxkernel is a compact operating system appro-priate for a wide variety of 32-bit, non-mem-ory management unit (MMU) processorcores. µClinux supports a huge range ofmicroprocessor architectures, including the

Xilinx MicroBlaze processor, and is deployedin millions of consumer and industrialembedded systems worldwide.

Integrating an Impulse C hardware coreinto µClinux is straightforward; the Impulsetools include support for µClinux and cangenerate the required hardware/softwareinterfaces automatically, as well as generate amakefile and associated software libraries toimplement the streaming and other func-tions mentioned previously. Using theXilinx FSL hardware interface, combinedwith a freely available generic FSL device


/* example 1 – simple use of ImpulseC-generated HW coprocessor and* Linux FSL driver* /

#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>

#define BUFSIZE 1024

void main(void){

unsigned int buffer[BUFSIZE];

/* Open the FSL device (Impulse HW coprocessor)*/int fd = open(“/dev/fslfifo0”,O_RDWR);

while(1){

/* Get incoming data – application dependent*/get_input_data(buffer);

/* Send data to ImpulseC HW processor on FSL port */write(fd, buffer,BUFSIZE*sizeof(buffer[0]);

/* Read the processed data back from the HW coprocessor */read(fd, buffer,BUFSIZE*sizeof(buffer[0]));

/* Do something with the data – application dependent */send_output_data(buffer);

}}

The Impulse compiler generates appropriate FIFO buffers and Fast Simplex Link (FSL) interconnections for the target platform, thereby saving

you from the low-level hardware design that would otherwise be needed.

Figure 4 – Simple communication between µClinux applications and ImpulseC hardware using the generic FSL FIFO device driver

driver in the MicroBlaze µClinux kernel,makes the process of connecting the softwareapplication to the Impulse C hardware accel-erator relatively easy.

The generic FSL device driver maps theFSL ports onto regular Linux device nodes,named /dev/fslfifo0 through to fslfifo7, withthe numbers corresponding to the physicalFSL channel ID.

The FIFO semantics of the FSL channelsmap naturally onto the standard Linux soft-ware FIFO model, and to the streaming pro-gramming model of Impulse C. An FSL portmay be opened, read, or written to, just likea normal file. Here is a simple example thatshows how easily a software application caninterface to a hardware co-processing corethrough the FSL interconnect (Figure 4).

You can easily modify this basic structureto further exploit the parallelism available.One easy performance improvement is tooverlap I/O and computation, using a dou-ble-buffering approach (Figure 5).

From these basic building blocks, youare ready to tune and optimize your appli-cation. For example, it becomes a simplematter to instantiate a second FFT core inthe system, connect it to the MicroBlazeprocessor, and integrate it into an embed-ded Linux application.

An interesting benefit of the embeddedLinux integration approach is that it allowsdevelopers to take advantage of all thatLinux has to offer. For example, with theFFT core mapped onto FSL channel 0, wecan use MicroBlaze Linux shell commandsto drive and test the core:

$ cat input.dat > /dev/fslfifo0 &; cat /dev/fslfifo0> output.dat;

Linux symbolic links permit us to aliasthe device names onto something moreuser-friendly:

$ ln -s /dev/fslfifo0 fft_core

$ cat input.dat > fft_core &; cat fft_core > output.dat;

ConclusionAlthough our example demonstrates howyou can accelerate a single embedded applica-tion using one FSL-attached accelerator,Xilinx Platform Studio tools also permit mul-tiple MicroBlaze CPUs to be instantiated inthe same system, on the same FPGA. By con-necting these CPUs with FSL channels andemploying the generic FSL device driverarchitecture, it becomes possible to create asmall-scale, single-chip multiprocessor systemwith fast inter-processor communication. Insuch a system, each CPU may have one ormore hardware acceleration modules (gener-ated using Impulse C), providing a balancedand scalable multi-processor hybrid architec-ture. The result is, in essence, a single-chip,hardware-accelerated cluster computer.

To discover what reconfigurable cluster-on-chip technology combined with C-to-hardware compilation can do for yourapplication, visit www.petalogix.com andwww.impulsec.com.


/* example 2 – Overlapping communication and computation to exploit* parallelism * /

#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>

#define BUFSIZE 1024

void main(void){

unsigned int buffer1[BUFSIZE],buffer2[BUFSIZE];unsigned int *buf1=buffer1;unsigned int *buf2=buffer2;unsigned int *tmp;

/* Open the FSL device (Impulse HW coprocessor)*/int fd = open(“/dev/fslfifo0”,O_RDWR);

/* Get incoming data – application dependent*/get_input_data(buf1);

while(1){

/* Send data to ImpulseC HW processor on FSL port */write(fd, buf1,BUFSIZE*sizeof(buffer[0]);

/* Read more data while HW coprocessor is working */get_input_data(buf2);

/* Read the processed data back from the HW processor */read(fd, buf1,BUFSIZE*sizeof(buffer[0]));

/* Do something with the data – application dependent */send_output_data(buf1);

/* Swap buffers */tmp=buf1;buf1=buf2;buf2=tmp;

}}

Figure 5 – Overlapping communication and computation for greater system throughput

Virtex™-4 SX 35 XtremeDSP™

Development Kit for Digital Communication Applications

Creating extremely high-performance

digital communications signal-

processing solutions can present

significant challenges in both

design complexity and time

to market. The XtremeDSP™

Development Platform from Xilinx

provides a complete development

solution, so your designs will be

faster, easier, and earlier to market.

Virtex-4 SX FPGAs feature up

to 512, XtremeDSP slices, each

capable of running at 500 MHz. This

performance makes them the ideal

co-processors for your DSP proces-

sors and the best way to increase

your system performance by sever-

al orders of magnitude.

The XtremeDSP Development

Platform — together with the

Xilinx System Generator for DSP

software and Xilinx DSP IP

algorithms — provide the ideal

development environment for

developing Virtex-II Pro based

signal-processing designs.

Your Complete Devleopment Platform

Developed with Nallatech, the Virtex-4 SX XtremeDSP Development Platform offers

everything you need to create high-performance signal-processing designs more quickly

and efficiently.

• Exceptional Performance – The dual-channel, high-performance ADCs and DACs,

coupled with a user-programmable Virtex-4 SX-10 FPGA, make this platform

ideal for implementing high-performance digital communication systems such as

Software Defined Radios. The SX 35 FPGA features over 55,000 logic cells, 192

XtremeDSP slices.

• Ease of Use – Combining the Xilinx System Generator for DSP software tool and

the XtremeDSP Development Kit provides an easy transition to using FPGAs for

high-performance signal processing—from algorithm concept to hardware verification.

The System Generator tool interfaces with MATLAB®/Simulink® and enables you to

perform hardware co-simulation on the XtremeDSP Development Platform via PCI

or JTAG. This provides simulation acceleration by an order of magnitude and allows

you to debug and verify the design on the FPGA.

• Comprehensive Support – Reduce your time to knowledge with the Xilinx DSP Design

Flow and DSP Implementation Techniques courses. You can also take advantage of

senior DSP support engineer expertise on the Xilinx Hotline.

Hardware co-simulation with the XtremeDSP Platform and Xilinx System Generator for DSP


BenADDA DIME-II module • Virtex-4 SX 35 user FPGA: XC4VSX35

• Two independent ADC channels: AD6645 ADC (14 bits up to 105 MSPS)

• Two independent DAC channels: AD9772 DAC (14 bits up to 160MSPS)

• Support for external clock, on-board oscillator, and programmable clocks

• Two banks of ZBT-SRAM (133 MHz, 512 Kx32 bits per bank)

• Multiple clocking options: internal and external

• Status LEDs

Also included with the XtremeDSP Platform • External power supply (US Mains cable with separate UK, European

or Australian Mains adapters)

• Wide ranging input (90 - 264Vac), multiple output, power supply,

generating +5 Volts @ 5A, and +12 Volts @ 2A, -12 Volts @ 800mA

• USB v1.1-compatible cable, two meters long

• Five MCX-to-BNC cables for connecting to the ADC/DAC and external

clock connectors

• PCI back-plate and two screws

• 2x BNC jack-to-jack adapters for use in loop-back configurations

• Large carrying case

XtremeDSP Installation Pack • Nallatech FUSE Software CD — Enables control and configuration of

FPGAs and provides tools to transfer data between the Kit and a host PC

via a GUI or a C-based API

Applications This multi-purpose board can be used for many digital communications

applications including:

• Narrow-band systems (QAM demodulation, carrier timing recovery,

channel coding)

• Spread-spectrum systems (e.g. chip rate processing, RACH, path

profiling, TCC)

• Multi-carrier systems (e.g. OFDM, MIMO, TCC)

• And many more.

Take the Next Step Purchase your XtremeDSP Platform at www.xilinx.com/store. For

more information, visit www.xilinx.com/dsp. To learn more about the

complete Nallatech platform offering, visit www.nallatech.com.

Price: $2,495

Finish Faster with Xilinx DSP Design Solutions

Hardware Platform Specifications • XtremeDSP development board consisting of a motherboard

(“BenONE-Kit Motherboard”) populated with a daughter card

(“BenADDA DIME-II Module”).

BenONE-Kit Motherboard • Supports the supplied BenADDA DIME-II module only

• Spartan-II™ FPGA for 3.3V/5V PCI or USB interface

• Host interfacing via 3.3V/5V PCI 32-bit/33-MHz or

USB v1.1 interfaces

• Status LEDs

• JTAG configuration headers

• User 0.1-inch pitch pin headers connected directly to user

programmable FPGA I/O

Corporate Headquarters

Xilinx, Inc.

2100 Logic Drive

San Jose, CA 95124

Tel: (408) 559-7778

Fax: (408) 559-7114

Web: www.xilinx.com

European Headquarters

Xilinx

Citywest Business Campus

Saggart,

Co. Dublin

Ireland

Tel: +353-1-464-0311

Fax: +353-1-464-0324

Web: www.xilinx.com

Japan

Xilinx, K.K.

Shinjuku Square Tower 18F

6-22-1 Nishi-Shinjuku

Shinjuku-ku, Tokyo

163-1118, Japan

Tel: 81-3-5321-7711

Fax: 81-3-5321-7765

Web: www.xilinx.co.jp

Asia Pacific

Xilinx Asia Pacific Pte. Ltd.

No. 3 Changi Business Park Vista, #04-01

Singapore 486051

Tel: (65) 6544-8999

Fax: (65) 6789-8886

RCB no: 20-0312557-M

Web: www.xilinx.com

The Programmable Logic CompanySM

Distributed By:

© 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

TO D

IME-

II M

OTH

ERBO

ARD

TO D

IME-

II M

OTH

ERBO

ARD

Address D

ata

ZBT SDRAM (2 Banks)

GPIO Bus

GPIO BusComm Link 2

Comm Link 2Comm Link 3

Local Bus

GPIO BusComm Link 1

Comm Link 2Comm Link 0

Adjacent OUT

Comm Link 4

Comm Link 5

CHA

NN

EL B

CHA

NN

EL A

CHA

NN

EL D

Analog Outputs ñ

DC Coupled ORDirectly Coupled

Analog Inputs ñ

Differential orsingle-ended

ExternalClock

DACAD9772A

DACAD9772A

ADCAD6645

CHA

NN

EL C

ON-MODULEXILINX VIRTEX-II Pro

FPGA2VP30

ADCAD6645

ClockManagement

OscillatorOR

2nd ExternalClock

Comm Link 7

Comm Link 6

Adjacent IN

Dime II module functional diagram


DSP10000-8-ILT (v1.0) Course Specification

© 2006 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and disclaimers are as listed at http://www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice.

DSP Design Flow

Course Description The DSP Design Flow course provides the advanced tools and expertise you need to develop advanced, low-cost DSP designs. This intermediate course in implementing DSP functions focuses onlearning how to use System Generator for DSP, design implementation tools, HDL co-simulation, and hardware-in-the-loop verification. Through hands-on exercises, you will implement a design from algorithm concept to hardware verification by using Xilinx FPGAcapabilities.

After completing this comprehensive training, you will have the necessary skills to:

� Describe the different design flows for implementing DSP functions, with a large focus on System Generator

� Identify Xilinx FPGA capabilities and know how to implement a design from algorithm concept to hardware simulation

� Implement a design from start to finish by using System Generator

� Perform hardware-in-the-loop and HDL co-simulations and improve productivity

� Integrate the ChipScope Pro block in a design and analyze the design

� Develop a hardware co-simulation model using System Generator Board Description Builder

� Integrate a System Generator design as a peripheral in a MicroBlaze™ processor-based system

� Utilize timing analyzer block to improve design performance

Course Outline Note: Target architectures include Virtex™-4, Virtex-II Pro, and Spartan™-3E FPGAs.

Day 1 � Introduction

� DSP Design Flows in FPGAs

� Lab 1: Creating a 12 x 8 MAC Using the Xilinx System Generator

� Digital Filtering

� Lab 2: Designing a FIR Filter

� HDL Co-Simulation

� Lab 3: MAC FIR Filter Verification Using Simultaneous Co-Simulations

Day 2� Looking Under the Hood

� Lab 4: Looking Under the Hood

� Controlling the System

� Lab 5: Controlling the System

� Multirate Systems

� Lab 6: Designing a MAC-Based FIR Using the DSP48 Slice

Day 3� Advanced Features

� Lab 7: Integrating the ChipScope Pro Analyzer

� Lab 8: A System Generator Design as an XPS Peripheral

� Lab 9: Multiple Clock Domains Design Using Shared Memories

� Lab 10: Improving Design Performance Using Timing Analyzer

� Lab 11. Designing Using the PicoBlaze™ MicroController

� Lab 12. Creating Parametric Designs

Lab Descriptions This lab-intensive class gives you hands-on experience by using System Generator for DSP to visualize, simulate, verify, and implement DSP algorithms in Xilinx FPGAs. The labs start at a descriptive level and build on each other. You should expect each successive lesson’s challenges to increase. In addition, the labs included in the Advanced Features module provide you experience with other tools such as the ChipScope Pro analyzer and the Embedded Development Kit. System Generator for DSP 8.1 features are identified, including hardware and software co-simulation verification.

Register TodayXilinx delivers public and private courses in locations throughout the world. Please contact Xilinx Education Services for more information, to view schedules, or to register online.

Visit www.xilinx.com/education, and click on the region where you want to attend a course.

North America, send your inquiries to [email protected], or contact the registrar at 877-XLX-CLAS (877-959-2527). To register online, search by Keyword "DSP" in the Training Catalog at https://xilinx.onsaba.net/xilinx.

Europe, send your inquiries to [email protected], call +44-870-7350-548, or send a fax to +44-870-7350-620.

Asia Pacific, contact our training providers at: www.xilinx.com/support/training/asia-learning-catalog.htm, send yourinquiries to [email protected], or call: +852-2424-5200.

Japan, see the Japanese training schedule at: www.xilinx.co.jp/support/training/japan-learning-catalog.htm, send your inquiries to [email protected], or call: +81-3-5321-7772.

You must have your tuition payment information available when youenroll. We accept credit cards (Visa, MasterCard, or American Express) as well as purchase orders and training credits.

Level – Intermediate

Course Duration – 3 days

Price – $1500 USD or 15 Training Credits Course Part Number – DSP10000-8-ILT Who Should Attend? – System engineers/designers, logic designers, and experienced hardware engineers who are implementing DSP algorithms using MathWorks MATLAB and Simulink and using Xilinx System Generator for DSP Prerequisites

� Fundamentals of MATLAB/Simulink and Xilinx FPGAs

� Basics of digital signal processing theory for functions, such as FIR (Finite Impulse Response) filters, oscillators and mixers, and FFT (Fast Fourier Transform) algorithms

Software Tools

� ISE™ 8.1i

� System Generator for DSP 8.1

� EDK 8.1

� ISIM Simulator 8.1

� ChipScope™ 8.1

� Mentor Graphics ModelSim PE 6.0c

� MATLAB with Simulink R14 SP1


DSP20000-7-ILT (v1.0) Course Specification

DSP Implementation Techniques for Xilinx FPGAs

! Memory aspect ratios and their manipulation

Course Description This course shows you how to take advantage of the features available in the Xilinx FPGA architecture, including the Virtexô-4 FPGA, and describes how DSP algorithms can be implemented efficiently. The techniques also demonstrate which decisions at the system level have the greatest impact on the implementation process and product costs.

© 2005 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and disclaimers are as listed at http://www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice.


! Describe how DSP algorithms can be implemented efficiently byusing Xilinx FPGA technology

! Identify the capabilities and features of the various Xilinx FPGAfamilies to implement efficient DSP algorithms

! Establish methods for the accurate estimation of silicon area consumption and cost

! Evaluate which algorithms are best suited for FPGAimplementation and identify which algorithms are less desirable

! Assess how system-level decisions impact hardware implementation and how hardware implementation can enhance results at the system level

Course Outline Day 1 ! On the Same Wavelength

! Basic terminology and acronyms used in DSP design

! Sample rates and bit widths used in DSP applications

! DSP building blocks and processing requirements

! Some Bits About Numbers

! Numbering formats, range, and precision

! Mathematical operations using a variety of formats

! Tuning the Receiver

! Structure and Resources of Xilinx Devices

! Estimating DSP building block sizes

Day 2 ! Tuning the Receiver (continued)

! Implementing the multiplication function

! Bit-width impact on system-level decisions

! Memories are Made of This

! Block versus distributed memory

! SRL16E and the delay function

! Selective Filters

! FIR filter specifications and implementation

! Selecting a technique for a given specification

! Effects of halfband and interpolated filters

Day 3 ! One Filter Does Not Make a System

! Options to be considered with multiple channels

! Interpolation and decimation

! Rate changing and its effect on FIR filter choice

! Filtering algorithms that exploit device architecture

! Importance of connectivity versus isolated functions

! Do Not Block the Datapath Level ñ AdvancedCourse Duration ñ 3 days Price ñ $1800 USD or 18 Training Credits Course Part Number ñ DSP20000-7-ILT Who Should Attend? ñ Engineers and designers who have an interest in developing products that use digital signal processing Prerequisites A fundamental understanding of digital signal processing theory,including an understanding of the following principles:

! Sample rates

! Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters

! Oscillators and mixers

! Fast Fourier Transform (FFT) algorithm

! Numeric controlled oscillators and mixers

! Strategies for FFT implementation

! Achieving bandwidth requirements of the FFT

! Using the FPGA as an efficient co-processor

Course Exercises ! MAC Rates and Memory Requirements

! Constructing a 128-Tap FIR Filter

! Fractional Number Formats

! Twos Complement Arithmetic

! Summation by Addition Tree

! Summation by Addition Chain

! Full Adder: How Many Slices?

! Summation Structure Sizes

! Serial Summation Structure

! 8-Bit by 12-Bit Multiplier

! KCM Multipliers

! Distributed RAM for FIFO

! Size Estimates for Delay Structures

! Using the SRL16E as a FIFO

! Creating Larger RAM Structures

! Selecting a MAC FIR Technique

! Parallel FIR Filter Size

! Symmetry, Interpolation, and Phases

! Decimation Filter

! ìfs/4î Mixing and Decimation

! Designing a Numeric Controlled Oscillator (NCO)

! FFT: Benchmarks and Transform Time

! Collection Time = Processing Time

! 128-Point FFT in 1.28 !s



North America, send your inquiries to [email protected], or contact the registrar at 877-XLX-CLAS (877-959-2527). To register online, search by Keyword ìDSPî in the Training Catalog at https://xilinx.onsaba.net/xilinx.

Europe, send your inquiries to [email protected], call +44-870-7350-548, or send a fax to +44-870-7350-620.





RIO22000-8-ILT (v2.0) Course Specification

© 2006 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and disclaimers are as listed at www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice.

Designing with Multi-Gigabit Serial I/O

Course Description Learn how to employ RocketIO™ MGT serial transceivers in yourVirtex™-II Pro design. Understand and utilize the features of the RocketIO transceiver blocks, such as CRC, 8b/10b encoding, channel bonding, clock correction, and comma detection. Additional highlighted topics include debugging techniques, use of the Architecture Wizard, synthesis and implementation considerations, and standards compliance. This course balances lecture modules and practical hands-on labs.


� Effectively use all of the advanced RocketIO features, such as CRC, channel bonding, clock correction, comma detection, 8b/10b encoding/decoding, programmable termination, and pre-emphasis

� Utilize the ports and attributes of RocketIO transceivers that control the RocketIO features

� Use the Architecture Wizard to instantiate RocketIO primitives in your design

� Achieve compatibility with high-speed I/O standards by using RocketIO transceivers

Course Outline

Day 1 � Introduction

� Clocking and Resets

� 8b/10b Encoder and Decoder Details

� Lab 1: 8b/10b Disparity and Bypass Lab

� Commas and Deserializer Alignment Details

� Lab 2: Commas and K-Characters Lab

� Cyclical Redundancy Check Details

� Lab 3: Cyclical Redundancy Check Lab

� Clock Correction Details

� Lab 4: Clock Correction Lab

Day 2 � Channel Bonding Details

� Lab 5: Channel Bonding Lab

� Architecture Wizard Overview

� Implementing a RocketIO Design

� Lab 6: Synthesis and Implementation Lab

� IP Overview: Aurora Reference Design

� Lab 7: Aurora Protocol Engine Lab

� Common Serial I/O Standards Compliance

� Physical Media Attachment Overview

Lab Descriptions � Lab 1: 8b/10b Disparity and Bypass Lab – Utilize the 8b/10b

encoder/decoder and manipulate running disparity. Learn how to bypass the 8b/10b encoder/decoder

� Lab 2: Commas and K-Characters Lab – Use programmable comma detection to align a serial data stream

� Lab 3: CRC Lab – Modify a design to use the CRC feature for both the user mode and the Fiber Channel mode of CRC

� Lab 4: Clock Correction Lab – Utilize the clock correction logic to compensate for frequency differences on the TX and RX side of a link

� Lab 5: Channel Bonding Lab – Modify a design to use twotransceivers bonded together to form one virtual channel

� Lab 6: Synthesis and Implementation Lab – Use the Architecture Wizard to instantiate RocketIO primitives, synthesize a design, and implement the design.

� Lab 7: Aurora Protocol Engine Lab – Use the Aurora reference design to send and receive data



North America, send your inquiries to [email protected], or contact the registrar at 877-XLX-CLAS (877-959-2527). To register online, search by Keyword "High-Speed" in the Training Catalog at https://xilinx.onsaba.net/xilinx.

Europe, send your inquiries to [email protected], call +44-870-7350-548 or send a fax to +44-870-7350-620.




Level – Intermediate Course Duration – 2 days Price – $1000 USD or 10 Training Credits Course Part Number – RIO22000-8-ILT Who Should Attend? – FPGA designers and logic designers Prerequisites

� Verilog or VHDL experience (or the Introduction to Verilog or the Introduction to VHDL course)

� Synthesis and simulation experience

� FPGA design experience or the Fundamentals of FPGA Design course

� Knowledge of high-speed serial I/O protocols and standards (SONET, Gigabit Ethernet, InfiniBand) is a plus

Software Tools

� ISE 8.1i� ModelSim PE 6.0


A series of compelling, highly technical DSP product demonstrations, presented

by Xilinx DSP experts, is now available on-line. These comprehensive DSP videos

provide excellent, step-by-step tutorials and quick refreshers on a wide array of

key topics. The videos are segmented into short chapters to respect your time

and make for easy viewing.

Ready for viewing, anytime you are

A complete on-line archive is easily accessible at your fingertips. Also, a free DVD

containing all the all of the demonstrations is available at www.xilinx.com/dod.

Order yours today!

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

FREE on-line training Demos On Demand

Pb-free devicesavailable now

with

www.xilinx.com/dod

� DSP Video Starter Kit� DSP Video Co-processing Kit� System Generator for DSP� Algorithm to Hardware in 60 Minutes

� XtremeDSP Slice� FPGAs for Signal Processing� Designing QAM Demodulators� Spectrum Channelization

Turbocharge yourDSP performance

Achieve high-definition, higher frame rates or multiple video streams

When complimenting a TI DSP, Xilinx XtremeDSP co-processing offers the performance,versatility, and economy for today’s high-end video and imaging applications.Whether it’s high-definition, motion estimation, video scaling, or any number ofcompute intensive functions, a Xilinx Virtex-4 or Spartan-3/3E FPGA can boostyour DSP performance. XtremeDSP co-processing delivers higher resolution,higher frame rate video processing than a standalone DSP processor, plus theability to handle multiple video streams.

Reduce power and cost per channel in wireless systems

For implementing custom wireless functions, such as multi-carrier crest factor reduction(CFR), digital pre-distortion (DPD), MIMO and other advanced antenna processing,our FPGAs lower your costs and power per channel. With up to 256 GMAC/s performance,you have the advantage of offloading compute intensive tasks from a TI DSP to aXilinx FPGA while increasing channel density in your wireless system.

Visit us at www.xilinx.com/dsp/coprocessing to learn more about the highest-performance DSP in the industry, and download your FREE evaluation copy ofXtremeDSP software.

www.xilinx.com/dsp/coprocessing

© 2005 Xilinx, Inc., All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

PN 0010925

magazine - Xilinx · 2019-10-17 · station signal processing, radar signal pro-cessing), multimedia processing (video pro-cessing, audio signal processing), and other application

Documents