Top Banner
EURASIP Journal on Embedded Systems Design and Architectures for Signal and Image Processing Guest Editors: Markus Rupp, Ahmet T. Erdogan, and Bertrand Granado
259
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 541420

EURASIP Journal on Embedded Systems

Design and Architectures for Signal and Image Processing

Guest Editors: Markus Rupp, Ahmet T. Erdogan, and Bertrand Granado

Page 2: 541420

Design and Architectures for Signal and ImageProcessing

Page 3: 541420

EURASIP Journal on Embedded Systems

Design and Architectures for Signal and ImageProcessing

Guest Editors: Markus Rupp, Ahmet T. Erdogan,and Bertrand Granado

Page 4: 541420

Copyright © 2009 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2009 of “EURASIP Journal on Embedded Systems.” All articles are open access articlesdistributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original work is properly cited.

Page 5: 541420

Editor-in-ChiefZoran Salcic, University of Auckland, New Zealand

Associate Editors

Sandro Bartolini, ItalyNeil Bergmann, AustraliaShuvra Bhattacharyya, USAEd Brinksma, The NetherlandsPaul Caspi, FranceLiang-Gee Chen, TaiwanDietmar Dietrich, AustriaStephen A. Edwards, USAAlain Girault, FranceRajesh K. Gupta, USA

Thomas Kaiser, GermanyBart Kienhuis, The NetherlandsChong-Min Kyung, KoreaMiriam Leeser, USAJohn McAllister, UKKoji Nakano, JapanAntonio Nunez, SpainSri Parameswaran, AustraliaZebo Peng, SwedenMarco Platzner, Germany

Marc Pouzet, FranceS. Ramesh, IndiaPartha S. Roop, New ZealandMarkus Rupp, AustriaAsim Smailagic, USALeonel Sousa, PortugalJarmo Henrik Takala, FinlandJean-Pierre Talpin, FranceJurgen Teich, GermanyDongsheng Wang, China

Page 6: 541420

Contents

Design and Architectures for Signal and Image Processing, Markus Rupp, Ahmet T. Erdogan,and Bertrand GranadoVolume 2009, Article ID 674308, 3 pages

Multicore Software-Defined Radio Architecture for GNSS Receiver Signal Processing, Heikki Hurskainen,Jussi Raasakka, Tapani Ahonen, and Jari NurmiVolume 2009, Article ID 543720, 10 pages

An Open Framework for Rapid Prototyping of Signal Processing Applications, Maxime Pelcat,Jonathan Piat, Matthieu Wipliez, Slaheddine Aridhi, and Jean-Francois NezanVolume 2009, Article ID 598529, 13 pages

Run-Time HW/SW Scheduling of Data Flow Applications on Reconfigurable Architectures,Fakhreddine Ghaffari, Benoit Miramond, and Francois VerdierVolume 2009, Article ID 976296, 13 pages

Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes,Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca FanucciVolume 2009, Article ID 723465, 15 pages

Comments on “Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes”,Kiran K. Gunnam, Gwan S. Choi, and Mark B. YearyVolume 2009, Article ID 704174, 3 pages

Reply to “Comments on Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPCCodes”, Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca FanucciVolume 2009, Article ID 635895, 2 pages

OLLAF: A Fine Grained Dynamically Reconfigurable Architecture for OS Support, Samuel Garcia andBertrand GranadoVolume 2009, Article ID 574716, 11 pages

Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture,Jehangir Khan, Smail Niar, Mazen A. R. Saghir, Yassin El-Hillali, and Atika Rivenq-MenhajVolume 2009, Article ID 175043, 21 pages

A Prototyping Virtual Socket System-On-Platform Architecture with a Novel ACQPPS Motion Estimatorfor H.264 Video Encoding Applications, Yifeng Qiu and Wael BadawyVolume 2009, Article ID 105979, 20 pages

FPSoC-Based Architecture for a Fast Motion Estimation Algorithm in H.264/AVC, Obianuju Ndili andTokunbo OgunfunmiVolume 2009, Article ID 893897, 16 pages

FPGA Accelerator for Wavelet-Based Automated Global Image Registration, Baofeng Li, Yong Dou,Haifang Zhou, and Xingming ZhouVolume 2009, Article ID 162078, 10 pages

Page 7: 541420

A System for an Accurate 3D Reconstruction in Video Endoscopy Capsule, Anthony Kolar,Olivier Romain, Jade Ayoub, David Faura, Sylvain Viateur, Bertrand Granado, and Tarik GrabaVolume 2009, Article ID 716317, 15 pages

Performance Evaluation of UML2-Modeled Embedded Streaming Applications with System-LevelSimulation, Tero Arpinen, Erno Salminen, Timo D. Hamalainen, and Marko HannikainenVolume 2009, Article ID 826296, 16 pages

Cascade Boosting-Based Object Detection from High-Level Description to Hardware Implementation,K. Khattab, J. Dubois, and J. MiteranVolume 2009, Article ID 235032, 12 pages

Very Low-Memory Wavelet Compression Architecture Using Strip-Based Processing for Implementationin Wireless Sensor Networks, Li Wern Chew, Wai Chong Chia, Li-minn Ang, and Kah Phooi SengVolume 2009, Article ID 479281, 16 pages

Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors,Muhammad Yasir Qadri and Klaus D. McDonald-MaierVolume 2009, Article ID 725438, 7 pages

Hardware Architecture for Pattern Recognition in Gamma-Ray Experiment, Sonia Khatchadourian,Jean-Christophe Prevotet, and Lounis KessalVolume 2009, Article ID 737689, 15 pages

Evaluation and Design Space Exploration of a Time-Division Multiplexed NoC on FPGA for ImageAnalysis Applications, Linlin Zhang, Virginie Fresse, Mohammed Khalid, Dominique Houzet,and Anne-Claire LegrandVolume 2009, Article ID 542035, 15 pages

Efficient Processing of a Rainfall Simulation Watershed on an FPGA-Based Architecture with Fast Accessto Neighbourhood Pixels, Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang,and Kah Phooi SengVolume 2009, Article ID 318654, 19 pages

Page 8: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 674308, 3 pagesdoi:10.1155/2009/674308

Editorial

Design and Architectures for Signal and Image Processing

Markus Rupp (EURASIP Member),1 Ahmet T. Erdogan,2 and Bertrand Granado3

1 Institute of Communications and Radio-Frequency Engineering (INTHFT), Vienna University of Thechnology, 1040 Vienna, Austria2 The School of Engineering and Electronics, The University of Edinburgh, Edinburgh EH9 3JL, UK3 ENSEA, Cergy-Pontoise University, boulevard du Port-95011 Cergy-Pontoise Cedex, France

Correspondence should be addressed to Markus Rupp, [email protected]

Received 8 December 2009; Accepted 8 December 2009

Copyright © 2009 Markus Rupp et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This Special Issue of the EURASIP Journal of embedded sys-tems is intended to present innovative methods, tools, designmethodologies, and frameworks for algorithm-architecturematching approach in the design flow including systemlevel design and hardware/software codesign, RTOS, systemmodeling and rapid prototyping, system synthesis, designverification, and performance analysis and estimation.

Today, typical sequential design flows are in use and theyare reaching their limits due to:

(i) The complexity of today’s systems designed withthe emerging submicron technologies for integratedcircuit manufacturing

(ii) The intense pressure on the design cycle time inorder to reach shorter time-to-market and reducedevelopment and production costs

(iii) The strict performance constraints that have to bereached in the end, typically low and/or guaranteedapplication execution time, integrated circuit area,and overall system power dissipation.

Because in such design methodology the system is seen as awhole, this special issue also covers the following topics:

(i) New and emerging architectures: SoC, MPSoC, con-figurable computing (ASIPs), (dynamically) recon-figurable systems using FPGAs

(ii) Smart sensors: audio and image sensors for highperformance and energy efficiency

(iii) Applications: automotive, medical, multimedia,telecommunications, ambient intelligence, objectrecognition, and cryptography

(iv) Resource management techniques for real-time oper-ating systems in a codesign framework

(v) Systems and architectures for real-time image pro-cessing

(vi) Formal models, transformations, and architecturesfor reliable embedded system design.

We received 30 submissions of which we eventually accepted17 for publication.

The paper entitled “Multicore software defined radioarchitecture for GNSS receiver signal processing” by H.Hurskainen et al. describes a multicore Software DefinedRadio (SDR) architecture for Global Navigation SatelliteSystem (GNSS) receiver implementation. Three GNSS SDRarchitectures are discussed: (1) a hardware-based SDR thatis feasible for embedded devices but relatively expensive,(2) a pure SDR approach that has high level of flexibilityand low bill of material, but is not yet suited for handheldapplications, and (3) a novel architecture that uses aprogramable array of multiple processing cores that exhibitsboth flexibility and potential for mobile devices.

The paper entitled “An open framework for rapid proto-typing of signal processing applications” by M. Pelcat et al.presents an open source eclipse-based framework which aimsto facilitate the exploration and development processes inthis context. The framework includes a generic graph editor(Graphiti), a graph transformation library (SDF4J), andan automatic mapper/scheduler tool with simulation andcode generation capabilities (PREESM). The input of theframework is composed of a scenario description and twographs: one graph describes an algorithm and the secondgraph describes an architecture. As an example, a prototype

Page 9: 541420

2 EURASIP Journal on Embedded Systems

for 3GPP long-term evolution (LTE) algorithm on a multi-core digital signal processor is built, illustrating both thefeatures and the capabilities of this framework.

The paper entitled “Run-time HW/SW scheduling ofdata flow applications on reconfigurable architectures” by F.Ghaffari et al. presents an efficient dynamic and run-timeHardware/Software scheduling approach. This schedulingheuristic consists in mapping on line the different tasks ofa highly dynamic application in such a way that the totalexecution time is minimized. On several image processingapplications, the scheduling method is applied. The pre-sented experiments include simulation and synthesis resultson a Virtex V-based platform. These results show a betterperformance against existing methods.

The paper entitled “Techniques and architectures forhazard-free semiparallel decoding of LDPC codes” by M.Rovini et al. describes three different techniques to properlyreschedule the decoding updates, based on the careful inser-tion of “idle” cycles, to prevent the hazards of the pipelinemechanism in LDPC decoding. Along these different semi-parallel architectures of a layered LDPC decoder suitable foruse with such techniques are analyzed. Taking the LDPCcodes for the wireless local area network (IEEE 802.11n) asa case study, a detailed analysis of the performance attainedwith the proposed techniques and architectures is reported,and results of the logic synthesis on a 65 nm low-powerCMOS technology are shown.

The paper entitled “OLLAF: a fine grained dynamicallyreconfigurable architecture for OS support” by S. Garciaand B. Granado presents OLLAF, a fine grained dynamicallyreconfigurable architecture (FGDRA), specially designed toefficiently support an OS. The studies presented here showthe contribution of this architecture in terms of hardwarecontext management, preemption support, as well as the gainthat can be obtained, by using OLLAF instead of a classicalFPGA, in terms of context management and preemptionoverhead.

The paper entitled “Trade-off exploration for targettracking application in a customized multiprocessor archi-tecture” by J. Khan et al. presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC) architectureoptimized for multiple target tracking (MTT) in automotiveapplications. The paper explains how the MTT applicationis designed and profiled to partition it among differentprocessors. It also explains how different optimizations wereapplied to customize the individual processor cores to theirassigned tasks and to assess their impact on performanceand FPGA resource utilization, resulting in a complete MTTapplication running on an optimized MPSoC architecturethat fits in a contemporary medium-sized FPGA and thatmeets the real-time constraints of the given application.

The paper entitled “A prototyping virtual socket system-on-platform architecture with a novel ACQPPS motionestimator for H.264 video encoding applications” by Y. Qiuand W. M. Badawy presents a novel adaptive crossed quarterpolar pattern search (ACQPPS) algorithm that is proposedto realize an enhanced inter prediction for H.264. Moreover,an efficient prototyping system-on-platform architecture isalso presented, which can be utilized for a realization of

H.264 baseline profile encoder with the support of integratedACQPPS motion estimator and related video IP accelerators.The implementation results show that ACQPPS motionestimator can achieve very high estimated image qualitycomparable to that from the full search method, in termsof peak signal-to-noise ratio (PSNR), while keeping thecomplexity at an extremely low level.

The paper entitled “FPSoC-based architecture for afast motion estimation algorithm in H.264/AVC” by O.Ndili and T. Ogunfunmi presents an architecture basedon a modified hybrid fast motion estimation (FME) algo-rithm. Presented results show that the modified hybridFME algorithm outperforms previous state-of-the-art FMEalgorithms, while its losses, when compared with FSME (fullsearch motion estimation), in terms of PSNR performanceand computation time are insignificant.

The paper entitled “FPGA accelerator for wavelet-based automated global image registration” by B. Li et al.presents an architecture for wavelet-based automated globalimage registration (WAGIR) that is fundamental for mostremote sensing image processing algorithms, and extremelycomputation intensive. They propose a block wavelet-basedautomated global image registration (BWAGIR) architecturebased on a block resampling scheme. The architecture with1 processing unit outperforms the CL cluster system with1 node by at least 7.4X, and the MPM massively parallelmachine with 1 node by at least 3.4X. And the BWAGIR with5 units achieves a speedup of about 3X against the CL with16 nodes, and a comparable speed with the MPM with 30nodes.

The paper entitled “A system for an accurate 3D recon-struction in video endoscopy capsule” by A. Kolar et al.presents the hardware and software development of a wire-less multispectral vision sensor which allows transmitting a3D reconstruction of a scene in real time. The paper alsopresents a method to acquire the images at a 25 frames/svideo rate with a discrimination between the texture and theprojected pattern. This method uses an energetic approach,a pulsed projector, and an original 64 × 64 CMOS imagesensor with programable integration time. Multiple imagesare taken with different integration times to obtain an imageof the pattern which is more energetic than the backgroundtexture. Also presented is a 3D reconstruction processingthat allows a precise and real-time reconstruction. Thisprocessing which is specifically designed for an integratedsensor and its integration in an FPGA-like device has a lowpower consumption compatible with a VCE examination.The paper presents experimental results with the realizationof a large-scale demonstrator using an SOPC prototypingboard.

The paper entitled “Performance evaluation of UML2-modeled embedded streaming applications with system-levelsimulation” by T. Arpinen et al. presents an efficient methodto capture abstract performance model of a streaming datareal-time embedded system (RTES). This method uses anMDA (model driven architecture) approach. The goal of theperformance modeling and simulation is to achieve earlyestimates on PE, memory, and on-chip network utilization,task response times, among other information that is used

Page 10: 541420

EURASIP Journal on Embedded Systems 3

for design-space exploration. UML2 is used for performancemodel specification. The application workload modelingis carried out using UML2 activity diagrams. Platform isdescribed with structural UML2 diagrams and model ele-ments annotated with performance values. The focus here ison modeling streaming data applications. It is characteristicto streaming applications that a long sequence of data itemsflows through a stable set of computation steps (tasks) withonly occasional control messaging and branching.

The paper entitled “Cascade boosting-based object detec-tion from high-level description to hardware implemen-tation” by K. Khattab et al. presents an implementationof boosting-based object detection algorithms that areconsidered the fastest accurate object detection algorithmstoday, but their implementation in a real-time solution isstill a challenge. A new parallel architecture, which exploitsthe parallelism and the pipelining in these algorithms, isproposed. The method to develop this architecture was basedon a high-level SystemC description. SystemC enables PCsimulation that allows simple and fast testing and leavesthe structure open to any kind of hardware or softwareimplementation since SystemC is independent from allplatforms.

The paper entitled “Very low memory wavelet compres-sion architecture using strip-based processing for imple-mentation in wireless sensor networks” by L. W. Chewet al. presents a hardware architecture for strip-based imagecompression using the SPIHT algorithm. The lifting-based5/3 DWT which supports a lossless transformation is usedin the proposed work. The wavelet coefficients outputfrom the DWT module is stored in a strip buffer in apredefined location using a new 1D addressing method forSPIHT coding. In addition, a proposed modification onthe traditional SPIHT algorithm is also presented. In orderto improve the coding performance, a degree-0 zerotreecoding methodology is applied during the implementationof SPIHT coding. To facilitate the hardware implementation,the proposed SPIHT coding eliminates the use of lists inits set-partitioning approach and is implemented in twopasses. The proposed modification reduces both the memoryrequirement and complexity of the hardware coder.

The paper entitled “Data cache-energy and throughputmodels: design exploration for embedded processors” byM. Y. Qadri and K. D. McDonald Maier proposes cache-energy models. These models strive to provide a completeapplication-based analysis. As a result they could facilitatethe tuning of a cache and an application according for a givenpower budget. The models presented in this paper are animproved extension of energy and throughput models for adata cache in term of the leakage energy that is indicated forthe entire processor rather than simply the cache on its own.The energy model covers the per cycle energy consumptionof the processor. The leakage energy statistics of the processorin the data sheet covers the cache and all peripherals of thechip. It is also improved in terms of refinement of the missrate that has been split into two terms: a read miss rate anda write miss rate. This was done as the read energy andwrite energy components correspond to the respective missrate contribution of the cache. The model-based approach

presented was used to predict the processors performancewith sufficient accuracy. An example application for designexploration that could facilitate the identification of anoptimal cache configuration and code profile for a targetapplication was discussed.

The paper entitled “Hardware architecture for patternrecognition in gamma-ray experiment” by S. Khatcha-dourian et al. presents an intelligent way of triggeringdata in the HESS (high energy stereoscopic system) phaseII experiment. The system relies on the utilization ofimage processing algorithms in order to increase the triggerefficiency. The proposed trigger scheme is based on a neuralsystem that extracts the interesting features of the incomingimages and rejects the background more efficiently thanclassical solutions. The paper presents the basic principles ofthe algorithms as well as their hardware implementation inFPGAs.

The paper entitled “Evaluation and design space explo-ration of a time-division multiplexed NoC on FPGA forimage analysis applications” by L. Zhang et al. presents anadaptable fat tree NoC architecture for field programmablegate array (FPGA) designed for image analysis applications.The authors propose a dedicated communication architec-ture for image analysis algorithms. This communicationmechanism is a generic NoC infrastructure dedicated todataflow image processing applications, mixing circuit-switching and packet-switching communications. The com-plete architecture integrates two dedicated communicationarchitectures and reusable IP blocks. Communications arebased on the NoC concept to support the high bandwidthrequired for a large number and type of data. For datacommunication inside the architecture, an efficient time-division multiplexed (TDM) architecture is proposed. ThisNoC uses a fat tree (FT) topology with virtual channels (VC)and flit packet-switching with fixed routes. Two versions ofthe NoC are presented in this paper. The results of theirimplementations and their design space exploration (DSE)on Altera StratixII are analyzed and compared with a point-to-point communication and illustrated with a multispectralimage application.

The paper entitled “Efficient processing of a rainfallsimulation watershed on an FPGA-based architecture withfast access to neighborhood pixels” by L. S. Yeong et al.describes a hardware architecture to implement the water-shed algorithm using rainfall simulation. The speed of thearchitecture is increased by utilizing a multiple memory bankapproach to allow parallel access to the neighborhood pixelvalues. In a single read cycle, the architecture is able toobtain all five values of the center and four neighbors for a4 connectivity watershed transform. The proposed rainfallwatershed architecture consists of two parts. The first partperforms the arrowing operation and the second part assignseach pixel to its associated catchment basin.

Markus RuppAhmet T. ErdoganBertrand Granado

Page 11: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 543720, 10 pagesdoi:10.1155/2009/543720

Research Article

Multicore Software-Defined Radio Architecture forGNSS Receiver Signal Processing

Heikki Hurskainen, Jussi Raasakka, Tapani Ahonen, and Jari Nurmi

Department of Computer Systems, Tampere University of Technology, P. O. Box 553, 33101 Tampere, Finland

Correspondence should be addressed to Heikki Hurskainen, [email protected]

Received 27 February 2009; Revised 22 May 2009; Accepted 30 June 2009

Recommended by Markus Rupp

We describe a multicore Software-Defined Radio (SDR) architecture for Global Navigation Satellite System (GNSS) receiverimplementation. A GNSS receiver picks up very low power signals from multiple satellites and then uses dedicated processingto demodulate and measure the exact timing of these signals from which the user’s position, velocity, and time (PVT) can beestimated. Three GNSS SDR architectures are discussed. (1) A hardware-based SDR that is feasible for embedded devices butrelatively expensive, (2) a pure SDR approach that has high level of flexibility and low bill of material, but is not yet suited forhandheld applications, and (3) a novel architecture that uses a programmable array of multiple processing cores that exhibits bothflexibility and potential for mobile devices. We present the CRISP project where the multicore architecture will be realized alongwith numerical analysis of application requirements of the platform’s processing cores and network payload.

Copyright © 2009 Heikki Hurskainen et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Global navigation has been a challenge to mankind forcenturies. However, in the modern world it has becomeeasier with the help from Global Satellite Navigation Systems(GNSSs). NAVSTAR Global Positioning System (GPS) [1]has been the most famous implementation of GNSS andthe only fully operational system available for civilian users,although this situation is changing.

Galileo [2] is emerging as a competitor and complementfor GPS, as they both are satellite navigation systems based onCode Division Multiple Access (CDMA) techniques. CDMAis a technique that allows multiple transmitters to use samecarrier simultaneously by multiplying pseudorandom noise(PRN) codes to the transmitted signal. The PRN code rateis higher than data symbol rate which divides the energy ofa data symbol to a wider bandwidth. The used PRN codesare unique to each transmitter and thus transmitter can beidentified in reception when received signal is correlated witha replica of the used PRN code.

The Russian GLONASS system, originally based onFrequency Division Multiple Access (FDMA), is adding a

CDMA feature to the system with GLONASS-K satellites[3]. China has also shown interest in implementing itsown system, called Compass, during the following decade[4]. The GPS modernization program [5] introduces addi-tional signals with new codes and modulation. Realiza-tion of the new navigation systems and modernization ofGPS produce updates and upgrades to system specifica-tions.

Besides changing specifications, GNSS is also facing chal-lenges from an environmental point of view. The resultingmultipath effects make it more difficult to determine exactsignal timing crucial for navigation algorithms. Researcharound multipath mitigation algorithms is active sinceaccurate navigation capability in environments with heavymultipaths is desired. Among interference issues multipathmitigation is also one of the biggest drivers for the introduc-tion of new GNSS signal modulations.

Designing a true GNSS receiver is not a trivial task. A trueGNSS receiver should be reconfigurable and flexible in designso that the posibilities of new specifications and algorithmscan be exploited, and the price should be low enough toenable mass market penetration.

Page 12: 541420

2 EURASIP Journal on Embedded Systems

2. GNSS Principles and Challenges

2.1. Navigation and Signal Processing. Navigation can beperformed when four or more satellites are visible tothe receiver. The pseudoranges from receiver to satellitesand navigation data (containing ephemeris parameters) areneeded [1, 6, 7].

When pseudoranges (ρ) are measured by the receiver,they can be used to solve unknowns, the users location(x, y, z)u and clock bias bu with known positions of satellites(x, y, z)i. The relation between pseudorange, satellite posi-tion, and user position is illustrated in

ρi =√

(xi − xu)2 + (yi − yu)2 + (zi − zu)2 + bu. (1)

The transmitted signal contains low rate navigationdata (50 Hz for GPS Standard Positioning Service (SPS)),repeating PRN code sequence (1023 chips at 1.023 MHz forGPS SPS) and a high rate carrier (GPS SPS is transmitted atL1 band which is centered at 1575.42 MHz) [1]. For GalileoE1 Open Service (OS) and future GPS L1C it also containsa Multiplexed Binary Offset Carrier (MBOC) modulation[8, 9]. These signal components are illustrated in Figure 1.

The signal processing for GNSS can be divided intoanalog and digital parts. Since the carrier frequencies of theGNSS are high (>1 GHz) it is impossible to perform digitalsignal processing on it. In the analog part of the receiver,which is called the radio front-end, the received signal isamplified, filtered, downconverted, and finally quantized andsampled to digital format.

The digital signal processing part (i.e., baseband process-ing) has two major tasks. First, the Doppler frequencies andcode phases of the satellites need to be acquired. The detailsof the acquisition process are well explained in literature, forexample, [1, 7]. There are a number of ways to implementacquisition, with parallel methods being faster than serialones, but at the cost of consuming more resources. Theparallel methods can be applied either as convolution in thetime domain (matched filters) or as multiplication in thefrequency domain (using FFT and IFFT).

Second, after successful acquisition the signals foundare being tracked. In tracking, the frequency and phase ofthe receiver are continously fine-tuned to keep receivingthe acquired signals. Also, the GNSS data is demodulatedand the precise timing is formed from the signal phasemeasurements. A detailed description of the tracking processcan be found, for example, in [1, 7]. The principles for datademodulation are also illustrated in Figure 1.

2.2. Design Challenges of GNSS. The environment we areliving in is constantly changing in topographic, geometric,economic, and political ways. These changes are driving theGNSS evolution.

Besides new systems (e.g., Galileo, Compass), the existingones (i.e., GPS, GLONASS) are being modernized. Thisleads to constantly evolving field of specifications whichmay increase the frustration and uncertainty among receiverdesigners and manufacturers.

The signal spectrum of future GNSS signals is growingwith the new systems. Currently the GPS L1 (centered at1575.42 MHz) is the only commercially exploited GNSSfrequency band. Galileo systems E1 OS signal will be sharingthe same band. Another common band of future GPS andGalileo signals will be centered at 1176.45 MHz (GPS L5 andGalileo E5a).

The GPS modernization program is also activating theL2 frequency band (centered at 1227.60 MHz) to civilianuse by implementing L2C (L2 Civil) signal [10]. This bandhas already been assigned for navigation use, but only forauthorized users via GPS Precise Positioning Service (PPS)[1].

To improve the signal code tracking and multipathperformance new Binary Offset Carrier (BOC) modulationwas originally introduced as baseline for Galileo and modernGPS L1 signal development [11]. The later agreementbetween European and US GNSS authorities further spec-ified the usage of Multiplexed BOC (MBOC) modulationin both systems. In MBOC modulation two different binarysubcarriers are added to the signal, either as time multiplexedmode (TMBOC), or summed together with predefinedweighting factors as Composite BOC (CBOC) [8, 9, 12].

Like any other wireless communication, satellite naviga-tion also suffers from multipaths in environments prone tosuch (e.g., urban canyons, indoors). The problem caused bymultipaths is even bigger in navigation than in communi-cation since precise timing also needs to be resolved. Thefield of multipath mitigation is actively researched and newalgorithms and architectures are presented frequently, forexample, in [13–15].

Besides GNSS there are also other wireless commu-nication technologies that are developing rapidly and thedirection of development is driven towards multipurposelow cost receivers (user handsets) with enhanced capabilities[16].

3. Overview of SDR GNSS Architectures

In this section we present three architectures for Software-Defined Radio (SDR) GNSS receiver. A simplified definitionof SDR is given in [17]. “Radio in which some or all of thephysical layer functions are software defined.”

The root SDR architecture was presented in [18]. Figure 2illustrates an example of GNSS receiver functions mappedon to this canonical architecture. Only the reception part ofthe architecture is presented since current GNSS receivers arenot transmitting. Radio Frequency (RF) conversion handlesthe signal processing before digitalization. The IntermediateFrequency (IF) processing block transfers the frequency ofthe received signal from IF to baseband and may also takecare of Doppler removal in GNSS. The baseband processingsegment handles the accurate timing and demodulation, thusenabling the construction of the navigation data bits. Thedivision into IF and baseband sections can vary dependingon the chosen solution since the complex envelope of thereceived signal can be handled in baseband also. Desirednavigation output (Position, Velocity, and Time (PVT)) issolved in the last steps of the GNSS receiver chain.

Page 13: 541420

EURASIP Journal on Embedded Systems 3

Receiver (reception)

Navigation data recovery

Navigation data

ReplicaPRN

Replicacarrier

Replicasubcarrier

Binarysubcarrier

CarrierPRN code

Satellite (transmission)

Transmission medium(i.e. space)

Figure 1: Principles for GNSS signal modulation in transmission and demodulation in reception.

User interface

Navigation processing

Basebandprocessing

BasebandIF processing

Code correlation

Carrier wipeoff

A/Dconversion

Downconversion

AGC

Radio

LNA

Local oscillator

Frequency synthesis

Carrier NCO

Code NCO&

generation

Source

Figure 2: Canonical SDR architecture adapted to GNSS. It is modified from [18].

Current state-of-the-art mass market receivers are basedon a chipset or single-chip receiver [19]. The chipset orsingle-chip receiver is usually implemented as an Applica-tion Specific Integrated Circuit (ASIC). ASICs have highNonrecurring Engineering (NRE) costs, but when producedin high volumes they have very low price per unit. ASICscan also be optimized for small size and to have smallpower consumption. Both of these features are desired inhandheld, battery operated devices. On the other hand,ASICs are fixed solutions and impossible to reconfigure.Modifications in design are also very expensive to realize withASIC technology.

This approach has proven to be successful in mass marketreceivers because of price and power consumption advan-tages although it may not hold its position with growingdemand for flexibility and shortened time to market.

3.1. Hardware Accelerated SDR Receiver Architecture. Thefirst SDR receiver architecture discussed in this paper is theapproach where the most demanding parts of the receiver areimplemented on a reconfigurable hardware platform, usuallyin the form of a Field Programmable Gate Array (FPGA)progammed with a Hardware Description Language (HDL).This architecture, comprised of radio front-end circuit,reconfigurable baseband hardware, and navigation softwareis well known and presented in numerous publications, forexample, [16, 20–22]. FPGAs have proved to be suitablefor performing GNSS signal processing functions [23]. Thebuilding blocks for hardware accelerated SDR receivers areillustrated in Figure 3.

In this architecture the RF conversion is performed byanalog radio. The last step of the conversion transformsthe signal from analog to digital format. IF processingand baseband functionalities are performed in acceleratinghardware. The source, PVT for the GNSS case, is constructedin navigation processing.

The big advantage for reconfigurable FPGAs in compar-ison to ASIC technologies is the savings in design, NRE andmask costs due to shorter development cycle. The risk is alsosmaller with FPGAs, since the possible bugs in design canbe fixed by upgrades later on. On the other hand FPGAs aremuch higher in unit price and power consumption.

A true GNSS Receiver poses some implementationchallenges. The specifications are designed to be compatible(i.e., systems do not interfere with each other too much)and the true interoperability is reached at receiver level. Oneexample of interoperative design challenges is the selectionof the number of correlators and their spacing for tracking,since different modulations have different requirements forthe correlator structure.

3.1.1. Challenges with Radio Front End. Although the focusof this paper is mainly on baseband functions, the radioshould not be forgotten. The block diagram for a GNSSsingle frequency radio front end is given on the left-handside of Figure 3. In the radio the received signal is firstamplified with the Low Noise Amplifier (LNA) and then afternecessary filtering it is downconverted to low IF, for example,to 4 MHz [24]. The signal is converted to a digital formatafter downconversion.

Page 14: 541420

4 EURASIP Journal on Embedded Systems

User interface

Navigation processing

Acquisition engine

Automatic gain control

(AGC)

Tracking channels

1 to N

General purpose processor

Reconfigurable hardware (FPGA)

A/Dconversion

Downconverter

Radio front end ASIC

Low noise amplifier

(LNA)

Local oscillator

Frequency synthesis

Figure 3: Hardware accelerated baseband architecture. From left to right: analog radio part, reconfigurable baseband hardware, andnavigation software running on GPP.

The challenges for GNSS radio design come from theincreasing amount of frequency bands. To call a receiver atrue GNSS receiver and also to get the best performance,more than one frequency band should be processed by theradio front-end. Dual- and/or multifrequency receivers arelikely choices for future receivers, and thus it is important tostudy potential architectures [25].

Another challenge comes from the increased bandwidthof new signals. With increased bandwidth the radio becomesmore vulnerable to interference. For mass market consumerproducts, the radio design should also meet certain priceand power consumption requirements. Only solutions withreasonable price and power consumption will survive.

3.1.2. Baseband Processing. The fundamental signal process-ing for GNSS was presented in Figure 1. The carrier andcode removal processes are illustrated in more detail inFigure 4. The incoming signal is divided into in-phase andquadrature-phase components by multiplying it with thelocally generated sine and cosine waves. Both phases arethen correlated in identical branches with several closelydelayed versions (for GPS; early, prompt, and late), of thelocally generated PRN code [1]. Results are then integratedand fed to discriminator computation and feedback filter.Numerically Controlled Oscillators (NCOs) are used to steerthe local replicas.

An example of the different needs for new GNSSsignals is the addition of 2 correlator fingers (bump-jumping algorithm) due to Galileo BOC modulation [26].In Figure 4 additional correlator components needed forGalileo tracking are marked with darker shade. In mostparts the GPS and Galileo signals in L1 band are usingthe same components. The main difference is that dueto the BOC family modulation Galileo needs additionalcorrelators; it is very-early (VE) and very-late (VL) to removethe uncertainty of main peak location estimation [27]. Theincreasing number of correlators is related to the increasein complexity, measured by the number of transistors in thefinal design [13].

The level of hardware acceleration depends on theselected algorithms. Acquisition is rarely needed comparedto tracking and thus it is more suitable for software imple-mentation. FFT-based algorithms are more desirable fordesigner to implement in software since hardware languagesare usually lacking direct support for floating-point numbercalculus. Tracking on the other hand is a process containingmostly multiplication and accumulation using relativelysmall word lengths. The thing that makes it more suitablefor hardware implementation is that the number of theserelatively simple computations is high, with a real-timedeadline.

3.2. Ideal SDR GNSS Receiver Architecture. The ideal SDRis characterized by assigning all functions after the analogradio to a single processor [18]. In the ideal case all hardwareproblems are turned to software problems.

A fundamental block diagram of a software receiver isillustrated in Figure 5 [28]. The architecture of the radiofront-end is the same that was illustrated in Figure 3. Afterradio the digitized signals are fed to buffers for softwareusage. Then all of the digital signal processing, acquisition,and tracking functions are performed by software.

In the literature, for example, [28, 29], the justificationand reasoning for SDR GNSS is strongly attributed to thewell-known Moores law which states that the capacity ofintegrated circuits is doubling every 18–24 months [30].Ideal SDR solutions should become feasible if and whenavailable processing power increases. Currently reportedSDR GPS receiver implementations are working in realtimeonly if the clock speed of the processor is from 900 MHz [31]to 3 GHz [29], which is too high for mobile devices but not,for example, a laptop PC.

In the recent years, the availability of GNSS radio frontends with USB has improved, making the implementa-tion of a pure software receiver on a PC platform quitestraightforward. The area where pure software receivers havealready made a breakthrough is postprocessing applications.Postprocessing with software receivers allows fast algorithm

Page 15: 541420

EURASIP Journal on Embedded Systems 5

Quadrature branch (not shown)

Discriminatorcomputation

& filtering

I & D

VE E P L VLCodeNCO

CarrierNCO Code generator

In-phase branch

sin cos

Figure 4: GPS/Galileo tracking channel.

Navigation processing

Tracking channels

1 to N

General purpose processor

Buffers & buffer control

User interface

Acquisition engine

Radio front endASIC

GNSS radio front end

Figure 5: Software receiver architecture. On left-hand side: analog radio part, and on right-hand side: baseband and navigation implementedas software running on a GPP.

prototyping and signal analysis. Typical postprocessingapplications are ionospheric monitoring, geodetic applica-tions, and other scientific applications [21, 32].

Software is definitely more flexible than hardware whencompared in terms of time to market, bill of materials, andreconfigurable implementation. But with a required clockfrequency of around 1 GHz or more, the generated heat andbattery life will be an issue for small handheld devices.

3.3. SDR with Multiple Cores. What about having an array ofreconfigurable cores for baseband processing? In a multicorearchitecture baseband processing is divided among multipleprocessing cores. This reduces the clock frequency neededto a range achievable by embedded devices and provides anincreased level of parallelism which also eases the work loadper processing unit.

An example of the GNSS receiver architecture withreconfigurable baseband approach is illustrated in Figure 6.In this example one of the four cores is acting as anacquisition engine and the remaining three are performingthe tracking functions. A fixed set of cores is not desirablesince the need for acquisition and tracking varies over time.For example, when receiver is turned on, all cores should be

performing acquisition to guarantee the fastest possible TimeTo First Fix (TTFF). After satellites have been found more ofthe acquisition cores are moved to the tracking task.

If (and when) manufactured in large volumes the(properly scaled) array of processing cores can be eventuallyimplemented in an ASIC circuit. This lowers the per unitprice and makes this solution more appealing for massmarkets, while still being reconfigurable and having highdegree of flexibility.

In the next section we present one future realization ofthis architecture.

4. CRISP Platform

Cutting edge Reconfigurable ICs for Stream Processing(CRISP) [33] is a project in the Framework Programme7 (FP7) of the European Union (EU). The objectives ofthe CRISP are to research the optimal utilization, effi-cient programming, and dependability of a reconfigurablemultiprocessor platform for streaming applications. TheCRISP consortium is a good mixture of academic andindustrial know-how with partners; Recore (NL), Universityof Twente (NL), Atmel (DE), Thales Netherlands (NL),

Page 16: 541420

6 EURASIP Journal on Embedded Systems

Navigation processing

Tracking channels

Tracking channels

Tracking channels

Generalpurpose

processor

Reconfigurable platform (array of cores)

User interface

Acquisition engine

Radio front endASIC

GNSS radio front end

Figure 6: Software reconfigurable baseband receiver architecture. From left: analog radio part, baseband implemented on an array ofreconfigurable cores, and navigation software running on GPP.

Tampere University of Technology (FI), and NXP (NL). Thethree-year project started in the beginning at 2008.

The reconfigurable CRISP platform, also called GeneralStreaming Processor (GSP), designed and implementedwithin the project, will consist of two separate devices:General Purpose Device (GPD) and Reconfigurable FabricDevice (RFD). The GPD contains off-the-shelf GeneralPurpose Processor (GPP) with memories and peripheralconnections whereas the RFD consists of 9 reconfigurablecores. The array of reconfigurable cores is illustrated inFigure 7 [34], with “R” depicting a router.

The reconfigurable cores are Montium cores (it wasrecently decided to use Xentium processing tile as Recon-figurable Core in the CRISP GSP. The Xentium has atleast similar performance to the Montium (with respect tocycle count), but is designed for better programmability(e.g., hardware supporting optimal software pipelining)).Montium [35] is a reconfigurable processing core. It hasfive Arithmetic and Logical Units (ALUs), each having twomemories, resulting in total of 10 internal memories. Thecores communicate via a Network-on-Chip (NoC) whichincludes two global memories. The device interfaces to otherdevices and outer world via standard interfaces.

Within the CRISP project the GNSS receiver is one ofthe two applications designed for proof of concept for theplatform. The other is a radar beamforming applicationwhich has much higher demands on computation than astandalone GNSS receiver.

4.1. Specifying the GNSS Receiver for the Multicore Platform.In the CRISP project our intention is to specify, implement,and integrate a GNSS receiver application supporting GPSand Galileo L1 Open Service (OS) signals on the multicoreplatform. In this case, the restriction for L1 band usage comesfrom the selected radio [24], but in principle the multicoreapproach can be extended to multifrequency receivers if asuitable radio front-end is used.

4.1.1. Requirements for Tile Processor. The requirementsof GNSS L1 application have been studied in [36]. The

Table 1: Estimation of GNSS baseband process complexity forMontium Tile Processor running at 200 MHz, max performance of1 GMAC/s [36].

Process Usage (MMAC/s) Usage of TP (%)

Acquisition (GPS) 43.66 4.4

Acquisition (Galileo) 196.15 19.6

Tracking (GPS) 163.67 16.4

Tracking (Galileo) 229.14 22.9

results, restated in Table 1, indicated that a single Montiumcore running at 200 MHz clock speed is barely capable ofexecuting the minimum required amount of acquisition andtracking processes. This analysis did not take into accountthe processing power needed for baseband to navigationhandover nor navigation processing itself. With this it isevident that an array of cores (more than one) is neededfor GNSS L1 purposes. The estimations given in Table 1 arebased on reported [35] performance of the Montium core.The acquisition figures are computed for a search speed ofone satellite per second and the tracking figures are for asingle channel.

The results presented in Table 1 reflect the complexityof the processes when the input stream is sampled at16.368 MHz, which is the output frequency of the selectedradio front end for CRISP platform [24]. This is approxi-mately 16 times the navigation signal fundamental frequencyof 1.023 MHz.

The GNSS application can also be used with a lower rateinput stream without a significant loss in application perfor-mance. For this paper, we analyzed the effect of the inputstream decimation to the complexity of the main basebandprocesses. The other parameters, such as acquisition timeand number of frequency bins for acquisition and numberof active correlators per channel for tracking, remained thesame as in [36].

Figures 8 and 9 illustrate the effect of decimation byfactors 1, 2, 4, 8, and 16 to the utilization of the Montium Tileprocessor. Decimation factor 1 equates to the case where no

Page 17: 541420

EURASIP Journal on Embedded Systems 7

RFD

RR

RRR

R

R R R

Test IF

Chip IF

Chip IF

Channeldata out

RF frontend data

in

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Network IF

Reconfigurablecore

Smartmemory

Smartmemory

Parallel IF

Parallel IF

Parallel IF

Parallel IF

JTAG

Serial IF

Serial IF

Serial IF

Tracking channel 3

Tracking channel 4Tracking channel 1

Tracking channel 2 Tracking channel 5

Tracking channel 0

Acquisition 0

Serial IF

Figure 7: Array of 9 reconfigurable cores [34] with example mapping of GNSS application illustrated, the selection of cores is random. “R”depicts router and “IF” interface.

decimation is applied, that is, results shown in Table 1. Thepresented figures show how the complexity of both processes,measured as Montium Tile Processor utilization percentage,decreases exponentially as decimation factor increases. Thebehavior is the same for GPS and Galileo signals, except thatutilization with Galileo signals is a bit larger than with GPSin all studied cases.

To ease the computational load of the Tile Processor thedecimation of the input stream seems to be a feasible choice.The amount of decimation should be sufficient to effectmeaningful savings in TP utilization without significantlydegrading performance of the application. For the currentGPS SPS signal, decimation by factor 4 (4.092 MHz) isfeasible without significant loss in receiver performance.Factor of 8 (2.046 MHz) is equal to the Nyqvist rate for1.023 MHz, which is the PRN code rate used in GSP SPSsignal.

In the Galileo case, decimation factor 4 is the maximumdecimation factor. This is because with a sampling frequencyof approximately 4 MHz the BOC(1,1) component of theGalileo E1 OS signal can be still received with a maximumloss of only −0.9 dB, when compared with the reception ofthe whole MBOC bandwidth [12]. (This applies also to themodern GPS L1C signals, but they are not specified in ourapplication [36].)

Table 2: Estimation of GNSS baseband process complexity withdecimated (by factor 4) input stream. Montium Tile Processorrunning at 200 MHz, max performance of 1 GMAC/s.

Process Usage (MMAC/s) Usage of TP (%)

Acquisition (GPS) 9.57 0.96

Acquisition (Galileo) 43.66 4.37

Tracking (GPS) 40.92 4.09

Tracking (Galileo) 57.28 5.73

In the ideal case the decimation of the input streamshould be changing with the receiver mode (GPS/Galileo).Since in CRISP the decimation of the radio stream will beimplemented as hardware in FPGA, which is connecting theradio to the parallel interface of the final CRISP prototypeplatform, the run time configuration of the decimation factoris not feasible. For this reason, in the rest of the paper we willfocus on the scenario where fixed decimation factor of 4 isused, resulting in a stream sample rate of 4.092 MHz.

Table 2 shows baseband complexity estimation for thecase when input stream is decimated by a factor of four.When it is compared to the original figures of complexityshown in Table 2, it can be seen that the utilization of TPis over four times smaller.

Page 18: 541420

8 EURASIP Journal on Embedded Systems

2 4

GPSGalileo

Input stream decimation factor

6 8 1210 14 160

4

6

2

8

10

Mon

tiu

m t

ile p

roce

ssor

uti

lizat

ion

(%

)

12

14

16

18

20

Figure 8: Acquisition process utilization of Montium Tile Processorresources as a function of the decimation factor of the input stream.

2 4

GPSGalileo

Input stream decimation factor

6 8 1210 14 160

5

10

Mon

tiu

m t

ile p

roce

ssor

uti

lizat

ion

(%

)

15

20

25

Figure 9: Tracking process utilization of Montium Tile Processorresources as a function of the decimation factor of the input stream.

4.1.2. Requirements for the Network-on-Chip. To analyze themulticore GNSS receiver application we built a functionalsoftware receiver with the C++ language, running on a PC.The detailed analysis of the software receiver will be given insubstantial paper [37].

In our SW receiver each process was implemented as asoftware thread. With approximating one process per corethis approach enabled us to estimate the link payload bylogging communication between the threads.

We estimated a scenario where one core was allocatedto perform acquisition and six cores were mapped for thetracking process. This scenario is illustrated in Figure 7. Dig-itized RF front-end data is input to the NoC via an interface.

40004100

430044004500

4200

Payl

oad

(byt

es/m

s)

0 0.5 1

Time (ms)

1.5 2 2.5 3.53 4.54 5×103

(a) Acquisition link payload

40904095

410541104115

4100

Payl

oad

(byt

es/m

s)

0 0.5 1

Time (ms)

1.5 2 2.5 3.53 4.54 5×103

(b) Average tracking link payload

Figure 10: Link payloads for GPS acquisition process (a) andaverage payload of GPS tracking processes (b).

A specific chip interface is used to connect the RFD to theGPD, and it is used to forward channel data (channel phasemeasurement data related to pseudorange measurements,and navigation data) to the GPD. The Selected mappingis a compromise between minimal operative setup (oneacquisition and four tracking) and the needs of dependabilitytesting processes, where individual cores may be taken offlinefor testing purposes.

The scenario was simulated with a prerecorded set ofreal GPS signals. Since signal sources for Galileo navigationwere not available, the Galileo case was not tested. Thelink payloads caused by the cores communicating while thesoftware was running for 5 seconds is illustrated in Figure 10.

The results show that, in GPS mode, our GNSS appli-cation causes a payload for each link/processing core with aconstant baseline of 4096 Bytes/millisecond. This is causedby the radio front-end input, that is, the incoming signal.In this scenario we used real GPS front end data which wassampled at 4.092 MHz, each byte representing one sample.This sampling rate is also equal to the potential decimationscenario discussed earlier.

With a higher sampling rate the link payload baseline willbe raised, but on the other hand one byte can be preprocessedto contain more than one sample, decreasing the trafficcaused by radio front-end input.

The first peak in the upper part of Figure 8 is causedby the acquisition process output. When GNSS applicationstarts, FFT-based acquisition is started and the results areready after 60 milliseconds, which are then transmitted totracking channels. This peak is also the largest individualpayload event caused by the GNSS application.

After a short initialization period the tracking processesstart to produce channel output. An Average of simulatedGPS tracking link/processing core payloads is illustratedin Figure 10(b). Every 20 milliseconds a navigation datasymbol (data rate is 50 Hz in GPS) is transmitted and oncea second higher transmission peak is caused by the loop

Page 19: 541420

EURASIP Journal on Embedded Systems 9

phase measurement data, which is transmitted to GPD forpseudorange estimation.

In Galileo mode, the payload caused by incoming signalwill be equal since the same radio input will be used forboth GPS and Galileo. However, the transmission of datasymbols will cause a bigger payload since data rate of GalileoE1 signals is 250 symbols per second [8]. Galileo phasemeasurement rate will remain the same as in GPS mode.

From the results it is seen that the link payload caused bythe incoming RF signal is the largest one in both operatingmodes, and if the link payload needs to be optimized thereduction of it is the first thing to be studied. The results alsoindicate that when GNSS application is running smoothlythe link payloads caused by it are predictable.

Note that this estimation does not contain any over-heads caused by network protocol or any other data thannavigation related (dependability, real-time mapping of theprocesses). These issues will be studied in our future work.

4.2. Open Issues. Besides the additional network load causedby other than the GNSS application itself, there are alsosome other issues that remain open. There may be challengesin designing software for a multicore environment. Powerconsumption as well as the final bill of materials (BOMs),(i.e., final price of the multicore product) remains an openissue at the time of this writing. In future these issues willbe studied and suitable optimizations performed after theprototyping and proof of concepts have been completedsuccessfully.

5. Conclusions

In this paper we discussed three Software-Defined Radio(SDR) architectures for a Global Navigation Satellite System(GNSS) receiver. The usage of flexible architectures inGNSS receiver was justified with the need for implementingsupport for upcoming navigation systems and new algo-rithms developed, and especially for multipath mitigation.The hardware accelerated SDR architecture is quite closeto the current mass market solutions. There the ASIC isreplaced with a reconfigurable piece of hardware, usually anFPGA. The second architecture, ideal (or pure) SDR receiveris using a single processor to realize all necessary signalprocessing functions. Real-time receivers remain a challenge,but postprocessing applications are already taking advantageof this architecture.

The third architecture, SDR with multiple cores, is anovel approach for GNSS receivers. This approach benefitsin both having high degree of flexibility, and when properlydesigned and scaled, a reasonably low unit price in highvolume production. In this paper we also presented theCRISP project where such a multicore architecture willbe realized along with the analysis of GNSS applicationrequirements for the multicore platform.

We extended the previously published analysis of pro-cessing tile utilization to cover the effect of input streamdecimation. Decimation by factor four seems to offer agood compromise between core utilization and applicationperformance.

We implemented a software GNSS receiver with processesimplemented as threads and used that to analyze the GNSSapplication communication payload for individual links.This analysis indicated that the incoming signal representsthe largest part of the communication in the networkbetween processing cores.

Acknowledgments

The authors want to thank Stephen T. Burgess from TampereUniversity of Technology for his useful comments aboutthe manuscript. This work was supported in part by theFUGAT project funded by the Finnish Funding Agency forTechnology and innovation (TEKES). Parts of this researchare conducted within the FP7 Cutting edge ReconfigurableICs for Stream Processing (CRISP) project (ICT-215881)supported by the European Commission.

References

[1] E. D. Kaplan and C. J. Hegarty, Eds., Understanding GPS,Principles and Applications, Artech House, Boston, Mass, USA,2nd edition, 2006.

[2] J. Benedicto, S. E. Dinwiddy, G. Gatti, R. Lucas, and M.Lugert, “GALILEO: Satellite System Design and TechnologyDevelopments,” European Space Agency, November 2000.

[3] S. Revnivykh, “GLONASS Status and Progress,” Decem-ber 2008, http://www.oosa.unvienna.org/pdf/icg/2008/icg3/04.pdf.

[4] G. Gibbons, “International system providers meeting (ICG-3) reflects GNSS’s competing interest, cooperative objectives,”Inside GNSS, December 2008.

[5] U. S. Airforce, “GPS Modernization Fact Sheet,” 2006,http://pnt.gov/public/docs/2006/modernization.pdf.

[6] M. S. Braasch and A. J. van Dierendonck, “GPS receiverarchitectures and measurements,” Proceedings of the IEEE, vol.87, no. 1, pp. 48–64, 1999.

[7] K. Borre, D. M. Akos, N. Bertelsen, P. Rinder, and S. H.Jensen, A Software Defined GPS and Galileo Receiver—ASingle-Frequency Approach, Birkhauser, Boston, Mass, USA,2007.

[8] “Galileo Open Service, Signal in space interface controldocument (OS SIS ICD),” Draft 1, February 2008.

[9] “Interface Specification—Navstar GPS Space segment/Usersegment L1C Interfaces,” IS-GPS-800, August 2007.

[10] R. D. Fontana, W. Cheung, and T. Stansell, “The modernizedL2C signal—leaping forward into the 21st century,” GPSWorld, pp. 28–34, September 2001.

[11] “Galileo Joint Undertaking—Galileo Open Service, Signal inspace interface control document (OS SIS ICD),” GJU, May2006.

[12] G. W. Hein, J.-A. Avila-Rodriguez, S. Wallner, et al., “MBOC:the new optimized spreading modulation recommendedfor GALILEO L1 OS and GPS L1C,” in Proceedings ofthe IEEE/ION Position, Location, and Navigation Symposium(PLANS ’06), pp. 883–892, San Diego, Calif, USA, April 2006.

[13] H. Hurskainen, E. S. Lohan, X. Hu, J. Raasakka, and J. Nurmi,“Multiple gate delay tracking structures for GNSS signalsand their evaluation with simulink, systemC, and VHDL,”International Journal of Navigation and Observation, vol. 2008,Article ID 785695, 17 pages, 2008.

Page 20: 541420

10 EURASIP Journal on Embedded Systems

[14] S. Kim, S. Yoo, S. Yoon, and S. Y. Kim, “A novel unambiguousmultipath mitigation scheme for BOC(kn, n) tracking inGNSS,” in Proceedings of the International Symposium onApplications and the Internet Workshops, p. 57, 2007.

[15] F. Dovis, M. Pini, and P. Mulassano, “Multiple DLL archi-tecture for multipath recovery in navigation receivers,” inProceedings of the 59th IEEE Vehicular Technology Conference(VTC ’04), vol. 5, pp. 2848–2851, May 2004.

[16] F. Dovis, A. Gramazio, and P. Mulassano, “SDR technologyapplied to Galileo receivers,” in Proceedings of the InternationalTechnical Meeting of the Satellite Division of the Institute ofNavigation (ION GPS ’02), Portland, Ore, USA, September2002.

[17] “SDR Forum,” January 2009, http://www.sdrforum.org.[18] J. Mitola, “The software radio architecture,” IEEE Communi-

cations Magazine, 1995.[19] P. G. Mattos, “A single-chip GPS receiver,” GPS World, October

2005.[20] P. J. Mumford, K. Parkinson, and A. G. Dempster, “The

namuru open GNSS research receiver,” in Proceedings of theInternational Technical Meeting of the Satellite Division of theInstitute of Navigation (ION GNSS ’06), vol. 5, pp. 2847–2855,Fort Worth, Tex, USA, September 2006.

[21] S. Ganguly, A. Jovancevic, D. A. Saxena, B. Sirpatil, and S.Zigic, “Open architecture real time development system forGPS and Galileo,” in Proceedings of the International TechnicalMeeting of the Satellite Division of the Institute of Navigation(ION GNSS ’04), pp. 2655–2666, Long Beach, Calif, USA,September 2004.

[22] H. Hurskainen, T. Paakki, Z. Liu, J. Raasakka, and J. Nurmi,“GNSS receiver reference design,” in Proceedings of the 4thAdvanced Satellite Mobile Systems (ASMS ’08), pp. 204–209,Bologna, Italy, August 2008.

[23] J. Hill, “Navigation signal processing with FPGAs,” in Pro-ceedings of the National Technical Meeting of the Institute ofNavigation, pp. 420–427, June 2004.

[24] Atmel, “GPS Front End IC ATR0603,” Datasheet, 2006.[25] M. Detratti, E. Lopez, E. Perez, and R. Palacio, “Dual-

frequency RF front end solution for hybrid Galileo/GPSmass market receiver,” in Proceedings of the IEEE ConsumerCommunications and Networking Conference (CCNC ’08), pp.603–607, Las Vegas, Nev, USA, January 2008.

[26] P. Fine and W. Wilson, “Tracking algorithms for GPS offsetcarrier signals,” in Proceedings of the ION National TechnicalMeeting (NTM ’99), San Diego, Calif, USA, January 1999.

[27] H. Hurskainen and J. Nurmi, “SystemC model of an interop-erative GPS/Galileo code correlator channel,” in Proceedings ofthe IEEE Workshop on Signal Processing Systems (SIPS ’06), pp.327–332, Banff, Canada, October 2006.

[28] D. M. Akos, “The role of Global Navigation Satellite System(GNSS) software radios in embedded systems,” GPS Solutions,May 2003.

[29] C. Dionisio, L. Cucchi, and R. Marracci, “SOFTREC G3,software receiver and signal analysis fog GNSS bands,” inProceedings of the 10th IEEE Internationl Symposium on SpreadSpectrum Techniques and Applications (ISSSTA ’08), Bologna,Italy, August 2008.

[30] G. E. Moore, “Cramming more components onto integratedcircuits,” Proceedings of the IEEE, vol. 86, no. 1, pp. 82–85,1998.

[31] S. Soderholm, T. Jokitalo, K. Kaisti, H. Kuusniemi, andH. Naukkarinen, “Smart positioning with fastrax’s softwareGPS receiver solution,” in Proceedings of the InternationalTechnical Meeting of the Satellite Division of the Institute of

Navigation (ION GNSS ’08), pp. 1193–1200, Savannah, Ga,USA, September 2008.

[32] J. H. Won, T. Pany, and G. W. Hein, “GNSS software definedradio: real receiver or just a tool for experts,” Inside GNSS, pp.48–56, July-August 2006.

[33] “CRISP Project,” December 2008, http://www.crisp-project.eu.

[34] P. Heysters, “CRISP Project Presentation,” June 2008,http://www.crisp-project.eu/images/publications/D6.1CRISP project presentation 080622.pdf.

[35] P. M. Heysters, G. K. Rauwerda, and L. T. Smit, “A flexible,low power, high performance DSP IP core for programmablesystems-on-chip,” in Proceedings of the IP/SoC, Grenoble,France, December 2005.

[36] H. Hurskainen, J. Raasakka, and J. Nurmi, “Specification ofGNSS application for multiprocessor platform,” in Proceedingsof the International Symposium on System-on-Chip (SOC ’08),pp. 128–133, Tampere, Finland, November 2008.

[37] J. Raasakka, H. Hurskainen, and J. Nurmi, “Modeling multi-core software GNSS receiver with real time SW receiver,”in Proceedings of the International Technical Meeting of theSatellite Division of the Institute of Navigation (ION GNSS ’09),Savannah, Ga, USA, September 2009.

Page 21: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 598529, 13 pagesdoi:10.1155/2009/598529

Research Article

An Open Framework for Rapid Prototyping ofSignal Processing Applications

Maxime Pelcat,1 Jonathan Piat,1 Matthieu Wipliez,1 Slaheddine Aridhi,2

and Jean-Francois Nezan1

1 IETR/Image and Remote Sensing Group, CNRS UMR 6164/INSA Rennes, 20, avenue des Buttes de Coesmes,35043 Rennes Cedex, France

2 HPMP Division, Texas Instruments, 06271 Villeneuve Loubet, France

Correspondence should be addressed to Maxime Pelcat, [email protected]

Received 27 February 2009; Revised 7 July 2009; Accepted 14 September 2009

Recommended by Markus Rupp

Embedded real-time applications in communication systems have significant timing constraints, thus requiring multiplecomputation units. Manually exploring the potential parallelism of an application deployed on multicore architectures is greatlytime-consuming. This paper presents an open-source Eclipse-based framework which aims to facilitate the exploration anddevelopment processes in this context. The framework includes a generic graph editor (Graphiti), a graph transformation library(SDF4J) and an automatic mapper/scheduler tool with simulation and code generation capabilities (PREESM). The input of theframework is composed of a scenario description and two graphs, one graph describes an algorithm and the second graph describesan architecture. The rapid prototyping results of a 3GPP Long-Term Evolution (LTE) algorithm on a multicore digital signalprocessor illustrate both the features and the capabilities of this framework.

Copyright © 2009 Maxime Pelcat et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The recent evolution of digital communication systems(voice, data, and video) has been dramatic. Over the last twodecades, low data-rate systems (such as dial-up modems, firstand second generation cellular systems, 802.11 Wireless localarea networks) have been replaced or augmented by systemscapable of data rates of several Mbps, supporting multimediaapplications (such as DSL, cable modems, 802.11b/a/g/nwireless local area networks, 3G, WiMax and ultra-widebandpersonal area networks).

As communication systems have evolved, the resultingincrease in data rates has necessitated a higher system algo-rithmic complexity. A more complex system requires greaterflexibility in order to function with different protocols indifferent environments. Additionally, there is an increasedneed for the system to support multiple interfaces andmulticomponent devices. Consequently, this requires theoptimization of device parameters over varying constraintssuch as performance, area, and power. Achieving thisdevice optimization requires a good understanding of the

application complexity and the choice of an appropriatearchitecture to support this application.

An embedded system commonly contains several pro-cessor cores in addition to hardware coprocessors. Theembedded system designer needs to distribute a set of signalprocessing functions onto a given hardware with predefinedfeatures. The functions are then executed as software codeon target architecture; this action will be called a deploymentin this paper. A common approach to implement a parallelalgorithm is the creation of a program containing severalsynchronized threads in which execution is driven by thescheduler of an operating system. Such an implementationdoes not meet the hard timing constraints required by real-time applications and the memory consumption constraintsrequired by embedded systems [1]. One-time manualscheduling developed for single-processor applications isalso not suitable for multiprocessor architectures: manualdata transfers and synchronizations quickly become verycomplex, leading to wasted time and potential deadlocks.

Page 22: 541420

2 EURASIP Journal on Embedded Systems

Furthermore, the task of finding an optimal deployment ofan algorithm mapped onto a multicomponent architectureis not straightforward. When performed manually, the resultis inevitably a suboptimal solution. These issues raise theneed for new methodologies, which allow the exploration ofseveral solutions, to achieve a more optimal result.

Several features must be provided by a fast prototypingprocess: description of the system (hardware and software),automatic mapping/scheduling, simulation of the execu-tion, and automatic code generation. This paper draws onpreviously presented works [2–4] in order to generate amore complete rapid prototyping framework. This completeframework is composed of three complementary tools basedon Eclipse [5] that provide a full environment for therapid prototyping of real-time embedded systems: Paralleland Real-time Embedded Executives Scheduling Method(PREESM), Graphiti and Synchronous Data Flow for Java(SDF4J). This framework implements the methodologyAlgorithm-Architecture Matching (AAM), which was previ-ously called Algorithm-Architecture Adequation (AAA) [6].The focus of this rapid prototyping activity is currentlystatic code mapping/scheduling but dynamic extensions areplanned for future generations of the tool.

From the graph descriptions of an algorithm and ofan architecture, PREESM can find the right deployment,provide simulation information, and generate a frameworkcode for the processor cores [2]. These rapid prototypingtasks can be combined and parameterized in a workflow.In PREESM, a workflow is defined as an oriented graphrepresenting the list of rapid prototyping tasks to executeon the input algorithm and architecture graphs in orderto determine and simulate a given deployment. A rapidprototyping process in PREESM consists of a succession oftransformations. These transformations are associated in adata flow graph representing a workflow that can be edited ina Graphiti generic graph editor. The PREESM input graphsmay also be edited using Graphiti. The PREESM algorithmmodels are handled by the SDF4J library. The framework canbe extended by modifying the workflows or by connectingnew plug-ins (for compilation, graph analyses, and so on).

In this paper, the differences between the proposedframework and related works are explained in Section 2.The framework structure is described in Section 3. Section 4details the features of PREESM that can be combined byusers in workflows. The use of the framework is illustrated bythe deployment of a wireless communication algorithm fromthe 3rd Generation Partnership Project (3GPP) Long-TermEvolution (LTE) standard in Section 5. Finally, conclusionsare given in Section 6.

2. State of the Art of Rapid Prototyping andMulticore Programming

There exist numerous solutions to partition algorithmsonto multicore architectures. If the target architecture ishomogeneous, several solutions exist which generate mul-ticore code from C with additional information (OpenMP[7], CILK [8]). In the case of heterogeneous architectures,

languages such as OpenCL [9] and the Multicore AssociationApplication Programming Interface (MCAPI [10]) defineways to express parallel properties of a code. However,they are not currently linked to efficient compilers andruntime environments. Moreover, compilers for such lan-guages would have difficulty in extracting and solving thebottlenecks of the implementation that appear inherently ingraph descriptions of the architecture and the algorithm.

The Poly-Mapper tool from PolyCore Software [11]offers functionalities similar to PREESM but, in contrastto PREESM, its mapping/scheduling is manual. Ptolemy II[12] is a simulation tool that supports many models ofcomputation. However, it also has no automatic mappingand currently its code generation for embedded systemsfocuses on single-core targets. Another family of frameworksexisting for data flow based programming is based onCAL [13] language and it includes OpenDF [14]. OpenDFemploys a more dynamic model than PREESM but itsrelated code generation does not currently support multicoreembedded systems.

Closer to PREESM are the Model Integrated Computing(MIC [15]), the Open Tool Integration Environment (OTIE[16]), the Synchronous Distributed Executives (SynDEx[17]), the Dataflow Interchange Format (DIF [18]), andSDF for Free (SDF3 [19]). Both MIC and OTIE can notbe accessed online. According to literature, MIC focuseson the transformation between algorithm domain-specificmodels and metamodels while OTIE defines a single systemdescription that can be used during the whole signalprocessing design cycle.

DIF is designed as an extensible repository of repre-sentation, analysis, transformation, and scheduling of dataflow language. DIF is a Java library which allows the userto go from graph specification using the DIF language toC code generation. However, the hierarchical SynchronousData Flow (SDF) model used in the SDF4J library andPREESM is not available in DIF.

SDF3 is an open-source tool implementing some dataflow models and providing analysis, transformation, visu-alization, and manual scheduling as a C++ library. SDF3implements the Scenario Aware Data Flow (SADF [20]), andprovides Multiprocessor System-on-Chip (MP-SoC) bind-ing/scheduling algorithm to output MP-SoC configurationfiles.

SynDEx and PREESM are both based on the AAMmethodology [6] but the tools do not provide the samefeatures. SynDEx is not an open source, it has its own modelof computation that does not support schedulability analysis,and code generation is possible but not provided with thetool. Moreover, the architecture model of SynDEx is at a toohigh level to account for bus contentions and DMA usedin modern chips (multicore processors of MP-SoC) in themapping/scheduling.

The features that differentiate PREESM from the relatedworks and similar tools are

(i) The tool is an open source and accessible online;

(ii) the algorithm description is based on a single well-known and predictable model of computation;

Page 23: 541420

EURASIP Journal on Embedded Systems 3

Rapid prototypingeclipse plug-ins

Data flow graphtransformation library

Generic grapheditor eclipse

plug-inGraph

transformation

SchedulerCode

generator

SDF4J

Graphiti Core

PREESM

Eclipse framework

Figure 1: An Eclipse-based Rapid Prototyping Framework.

(iii) the mapping and the scheduling are totally auto-matic;

(iv) the functional code for heterogeneous multicoreembedded systems can be generated automatically;

(v) the algorithm model provides a helpful hierar-chical encapsulation thus simplifying the map-ping/scheduling [3].

The PREESM framework structure is detailed in the nextsection.

3. An Open-Source Eclipse-Based RapidPrototyping Framework

3.1. The Framework Structure. The framework structure ispresented in Figure 1. It is composed of several tools toincrease reusability in several contexts.

The first step of the process is to describe both the targetalgorithm and the target architecture graphs. A graphicaleditor reduces the development time required to create,modify and edit those graphs. The role of Graphiti [21] isto support the creation of algorithm and architecture graphsfor the proposed framework. Graphiti can also be quicklyconfigured to support any type of file formats used forgeneric graph descriptions.

The algorithm is currently described as a SynchronousData Flow (SDF [22]) Graph. The SDF model is a goodsolution to describe algorithms with static behavior. TheSDF4J [23] is an open-source library providing usualtransformations of SDF graphs in the Java programminglanguage. The extensive use of SDF and its derivatives inthe programming model community led to the developmentof SDF4J as an external tool. Due to the greater specificityof the architecture description compared to the algorithmdescription, it was decided to perform the architecturetransformation inside the PREESM plug-ins.

The PREESM project [24] involves the development of atool that performs the rapid prototyping tasks. The PREESMtool uses the Graphiti tool and SDF4J library to designalgorithm and architecture graphs and to generate theirtransformations. The PREESM core is an Eclipse plug-in thatexecutes sequences of rapid prototyping tasks or workflows.The tasks of a workflow are delegated to PREESM plug-ins. There are currently three PREESM plug-ins: the graph

transformation plug-in, the scheduler plug-in, and the code-generation plug-in.

The three tools of the framework are detailed in the nextsections.

3.2. Graphiti: A Generic Graph Editor for Editing Architectures,Algorithms and Workflows. Graphiti is an open-source plug-in for the Eclipse environment that provides a generic grapheditor. It is written using the Graphical Editor Framework(GEF). The editor is generic in the sense that any typeof graph may be represented and edited. Graphiti is usedroutinely with the following graph types and associated fileformats: CAL networks [13, 25], a subset of IP-XACT [26],GraphML [27] and PREESM workflows [28].

3.2.1. Overview of Graphiti. A type of graph is registeredwithin the editor by a configuration. A configuration is anXML (Extensible Markup Language [29]) file that describes

(1) the abstract syntax of the graph (types of verticesand edges, and attributes allowed for objects of eachtype);

(2) the visual syntax of the graph (colors, shapes, etc.);

(3) transformations from the file format in which thegraph is defined to Graphiti’s XML file format G, andvice versa (Figure 2);

Two kinds of input transformations are supported, fromXML to XML and from text to XML (Figure 2). XML istransformed to XML with Extensible Stylesheet LanguageTransformation (XSLT [30]), and text is parsed to its Con-crete Syntax Tree (CST) represented in XML according to aLL(k) grammar by the Grammatica [31] parser. Similarly,two kinds of output transformations are supported, fromXML to XML and from XML to text.

Graphiti handles attributed graphs [32]. An attributedgraph is defined as a directed multigraph G = (V ,E,μ) withV the set of vertices, E the multiset of edges (there can bemore than one edge between any two vertices). μ is a functionμ : ({G} ∪ V ∪ E) × A �→ U that associates instances withattributes from the attribute name set A and values from U ,the set of possible attribute values. A built-in type attributeis defined so that each instance i ∈ {G} ∪ V ∪ E has a typet = μ(i, type), and only admits attributes from a set At ⊂ A

Page 24: 541420

4 EURASIP Journal on Embedded Systems

XML

TextParsing XML

CST

XSLTtransformations

G

(a)

XML

Text

XSLTtransformations

G

(b)

Figure 2: Input/output with Graphiti’s XML format G.

produce

out

do something

acc

in

out

consume

in

Figure 3: A sample graph.

given by At = τ(t). Additionally, a type t has a visual syntaxσ(t) that defines its color, shape, and size.

To edit a graph, the user selects a file and the matchingconfiguration is computed based on the file extension. Thetransformations defined in the configuration file are thenapplied to the input file and result in a graph defined inGraphiti’s XML format G as shown in Figure 2. The editoruses the visual syntax defined by σ in the configuration todraw the graph, vertices, and edges. For each instance of typet the user can edit the relevant attributes allowed by τ(t)as defined in the configuration. Saving a graph consists ofwriting the graph in G, and transforming it back to the inputfile’s native format.

3.2.2. Editing a Configuration for a Graph Type. To create aconfiguration for the graph represented in Figure 3, a node (asingle type of vertex) must be defined. A node has a uniqueidentifier called id, and accepts a list of values initially equal to[0] (Figure 4). Additionally, ports need to be specified on theedges, so the configuration describes an edgeType element(Figure 5) that carries sourcePort and targetPort parametersto store an edge’s source and target ports, respectively, suchas acc, in, and out in Figure 3.

Graphiti is a stand-alone tool, totally independent ofPREESM. However, Graphiti generates workflow graphs,IP-XACT and GraphML files that are the main inputs ofPREESM. The GraphML files contain the algorithm model.These inputs are loaded and stored in PREESM by the SDF4Jlibrary. This library, discussed in the next section, executesthe graph transformations.

3.3. SDF4J: A Java Library for Algorithm Data Flow GraphTransformations. SDF4J is a library defining several DataFlow oriented graph models such as SDF and DirectedAcyclic Graph (DAG [33]). It provides the user with severalclassic SDF transformations such as hierarchy flattening, and

<vertexType name=“node”><attributes><color red=“163” green=“0” blue=“85”/><shape name=“roundedBox”/><size width=“40” height=“40”/></attributes><parameters><parameter name=“id”

type=“java.lang.String”default=“ ”/>

<parameter name=“values”type=“java.util.List”>

<element value=“0”/></parameter></parameters></vertexType>

Figure 4: The type of vertices of the graph shown in Figure 3.

<edgeType name=“edge”><attributes><directed value=“true”/></attributes><parameters><parameter name=“source port”

type=“java.lang.String”default=“ ”/>

<parameter name=“target port”type=“java.lang.String”default=“ ”/>

</parameters></vertexType>

Figure 5: The type of edges of the graph shown in Figure 3.

SDF to Homogeneous SDF (HSDF [34]) transformationsand some clustering algorithms. This library also gives thepossibility to expand optimization templates. It defines itsown graph representation based on the GraphML standardand provides the associated parser and exporter class. SDF4Jis freely available (GPL license) for download.

3.3.1. SDF4J SDF Graph model. An SDF graph is usedto simplify the application specifications. It allows therepresentation of the application behavior at a coarse grainlevel. This data flow representation models the applicationoperations and specifies the data dependencies between theseoperations.

An SDF graph is a finite directed, weighted graph G =<V ,E,d, p, c > where:

(i) V is the set of nodes. A node computes an input datastream and outputs the result;

(ii) E ⊆ V × V is the edge set, representing channelswhich carry data streams;

(iii) d : E → N ∪ {0} is a function with d(e) the numberof initial tokens on an edge e;

(iv) p : E → N is a function with p(e) representing thenumber of data tokens produced at e’s source to becarried by e;

Page 25: 541420

EURASIP Journal on Embedded Systems 5

op1 op2 op4

op3

3 2 2 4

3

2 2

4

Figure 6: A SDF graph.

(v) c : E → N is a function with c(e) representing thenumber of data tokens consumed from e by e’s sinknode;

This model offers strong compile-time predictabilityproperties, but has limited expressive capability. The SDFimplementation enabled by the SDF4J supports the hierarchydefined in [3] which increases the model expressiveness. Thisspecific implementation is straightforward to the program-mer and allows user-defined structural optimizations. Thismodel is also intended to lead to a better code generationusing common C patterns like loop and function calls. It ishighly expandable as the user can associate any propertiesto the graph components (edge, vertex) to produce acustomized model.

3.3.2. SDF4J SDF Graph Transformations. SDF4J implementsseveral algorithms intended to transform the base model orto optimize the application behavior at different levels.

(i) The hierarchy flattening transformation aims to flattenthe hierarchy (remove hierarchy levels) at the chosendepth in order to later extract as much as possibleparallelism from the designer’s hierarchical descrip-tion.

(ii) The HSDF transformation (Figure 7) transforms theSDF model to an HSDF model in which the amountof tokens exchanged on edges are homogeneous(production = consumption). This model revealsall the potential parallelism in the application butdramatically increases the amount of vertices in thegraph.

(iii) The internalization transformation based on [35]is an efficient clustering method minimizing thenumber of vertices in the graph without decreasingthe potential parallelism in the application.

(iv) The SDF to DAG transformation converts the SDF orHSDF model to the DAG model which is commonlyused by scheduling methods [33].

3.4. PREESM: A Complete Framework for Hardware and Soft-ware Codesign. In the framework, the role of the PREESMtool is to perform the rapid prototyping tasks. Figure 8depicts an example of a classic workflow which can beexecuted in the PREESM tool. As seen in Section 3.3, thedata flow model chosen to describe applications in PREESMis the SDF model. This model, described in [22], has thegreat advantage of enabling the formal verification of staticschedulability. The typical number of vertices to schedule in

op1 op2

op2

op2

op2

op13 1

1

1

1 1

1

1

Figure 7: A SDF graph and its HSDF transformation.

PREESM is between one hundred and several thousands. Thearchitecture is described using IP-XACT language, an IEEEstandard from the SPIRIT consortium [26]. The typical sizeof an architecture representation in PREESM is between afew cores and several dozen cores. A scenario is defined as aset of parameters and constraints that specify the conditionsunder which the deployment will run.

As can be seen in Figure 8, prior to entering thescheduling phase, the algorithm goes through three trans-formation steps: the hierarchy flattening transformation,the HSDF transformation, and the DAG transformation(see Section 3.3.2). These transformations prepare the graphfor the static scheduling and are provided by the GraphTransformation Module (see Section 4.1). Subsequently, theDAG—converted SDF graph—is processed by the scheduler[36]. As a result of the deployment by the scheduler, acode is generated and a Gantt chart of the execution isdisplayed. The generated code consists of scheduled functioncalls, synchronizations, and data transfers between cores. Thefunctions themselves are handwritten.

The plug-ins of the PREESM tool implement the rapidprototyping tasks that a user can add to the workflows. Theseplug-ins are detailed in next section.

4. The Current Features of PREESM

4.1. The Graph Transformation Module. In order to generatean efficient schedule for a given algorithm description, theapplication defined by the designer must be transformed.The purpose of this transformation is to reveal the potentialparallelism of the algorithm and simplify the work of thetask scheduler. To provide the user with flexibility whileoptimizing the design, the entire graph transformationprovided by the SDF4J library can be instantiated in aworkflow with parameters allowing the user to control eachof the three transformations. For example, the hierarchicalflattening transformation can be configured to flatten agiven number of hierarchy levels (depth) in order to keepsome of the user hierarchical construction and to maintainthe amount of vertices to schedule at a reasonable level.The HSDF transformation provides the scheduler with agraph of high potential parallelism as all the vertices of theSDF graph are repeated according to the SDF graph’s basicrepetition vector. Consequently, the number of vertices toschedule is larger than in the original graph. The clusteringtransformation prepares the algorithm for the schedulingprocess by grouping vertices according to criteria such asstrong connectivity or strong data dependency between

Page 26: 541420

6 EURASIP Journal on Embedded Systems

Graphiti editor

Architectureeditor

Algorithmeditor

Scenarioeditor

HierarchicalSDF

Hierarchy flattening

HSDF transformation

SDF to DAG transformation

Mapping /scheduling

DAG + implementationinformationGantt chart

Code generation

PREESM framework

IP-X

AC

T

Scen

arioSDF

HSDF

DAG

Code

Figure 8: Example of a workflow graph: from SDF and IP-XACT descriptions to the generated code.

vertices. The grouped vertices are then transformed into ahierarchical vertex which is then treated as a single vertexin the scheduling process. This vertex grouping reduces thenumber of vertices to schedule, speeding up the schedulingprocess. The user can freely use available transformations inhis workflow in order to control the criteria for optimizingthe targeted application and architecture.

As can be seen in the workflow displayed in Figure 8,the graph transformation steps are followed by the staticscheduling step.

4.2. The PREESM Static Scheduler. Scheduling consists ofstatically distributing the tasks that constitute an applicationbetween available cores in a multicore architecture andminimizing parameters such as final latency. This problemhas been proven to be NP-complete [37]. A static schedulingalgorithm is usually described as a monolithic process, andcarries out two distinct functionalities: choosing the core toexecute a specific function and evaluating the cost of thegenerated solutions.

The PREESM scheduler splits these functionalities intothree submodules [4] which share minimal interfaces: thetask scheduling, the edge scheduling, and the ArchitectureBenchmark Computer (ABC) submodules. The task schedul-ing submodule produces a scheduling solution for theapplication tasks mapped onto the architecture cores andthen queries the ABC submodule to evaluate the cost of the

proposed solution. The advantage of this approach is that anytask scheduling heuristic may be combined with any ABCmodel, leading to many different scheduling possibilities. Forinstance, an ABC minimizing the deployment memory orenergy consumption can be implemented without modifyingthe task scheduling heuristics.

The interface offered by the ABC to the task schedulingsubmodule is minimal. The ABC gives the number of avail-able cores, receives a deployment description and returnscosts to the task scheduling (infinite if the deployment isimpossible). The time keeper calculates and stores timingsfor the tasks and the transfers when necessary for the ABC.

The ABC needs to schedule the edges in order to calculatethe deployment cost. However, it is not designed to makeany deployment choices; this task is delegated to the edgescheduling submodule. The router in the edge schedulingsubmodule finds potential routes between the available cores.

The choice of module structure was motivated bythe behavioral commonality of the majority of schedulingalgorithms (see Figure 9).

4.2.1. Scheduling Heuristics. Three algorithms are currentlycoded, and are modified versions of the algorithms describedin [38].

(i) A list scheduling algorithm schedules tasks in theorder dictated by a list constructed from estimatinga critical path. Once a mapping choice has been

Page 27: 541420

EURASIP Journal on Embedded Systems 7

made, it will never be modified. This algorithm isfast but has limitations due to this last property.List scheduling is used as a starting point for otherrefinement algorithms.

(ii) The FAST algorithm is a refinement of the listscheduling solution which uses probabilistic hops. Itchanges the mapping choices of randomly chosentasks; that is, it associates these tasks to anotherprocessing unit. It runs until stopped by the userand keeps the best latency found. The algorithm ismultithreaded to exploit the multicore parallelism ofa host computer.

(iii) A genetic algorithm is coded as a refinement of theFAST algorithm. The n best solutions of FAST areused as the base population for the genetic algorithm.The user can stop the processing at any time whileretaining the last best solution. This algorithm is alsomultithreaded.

The FAST algorithm has been developed to solve complexdeployment problems. In the original heuristic, the finalorder of tasks to schedule, as defined by the list schedulingalgorithm, was not modified by the FAST algorithm. TheFAST algorithm only modifies the mapping choices of thetasks. In large-scale applications, the initial order of thetasks performed by the list scheduling algorithm becomesoccasionally suboptimal. In the modified version of the FASTscheduling algorithm, the ABC recalculates the final order ofa task when the heuristic maps a task to a new core. The taskswitcher algorithm used to recalculate the order simply looksfor the earliest appropriately sized hole in the core schedulefor the mapped task (see Figure 10).

4.2.2. Scheduling Architecture Model. The current architec-ture representation was driven by the need to accuratelymodel multicore architectures and hardware coprocessorswith intercores message-passing communication. This com-munication is handled in parallel to the computation usingDirect Memory Access (DMA) modules. This model iscurrently used to closely simulate the Texas InstrumentsTMS320TCI6487 processor (see Section 5.3.2). The modelwill soon be extended to shared memory communicationsand more complex interconnections. The term operatorrepresents either a processor core or a hardware coprocessor.Operators are linked by media, each medium representing abus and the associated DMA. The architectures can be eitherhomogeneous (with all operators and media identical) orheterogeneous. For each medium, the user defines a DMAset up time and a bus data rate. As shown in Figure 9,the architecture model is only processed in the schedulerby the ABC and not by the heuristic and edge schedulingsubmodules.

4.2.3. Architecture Benchmark Computer. Scheduling oftenrequires much time. Testing intermediate solutions withprecision is an especially time-consuming operation. TheABC submodule was created by reusing the useful conceptof time scalability introduced in SystemC Transaction Level

DAG IP-XACT + scenario

Number of cores

Task scheduleTask scheduling

Architecturebenchmark

computer (ABC)

Time keeper

Scheduler Cost

Task schedule

Router

Edge scheduling

Edgeschedule

Figure 9: Scheduler module structure.

Modeling (TLM) [39]. This language defines several levels ofsystem temporal simulation, from untimed to cycle-accurateprecision. This concept motivated the development of severalABC latency models with different timing precisions. ThreeABC latency models are currently coded (see Figure 11).

(i) The loosely-timed model takes into account task andtransfer times but no transfer contention.

(ii) The approximately-timed model associates each inter-core communication medium with its constant rateand simulates contentions.

(iii) The accurately-timed model adds set up times whichsimulate the duration necessary to initialize a paralleltransfer controller like Texas Instruments EnhancedDirect Memory Access (EDMA [40]). This set uptime is scheduled in the core which sends the transfer.

The task and architecture properties feeding the ABCsubmodule are evaluated experimentally, and include mediadata rate, set up times, and task timings. ABC modelsevaluating parameters other than latency are planed inorder to minimize memory size, memory accesses, cadence(i.e., average runtime), and so on. Currently, only latencyis minimized due to the limitations of the list schedulingalgorithms: these costs cannot be evaluated on partialdeployments.

4.2.4. Edge Scheduling Submodule. When a data block istransferred from one operator to another, transfer tasks areadded and then mapped to the corresponding medium. Aroute is associated with each edge carrying data from oneoperator to another, which possibly may go through severalother operators. The edge scheduling submodule routes theedges and schedules their route steps. The existing routingprocess is basic and will be developed further once thearchitecture model has been extended. Edge scheduling canbe executed with different algorithms of varying complexity,which results in another level of scalability. Currently, twoalgorithms are implemented:

(i) the simple edge scheduler follows the scheduling ordergiven by the task list provided by the list schedulingalgorithm;

Page 28: 541420

8 EURASIP Journal on Embedded Systems

DAG IP-XACT + scenario

Task scheduling ABC

Scheduler

List schedulingGenetic algorithms FAST

Latency/cadence/memory driven

Edge scheduling

Only latency-driven

ACCURATE

FAST

Figure 10: Switchable scheduling heuristics.

(ii) the switching edge scheduler reuses the task switcheralgorithm discussed in Section 4.2.1 for edge schedul-ing. When a new communication edge needs to bescheduled, the algorithm looks for the earliest hole ofappropriate size in the medium schedule.

The scheduler framework enables the comparison ofdifferent edge scheduling algorithms using the same taskscheduling submodule and architecture model description.The main advantage of the scheduler structure is theindependence of scheduling algorithms from cost type andbenchmark complexity.

4.3. Generating a Code from a Static Schedule. Using theAAM methodology from [6], a code can be generated fromthe static scheduling of the input algorithm on the inputarchitecture (see workflow in Figure 8). This code consistsof an initialization phase and a loop endlessly repeating thealgorithm graph. From the deployment generated by thescheduler, the code generation module generates a genericrepresentation of the code in XML. The specific code forthe target is then obtained after an XSLT transformation.The code generation flow for a Texas Instruments tricoreprocessor TMS320TCI6487 (see Section 5.3.2) is illustratedby Figure 12.

PREESM currently supports the C64x and C64x+ basedprocessors from Texas Instruments with DSP-BIOS Oper-ating System [41] and the x86 processors with WindowsOperating System. The supported intercore communicationschemes include TCP/IP with sockets, Texas InstrumentsEDMA3 [42], and RapidIO link [43].

An actor is a task with no hierarchy. A function mustbe associated with each actor and the prototype of thefunction must be defined to add the right parameters in theright order. A CORBA Interface Definition Language (IDL)file is associated with each actor in PREESM. An exampleof an IDL file is shown in Figure 13. This file gives thegeneric prototypes of the initialization and loop functioncalls associated with a task. IDL was chosen because it is alanguage-independent way to express an interface.

DAG

IP-XACT scenario

Task scheduling

Scheduler

Architecture benchmark computer (ABC)

Accurately-timed

Edge scheduling

Approximately-timed Loosely-timed

ACCURATE

FAST

Bus contention+ setup times Bus contention

Unscheduledcommunication

Mem

ory

Cad

ence

Figure 11: Switchable ABC models.

Depending on the type of medium between the operatorsin the PREESM architecture model, the XSLT transformationgenerates calls to the appropriate predefined communicationlibrary. Specific code libraries have been developed tomanage the communications and synchronizations betweenthe target cores [2].

5. Rapid Prototyping of a Signal ProcessingAlgorithm from the 3GPP LTE Standard

The framework functionalities detailed in the previoussections are now applied to the rapid prototyping of asignal processing application from the 3GPP LTE radio accessnetwork physical layer.

5.1. The 3GPP LTE Standard. The 3GPP [44] is a groupformed by telecommunication organizations to standardizethe third generation (3G) mobile phone system specification.This group is currently developing a new standard: the Long-Term Evolution (LTE) of the 3G. The aim of this standard isto bring data rates of tens of megabits per second to wirelessdevices. The communication between the User Equipment(UE) and the evolved base station (eNodeB) starts when theuser equipment (UE) requests a connection to the eNodeBvia random access preamble (Figure 14). The eNodeB thenallocates radio resources to the user for the rest of the randomaccess procedure and sends a response. The UE answerswith a L2/L3 message containing an identification number.Finally, the eNodeB sends back the identification numberof the connected UE. If several UEs sent the same randomaccess preamble at the same time, only one connectionis granted and the other UEs will need to send a newrandom access preamble. After the random access procedure,the eNodeB allocates resources to the UE and uplink anddownlink logical channels are created to exchange datacontinuously. The decoding algorithm, at the eNodeB, ofthe UE random access preamble is studied in this section.This algorithm is known as the Random Access CHannelPreamble Detection (RACH-PD).

Page 29: 541420

EURASIP Journal on Embedded Systems 9

Medium 1type

Architecture model

Proc 1c64x+

Proc 2c64x+

Proc 3c64x+

Algorithm

Sch

edu

ler

Dep

loym

ent

Cod

ege

ner

atio

n

Proc1.xml

Proc2.xml

Proc3.xml

IDL prototypesC64x+.xsl

Communicationlibraries actors code

Proc1.c

Proc2.c

Proc3.c

Proc1.exe

Proc2.exe

Proc3.exe

XSL

tran

sfor

mat

ion

TI

code

com

pos

erco

mpi

ler

Figure 12: Code generation.

module antenna delay {typedef long cplx;typedef short param;interface antenna delay {

void init(in cplx antIn);void loop(in cplx antIn,

out char waitOut, in param antSize);};

};

Figure 13: Example of an IDL prototype.

UE eNodeB

Random access preamble

Random access response

L2/L3 message

Message for early contention resolution

Figure 14: Random access procedure.

5.2. The RACH Preamble Detection. The RACH is acontention-based uplink channel used mainly in the initialtransmission requests from the UE to the eNodeB forconnection to the network. The UE, seeking connectionwith a base station, sends its signature in a RACH preamblededicated time and frequency window in accordance with apredefined preamble format. Signatures have special auto-correlation and intercorrelation properties that maximize theability of the eNodeB to distinguish between different UEs.The RACH preamble procedure implemented in the LTEeNodeB can detect and identify each user’s signature and isdependent on the cell size and the system bandwidth. Assume

GP1 GP2

Time

RACH burst

n ms

Preamblebandwidth

2x N-sample preamble

Figure 15: The random access slot structure.

that the eNodeB has the capacity to handle the processing ofthis RACH preamble detection every millisecond in a worstcase scenario.

The preamble is sent over a specified time-frequencyresource, denoted as a slot, available with a certain cycleperiod and a fixed bandwidth. Within each slot, a GuardPeriod (GP) is reserved at each end to maintain timeorthogonality between adjacent slots [45]. This preamble-based random access slot structure is shown in Figure 15.

The case study in this article assumes a RACH-PD fora cell size of 115 km. This is the largest cell size supportedby LTE and is also the case requiring the most processingpower. According to [46], preamble format no. 3 is usedwith 21,012 complex samples as a cyclic prefix for GP1,followed by a preamble of 24,576 samples followed by thesame 24,576 samples repeated. In this case the slot durationis 3 ms which gives a GP2 of 21,996 samples. As per Figure 16,the algorithm for the RACH preamble detection can besummarized in the following steps [45].

(1) After the cyclic prefix removal, the preprocessing(Preproc) function isolates the RACH bandwidth, byshifting the data in frequency and filtering it withdownsampling. It then transforms the data into thefrequency domain.

(2) Next, the circular correlation (CirCorr) functioncorrelates data with several prestored preamble rootsequences (or signatures) in order to discriminatebetween simultaneous messages from several users. Italso applies an IFFT to return to the temporal domainand calculates the energy of each root sequencecorrelation.

Page 30: 541420

10 EURASIP Journal on Embedded Systems

Antenna #2 to NPreamble repetition #1 to P

Antenna#1

Preamble repetition #2 to P

Antenna #1 preamble repetition #1RACH preprocessing

Antenna #2 to N preamble repetition #1 to P

Antenna #1 preamble repetition #2 to P

Antenna #1RACH circular correlation

Root sequence # 2 to R

Root sequence # 1

Noise floorestimation

PeakSearchAn

ten

na

inte

rfac

e

Freq

uen

cysh

ift

FIR

(ban

dpas

sfi

lter

)

DFT

Subc

arri

erde

map

pin

g

ZC

root

seq.

mu

lt.

Zer

opa

d.

IFFT

Pow

erco

mp. Po

wer

accu

mu

lati

on

Figure 16: Random Access Channel Preamble Detection (RACH-PD) Algorithm.

(3) Then, the noisefloor threshold (NoiseFloorThr)function collects these energies and estimates thenoise level for each root sequence.

(4) Finally, the peak search (PeakSearch) function detectsall signatures sent by the users in the current timewindow. It additionally evaluates the transmissiontiming advance corresponding to the approximateuser distance.

In general, depending on the cell size, three parametersof RACH may be varied: the number of receive antennas,the number of root sequences, and the number of times thesame preamble is repeated. The 115 km cell case implies 4antennas, 64 root sequences, and 2 repetitions.

5.3. Architecture Exploration

5.3.1. Algorithm Model. The goal of this exploration is todetermine through simulation the architecture best suitedto the 115km cell RACH-PD algorithm. The RACH-PDalgorithm behavior is described as a SDF graph in PREESM.A static deployment enables static memory allocation, soremoving the need for runtime memory administration. Thealgorithm can be easily adapted to different configurationsby tuning the HSDF parameters. Using the same approach asin [47], valid scheduling derived from the representation inFigure 16 can be described by the compact expression:

(8Preproc)(4(64(InitPower(2((SingleZCProc)(PowAcc))))PowAcc))(64NoiseFloorThreshold)PeakSearchWe can separate the preamble detection algorithm in 4

steps:

(1) preprocessing step: (8Preproc),

(2) circular correlation step: (4(64(InitPower(2((SingleZCProc)(PowAcc))))PowAcc)),

(3) noise floor threshold step: (64NoiseFloorThreshold),

(4) peak search step: PeakSearch.

Each of these steps is mapped onto the available coresand will appear in the exploration results detailed in

C64x+ C64x+ C64x+

C64x+ C64x+ C64x+

C64x+

C64x+

C64x+

C64x+

EDMAEDMA

EDMA

1 2

34

Figure 17: Four architectures explored.

Section 5.3.4. The given description generates 1,357 opera-tions; this does not include the communication operationsnecessary in the case of multicore architectures. Placingthese operations by hand onto the different cores wouldbe greatly time-consuming. As seen in Section 4.2 therapid prototyping PREESM tool offers automatic scheduling,avoiding the problem of manual placement.

5.3.2. Architecture Exploration. The four architecturesexplored are shown in Figure 17. The cores are allhomogeneous Texas Instrument TMS320C64x+ DigitalSignal Processors (DSP) running at 1 GHz [48]. Theconnections are made via DMA links. The first architectureis a single-core DSP such as the TMS320TCI6482. Thesecond architecture is dual-core, with each core similar tothat of the TMS320TCI6482. The third is a tri-core andis equivalent to the new TMS320TCI6487 [40]. Finally,the fourth architecture is a theoretical architecture forexploration only, as it is a quad-core. The exploration goalis to determine the number of cores required to run therandom RACH-PD algorithm in a 115 km cell and how tobest distribute the operations on the given cores.

5.3.3. Architecture Model. To solve the deployment problem,each operation is assigned an experimental timing (interms of CPU cycles). These timings are measured with

Page 31: 541420

EURASIP Journal on Embedded Systems 11

Real-time limit of 4 ms

1 core

2 cores+ EDMA

3 cores+ EDMA

4 cores+ EDMA

Loosely timedApproximately timedAccurately timed

Figure 18: Timings of the RACH-PD algorithm schedule on targetarchitectures.

deployments of the actors on a single C64x+. Since theC64x+ is a 32-bit fixed-point DSP core, the algorithms mustbe converted from floating-point to fixed-point prior tothese deployments. The EDMA is modelled as a nonblockingmedium (see Section 4.2.2) transferring data at a constantrate and with a given set up time. Assuming the EDMA hasthe same performance from the L2 internal memory to theL2 internal memory as the EDMA3 of the TMS320TCI6482(see [42], then the transfer of N bytes via EDMA shouldtake approximately): transfer(N) = 135 + (N ÷ 3.375) cycles.Consequently, in the PREESM model, the average data rateused for simulation is 3.375 GBytes/s and the EDMA set uptime is 135 cycles.

5.3.4. Architecture Choice. The PREESM automatic schedul-ing process is applied for each architecture. The workflowused is close to that of Figure 8. The simulation resultsobtained are shown in Figure 18. The list scheduling heuris-tic is used with loosely-timed, approximately-timed, andaccurately-timed ABCs. Due to the 115 km cell constraints,preamble detection must be processed in less than 4 ms.

The experimental timings were measured on code exe-cutions using a TMS320TCI6487. The timings feeding thesimulation are measured in loops, each calling a singlefunction with L1 cache activated. For more details aboutC64x+ cache, see [48]. This represents the applicationbehavior when local data access is ideal and will lead toan optimistic simulation. The RACH application is wellsuited for a parallel architecture, as the addition of one corereduces the latency dramatically. Two cores can process thealgorithm within a time frame close to the real-time deadlinewith loosely and approximately timed models but high datatransfer contention and high number of transfers disqualifyit when accurately timed model is used.

The 3-core solution is clearly the best one: its CPU loads(less than 86% with accurately-timed ABC) are satisfactoryand do not justify the use of a fourth core, as can be seenin Figure 18. The high data contention in this case studyjustifies the use of several ABC models; simple models for

GEM 0

Chip

GEM 1 GEM 2

C64x+Core 0

C64x+Core 1

C64x+Core 2

L2 mem L2 mem L2 mem

Switched central resources (SCR)

EDMA3Inter-core

interruptionsHardware

semaphores

DDR2 external memory

Figure 19: TMS320TCI6487 architecture.

fast results and more complex models to dimension correctlythe system.

5.4. Code Generation. Developed Code libraries for theTMS320TCI6487 and automatically generated code createdby PREESM (see Section 4.3) were used in this experiment.Details of the code libraries and code optimizations aregiven in [2]. The architecture of the TMS320TCI6487 isshown in Figure 19. The communication between the coresis performed by copying data with the EDMA3 from onecore local L2 memory to another core L2 memory. The coresare synchronized using intercore interruptions. Two modesare available for memory sharing: in symmetric mode,each CPU has 1MByte of L2 memory while in asymmetricmode, core-0 has 1.5 MByte, core-1 has 1 MByte and core-20.5 MByte.

From the PREESM generated code, the size of thestatically allocated buffers are 1.65 MBytes for one core,1.25 MBytes for a second core, and 200 kBytes for a thirdcore. The asymmetric mode is chosen to fit this memorydistribution. As the necessary memory is higher than theinternal L2, some buffers are manually chosen to go inthe external memory and the L2 cache [40] is activated. Amemory minimization ABC in PREESM would help thisprocess, targeting some memory objectives while mappingthe actors on the cores.

Modeling the RACH-PD algorithm in PREESM whilevarying the architectures (1,2,3 and 4 cores-based) enabledthe exploration of multiple solutions under the criterionof meeting the stringent latency requirement. Once thetarget architecture is chosen, PREESM can be setup togenerate a framework code for the simulated solution. Ashighlighted and explained in the previous paragraph, thestatically allocated buffers by the generated code were higherthan the physical memory of the target architecture. This

Page 32: 541420

12 EURASIP Journal on Embedded Systems

CPU 2 CPU 1 CPU 0

Preprocess

Preprocess

Preprocess

Preprocess

Circorr32signatures

Circorr32signatures

Circorr32signatures

Circorr32signatures

Circorr32signatures

Circorr32signatures

Maximalcadence

noiseFloor +PeakSearch

4 ms

4 ms

4 ms

Figure 20: Execution of the RACH-PD algorithm ona TMS320TCI6487.

necessitated moving manually some of the noncritical buffersto external memory. This generated code, representing apriori a good deployment solution, when executed on thetarget had an average load of 78% per core while meetingthe real time deadline. Hence, the goal of decoding a RACH-PD every 4 ms on the TMS320TCI6487 is thus successfullyaccomplished. A simplified view of the code execution isshown in Figure 20. The execution of the generated code hadled to a realistic assessment of a deployment very close to thatpredicted with accurately timed ABC where the simulationhad shown an average load per core around 80%. Theseresults show that prototyping the application with PREESMallows by simulation to assess different solutions and to givethe designer a realistic picture of the multicore solutionbefore solving complex mapping problems. This global resultneeds to be tempered because one week-effort of manualmemory optimizations and also some manual constraintswere necessary to obtain such a fast deployment. New ABCscomputing the costs of semaphores for synchronizationsand the memory balance between the cores will reduce thismanual optimizations time.

6. Conclusions

The intent of this paper was to detail the functionalitiesof a rapid prototyping framework comprising the Graphiti,SDF4J, and PREESM tools. The main features of the frame-work are the generic graph editor, the graph transformationmodule, the automatic static scheduler, and the code genera-tor. With this framework, a user can describe and simulatethe deployment, choose the most suitable architecture forthe algorithm and generate an efficient framework code.The framework has been successfully tested on RACH-PDalgorithm from the 3GPP LTE standard. The RACH-PDalgorithm with 1357 operations was deployed on a tricoreDSP and the simulation was validated by the generated codeexecution. In the near future, an increasing number of CPUswill be available in complex System on Chips. Developing

methodologies and tools to efficiently partition code on thesearchitectures is thus an increasingly important objective.

References

[1] E. A. Lee, “The problem with threads,” Computer, vol. 39, no.5, pp. 33–42, 2006.

[2] M. Pelcat, S. Aridhi, and J. F. Nezan, “Optimization ofautomatically generated multi-core code for the LTE RACH-PD algorithm,” in Proceedings of the Conference on Designand Architectures for Signal and Image Processing (DASIP ’08),Bruxelles, Belgium, November 2008.

[3] J. Piat, S. S. Bhattacharyya, M. Pelcat, and M. Raulet, “Multi-core code generation from interface based hierarchy,” inProceedings of the Conference on Design and Architectures forSignal and Image Processing (DASIP ’09), Sophia Antipolis,France, September 2009.

[4] M. Pelcat, P. Menuet, S. Aridhi, and J.-F. Nezan, “Scalablecompile-time scheduler for multi-core architectures,” in Pro-ceedings of the Conference on Design and Architectures for Signaland Image Processing (DASIP ’09), Sophia Antipolis, France,September 2009.

[5] “Eclipse Open Source IDE,” http://www.eclipse.org/down-loads.

[6] T. Grandpierre and Y. Sorel, “From algorithm and architecturespecifications to automatic generation of distributed real-timeexecutives: a seamless flow of graphs transformations,” inProceedings of the 1st ACM and IEEE International Conferenceon Formal Methods and Models for Co-Design (MEMOCODE’03), pp. 123–132, 2003.

[7] “OpenMP,” http://openmp.org/wp.[8] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,

K. H. Randall, and Y. Zhou, “Cilk: an efficient multithreadedruntime system,” Journal of Parallel and Distributed Comput-ing, vol. 37, no. 1, pp. 55–69, 1996.

[9] “OpenCL,” http://www.khronos.org/opencl.[10] “The Multicore Association,” http://www.multicore-associa-

tion.org/home.php.[11] “PolyCore Software Poly-Mapper tool,” http://www.poly-

coresoftware.com/products3.php.[12] E. A. Lee, “Overview of the ptolemy project,” Technical

Memorandum UCB/ERL M01/11, University of California,Berkeley, Calif, USA, 2001.

[13] J. Eker and J. W. Janneck, “CAL language report,” Tech.Rep. ERL Technical Memo UCB/ERL M03/48, University ofCalifornia, Berkeley, Calif, USA, December 2003.

[14] S. S. Bhattacharyya, G. Brebner, J. Janneck, et al., “OpenDF:a dataflow toolset for reconfigurable hardware and multicoresystems,” ACM SIGARCH Computer Architecture News, vol. 36,no. 5, pp. 29–35, 2008.

[15] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty, “Model-integrated development of embedded software,” Proceedings ofthe IEEE, vol. 91, no. 1, pp. 145–164, 2003.

[16] P. Belanovic, An open tool integration environment for efficientdesign of embedded systems in wireless communications, Ph.D.thesis, Technische Universitat Wien, Wien, Austria, 2006.

[17] T. Grandpierre, C. Lavarenne, and Y. Sorel, “Optimized rapidprototyping for real-time embedded heterogeneous multipro-cessors,” in Proceedings of the 7th International Workshop onHardware/Software Codesign (CODES ’99), pp. 74–78, 1999.

[18] C.-J. Hsu, F. Keceli, M.-Y. Ko, S. Shahparnia, and S. S.Bhattacharyya, “DIF: an interchange format for dataflow-based design tools,” in Proceedings of the 3rd and 4th

Page 33: 541420

EURASIP Journal on Embedded Systems 13

International Workshops on Computer Systems: Architectures,Modeling, and Simulation (SAMOS ’04), vol. 3133 of LectureNotes in Computer Science, pp. 423–432, 2004.

[19] S. Stuijk, Predictable mapping of streaming applications on mul-tiprocessors, Ph.D. thesis, Technische Universiteit Eindhoven,Eindhoven, The Netherlands, 2007.

[20] B. D. Theelen, “A performance analysis tool for scenario-awaresteaming applications,” in Proceedings of the 4th InternationalConference on the Quantitative Evaluation of Systems (QEST’07), pp. 269–270, 2007.

[21] “Graphiti Editor,” http://sourceforge.net/projects/graphiti-editor.

[22] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,”Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987.

[23] “SDF4J,” http://sourceforge.net/projects/sdf4j.[24] “PREESM,” http://sourceforge.net/projects/preesm.[25] J. W. Janneck, “NL—a network language,” Tech. Rep., ASTG

Technical Memo, Programmable Solutions Group, Xilinx, July2007.

[26] SPIRIT Schema Working Group, “IP-XACT v1.4: a specifica-tion for XML meta-data and tool interfaces,” Tech. Rep., TheSPIRIT Consortium, March 2008.

[27] U. Brandes, M. Eiglsperger, I. Herman, M. Himsolt, andM. S. Marshall, “Graphml progress report, structural layerproposal,” in Proceedings of the 9th International Symposium onGraph Drawing (GD ’01), P. Mutzel, M. Junger, and S. Leipert,Eds., pp. 501–512, Springer, Vienna, Austria, 2001.

[28] J. Piat, M. Raulet, M. Pelcat, P. Mu, and O. Deforges, “Anextensible framework for fast prototyping of multiprocessordataflow applications,” in Proceedings of the 3rd InternationalDesign and Test Workshop (IDT ’08), pp. 215–220, Monastir,Tunisia, December 2008.

[29] “w3c XML standard,” http://www.w3.org/XML.[30] “w3c XSLT standard,” http://www.w3.org/Style/XSL.[31] “Grammatica parser generator,” http://grammatica.perceder-

berg.net.[32] J. W. Janneck and R. Esser, “A predicate-based approach

to defining visual language syntax,” in Proceedings of IEEESymposium on Human-Centric Computing (HCC ’01), pp. 40–47, Stresa, Italy, 2001.

[33] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee, “A hierar-chical multiprocessor scheduling framework for synchronousdataflow graphs,” Tech. Rep., University of California, Berke-ley, Calif, USA, 1995.

[34] S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors:Scheduling and Synchronization, CRC Press, Boca Raton, Fla,USA, 1st edition, 2000.

[35] V. Sarkar, Partitioning and scheduling parallel programs forexecution on multiprocessors, Ph.D. thesis, Stanford University,Palo Alto, Calif, USA, 1987.

[36] O. Sinnen and L. A. Sousa, “Communication contention intask scheduling,” IEEE Transactions on Parallel and DistributedSystems, vol. 16, no. 6, pp. 503–515, 2005.

[37] M. R. Garey and D. S. Johnson, Computers and Intractability:A Guide to the Theory of NP-Completeness, W. H. Freeman, SanFrancisco, Calif, USA, 1990.

[38] Y.-K. Kwok, High-performance algorithms of compiletimescheduling of parallel processors, Ph.D. thesis, Hong KongUniversity of Science and Technology, Hong Kong, 1997.

[39] F. Ghenassia, Transaction-Level Modeling with Systemc: TLMConcepts and Applications for Embedded Systems, Springer,New York, NY, USA, 2006.

[40] “TMS320TCI6487 DSP platform, texas instrument productbulletin (SPRT405)”.

[41] “Tms320 dsp/bios users guide (SPRU423F)”.[42] B. Feng and R. Salman, “TMS320TCI6482 EDMA3 perfor-

mance,” Technical Document SPRAAG8, Texas Instruments,November 2006.

[43] “RapidIO,” http://www.rapidio.org/home.[44] “The 3rd Generation Partnership Project,” http://www

.3gpp.org.[45] J. Jiang, T. Muharemovic, and P. Bertrand, “Random access

preamble detection for long term evolution wireless net-works,” US patent no. 20090040918.

[46] “3GPP technical specification group radio access network;evolved universal terrestrial radio access (EUTRA) (Release 8),3GPP, TS36.211 (V 8.1.0)”.

[47] S. S. Bhattacharyya and E. A. Lee, “Memory managementfor dataflow programming of multirate signal processingalgorithms,” IEEE Transactions on Signal Processing, vol. 42, no.5, pp. 1190–1201, 1994.

[48] “TMS320C64x/C64x+ DSP CPU and instruction set,” Refer-ence Guide SPRU732G, Texas Instruments, February 2008.

Page 34: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 976296, 13 pagesdoi:10.1155/2009/976296

Research Article

Run-Time HW/SW Scheduling of Data Flow Applications onReconfigurable Architectures

Fakhreddine Ghaffari, Benoit Miramond, and Francois Verdier

ETIS Laboratory, UMR 8051, ENSEA, University of Cergy Pontoise, CNRS, 6 avenue Du Ponceau, BP 44,95014 Cergy-Pontoise Cedex, France

Correspondence should be addressed to Fakhreddine Ghaffari, [email protected]

Received 1 March 2009; Revised 22 July 2009; Accepted 7 October 2009

Recommended by Markus Rupp

This paper presents an efficient dynamic and run-time Hardware/Software scheduling approach. This scheduling heuristic consistsin mapping online the different tasks of a highly dynamic application in such a way that the total execution time is minimized. Weconsider soft real-time data flow graph oriented applications for which the execution time is function of the input data nature. Thetarget architecture is composed of two processors connected to a dynamically reconfigurable hardware accelerator. Our approachtakes advantage of the reconfiguration property of the considered architecture to adapt the treatment to the system dynamics. Wecompare our heuristic with another similar approach. We present the results of our scheduling method on several image processingapplications. Our experiments include simulation and synthesis results on a Virtex V-based platform. These results show a betterperformance against existing methods.

Copyright © 2009 Fakhreddine Ghaffari et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

One of the main steps of the HW/SW codesign of a mixedelectronic system (Software and Hardware) is the schedulingof the application tasks on the processing elements (PEs) ofthe platform. The scheduling of an application formed byN tasks on M target processing units consists in finding therealizable partitioning in which the N tasks are launched ontotheir corresponding M units and an ordering on each PEfor which the total execution time of the application meetsthe real-time constraints. This problem of multiprocessorscheduling is known to be NP-hard [1, 2], that is, why wepropose a heuristic approach.

Many applications, in particular in image processing(e.g., an intelligent embedded camera), have dependent dataexecution times according to the nature of the input to beprocessed. In this kind of application, the implementationis often stressed by real-time constraints, which demandadaptive computation capabilities. In this case, accordingto the nature of the input data, the system must adapt itsbehaviour to the dynamics of the evolution of the data andcontinue to meet the variable needs of required calculation

(in quantity and/or in type). Examples of applications wherethe processing needs changes in quantity (the computationload is variable) come from the intelligent image processingwhere the duration of the treatments can depend on thenumber of the objects in the image (motion detection,tracking, etc.) or of the number of interest areas (contoursdetection, labelling, etc.).

We can quote also the use of run time of different filtersaccording to the texture of the processed image (here it isthe type of processing which is variable). Another exampleof the dynamic applications is video encoding where the run-length encoding (RLE) of frames depends on the informationwithin frames.

For these dynamic applications, many implementationways are possible. In this paper we consider an intelligentembedded camera for which we propose a new designapproach compared to classical worst case implementations.

Our method consists in evaluating online the applicationcontext and adapting its implementation onto the differenttargeted processing units by launching a run time partition-ing algorithm. The online modification of the partitioningresult can also be a solution of fault tolerance, by affecting

Page 35: 541420

2 EURASIP Journal on Embedded Systems

in run time the tasks of the fault target unit on othersoperational targets [3]. This induces also to revise thescheduling strategy. More precisely, the result of this latermust change at run time in two cases.

(i) Firstly, to evaluate the partitioning result. After eachmodification of the tasks implementations we need toknow the new total execution time. And this is onlypossible by rescheduling all the tasks.

(ii) Secondly, by modifying the scheduling result wecan obtain a better total execution time whichmeets the real-time constraint without modifying thepartitioning. This is because the characteristics of thetasks (mainly execution time) are modified accordingto the nature of the input data.

In that context, the choice of the implementation of thescheduler is of major importance and depends on theheuristic complexity. Indeed, with our method the decisionstaken online by our scheduler can be very time consuming.A software implementation of the proposed schedulingstrategies will then delay the application tasks. For thisreason, we propose in this work a hardware implementationfor our scheduling heuristic.

With this implementation, the scheduler takes only fewclock cycles. So we can easily call the scheduler at runtime without penalty on the total execution time of theapplication.

The primary contribution of our work is the conceptof an efficient online scheduling heuristic for heterogeneousmultiprocessor platforms. This heuristic provides goodresults for both hardware tasks (onto the FPGA) and softwaretasks (onto the targeted General Purpose Processors) as wellas an extensive speedup through the hardware implementa-tion of this scheduling heuristic. Finally, the implementationof our scheduler allows the system to adapt itself to theapplication context in real time. We have simulated andsynthesized our scheduler by targeting a FPGA (Xilinx Virtex5) platform. We have tested the scheduling technique onseveral image processing applications implemented onto aheterogeneous target architecture composed of two proces-sors coupled with a configurable logic unit (FPGA).

The remainder of this paper is organized as follows.Section 2 presents related works on hardware/softwarescheduling approaches. Section 3 introduces the frameworkof our scheduling problem. Section 4 presents the proposedapproach. Section 5 shows the experimental results, andfinally Section 6 concludes this paper.

2. Related Works

The field of study which tries to find an execution order fora set of tasks that meets system design objectives (e.g., min-imize the total application execution time) has been widelycovered in the literature. In [4–6] the problem of HW/SWscheduling for system-on-chip platforms with dynamicallyreconfigurable logic architecture is exhaustively studied.Moreover several works deal with scheduling algorithmimplemented in hardware [7–9]. Scheduling in such systems

is based on priorities. Therefore, an obvious solution is toimplement priorities queues. Many hardware architecturesfor the queues have been proposed: binary tree comparators,FIFO queues plus a priority encoder, and a systolic arraypriority queue [7]. Nevertheless, all these approaches arebased on a fixed priority static scheduling technique. More-over most of the hardware proposed approaches addressesthe implementation of only one scheduling algorithm (e.g.,Earliest Deadline First) [9, 10]. Hence they are inefficient andnot appropriate for systems where the required schedulingbehavior changes during run time. Also, system performancefor tasks with data dependent execution times should beimproved by using dynamic schedulers instead of static (atcompile time) scheduling techniques [11, 12].

In our work, we propose a new hardware implementedapproach which computes at run-time tasks priorities basedon the characteristics of each task (execution time, graphdependencies, etc.). Our approach is dynamic in the sensethat the execution order is decided at run time and supportsa heterogeneous (HW/SW) multiprocessor architecture.

The idea of dynamic partitioning/scheduling is basedon the dynamic reconfiguration of the target architecture.Increasingly FPGA [13, 14] offers very attractive reconfigu-ration capabilities: partial or total, static or dynamic.

The reconfiguration latency of dynamically reconfig-urable devices represents a major problem that must notbe neglected. Several references can be found addressingtemporal partitioning for reconfiguration latency minimiza-tion [15]. Moreover, configuration prefetching techniquesare used to minimize reconfiguration overhead. A similartechnique to lighten this overhead is developed in [16] andis integrated into an existing design environment. A prefetchand replacement unit modifies the schedule and significantlyreduces the latency even for highly dynamic tasks.

In fact, there are two different approaches in the litera-ture: the first approach reduces reconfiguration overhead bymodifying scheduling results.

The second one distinguishes between scheduling andreconfiguration. The reconfiguration occurs only if theHW/SW partitioning step needs it. The scheduling algorithmis needed only to validate this partitioning result.

After partitioning, the implementation of each taskis unchanged and configuration is not longer necessary.Scheduling aims at finding the best execution time for a givenimplementation strategy. Since scheduling does not changepartitioning decision it does not take reconfiguration timeinto account.

In this paper, we focus only on the scheduling strategy inthe second case. We assume that the reconfiguration aspectsare taken into account during the HW/SW partitioningstep (decision of task implementation). Furthermore weaddressed this last step in our previous works [17].

3. Problem Definition

3.1. Target Architecture. The target architecture is depicted inFigure 1. It is a heterogeneous architecture, which containstwo software processing units: a Master Processor and a Slave

Page 36: 541420

EURASIP Journal on Embedded Systems 3

MasterCPU

SalveCPU

Bus 1

Bus 3

DMA

RAM

Contexts

Bus 2

Bus i

RCU

Controldata

Figure 1: The target architecture.

Processor. The platform also contains a hardware processingunit: (Reconfigurable Computing Unit) RCU and sharedmemory resources. The software processing units are Von-Neumann monoprocessing systems and execute only a singletask at a time.

Each hardware task (implemented on the RCU) occupiesa tile on the reconfigurable area [18]. The size of the tileis the same for all the tasks to facilitate the placement androuting of the RCU. We choose, for example, the tile size ofthe task which uses the maximum of resources on the RCU(we designate by “resource” here the Logic Element used bythe RCU to map any task).

The RCU unit can be reconfigured partially or totally.Each hardware task is represented by a partial bitstream.All bitstreams are memorized in the contexts memory (theshared memory between the processors and the RCU inFigure 1). These bitstreams will be loaded in the RCUbefore scheduling to reconfigure the FPGA according torun-time partitioning results [17]. The HW/SW partition-ing result can change at run time according to temporalcharacteristics of tasks [6]. In [17] we proposed an HW/SWpartitioning approach based on HW→ SW and SW→HWtasks migrations. The theory of tasks migrations consists inaccelerating the task(s) which become critical by modifyingtheir implementations from software units to hardware unitsand to decelerate the tasks which become noncritical byreturning them to the software units.

After each new HW/SW partitioning result, the schedulermust provide an evaluation for this solution by providing thecorresponding total execution time. Thus it presents a real-time constraint since it will be launched at run time. Withthis approach of dynamic partitioning/scheduling the targetarchitecture will be very flexible. It can self-adapt even withvery dynamic applications.

3.2. Application Model. The considered applications are dataflow oriented applications such as image processing, audioprocessing, or video processing. To model this kind ofapplications we consider a Data Flow Graph (DFG) (anexample is depicted in Figure 2) which is a directed acyclicgraph where nodes are processing functions and edgesdescribe communication between tasks (data dependencies

MSA

5

1 1

SL B 3 SL C 7

3 1

HW

E 2

1

HW D 18 13MS

F

MS: Master processorSL: Slave processorHW: FPGA

Figure 2: An Example of DFG with 6 tasks.

between tasks). The size of the DFG depends on thefunctional partitioning of the application and then on thenumber of tasks and edges. We can notice that the structureof the DFG has a great effect on the execution time of thescheduling operations. A low granularity DFG makes thesystem easy to be predictable because tasks execution timedoes not vary considerably, thus limiting timing constraintsviolation. On the other hand, for a very low granularity DFG,the number of tasks in a DFG of great size explodes, and thecommunications between tasks become unmanageable.

Each node of the DFG represents a specific task in theapplication. For each task there can be up to three differentimplementations: Hardware implementations (HW) placedin the FPGA, Software implementations running on the mas-ter processor (MS), and another Software implementationrunning on the slave processor (SL).

Each node of Figure 2 is annotated with two data: oneabout the implementation (MS or SL or HW) and the otheris the execution time of the task. Similarly each edge isannotated with the communication time between two nodes(two tasks).

Each task of the DFG is characterized by the followingfour parameters:

(a) Texe (execution time),

(b) Impl (implementation on the RCU or on the masterprocessor or on the slave processor),

(c) Nbpred (number of predecessor tasks),

(d) The Nbsucc (number of successor tasks).

All the tasks of a DFG are thus modeled identically, andthe only real-time constraint is on the total execution time.At each scheduler invocation, this total execution timecorresponds to the longest path in the mapped task graph.It then depends both on the application partitioning and onthe chosen order of execution on processors.

4. Proposed Approach

The applications are periodic. In one period, all the tasksof the DFG must be executed. In the image processing,for instance, the period is the execution time needed to

Page 37: 541420

4 EURASIP Journal on Embedded Systems

For all Software tasks do{

Comput ASAPTask with minimum ASAP will be chosen

If (Equality of ASAP)Compute Urgency

Task with maximum urgency will be chosenIf (Equality of Urgency)Compare Execution time

Task with maximum execution time will be chosen}

Algorithm 1: Principle of our scheduling policy.

process one image. The scheduling must occur online at theend of the execution of all the tasks, and when a violationof real-time constraints is predicted. Hence the result ofpartitioning/scheduling will be applied on the next period(next image, for image processing applications).

Our run-time scheduling policy is dynamic since theexecution order of application tasks is decided at run time.For the tasks implemented on the RCU, we assume that thehardware resources are sufficient to execute in parallel allhardware tasks chosen by the partitioning step. Thereforethe only condition for launching their execution is thesatisfaction of all data dependencies. That is to say, a taskmay begin execution only after all its incoming edges havebeen executed.

For the tasks implemented on the software processors,the conditions for launching are the following.

(1) The satisfaction of all data dependencies.

(2) The discharge of the software unit.

Hereby the task can have four different states.

(i) Waiting.

(ii) Running.

(iii) Ready.

(iv) Stopped.

The task is in the waiting state when it waits the endof execution of one or several predecessor tasks. When asoftware processing unit has finished the execution of atask, new tasks may become ready for execution if all theirdependencies have been completed of course.

The task can be stopped in the case of preemption or afterfinishing its execution.

The states of the processing units (SW, SL, and HW) inour target architecture are: execution state, reconfigurationstate or idle state.

In the following, we will explain the principle of ourapproach as well as a hardware implementation of theproposed HW/SW scheduler.

As explained in Algorithm 1, the basic idea of ourheuristic of scheduling is to take decision of tasks prioritiesaccording to three criteria.

The first criterion is the As Soon As Possible (ASAP) time.The task which has the shortest ASAP date will be launchedfirst.

The second criterion is the urgency time: the task whichhas the maximum of urgency will have priority to belaunched before the others. This new criterion is based on thenature of the successors of the task. The urgency criterion isemployed only if there is equality of the first criterion for atleast two tasks. If there is still equality of this second criterionwe compare the last criterion which is execution time of thetasks. We choose the task which has the upper execution timeto launch first.

We use these criteria to choose between two or severalsoftware tasks (on the Master or on the Slave) for running.

4.0.1. The Urgency Criterion. The urgency criterion is basedon the implementation of tasks and the implementations oftheir successors. A task is considered as urgent when it isimplemented on the software unit (Master or slave) and hasone or more successor tasks implemented on other differentunits (hardware unit or software unit).

Figure 3 shows three examples of DFG. In Figure 3(a) taskC is implemented on the Slave processor and it is followedby task D which is implemented on the RCU. Thus theurgency (Urg) of task C is the execution time of its successor(Urg (C) = 13). In example (b) it is the task B which isfollowed by the task D implemented on a different unit (onthe Master processor). In the last example (c) both tasks Band C are urgent but task B is more urgent than task C sinceits successor has an execution time upper than the executiontime of the successor of task C.

When a task has several successors with different imple-mentations, the urgency is the maximum of execution timesof the successors.

In general case, when the direct successor of task Ahas the same implementation as A and has a successorwith a different implementation, then this last feedbacks theurgency to task A.

We show the scheduling result for case (a) when werespect the urgency criterion in Figure 3(d) and otherwisein Figure 3(e). We can notice for all the examples of DFGin Figure 3 that the urgency criterion makes a best choice toobtain a minimum total execution time. The third criterion(the execution time) is an arbitrary choice and has very rarelyimpact on the total execution time.

We can notice also that our scheduler supports thedynamic creation and deletion of tasks. These online servicesare only possible when keeping a fixed structure of the DFGalong the execution. In that case the dependencies betweentasks are known a priori. Dynamic deletion is then possibleby assigning a null execution time to the tasks which arenot active. and dynamic creation by assigning their executiontime when they become active.

This scheduling strategy needs an online computation ofseveral criterions for all software tasks in the DFG.

We tried first to implement this new scheduling policyon a processor. Figure 4 shows the computation time of ourscheduling method when implemented on an Intel Core 2

Page 38: 541420

EURASIP Journal on Embedded Systems 5

MS A 5

SL B3

SLC7

HW D 13

(a) Urg[C] = 13

MS A 5

SL B3

SLC7

MS D 5

(b) Urg[B] = 5

MSA

5

MS B3

∗MS

C 7

HW D8

SLE2

(c) Urg[B] = 8, Urg[C]= 2

HW

SL

MSA

B C

D

8 15

5 28

(d) Case of DFG (a) task B before task C

HW

SL

MSA

C B

D

12 15

5 25

(e) Case of DFG (a) task C before task B

Figure 3: Case examples of urgency computing.

05

10152025303540

Exe

cuti

onti

me

(ms)

1

55 109

163

217

271

325

379

433

487

541

595

649

703

757

Images

Scheduling execution time onIntel (R) Core (TM) 2 Duo CPU 2.8 Ghz + 4Go RAM

33.49652

12.68212 14.14788

20.690416.17788

Figure 4: Execution time of the software implementation of thescheduler.

Duo CPU with a frequency of 2.8 GHz and 4 Go of RAM.We can notice that the average computation time of thescheduler is about 12 milliseconds for an image. Theseexperiments are done on an image processing application(the DFG depicted on Figure 12) whose period of processingby an image is 19 milliseconds. So the scheduling (with thissoftware implementation) takes about 63% of a one imageprocessing computation time on a desktop computer.

We can conclude that, in an embedded context, asoftware implementation of this strategy is thus incompatiblewith real-time constraints.

We describe in the following an optimized hardwareimplementation of our scheduler.

4.1. Hardware Scheduler Architecture. In this section, wedescribe the proposed architecture of our scheduler. Thisarchitecture is shown in Figure 5 for a DFG example of threetasks. It is divided in four main parts.

(1) The DFG IP Sched (the middle part surrounded by adashed line in the figure).

(2) The DFG Update (DFG Up in the figure).

(3) The MS Manager (SWTM).

(4) The Slave Manager (SLTM).

The basic idea of this hardware architecture is to parallelizeat the maximum the scheduling of processing tasks. So, at themost (and in the best case), we can schedule all the tasks ofthe DFG in parallel for infinite resources architecture.

We associate to the application DFG a modified graphwith the same structure composed of the IP nodes (each IPrepresents a task). Therefore in the best case, where tasks areindependent, we could schedule all the tasks in the DFG inonly one clock cycle.

To parallelize also the management of the softwareexecution times, we associate for each software unit ahardware module:

(i) the Master Task Manager (SWTM in Figure 5),

(ii) the Slave Task Manager (SLTM in the Figure 5).

These two modules manage the order of the tasks executionsand compute the processor execution time for each one.

The inputs signals of this scheduler architecture are thefollowing.

(i) A pointer in memory to the implementations of allthe tasks. We have three kinds of implementation(RCU, Master, and Slave). With the signals SW andHW we can code these three possibilities.

(ii) The measured execution time of each task (Texe).

(iii) The Clock signal and the Reset.

Page 39: 541420

6 EURASIP Journal on Embedded Systems

SW

HW

Texe

CLK

Reset

SWTM

SLTM

IP1

IP2 IP3

DFG UP

Texe Total

All Done

Scheduled DFG

Nb Task

Nb Task Slave

Figure 5: An example of the scheduler architecture for a DFG ofthree tasks.

The outputs signals are the following.

(i) The total execution time after scheduling all tasks(Texe Total).

(ii) The signal All Done which indicates the end of thescheduling.

(iii) Scheduled DFG is a pointer to the scheduling resultmatrix to be sent to the operating system (or anysimple executive).

(iv) The Nb Task and the Nb Task Slave are the numberof tasks scheduled on the Master and the number oftasks scheduled on the Slave, respectively. As notedhere, these two signals were added solely for thepurpose of simulation in ModelSim (to check thescheduling result). In the real case we do not needthese two output signals since this information comesfrom the partitioning block.

The last one is the DFG Up. This allows updating theresults matrix after each scheduling of a task.

In the following paragraphs, we will detail each part ofthis architecture.

4.1.1. The DFG IP Sched Block. In this block there are Ncomponents (N is the number of tasks in the application).For each task we associate an IP component which computesthe intrinsic characteristics of this task (urgency, ASAP,Ready state, etc.). It also computes the total execution timefor the entire graph.

The proposed architecture of this IP is shown in Figure 6(in the appendix).

For each task the implementation PE and the executiontime are fixed, so the role of this IP is to calculate the starttime of the task and to define its state. This is done by takinginto account the state of the corresponding target (master,slave, or RCU). It then iterates along the DFG structure todetermine a total execution ordering and to affect the starttime.

This IP calculate also the urgency criterion of criticaltasks according to the implementation and the executiontime of their successors.

If the task is implemented on the RCU it will be launchedas soon as all its predecessors will be done. So the schedulingtime of hardware tasks depends on the number of tasks that

we can run in parallel. For example, the IP can scheduleall hardware tasks that can run in parallel in a one clockcycle.

For the software tasks (on the master or on the slave)the scheduling will take one clock cycle per task. Thus thecomputing time of the hardware scheduler only depends onthe result of the HW/SW partitioning.

4.1.2. The DFG Update Block. When a DFG is scheduledthe result modifies the DFG into a new structure. TheDFG Update block (Figure 7 in the appendix) generatesnew edges (dependencies between tasks) after scheduling inobjective to give a total order of execution on each computingunit according to the scheduling results.

We represent dependencies between tasks in the DFG bya matrix where the rows represent the successors and thecolumns represent the predecessors. For example, Figure 8depicts the matrix of dependencies corresponding to theDFG of Figure 2. After scheduling, the resulting matrix isthe update of the original one. It contains more depen-dencies than this later. This is the role of the DFG Updateblock.

4.1.3. The MS Manager Block. The objective of this moduleis to schedule the software tasks according to the algorithmgiven above. Figure 9 in the appendix presents the architec-ture of the Master Manager bloc. The input signal ASAP SWrepresents the ASAP times of all the tasks. The Urgency Timesignal represents the urgency of each task of the application.The SW Ready signal represents the Ready signals of all thesoftware tasks.

The Signal MIN ASAP TASKS represents all the tasks“Ready” and having the same minimum values of time ASAP.

The signal MAX CT TASKS represents all the tasks“Ready” and having the same maximum of urgency. Thetasks which have the two preceding criteria will be rep-resented by the Tasks Ready signal. The Task Scheduledsignal determines the only software task which will bescheduled. With this signal, it is possible to choose the goodvalue of signal TEXE SW and to give the new value ofthe SW Total Time signal thereafter. A single clock cycle isnecessary to schedule a single software task.

By analogy the Slave Manager block has the same role asthe SW Manager block. From scheduling point of view thereis no difference between the two processors.

4.2. HW/SW Scheduler Outputs. In this section, we describehow the results of our scheduler are processed by a targetmodule such as an executive or a Real-Time OperatingSystem (RTOS). As depicted in Figure 8 , the output ofour run-time HW/SW scheduler is n × n matrix where “n”is the total number of tasks in the DFG. Figure 10 showsthe scheduling result of the DFG depicted in Figure 12.This matrix will be used by a centralized Operating Sys-tem (OS) to fill its task queues for the three computingunits.

The table shown in Figure 11 is a compilation of both theresults of the partitioning and scheduling operations.

Page 40: 541420

EURASIP Journal on Embedded Systems 7

Combinatorial logics

SW_Ready = SW and ready and (not) doneSL_Ready = SL and ready and (not) doneHW_Ready = (not) SW and (not) SL and ready and (not) done

AndOr

Mux

Register done

Max

Mux

Adder

Mux

Register finishingtime

Mux

Mux

Mux

CLK Reset Finishing time

All_DONECLK Reset

SW_Sched_DONE

Critical time

Max_Texe_HW_Succ

SW_Ready

SW HW Ready

All_DONE

All_DONE

HW_Ready

Done

TEXE

ASAP_Out0

0

0

ASAP

ASAP

Max_Texe_PredSW_Total_Time

SW

TEXE_Pred

TEXE_SW

TEXE_SL

HW

SL_Total_Time

HWSW

0

SL_Sched_DONE

SL_Ready

Figure 6: An IP representing one task.

New successorsregisters matrix

Original DFGregisters matrix

Or

XOR

XOR

Or

Or

Or

SW_Ready_in

Task_Sched_SW

SW_Enable

Task_Sched_SL

SL_Enable

SL_Ready_in

CLK

Reset

Scheduled_DFG

Figure 7: The DFG updating architecture.

The OS browses the matrix row by row. Whenever it findsa “1” it passes the task whose number corresponds to thecolumn in the waiting state. At the end of a task executionthe corresponding waiting tasks on each units will becomeeither Ready or Running.

A task will be in the Ready state only when all itsdependencies are done and that the target unit is busy. Thusthere is no Ready state for the hardware tasks.

It should be noted that if the OS runs on the Masterprocessor, for example, this later will be interrupted eachtime to execute the OS.

5. Experiments and Results

With the idea to cover a wide range of data-flow applications,we leaded experiments on real and artificial applications. Inthe context of this paper we present the summary of theresults obtained on a 3-case studies in the domain of real-time image processing:

(i) a motion detection application,

(ii) an artificial extension of this detection application,

(iii) a robotic vision application.

Page 41: 541420

8 EURASIP Journal on Embedded Systems

0 1 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 1

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

SuccessorsA B C D E F

A

B

C

D

E

F

Before scheduling

MSA 5

SL B 3 SL C 7

HW D 18

MS

E 2

MSF 13

(a)

SuccessorsA B C D E F

A

B

C

D

E

F

0 1 1 0 0 0

0 0 1 1 0 0

0 0 0 0 1 1

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 1 0

After scheduling

MSA 5

SL B 3 SL C 7

HW D 18

MS

E 2

MSF 13

(b)

Figure 8: Matrix result after scheduling.

The second case study is a complex DFG which containsdifferent classical structures (Fork, join, sequential). ThisDFG is depicted in Figure 12. It contains twenty tasks. Eachtask can be implemented on the Software computation unit(Master or Slave processor) or on the Reconfigurable RCU.The original DFG is the model of an image processingapplication: motion detection on a fixed image background.This application is composed of 10 sequential tasks (fromID 1 to ID 10 in Figure 12). We added 10 others virtualtasks to obtain a complex DFG containing the differentpossible parallel structures. This type of parallel programparadigm (Fork, join, etc.) arises in many applicationareas.

In order to test the presented scheduling approach, wehave performed a large number of experiments where severalscenarios of HW/SW partitioning results were analyzed.

As an example, Figure 12 presents the scheduling resultwhen tasks 3, 4, 7, 8, 11, 12 17, 18, and 20 are implementedin hardware. As explained in Section 4.1 new dependencies(dotted lines) are added in the original graph to impose atotal order on each processor. In this figure all the executiontimes are in milliseconds (ms).

We also leaded our experiments on a more dynamicapplication from robotic vision domain [19]. It consists ina subset of a cognitive system allowing a robot equippedwith a CCD-camera to navigate and to perceive objects. Theglobal architecture in which the visual system is integrated isbiologically inspired and based on the interactions betweenthe processing of the visual flow and the robot movements. Inorder to learn its environment the system identifies keypointsin the landscape.

Keypoints are detected in a sampled scale space based onan image pyramid as presented in Figure 13. The applicationis dynamic in the sense that the number of keypoints dependson the scene observed by the camera. Then the executiontime of the Search and Extract tasks in the graph dynamicallychanges (see [19] for more details about this application).

5.1. Comparison Results. Throughout our experiments, wecompared the result of our scheduler with the one givenby the HCP algorithm (Heterogeneous Critical Path) devel-oped by Bjorn-Jorgensen and Madsen [20]. This algorithmrepresents an approach of scheduling on a heterogeneousmultiprocessor architecture. It starts with the calculation ofpriorities for each task associated with a processor. A task ischosen depending on the length of its critical path (CPL).The task which has the largest minimum CPL will have thehighest priority. We compared with this method because it isshown better than several other approaches (MD, MCP, PC,etc.) [21].

The summary of the experiments leaded is presentedin Figure 14. Each column gives average values for one ofthe three presented applications with different partitioningstrategies.

By comparing the first and second rows, our schedulingmethod provides consistent results.

The quality of the scheduling solutions found by ourmethod and the HCP method is similar. Moreover, ourmethod obtains better results for the complex Icam appli-cation. The HCP method returns an average total executiontime equal to 69 milliseconds whereas our method returnsonly 58 milliseconds for the same DFG. For the icam simpleapplication, the DFG is completely sequential, so whateverthe scheduling method the result is always the same. For therobotic vision application, we find the same total executiontime with the two methods because of the existence ofa critical path in the DFG which always sets the overallexecution time. We also measured the execution overhead ofthe proposed scheduling algorithm when it is implementedin software (third row of Figure 14) and in hardware (forthrow).

Since the scheduling overhead depends on the number oftasks in the application we only indicate the average valuesin Figure 14. For example, Figure 15 presents the executiontime of the hardware scheduler (in cycles) according to thenumber of software tasks.

From Figure 15, it may be concluded that when the resultof partitioning changes at runtime, then the computationtime needed for our scheduler to schedule all the DFGtasks is widely dependent on this modification of tasksimplementations. So:

(1) there is a great impact of the partitioning result andthe DFG structure on the scheduler computationtime,

(2) the longest sequential sequence of tasks correspondsto the case where all tasks are on the Software (eachtask takes one clock cycle). This case corresponds tothe maximum of schedule computation time,

Page 42: 541420

EURASIP Journal on Embedded Systems 9

RegisterSW total

time MuxAnd

Test

Comp

And

ASAP_SW

Urgency_Time

SW_Ready

CLK

Reset

SW_Total_Time

And

TEXE_SW

SW_Scheduled_Enable

Min

Max

Comp

And Combi

AndMin

Task_Scheduled

MIN_ASAP_TASKS

MAX_CT_TASKS

Tasks_Ready

Figure 9: The module of the MS Manager.

Table 1: Device utilization summary after synthesis.

Used logic utilization Icam simple Icam complex Robotic vision

Number of slices registers <1% <1% <1%

Number of slice LUTs 3% 6% 9%

Number of fully used Bit Slices 5% 5% 6%

Number of bonded IOBs 24% 64% 100%

Number of BUFG/BUFGCTRLs 3% 3% 3%

Scheduler frequency 23,94 Mhz 19,54 MHz 17,44 MHz

0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 10: Scheduling result.

(3) the minimum schedule computation time dependson the DFG structure. The longest sequentialsequence of tasks when all tasks are on the hardware(each task takes one clock cycle).

In our case (Figure 12) this sequence is formed by 10tasks, so the minimum schedule computation time is equalto 10 clock cycles for this application.

The results confirm that a software implementationof our scheduling algorithm is incompatible with onlinescheduling. Instead, the hardware implementation proposedin the paper brings determinism, a better scalability, and an×20000 speedup.

5.2. Synthesis Results. We have synthesized our schedulerarchitecture with an FPGA target platform (Virtex5, deviceXC5VLX330 -2 ff 1760) [22] for the RCU of Figure 1.Table 1 shows the device utilization summary for the threeconsidered applications when we choose a size of the busequal to 16 bits. We noticed that for the presented complexDFG, our scheduler uses only 6% of the device slices LUTswhich is reasonable. These results are obtained with a designfrequency about 19,54 MHz. The device Virtex V XCVLX330provide 207360 Slices registers, 207360 Slices LUT, 9687 fullyused Bit slices, 1200 IOBs, and 32 BUFG/BUFGCTRLs.

These results are confirmed in Figure 16, where wesynthesize the same Scheduler for the three applications, butwith 10 bits bus size as explained in the following paragraph.

5.3. Accuracy of Execution Times Values. The accuracy ofthe execution time values is defined by the size of the buswhich must convey the information from module to another.The size of this bus is a very determinant parameter in thescheduler synthesis.

Page 43: 541420

10 EURASIP Journal on Embedded Systems

Table 2: Behaviour of the scheduler in dynamic situations.

Total executiontime of application

Scheduling timeSize of thescheduler IP

Need toresynthesize

Variation of execution time of tasks Impacted Not impacted Not impacted No

Variation of partitioning results Impacted Impacted Not impacted No

Variation of the DFG structure (fork, join, etc.) Impacted Impacted Not impacted Yes

Variation of the application Impacted Impacted Impacted Yes

Variation of the execution time precision Not impacted Impacted Impacted Yes

14

Units

Master Slave RCU

Stat

e

Wai

t

Ru

n

Ru

n

Ru

n

Don

e

Don

e

Wai

t

Wai

t

Rea

dy

Rea

dy

Don

eS1 1 2 11S2 1 2 3,12 11S3 2 12,4 3,11S4 12,4 11 3S5 14 13 4 12 11S6 16

,5,15

14 13 4 12

S7 16,5,15

13 4

S8 6 16,15 5 14 4S9 16,15 6 5 17,7S10 15 16 6 17,18

,87

S11 19 15 16 17,8 7,18S12 19 15 8 7 18S13 9 19 15 17 8 7S14 9 19 15 17 8S15 9 19 20 17S16 9 19

1510 17,20

S17 9 10 17 20S18 9 10 17S19 9 10S20 10

Figure 11: Lists of management for a centralised OS.

As shown in Figure 17 , when the size of the busincreases the number of hardware resources used increasesalso and the frequency of the circuit decreases. But forour scheduler, even with a size of 32-bit, the IP keeps arelatively small size compared to the total number of availableresources (20% of Slices LUTs). This is another advantage ofour scheduler architecture.

In the general case, the designer has to make a tradeoffbetween accuracy of performance measures (in this casethe execution time) and the cost in number of hardwareresources and the maximum frequency of the circuit.

5.4. Summary

5.5. Description of the Scheduling Algorithm. Through thevarious results of synthesis, we confirm the effectivenessof the hardware architecture for our proposed schedulingmethod. With these three applications, we have sweptmost of existing DFGs structures: the sequential in the

ID12

ID11

ID10

ID1

ID6

ID7

ID2

ID3

ID4

ID5

ID9

ID8

ID13

ID14

ID15

ID16

ID17

ID18

ID19

ID20

Averaging

Subtraction

Threshold

Ero/Dilat

Reconstruction

Dilatation

Labeling

Covering(envelope)

Motion test

Updatingbackground

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

Virtualtask

SW

SW

SW

SW

SW

SW

SW

HW

HW

HW

HW

HW

HW

HW

HW

SW

HW

SL

SL

8

2

3

6

4

7

9

2

5

1

8

6

3

4

9

5

1

6

2

2SL

Figure 12: DFG application. The interrupted lines represent thescheduling results.

application icam simple, the fork and join in the applicationicam complex, and the parallelism in the application ofrobotic vision.

This scheduling heuristic gives better results than theHCP method. Moreover the proposed hardware architectureis very efficient in terms of resources utilization and schedul-ing latency.

Page 44: 541420

EURASIP Journal on Embedded Systems 11

1 1 ms

600 ms115 ms

600 ms

1100 ms

45 ms

25 ms 43 ms

800 ms100 ms

260 ms70 ms

100 ms

25 ms

2500 ms

2500 ms

180 ms

4700 ms

900 ms

43 ms

20 ms

200 ms200 ms

10 ms

300 ms

22 ms

22 ms20 ms

10 ms20 ms

SW

SW

SW

SW

SL

HW

HW

HW

HW

SW

HW

HW

HW

HW

HW

HW

HW

HW

HW

HW

HW

HWHW

HWSL

SLSW

SW

SW

HW

Gauss 1

Gauss 1

Gauss 2 Gauss 1

Gauss 2

Gauss 1Gauss 1

Gauss 1

Gauss 2

Subsample

Gradient

4

22

24

26

29

30

2523

27

28

3

5

6

7

9

8

10

11

12

2

13

14

16

20

21

15

17

18

19

Oversampling

DoG DoG DoG DoG

DoG

DoG

Search

Search Search Search Search

Search

Extract

Extract Extract ExtractExtract

Extract

Low frequency Medium frequency High frequency

Figure 13: DFG graph of the robotic vision application.

Our totalexecution time(ms)

HCP totalexecution time(ms)

SW schedulingtime (ms)HW schedulingtime (ms)

Icam simple(10 tasks)

47

47

7.5

0.00041764

Icame complex(20 tasks)

58

69

12

0.00056274

Robotic vision(30 tasks)

10943

10943

21

0.000916976

Figure 14: Execution time for 3 applications.

These features allow our scheduling IP to run onlineand meet the needs of the dynamic nature of most todayapplications.

6. Conclusions

In this paper, we presented a complete run-time hardware-software scheduling approach. Results of our experimentsshow the efficiency of the adaptation of the scheduling toa dynamic change of the partitioning that can be due to

0

5

10

15

20

25

Com

puti

ng

tim

e(c

lock

cycl

es)

1 4 7 10 13 16 19

Number of software tasks

Figure 15: Variation of scheduling computation time according totasks implentations.

a new mode of a dynamic application or to fault detection.As developed in this paper, a dynamic HW/SW schedul-ing approach has many advantages over static traditionalapproaches. In addition, the efficiency of our hardwareimplementation gives to our scheduler a minimal overheadin on line execution context.

In conclusion, Table 2 resumes the behavior of ourscheduling approach with different situations of dynamicity.

We show through this table in which case it is necessaryto restrict the IP scheduler and resynthesize it and in whichcase the IP can adapt to the dynamic system.

Page 45: 541420

12 EURASIP Journal on Embedded Systems

0

5

10

15

20

25

30

Freq

uen

cy(M

hz)

FPG

Are

sou

rces

(%)

Icam simple (10tasks)

Icam complexe(20tasks)

Robotic vision (30tasks)

Benchmarks

Frequencynbre slices Luts

Synthesis results for Bus size = 10 bits

Figure 16: Scalability of the method according to applicationcomplexity.

0

5

10

15

20

25

Freq

uen

cy(M

hz)

Slic

eLU

Ts(%

)

10 16 32

Bus size (bits)

Frequencynbre slices Luts

Variation of synthesis results according to bus size

Figure 17: Impact of bus size on the scheduler synthesis results forrobotic vision application.

Our future works consist in integrating our schedulingapproach among the services of an RTOS for dynamicallyreconfigurable systems.

Appendix

Block Diagrams of the Hardware Scheduler

See Figures 6, 7, and 9.

References

[1] M. Garey and D. Johnson, Computers and Intractability:A Guide to the Theory of NP-Completeness, Freeman, SanFrancisco, Calif, USA, 1979.

[2] Z. A. Mann and A. Orban, “Optimization problems in system-level synthesis,” in Proceedings of the 3rd Hungarian-JapaneseSymposium on Discrete Mathematics and Its Applications,Tokyo, Japan, 2003.

[3] C. Haubelt, D. Koch, and J. Teich, “Basic OS support fordistributed reconfigurable hardware,” in Proceedings of the3rd and 4th International Workshops on Computer Systems:Architectures, Modeling, and Simulation (SAMOS ’04), vol.3133, pp. 30–38, Samos, Greece, July 2004.

[4] J. Noguera and R. M. Badia, “Dynamic runtime HW/SWscheduling techniques for reconfigurable architectures,” inProceedings of the 10th International Symposium on Hard-ware/Software Codesign (CODES ’02), pp. 205–210, ACMPress, New York, NY, USA, 2002.

[5] Y. Lu, T. Marconi, K. Bertels, and G. Gaydadjiev, “Onlinetask scheduling for the FPGA-based partially reconfigurablesystems,” in Proceedings of the 5th International Workshop onReconfigurable Computing: Architectures, Tools and Applica-tions (ARC ’09), pp. 216–230, Karlsruhe, Germany, March2009.

[6] R. Pellizzoni and M. Caccamo, “Real-time management ofhardware and software tasks for FPGA-based embeddedsystems,” IEEE Transactions on Computers, vol. 56, no. 12, pp.1666–1680, 2007.

[7] S.-W. Moon, J. Rexford, and K. G. Shin, “Scalable hardwarepriority queue architectures for high-speed packet switches,”IEEE Transactions on Computers, vol. 49, no. 11, pp. 1215–1227, 2000.

[8] D. Picker and R. Fellman, “A VLSI priority packet queue withinheritance and overwrite,” IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 3, no. 2, pp. 245–253,1995.

[9] B. K. Kim and K. G. Shin, “Scalable hardware earliest-deadline-first scheduler for ATM switching networks,” inProceedings of the 18th IEEE Real-Time Systems Symposium, pp.210–218, San Francisco, Calif, USA, December 1997.

[10] T. Pop, P. Pop, P. Eles, and Z. Peng, “Analysis and optimiza-tion of hierarchically scheduled multiprocessor embeddedsystems,” International Journal on Parallel Program, vol. 36, no.1, pp. 37–67, 2008.

[11] S. Fekete, J. van der Veen, J. Angermeier, D. Gohringer,M. Majer, and J. Teich, “Scheduling and communication-aware mapping of HW-SW modules for dynamically andpartially reconfigurable SoC architectures,” in Proceedings ofthe Dynamically Reconfigurable Systems Workshop (DRS ’07),Zurich, Switzerland, March 2007.

[12] B. Miramond and J.-M. Delosme, “Design space explorationfor dynamically reconfigurable architectures,” in Proceedingsof the Conference on Design, Automation and Test in Europe(DATE ’05), pp. 366–371, 2005.

[13] Altera Corp., http://www.altera.com/.[14] Xilinx Corp., http://www.xilinx.com/.[15] K.M.G. Purna and D. Bhatia, “Temporal partitioning and

scheduling data flow graphs for reconfigurable computers,”IEEE Transactions on Computers, vol. 48, no. 6, pp. 579–590,1999.

[16] J. Resano, D. Mozos, D. Verkest, F. Catthoor, and S. Vernalde,“Specific scheduling support to minimize the reconfigurationoverhead of dynamically reconfigurable hardware,” in Proceed-ings of the 41st Annual Conference on Design Automation (DCA’04), pp. 119–124, ACM Press, San Diego, Calif, USA, June2004.

[17] F. Ghaffari, M. Auguin, M. Abid, and M. B. Jemaa, “Dynamicand on-line design space exploration for reconfigurable

Page 46: 541420

EURASIP Journal on Embedded Systems 13

architectures,” in Transactions on High-Performance EmbeddedArchitectures and Compilers I, P. Stenstrom, Ed., vol. 4050of Lecture Notes in Computer Science, pp. 179–193, Springer,Berlin, Germany, 2007.

[18] J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, andR. Lauwereins, “Infrastructure for design and management ofrelocatable tasks in a heterogeneous reconfigurable system-on-chip,” in Proceedings of the Conference on Design, Automationand Test in Europe (DATE ’03), Messe, Munich, Germany,March 2003.

[19] F. Verdier, B. Miramond, M. Maillard, E. Huck, and T. Lefeb-vre, “Using high-level RTOS models for HW/SW embeddedarchitecture exploration: case study on mobile robotic vision,”EURASIP Journal on Embedded Systems, vol. 2008, no. 1,Article ID 349465, 2008.

[20] P. Bjorn-Jorgensen and J. Madsen, “Critical path driven cosyn-thesis for heterogeneous target architectures,” in Proceedings ofthe 5th International Workshop on Hardware/Software Codesign(CODES/CASHE ’97), pp. 15–19, Braunschweig, Germany,March 1997.

[21] Y.-K. Kwok and I. Ahmad, “Static scheduling algorithmsfor allocating directed task graphs to multiprocessors,” ACMComputing Surveys, vol. 31, no. 4, pp. 406–471, 1999.

[22] Virtex II Pro, Xilinx Corp., http://www.xilinx.com/.

Page 47: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 723465, 15 pagesdoi:10.1155/2009/723465

Research Article

Techniques and Architectures for Hazard-Free Semi-ParallelDecoding of LDPC Codes

Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci

Department of Information Engineering, University of Pisa, Via G. Caruso 16, 56122 Pisa, Italy

Correspondence should be addressed to Massimo Rovini, [email protected]

Received 4 March 2009; Revised 18 May 2009; Accepted 27 July 2009

Recommended by Markus Rupp

The layered decoding algorithm has recently been proposed as an efficient means for the decoding of low-density parity-check(LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelinedsemi-parallel decoders suffer from violations or “hazards” between consecutive updates, which not only violate the layeredprinciple but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three differenttechniques to properly reschedule the decoding updates, based on the careful insertion of “idle” cycles, to prevent the hazards ofthe pipeline mechanism. Also, different semi-parallel architectures of a layered LDPC decoder suitable for use with such techniquesare analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis ofthe performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nmlow-power CMOS technology are shown.

Copyright © 2009 Massimo Rovini et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Improving the reliability of data transmission over noisychannels is the key issue of modern communication systemsand particularly of wireless systems, whose spatial coverageand data rate are increasing steadily.

In this context, low-density parity-check (LDPC) codeshave gained the momentum of the scientific community andthey have recently been adopted as forward error correction(FEC) codes by several communication standards, such asthe second generation digital video broadcasting (DVB-S2,[1]), the wireless metropolitan area networks (WMANs,IEEE 802.16e, [2]), the wireless local area networks (WLANs,IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10Gbase-T,IEEE 802.2ae).

LDPC codes were first discovered by Gallager in thefar 1960s [4] but have long been put aside until MacKayand Neal, sustained by the advances in the very highlarge-scale of integration (VLSI) technology, rediscoveredthem in the early 1990s [5]. The renewed interest andthe success of LDPC codes is due to (i) the remarkableerror-correction performance, even at low signal-to-noiseratios (SNRs) and for small block-lengths, (ii) the flexibility

in the design of the code parameters, (iii) the decodingalgorithm, very suitable for hardware parallelization, and lastbut not least (iv) the advent of structured or architecture-aware (AA) codes [6]. AA-LDPC codes reduce the decoderarea and power consumption and improve the scalabilityof its architecture and so allow the full exploitation of thecomplexity/throughput design trade-offs. Furthermore, AA-codes perform so close to random codes [6], that they are thecommon choice of all latest LDPC-based standards.

Nowadays, data services and user applications imposesevere low-complexity and low-power constraints anddemand very high throughput to the design of practicaldecoders. The adoption of a fully parallel decoder architec-ture leads to impressive throughput but unfortunately is alsoso complex in terms of both area and routing [7] that a semi-parallel implementation is usually preferred (see [6, 8]).

So, to counteract the reduced throughput, designers canact at two levels: at the algorithmic level, by efficientlyrescheduling the message-passing algorithm to improve itsconvergence rate, and at the architectural level, with thepipeline of the decoding process, to shorten the iterationtime. The first matter can be solved with the turbo-decodingmessage-passing (TDMP) [6] or the layered decoding

Page 48: 541420

2 EURASIP Journal on Embedded Systems

Vectorization

Arrayof CN

Arrayof VN

CN0

CN1

CN2

VN0

VN1

VN2

VN3

λ0

y0

λ1y1

λ2y2

λ3y3

HB =⎡⎢⎣

1 0 1 01 0 1 10 1 0 1

⎤⎥⎦

⎡⎢⎢⎣

0 0 1 0 00 0 0 1 00 0 0 0 11 0 0 0 00 1 0 0 0

⎤⎥⎥⎦

μ0,0

ε0,0

Figure 1: Tanner graph of a simple 3× 4 base-matrix and principleof vectorization.

algorithm [9], while pipelined architectures are mandatoryespecially when the decoder employs serial processing units.

However, the pipeline mechanism may dramatically cor-rupt the error-correction performance of a layered decoderby letting the processing units not always work on the mostupdated messages. This issue, known as pipeline “hazard”,arises when the dependence between the elaborations isviolated. The idea is then to reschedule the sequence ofupdates and to delay with “idle” cycles the decoding processuntil newer data are available.

As an improvement to similar state-of-the-art works[10–13], this paper proposes three systematic techniquesto optimally reschedule the decoding process in a wayto minimize the number of idle cycles and achieve themaximum throughput. Also, this paper discusses differentsemi-parallel architectures, based on serial processing unitsand all supporting the reordering strategies, so as to attainthe best trade-off between complexity and throughput forevery LDPC code.

Semi-parallel architectures of LDPC decoder haverecently been addressed in several papers, although noneof them formally solves the issue of pipeline hazards anddecoding idling. Gunnam et al. describe in [10] a pipelinedsemi-parallel decoder for WLAN LDPC codes, but theauthors do not mention the issue of the pipeline hazards;only, the need of properly scrambling the sequence of datain order to clear some memory conflicts is described.

Boutillon et al. consider in [13] methods and architec-tures for layered decoding; the authors mention the problemof pipeline hazards (cut-edge conflict) and of using anoutput order different from the natural one in the processingunits; nonetheless, the issue is not investigated further, andthey simply modify the decoding algorithm to computepartial updates as in [14]. Although this approach allowsthe decoder to operate in full pipeline with no idle cycles,it is actually suboptimal in terms of both performance andcomplexity.

Similarly, Bhatt et al. propose in [11] a pipelined block-serial decoder architecture based on partial updates, butagain, they do not investigate the dependence betweenelaborations.

In [12], Fewer et al. implement a semi-parallel TDMPdecoder, but the authors boost the throughput by decodingtwo codewords in parallel and not by means of pipeline.

This paper is organised as follows. Section 2 recalls thebasics of LDPC and of AA-LDPC codes and Section 3 sum-marizes the layered decoding algorithm. Section 4 introducesthree different techniques to reduce the dependence betweenconsecutive updates and analytically derives the relatednumber of idle cycles. After this, Section 5 describes theVLSI architectures of a pipelined block-serial LDPC-layereddecoder. Section 6 briefly reviews the WLAN codes used as acase study, while the performances of the related decoder areanalysed in Section 7. Then, the results of the logic synthesison a 65 nm low-power CMOS technology are discussed inSection 8, along with the comparison with similar state-of-the-art implementations. Finally, conclusions are drawn inSection 9.

2. Architecture-Aware Block-LDPC Codes

LDPC codes are linear block-codes described by a parity-check matrix H establishing a certain number of (even)parity constraints on the bits of a codeword. Figure 1 showsthe parity-check matrix of a very simple LDPC code withlength N = 4 bits and with M = 3 parity constraints.LDPC codes are also effectively described in a graphical waythrough a Tanner graph [15], where each bit in the codewordis represented with a circle, known as variable-node (VN),and each parity-check constraint with a square, known ascheck-node (CN).

Recently, the joint design of code and decoder hasblossomed in many works (see [8, 16]), and several principleshave been established for the design of implementation-oriented AA-codes [6]. These can be summarized into (i)the arrangement of the parity-check matrix in squaredsubblocks, and (ii) the use of deterministic patterns withinthe subblocks. Accordingly, AA-LDPC codes are also referredto as block-LDPC codes [8].

The pattern used within blocks is the vital facet for a low-cost implementation of the interconnection network of thedecoder and can be based either on permutations, as in [6]and for the class of π-rotation codes [17], or on circulantsor cyclic shifts of the identity matrix, as in [8] and in everyrecent standards [1–3].

AA-LDPC codes are defined by the number of block-columns nc, the number of block-rows nr , and the block-size B, which is the size of the component submatrices.Their parity-check matrix H can be conveniently viewed asH = PHB , that is, as the expansion of a base-matrix HB withsize nr × nc. The expansion is accomplished by replacingthe 1’s in HB with permutations or circulants, and the 0’swith null subblocks. Thus, the block-size B is also referredto as expansion-factor, for a codeword length of the resultingLDPC code equal to N = B · nc and code rate r = 1− nr/nc.

A simple example of expansion or vectorization of a base-matrix is shown in Figure 1. The size, number, and locationof the nonnull blocks in the code are the key parameters toget good error-correction performance and low-complexityof the related decoder.

Page 49: 541420

EURASIP Journal on Embedded Systems 3

3. Decoding of LDPC Codes

LDPC codes are decoded with the belief propagation (BP) ormessage-passing (MP) algorithm, that belong to the broaderclass of maximum a posteriori (MAP) algorithms. The BPalgorithm has been proved to be optimal if the graph ofthe code does not contain cycles, but it can still be usedand considered as a reference for practical codes with cycles.In the latter case, the sequence of the elaborations, alsoreferred to as schedule, considerably affects the achievableperformance.

The most common schedule for BP is the so-called two-phase or flooding schedule (FS) [18], where all parity-checknodes first, followed by all variable nodes then, are updatedin sequence.

A different approach, taking the distribution of closedpaths and girths in the code into account, has been describedby Xiao and Banihashemi in [19]. Although probabilisticschedules are shown to outperform deterministic schedules,the random activation strategy of the processing nodes is notvery suitable to HW implementation and adds significantcomplexity overheads.

The most attractive schedule is the shuffled or layereddecoding [6, 9, 18, 20]. Compared to the FS, the layeredschedule almost doubles the decoding convergence speed,both for codes with cycles and cycle-free [20]. This isachieved by looking at the code as a connection of smallersupercodes [6] or layers [9], exchanging intermediate relia-bility messages. Specifically, a posteriori messages are madeavailable to the next layers immediately after computationand not at next iteration as in a conventional floodingschedule.

Layers can be any set of either CNs or VNs, and,accordingly, CN-centric (or horizontal) or VN-centric (orvertical) algorithms have been analyzed in [18, 20]. However,CN-centric solutions are preferable since they can exploitserial, flexible, and low-complexity CN processors.

The horizontal layered decoding (HLD) is summarizedin Algorithm 1 and consists in the exchange of probabilisticreliability messages around the edges of the Tanner graph(see Figure 1) in the form of logarithms of likelihood ratios(LLRs); given the random variable x, its LLR is defined as

LLR(x) = logPr(x = 1)Pr(x = 0)

. (1)

In Algorithm 1, λn is the nth a priori LLR of thereceived bits, with n = 0, 1, . . . ,N − 1 and N the lengthof the codeword, M is the overall number of parity-checkconstraints, and Nit the number of decoding iterations.Also, N (m) is the set of VNs connected to the mth CN,ε(q)m,n represents the check-to-variable (c2v) reliability message

sent from CN m to VN n at iteration q, and yn is thetotal information or soft-output (SO) of the nth bit in thecodeword (see Figure 1).

For the sake of an easier notation, it is assumed here that alayer corresponds to a single row of the parity-check matrix.Before being used by the next CN or layer, SOs are refinedwith the involved c2v message, as shown in line 13, andthanks to this mechanism, faster convergence is achieved.

input: a-priori LLR λn, n = 0, 1, . . . ,N − 1output: a-Posteriori hard-decisions yn = sign(yn)(1) // Messages initialization(2) q = 0, yn = λn, ε(0)

m,n = 0,∀n = 0, . . . ,N − 1,∀m = 0, . . . ,M − 1;

(3) while (q < Nit & !Convergence) do(4) // Loop on all layers(5) for m← 0 to M − 1 do(6) // Check-node update(7) forall n ∈ N (m) do(8) // Sign update

(9) − sign(ε(q+1)m,n ) =∏ j∈N (m)\n − sign(yj − ε(q)

m, j);(10) // Magnitude update

(11) |ε(q+1)m,n | = M−min∗j∈N (m)\n(|yj − ε(q)

m, j|);(12) // Soft-output update

(13) yn = yn − ε(q)m,n + ε(q+1)

m,n

(14) end(15) end(16) q + +;(17) end

Algorithm 1: Horizontal layered decoding.

Magnitudes are updated with the M-min∗ binary opera-tor [21] defined as M-min∗(a, b)=min(a, b)+log(e|a−b|/(1+e|a−b|)) for a, b ≥ 0. Following an approach similar toJones et al. [22], the updating rule of magnitudes is furthersimplified with the method described in [23], which provedto yield very good performance. Here, only two valuesare computed and propagated for the magnitude of c2vmessages; specifically, if we define

jmin = arg

{minj∈N (m)

(∣∣∣yj − ε(q)m, j

∣∣∣)}

(2)

the index of the smallest variable-to-check (v2c) messageentering CN m, then a dedicated c2v message is computedin response to VN jmin:

∣∣∣ε(q+1)m, jmin

∣∣∣ = M- min∗j∈N (m), j /= jmin

(∣∣∣yj − ε(q)m, j

∣∣∣)=αm (3)

while all the remaining VNs receive one common, non-marginalized value for magnitude given by

∣∣∣∣ε(q+1)m,n

∣∣∣∣n /= jmin

= M-min∗(αm,

∣∣∣yjmin − ε(q)m, jmin

∣∣∣)=βm. (4)

4. Decoding Pipelining and Idling

The data-flow of a pipelined decoder with serial processingunits is sketched in Figure 2. A centralized memory unitkeeps the updated soft-outputs, computed by the nodeprocessors (NPs) according to Algorithm 1. If we denote withdk the number of nonnull blocks in layer k, that is, thedegree of layer k, then the processor takes dk clock cyclesto serially load its inputs. Then, refined values are writtenback in memory (after scrambling or permutation) with the

Page 50: 541420

4 EURASIP Journal on Embedded Systems

OutputSO

InputSO

· · ·

SO memory(synchronous)

Serial NP

SO bufferPermutation

network

Figure 2: Outline of the flow of soft-outputs in an LDPC-layereddecoder with serial processing units.

latency of LSO clock cycles, and this operation takes again dkclock cycles. Overall, the processing time of layer k is then2dk + LSO clock cycles, as shown in Figure 3(a).

If the decoder works in pipeline, time is saved byoverlapping the phases of elaboration, writing-out andreading, so that data are continuously read from and writteninto memory, and a new layer is processed every dk clockcycles (see Figure 3(b)).

Although highly desirable, the pipeline mechanism isparticularly challenging in a layered LDPC decoder, sincethe soft-outputs retrieved from memory and used for thecurrent elaboration could not be always up-to-date, butnewer values could be still in the pipeline. This issue, knownas pipeline hazard, prevents the use and so the propagationof always up-to-date messages and spoils the error-correctionperformance of the decoding algorithm.

The solution investigated in this paper is to insert nullor idle cycles between consecutive updates, so that a nodeprocessor is suspended to wait for newer data. The numberof idle cycles must be kept as small as possible since it affectsthe iteration time and so the decoding throughput. Its valuedepends on the actual sequence of layers updated by thedecoder as well as on the order followed to update messageswithin a layer.

Three different strategies are described in this section,to reduce the dependence between consecutive updates inthe HLD algorithm and, accordingly, the number of idlecycles. These differ in the order followed for acquisitionand writing-out of the decoding messages and constitute apowerful tool for the design of “layered”, hazard-free, LDPCcodes.

4.1. System Notation. Without any lack of generality, letus identify a layer with one single parity-check node andfocusing on the set Sk of soft-outputs participating to layerk, let us define the following subsets:

(i) Ak = Sk ∩Sk−1, the set of SOs in common with layerk − 1;

(ii) Bk = {Sk ∩ Sk+1} \ Sk−1, the set of SOs in commonwith layer k + 1 and not in Ak;

(iii) Ck = Sk−1∩Sk∩Sk+1, the set of SOs in common withboth layers k − 1 and k + 1;

(iv) Ek = {Sk ∩ Sk−2} \ {Sk−1 ∪ Sk+1}, the set of SOs incommon with layer k − 2 and not in Ak or Bk;

(v) Fk = {Sk ∩ Sk+2} \ {Sk−2 ∪ Sk−1 ∪ Sk+1}, the set ofSOs in common with layer k + 2 but not in Ek, Ak,Bk;

(vi) Gk = {Sk−2 ∩ Sk ∩ Sk+2} \ {Sk−1 ∪ Sk+1}, the set ofSOs in common with both layers k − 2 and k + 2, butnot in Ak or Bk;

(vii) Rk, the set of remaining SOs.

In the definitions above the notation A \ B means therelative complement of B in A or the set-theoretic differenceof A and B. Let us also define the following cardinalities:dk = |Sk| (degree of layer k), αk = |Ak|, βk = |Bk|,χk = |Ck|, εk = |Ek|, φk = |Fk|, γk = |Gk|, ρk = |Rk|.

4.2. Equal Output Processing. First, let us consider a verystraightforward and implementation friendly architecture ofthe node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.

In such a case it would be desirable to (i) postpone theacquisition of messages updated by the previous layer, that is,messages in Ak, and (ii) output the messages in Bk as soonas possible to let the next layer start earlier. Actually, the lastconstraint only holds when Ak does not include any messagecommon to layer k + 1, that is, when Ck = ∅; otherwise, theset Bk could be acquired at any time before Ak.

Figure 4 shows the I/O data streams of an equal outputprocessing (EOP) unit. Here, LSO is the latency of theSO data-path, including the elaboration in the NP, thescrambling, and the two memory accesses (reading andwriting). Focusing on layer k + 1, the set Ck+1 cannot beassigned to any specific position within Ak+1, since the wholeAk+1 is acquired according to the same order used by layer kto output (and so also acquire) the sets Bk and Ck. For thisreason, the situation plotted in Figure 4 is only for the sakeof a clearer drawing.

With reference to Figure 4, pipeline hazards are cleared ifIk idle cycles are spent between layer k and k + 1 so that

Ik + |Sk+1 \Ak+1| ≥ LSO + |Sk \ (Ak ∪Bk)| · u(|Ck|) (5)

with u(x) = 1 for x > 0 and u(x) = 0 otherwise. This meansthat if Ck is empty, then the messages in Sk \ (Ak ∪Bk) donot need to be waited for. The solution to (5) with minimumlatency is

Ik = LSO − (dk+1 − αk+1) +(dk − αk − βk

) · u(χk). (6)

Note that (5) and (6) only hold under the hypothesis ofCk leading within Ak. If this is not the case, up to |Ak \ Ck|extra idle cycles could be added if Ck is output last within Ak.

So far, we have only focused on the interaction betweentwo consecutive layers; however, violations could also arisebetween layer k and k+ 2. Despite this possibility, this issue isnot treated here, as it is typically mitigated by the same idlecycles already inserted between layers k and k+1 and betweenlayers k + 1 and k + 2.

4.3. Reversed Output Processing. Depending on the particularstructure of the parity-check matrix H, it may occur that the

Page 51: 541420

EURASIP Journal on Embedded Systems 5

Read WriteElab. Read WriteElab.

Layer k Layer k + 1

dk LSO dk

(a) Not pipelined

Elab. Elab. Elab.

Layer k Layer k + 1 Layer k + 2

Read, k Read, k + 1 Read, k + 2

Write, k Write, k + 1 Write, k + 2

(b) Pipelined

Figure 3: Pipelined and not pipelined data-flow.

Layer k

βkχk αk βk+1

χk+1 αk+1

IdleBk · · · Ck Ak Bk+1 · · · Ck+1 Ak+1

Layer k + 1dk dk+1 Ik+1Ik

Idle

TimeSO output stream

SO input stream

Layer k dk

LSO

Bk · · · Ck Ak

βk χkαk

Figure 4: Input and output data streams in an NP with EOP.

most of the messages of layer k in common with layer k−1 arealso shared with layer k + 1, that is, Ak Ck and Bk ∅.If this condition holds, as for the WLAN LDPC codes (seeFigure 11), it can be worth reversing the output order of SOsso that the messages in Ak can be both acquired last andoutput first.

Figure 5(a) shows the I/O streams of a reversed outputprocessing (ROP) unit. Exploiting the reversal mechanism,the set Bk is acquired second-last, just before Ak, so that it isavailable earlier for layer k + 1.

Following a reasoning similar to EOP, the situationsketched in Figure 5(a) where Ck is delivered first within Ak

is just for an easier representation, and the condition forhazard-free layered decoding is now

I2lk + |Sk+1 \Ak+1| ≥ LSO + |Ak \Ck| · u(|Bk|). (7)

Indeed, when Bk = ∅, one could output Ck first in Ak,and so get rid of the term |Ak \ Ck|. However, since Ck isactually left floating within Ak, (7) represents again a best-case scenario, and up to |Ak \ Ck| extra idle cycles could berequired. From (7), the minimum latency solution is

I2lk = LSO − (dk+1 − αk+1) +

(αk − χk

)· u(βk

). (8)

Similarly to EOP, the ROP strategy also suffers frompipeline hazards between three consecutive layers, andbecause of the reversed output order, the issue is more

relevant now. This situation is sketched in Figure 5(b), wherethe sets Ek, Fk, and Gk are managed similarly to Ak, Bk,and Ck. The ROP strategy is then instructed to acquire theset Ek later and to output Fk earlier. However, the situationis complicated by the fact that the set Fk−1 ∪ Gk−1 may notentirely coincide with Ek+1; rather it is Ek+1 ⊆ (Fk−1 ∪Gk−1),since some of the messages in Fk−1 ∪ Gk−1 can be foundin Bk+1. This is highlighted in Figure 5(b), where thosemessages of Fk−1 and Gk−1 not delivered to Ek+1 are shownin dark grey.

To clear the hazards between three layers, additional idlecycles are added in the number of

I3lk = max

{ACQk+1 −WRk−1, 0

}, (9)

where ACQk+1 is the acquisition margin on layer k + 1, andWRk−1 is the writing-out margin on layer k−1. These can becomputed under the assumption of no hazard between layerk− 1 and k (i.e., Ck ∪Ak is aligned with Ck−1∪Bk−1 thanksto I2l

k as shown in Figure 5(b)) and are given by

ACQk+1 = I2lk + dk+1 −

(αk+1 + βk+1 + εk+1

),

WRk−1 =(εk−1 − γk−1 +

∣∣Gk−1 \ Ek+1∣∣) · u(φk−1

).

(10)

The margin WRk−1 is actually nonnull only if Fk−1 /=∅;otherwise, WRk−1 = 0 under the hypothesis that (i) the setGk−1 is output first within Ek−1, and (ii) within Gk−1, themessages not in Ek+1 are output last.

Page 52: 541420

6 EURASIP Journal on Embedded Systems

Layer k

βk αkχk βk+1 αk+1

χk+1

Idle

Bk· · · Ak Ck Bk+1· · · Ak+1 Ck+1

Layer k + 1dk dk+1 Ik+1Ik

Idle

TimeSO output stream

SO input stream

Layer k dk

LSO

Ck · · ·Ak Bk

βkχkαk

(a) Pipeline hazards in the update of two consecutive layers

Layer k

βk αkχk φk+1

γk+1εk+1 βk+1 αk+1

χk+1

Idle

Bk· · · Ak Ck Fk+1 Gk+1 Ek+1 Bk+1Rk+1 Ak+1 Ck+1

Layer k + 1dk dk+1 Ik+1Ik

Idle

Time

SO output stream

SO input stream

Layer k − 1

LSO

Ck · · ·Ak Bk

βkχkαk

Rk−1Fk−1Gk−1Ek−1Bk−1Ak−1Ck−1

Layer k dkdk−1

χk−1

αk−1βk−1

εk−1γk−1 φk−1

(b) Pipeline hazards in the update of three consecutive layers. Messages of Gk−1 and Fk−1 not in Ek+1 are shown in dark grey

Figure 5: Organization of the input and output data stream in an NP with ROP.

Layer k

εk αk εk+1 αk+1Idle

Ek· · · Ak Ek+1· · · Ak+1

Layer k + 1dk dk+1 Ik+1Ik

Idle

TimeSO output stream

SO input stream

Layer k dk

LSO

Bk ∪ Ck Fk ∪ Gk · · ·

βk + χk φk + γk

Figure 6: Input and output data streams in an NP with UOP.

Overall, the number of idle cycles of ROP is given by

Ik = I2lk + I3l

k . (11)

4.4. Unconstrained Output Processing. Fewer idle cycles areexpected if the orders used for input and output are notconstrained to each other. This implies that layer k can stilldelay the acquisition of the messages updated by layer k − 1(i.e., messages in Ak) as usual, but at the same time the

messages common to layer k+ 1 (i.e., in Bk∪Ck) can also bedelivered earlier.

The input and output data streams of an unconstrainedoutput processing (UOP) unit are shown in Figure 6. Now,hazard-free layered decoding is achieved when

Ik + |Sk+1 \Ak+1| ≥ LSO, (12)

which yields

Ik = LSO − (dk+1 − αk+1). (13)

Page 53: 541420

EURASIP Journal on Embedded Systems 7

SOMEM

c2vmessage

MEM

v2c buffer

CNU array

Circularshifting network

Input bufferFSM

yn’[k]

ε(q)m+1,n’

+

CNU #0

CNU #1

CNU #2...

CNU #B − 1

yn[k]− ε(q)m,n

+

+ε(q+1)m,n

yn[k + 1]

B

B

B

BIO

Figure 7: Layered decoder architecture with variable-to-check buffer.

Regarding the interaction between three consecutivelayers, if the messages common to layer k+2 (i.e., in Fk∪Gk)are output just after Bk ∪ Ck, and if on layer k + 2, the setEk+2 is taken just before Ak+2, then there is no risk of pipelinehazard between layer k and k + 2.

4.5. Decoding of Irregular Codes. A serial processor cannotprocess consecutive layers with decreasing degrees, dk+1 <dk, as the pipeline of the internal elaborations would becorrupted and the output messages of the two layers wouldoverlap in time. This is not but another kind of pipelinehazard, and again, it can be solved by delaying the updateof the second layer with Δdk = dk − dk+1 idle cycles.

Since this type of hazard is independent of that seenabove, the same idle cycles may help to solve both issues. Forthis reason, the overall number of idle cycles becomes

I′k = max{Ik,Δdk, 0} (14)

with Ik being computed according to (6), (11), or (13).

4.6. Optimal Sequence of Layers. For a given reorderingstrategy, the overall number of idle cycles per decodingiteration is a function of the actual sequence of layers used forthe decoding. For a code with Λ layers, the optimal sequenceof layer p minimizing the time spent in idle is given by

p = arg minp∈P

⎧⎨⎩Λ−1∑

k=0

I′k(p)⎫⎬⎭, (15)

where I′k(p) is the number of idle cycles between layer k andk + 1 for the generic permutation p and is given by (14), andP is the set of the possible permutations of layers.

The minimization problem in (15) can be solved bymeans of a brute-force computer search and results in thedefinition of a permuted parity-check matrix H, whose layersare scrambled according to the optimal permutation p. Then,within each layer of H, the order to update the nonnullsubblocks is given by the strategy in use among EOP, ROP,and UOP.

4.7. Summary and Results. The three methods proposed inthis section are differently effective to minimize the overalltime spent in idle. Although UOP is expected to yield

the smallest latency, the results strongly depend on theconsidered LDPC code, and ROP and EOP can be very closeto UOP. As a case-example, results will be shown in Section 7for the WLAN LDPC codes.

However, the effectiveness of the individual methodsmust be weighed up in view of the requirements of theunderlying decoder architecture and the costs of its hardwareimplementation, which is the objective of Section 5. Thus,UOP generally requires bigger complexity in hardware, andEOP or ROP can be preferred for particular codes.

5. Decoder Architectures

Low complexity and high throughput are key featuresdemanded to every competitive LDPC decoder, and to thisextent, semi-parallel architectures are widely recognised asthe best design choice.

As shown in [6, 8, 12] to mention just a few, a semi-parallel architecture includes an array of processing elementswith size usually equal to the expansion factor B of thebase-matrix HB. Therefore, the HLD algorithm describedin Section 3 must be intended in a vectorized form as well,and in order to exploit the code structure, a layer countsB consecutive parity-check nodes. Layers (in the numberof nr = M/B) are updated in sequence by the B check-node units (CNUs), and an array of B SOs (yn) and ofBc2v messages (ε(q)

m,n) are concurrently updated at everyclock cycle. Since the parity-check equations in a layer areindependent by construction, that is, they do not share SOs,the analysis of Section 4 still holds in a vectorized form.

The CNUs are designed to serially update the c2vmagnitudes according to (3) and (4), and any arbitraryorder of the c2v messages (and so of SOs, see line 13 ofAlgorithm 1) can be easily achieved by properly multiplexingbetween the two values as also shown in [23]. It mustbe pointed out that the 2-output approximation describedin Section 3 is pivotal to a low-complexity implementationof EOP, ROP, or UOP in the CNU. However, the samestrategies could also be used with a different (or even no)approximation in the CNU, although the cost of the relatedimplementation would probably be higher.

Three VLSI architectures of a layered decoder will bedescribed, that differ in the management of the memory

Page 54: 541420

8 EURASIP Journal on Embedded Systems

SO MEM

r1w1

w1

r2

r1

r2

c2v MEM

CNU array

Circularshifting network

Input bufferFSM

yn’[k]

ε(q)m+1,n’

+

yn[k]

ε(q)m,n

+

− B

CNU #0

CNU #1

CNU #2

...

CNU #B − 1

+

+

ε(q+1)m,n

yn[k + 1]

BB

B

BIO

BB

Figure 8: Layered decoder with three-port SO and c2v memories.

units of both SO and c2v, and so result in differentimplementation costs in terms of memory (RAM and ROM)and logic.

5.1. Local Variable-to-Check Buffer. The most straightfor-ward architecture of a vectorized layered decoder is shownin Figure 7. Here, the arrays of v2c messages μ(q)

m,n enteringthe CNUs during the update of layer m = 0, 1, . . . ,nr − 1, arecomputed on-the-fly as μ(q)

m,n = yn−ε(q)m,n with n ∈ N (m), and

both the arrays of c2v and SO messages are retrieved frommemory.

Then, the updated c2v messages are used to refine everyarray of SOs belonging to layer m: according to line 13 ofAlgorithm 1, this is done by adding the new c2v array ε(q+1)

m,n

to the input v2c array μ(q)m,n. Since the CNUs work in pipeline,

while the update of layer m is still progress, the array ofthe v2c messages belonging to layer m + 1 is already being

computed as μ(q)m+1,n′ = yn′ − ε(q)

m+1,n′ , with n′ ∈ N (m + 1).For this reason, μ(q)

m,n needs to be temporarily stored in a localbuffer as shown in Figure 7. The buffer is vectorized as welland stores B×dc,max messages, with dc,max the maximum CNdegree in the code.

Before being stored back in memory, the array yn iscircularly shifted and made ready for its next use, by applyingcompound or incremental rotations [12]; this operation iscarried out by the circular shifting network of Figure 7, andmore details about its architecture are available in [24].

The v2c buffer is the key element that allows the archi-tecture to work in pipeline. This has to sustain one readingand one writing access concurrently and can be efficientlyimplemented with shift-register based architectures for EOP(first-in, first-out, FIFO buffer) and ROP (last-in, first-out,LIFO buffer). On the contrary, UOP needs to map the bufferonto a dual-port memory bank, whose (reading) address isprovided by and extra configuration memory (ROM).

5.2. Double Memory Access. The buffer of Arch. V-A can beremoved if the v2c messages are computed twice on-the-fly,as shown in Figure 8: the first time to feed the array of CNUs,and then to update the SOs. To this aim, a further reading is

required to get the arrays yn and ε(q)m,n from memory, and so

recompute the array μ(q)m,n on the CNUs output.

It follows that three-port memories are needed for bothSO and c2v messages since three concurrent accesses haveto be supported: two readings (see ports r1 and r2 inFigure 8) and one writing. This memory can be implementedby distributing data on several banks of customary dual-port memory, in such a way that two readings alwaysinvolve different banks. Actually, in a layered decoder asame memory location needs to be accessed several timesper iteration and concurrently to several other data, so thatresorting to only two memory banks would be unfeasible.On the other hand, the management of a higher number ofbanks would add a significant overhead to the complexity ofthe whole design.

The proposed solution is sketched in Figure 9 and isbased on only two banks (A and B) but, to clear accessconflicts, some data are redundantly stored in both the banks(see elements C1 and C2 in the example of Figure 9).

The most trivial and expensive solution is achieved whenboth banks are a full copy or a mirror of the originalmemory as in [11], which corresponds to 100% redundancy.Conversely to this route, data can be selectively assignedto the two banks through computer search aiming at aminimum redundancy.

Roughly speaking, if we denote by σi the cardinality ofthe set of data (SO or c2v messages) read concurrently tothe ith data for i = 0, 1, . . . ,N − 1, then the higher

∑∀i σi

is (for a given N), the higher is the expected redundancy. So,a small redundancy ρc2v is experienced by the c2v memory,since each c2v message can collide with at most two otherdata (i.e., maxi{σi} = 2), while a higher redundancy ρSO isassociated to the SO memory, since every SO can face upto 2dVN,n conflicts, with dVN,n being the degree of the nthvariable node, typically greater than 1 (especially for low-ratecodes).

Indeed, the issue of memory partitioning and thereordering techniques described in Section 4 are linked toeach other: whenever the CNUs are in idle, only one readingis performed. Therefore, an overall system optimizationaiming at minimizing the iteration latency and the amount

Page 55: 541420

EURASIP Journal on Embedded Systems 9

w_en

w_en w_addr

r_addr

r_data r_data1

r_data2r_addr

r_data

w_dataw_addr

w_enw_addrw_data

Read controller

Memory architecture

A1A2B1B2A3B3A4C1A5A6A7A8B4B5B6

C1C2A1A2A3A4A5A6A7A8

C1C2B1B2B3B4B5B6

r_addr1 r_addr2

w_data

Bank A

Bank A

Data partitoning

Write controller

Originalmemory Bank B

Bank B

C2

C1

Figure 9: Three-port memory: data partitioning and architecture.

v2c MEM

r1w1

r2

CNU array

Circularshifting network

Input bufferFSM

c2vmessage

MEM

yn’[k]− ε(q)m+1,n’

B

yn[k]− ε(q)m,n

ε(q)m’,n

CNU #0

CNU #1

CNU #2

...

CNU #B − 1

+

+

ε(q+1)m,n

yn[k + 1]

B

+

−B

B

BIO

B

B

B

Figure 10: Layered decoder with v2c three-port memory.

of memory redundancy at the same time could be pursued;however, due to the huge optimization space, this task isalmost unfeasible and is not considered in this work.

5.3. Storage of Variable-to-Check Messages. During the elab-oration of a generic layer, a certain v2c message is neededtwice, and a local buffer or multiple memory readingoperations were implemented in Arch. V-A and Arch. V-B,respectively.

A third way of solving the problem is computing thearray of v2c messages only once per iteration, like in Arch.V-A, but instead of using a local buffer, the v2c messages areprecomputed and stored in the SO memory ready for thenext use, as sketched in Figure 10. A similar architecture isused in [10, 16] but the issue of decoding pipeline is notclearly stated there.

In this way, the SO memory turns into a v2c memorywith the following meaning: the array yn updated by layerm is stored in memory after marginalization with the c2vmessage εm′,n, with m′ being the index of the next layerreusing the same array of SOs, yn. In other words, the array of

v2c messages involved in the next update of the same block-column n is precomputed. Therefore, the data stored in thev2c memory are used twice, first to feed the array of CNUs,and then for the SOs update.

Similarly to Arch. V-B, a three-port memory wouldbe required because of the decoding pipeline; the sameconsiderations of Section 5.2 still hold, and an optimumpartitioning of the v2c memory onto two banks with someredundancy can be found. Note that, as opposed to Arch.V-B, a customary dual-port memory is enough for c2vmessages.

As far as the complexity is concerned, at first glance thissolution seems to be preferable to Arch. V-B since it needsonly two stages of parallel adders while the c2v memory isnot split. However, the management of the reading ports ofthe v2c memory introduces significant overheads, since afterthe update of the soft outputs yn by layer m, the memorycontroller must be aware of what is the next layerm′ using thesame soft outputs yn. This information needs to be stored ina dedicated configuration memory, whose size and area canbe significant, especially in a multilength, multirate decoder.

Page 56: 541420

10 EURASIP Journal on Embedded Systems

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

1

2

3

4

5

6

70

00

00

00

00

00

00

01

0

1

252

7

15

68

17

23

11

406155

76 367

52

7

56

14

64 67

61 75 4

56 74 77

28 21 68

48 38 43

40 2 53

69 23 64

14 0 68

58 11 34

12

8

63

20

10

78

25

10

20

64 78

22 21

62

65

24

20

78

4

20

24

8

44

52

72

23

75

29

58

44

5HB =

81× 81 identity matrix rotated by r 81× 81 zero matrixr

Figure 11: Parity-check base-matrix of the block-LDPC code for IEEE 802.11n with codeword sizeN2 = 1944 and rate r = 2/3. Black squarescorrespond to cyclic shifts s of the identity matrix (0 ≤ s ≤ B − 1), also indicated in the square, while empty squares correspond to all-zerosubmatrices.

6. A Case Study: The IEEE 802.11n LDPC Codes

6.1. LDPC Code Construction. The WLAN standard [3]defines AA-LDPC codes based on circulants of the identitymatrix. Three different codeword lengths are supported,N0 = 648,N1 = 1296, andN2 = 1944, each coming with fourcode rates, 1/2, 2/3, 3/4, and 5/6, for a total of 12 differentcodes. As a distinguishing feature, a different block-size isused for each codeword length, that is, B0 = 27, B1 = 54,and B2 = 81, respectively; accordingly, every code countsnc = Ni/Bi = 24 block-columns, while the block-rows (layers)are in the number of nr = (1−r)nc = 12, 8, 6, 4 for code rates1/2, 2/3, 3/4, and 5/6, respectively.

An example of the base-matrix HB for the code withlength N2 = 1944 and rate r = 2/3 is shown in Figure 11.

6.2. Multiframe Decoder Architecture. In order to attain anadequate throughput for every WLAN codes, the decodermust include a number of CNUs at least equal to max{Bi} =81. This means that two thirds of the processors wouldremain unused with the shortest codes.

In the latter case, the throughput can be increased thanksto a multiframe approach, where Fi = �max{Bi}/Bi� framesof the code with block-size Bi are decoded in parallel. Asimilar solution is described in [12], but in that case twodifferent frames are decoded in time-division multiplexingby exploiting the 2 nonoverlapped phases of the floodingalgorithm. Here, Fi frames are decoded concurrently, andmore specifically, three different frames of the shortest codecan be assigned to a cluster of 27 CNUs each.

Note that to work properly, the circular shifting networkmust support concurrent subrotations as described in [24].

7. Decoder Performance

As to give a practical example of the reordering strategiesdescribed in Section 4, Figure 12 shows the data flow relatedto the update of layer 0 for the WLAN code of Figure 11.While 6 idle cycles are required following the original,natural order of updates (see Figure 12(a)), EOP needs5 cycles (see Figure 12(b)), ROP reduces them to 1 (see

Figure 12(c)), while no idle cycle is used by UOP (seeFigure 12(d)). The subsets defined in Section 4.1 are alsoshown in Figure 5, along with the optimal sequence of layersfollowed for decoding.

7.1. Latency and Throughput. The latency of a pipelinedLDPC decoder can be expressed as

Tdec = tclk ·{Nit · (NB + Iit) + Lpipe + 2IO

}(16)

with tclk = 1/ fclk being the clock period, Nit being thenumber of iterations, NB being the number of nonnullblocks in the code, Iit =

∑nr−1k=0 I′k being the number of

idle cycles per iteration, Lpipe being the cycles to emptythe decoder pipelin and finally, LIO being the cycles forthe input/output interface. Among the parameters above,Nit is set for good error-correction performance, NB is acode-dependent parameter, and LIO is fixed by the I/Omanagement; thus, for a minimum latency, the designercan only act on Iit, whose value can be optimised with thetechniques of Section 4.

Focusing on the IEEE 802.11n codes, Table 1 shows theoverall number of cycles for 12 iterations (Ldec = Tdec/tclk),the number of idle cycles per iteration (Iit), the percentageof idle cycles with respect to the total (idling %), and thethroughput at the clock frequency of 240 MHz.

The latter is expressed in information bits decoded pertime unit and is also referred to as net throughput:

Γn = Fi · r ·Ni

Tdec, (17)

where Fi is the number of frames decoded in parallel. Forthis reason, the figures of Table 1 for the short codes are verysimilar to those for the long codes (N0F0 = N2F2); on thecontrary, the middle codes do not benefit from the samemechanism (i.e., F1 = 1) and their throughput is scaled downby a factor 2/3.

The results of Table 1 are for every technique of Section 4as well as for the original codes before optimization.Although EOP clearly outperforms the original codes, betterresults are achieved with ROP and UOP for the WLAN case

Page 57: 541420

EURASIP Journal on Embedded Systems 11

564

213

2515

116

1714

811

751

633

017

610

42

647

49

712

017

6710

248

741

203

018

560

772

56

42

1325

151

1617

148

1175

163

30

1761

04

2SO rotation

Layer #0 Layer #1

Layer #0

Idle

SO index

SO input

SO output

t

(a) Original base-matrix (sequence of layers: 0,1,2,3,4,5,6,7)

56

42

1325

151

1617

148

1175

163

30

1761

04

20

2221

622

423

1368

120

2123

110

329

1469

064

2

56

42

1325

151

1617

148

1175

163

30

1761

04

2SO rotation

Layer #0 Layer #5

Layer #0

Idle

SO index

SO input

SO output

B0 A0 B5 A5

t

B0 A0

(b) EOP (optimised sequence of layers: 0,5,6,7,4,2,3,1)

564

213

2515

116

1714

811

751

633

017

610

42

145

019

018

103

656

2310

280

74

682

211

7514

56

42

1325

151

1617

148

1175

163

30

1761

04

2SO rotation

Layer #0 Layer #2

Layer #0

Idle

SO index

SO input

SO output

B0 A0 A2

t

A0 B0

(c) ROP (optimised sequence of layers: 0,2,7,5,6,3,4,1)

56

42

1325

151

1617

148

1175

163

30

1761

04

252

562

60

2040

020

80

2125 02 5 3 4 4

56

42

1325

151

1617

148

11

2 11

0

17

1 3

75

163

361

04

2

16

SO rotation

Layer #0 Layer #4

Layer #0

No idle

SO index

SO input

SO output

E0 A0 A4E4

t

B0 F0

(d) UOP (optimised sequence of layers: 0,4,7,6,1,2,6,3)

Figure 12: An example of optimization of the base-matrix of the LDPC code IEEE 802.11n with N2 = 1944 and r = 2/3 with EOP, ROP andUOP. Critical propagations are highlighted in dark gray.

example, where at most 14% and 11% of the decoding timeare spent in idle, respectively. On average, the decoding timedecreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns withROP and 5.1 ns with UOP. This behaviour can be explainedby considering that for the WLAN codes the term (dk − αk −βk)·u(χk) found in (6) for EOP is significantly nonnull, whilecomparing (8) to (13), ROP and UOP basically differ for theterm (αk−χk)·u(βk), which is negligible for the WLAN codes.

7.2. Error-Correction Performance. Figure 13 compares thefloating point frame error rate (FER) after 12 decodingiterations of a pipelined decoder using EOP, ROP, and UOPwith a reference curve obtained by simulating the originalparity-check matrix before optimization, in a nonpipelined

decoder. Two simulations were run for each strategy, onewith the proper number of idle cycles (curves with fullmarkers), and the other without idle cycles and referred toas full pipeline mode (curves with empty markers).

As expected, the three strategies reach the reference curveof the HLD algorithm when properly idled. Then, in caseof full pipeline (Ik = 0, ∀k), the performance of EOP arespoiled, while ROP and UOP only pay about 0.6 and 0.3 dB,respectively. This means that the reordering has significantlyreduced the dependence between layers and only few hazardsarise without idle cycles.

Similarly to EOP, no received codeword is successfullydecoded even at high SNRs (i.e., FER = 1) if the originalcode descriptors are simulated in full pipeline. This confirms

Page 58: 541420

12 EURASIP Journal on Embedded Systems

Table 1: Performance of an LDPC decoder for IEEE 802.11n with 12 iterations: LSO = 5 and fclk = 240 MHz.

Code lenght N0 = 648 N1 = 1296 N2 = 1944

Code rate 1/2 2/3 3/4 5/6 1/2 2/3 3/4 5/6 1/2 2/3 3/4 5/6

NB 88 88 88 88 86 88 88 85 86 88 85 79

LIO 72 72 72 72 48 48 48 48 72 72 72 72

Original

Ldec 2299 1763 1779 1486 2106 1715 1886 1653 2107 1775 1752 1603

Iit 91 46 47 22 81 46 60 43 77 47 48 41

idling % 47% 31% 31% 17% 46% 32% 38% 31% 44% 31% 32% 30%

Γn (Mbps) 101 176 197 262 74 121 124 157 111 175 200 243

EOP

Ldec 1927 1691 1575 1462 1819 1643 1527 1377 1855 1691 1538 1352

Iit 60 40 30 20 57 40 30 20 56 40 30 20

idling % 37% 28% 23% 16% 37% 29% 23% 17% 36% 28% 23% 17%

Γn (Mbps) 121 184 222 266 85 126 153 188 126 184 228 288

ROP

Ldec 1308 1216 1290 1403 1223 1168 1239 1330 1283 1228 1243 1305

Iit 8 0 6 15 7 0 6 16 8 1 5 16

idling % 7.3% 0% 5.5% 13% 6.8% 0% 5.5% 14% 7.4% 1% 4.8% 14%

Γn (Mbps) 178 256 271 277 127 178 188 195 182 253 282 298

UOP

Ldec 1308 1216 1243 1380 1187 1168 1195 1260 1259 1216 1195 1164

Iit 8 0 2 13 4 0 2 10 6 0 1 4

idling % 7.3% 0% 1.9% 11% 4% 0% 2% 9.3% 5.6% 0% 0.9% 4%

Γn (Mbps) 178 256 282 282 131 178 195 206 185 256 293 334

once more the importance of idle cycles in a pipelined HLDdecoding decoder and motivates the need of an optimizationtechnique.

Considering the same scenario of Figure 13, Figure 14shows the convergence speed, measured in average numberof iterations, of the layered decoding algorithm. The curvesconfirm that HLD needs one half of the number of iterationsof the flooding schedule, on average, and show that the fullpipeline mode is also penalized in terms of speed.

8. Implementation Results

The complexity of an LDPC decoder for IEEE 802.11n codeswas derived through logical synthesis on a low-power 65 nmCMOS technology targeting fclk = 240 MHz. Every architec-ture of Section 5 was considered for implementation, eachone supporting the three reordering strategies, for a total of 9combinations. For good error correction performance, inputLLRs and c2v messages were represented on 5 bits, whileinternal SO and v2c messages on 7 bits.

Table 2 summarizes the complexity of the differentdesigns in terms of logic, measured in equivalent Kgatesand number of RAM and ROM bits. Equivalent gates arecounted by referring to the low-drive, 2-input NAND cell,whose area is 2.08 μm2 for the target technology library. Arch.V-A needs the highest number of memory bits due to thelocal variable-to-check buffer, but its logic is smaller since itrequires no additional hardware resources (adders) and lessconfiguration bits.

Because of the partitioning of both the SO and thec2v memories, Arch. V-B needs more logic resources and

more memory bits than Arch. V-C (both for data andconfiguration). The redundancy ratios ρSO and ρc2v of the SOand c2v memory in Arch. V-B, respectively, and ρv2c of the v2cmemory in Arch. V-C, are also reported in Table 2.

As a matter of fact, the three architectures are very similarin complexity and performance, and, for a given set of LDPCcodes, the designer can select the most suitable solution bytrading-off decoding latency and throughput at the systemlevel, with the requirements of logic and memory in terms ofarea, speed, and power consumption at the technology level.

Table 3 compares the design of a decoder for IEEE802.11n based on Arch. V-C with UOP with similar state-of-the-art implementations: a parallel decoder by Blanskby andHowland [7], a 2048-bit rate 1/2 TDMP decoder by Mansourand Shanbhag [25], a similar design for WLAN by Gunnamet al. [10], and a decoder for WiMAX by Brack et al. [26].Here, for a fair comparison, the throughput is expressed inchannel bits decoded per time unit; that is, it is the channelthroughput Γc = Ni/Tdec = Γn/r.

For the comparison, we focused on the architecturalefficiency ηA defined as

ηA = Tdec · fclk

Nit ·NB= N · fclk

Γc ·Nit ·NB, (18)

which represents the average number of clock cycles toupdate one block of H. In decoders based on serial functionalunits it is ηA ≥ 1, and the higher ηA is, the less efficientis the architecture. Actually, ηA can reach 1 only whenthe dependence between consecutive layers is solved at thecode design level. This is the case of two WiMAX codes

Page 59: 541420

EURASIP Journal on Embedded Systems 13

−2.5 −2 −1.5 −1 −0.5

SNR (dB)

10−6

10−5

10−4

10−3

10−2

10−1

100FE

R

IEEE 802.11n, N2 = 1944, r = 1/2

HLD referencePipelined HLD & idle cycles

EOPROPUOP

Full-pipelined HLD(no idle cycles)

OriginalEOPROPUOP

Figure 13: Error-correction performance of the IEEE 802.11n,N2 = 1944, rate-1/2 LDPC code after 12 decoding iterations.

Table 2: IEEE 802.11n LDPC decoder complexity analysis.

EOP ROP UOP

Arch. V-A

logic (Kgates) 71.29 71.62 74.65

RAM bits 61,722 61,722 61,722

ROM bits 23,159 23,159 40,788

Arch. V-B

logic (Kgates) 75.45 75.75 77.99

RAM bits 53,622 54,837 57,024

ρSO 29.2% 29.2% 33.3%

ρc2v 1.1% 4.6% 9.1%

ROM bits 36,582 36,582 51,849

Arch. V-C

logic (Kgates) 71.83 72.14 74.60

RAM bits 53,217 53,217 53,784

ρv2c 29.2% 29.2% 33.3%

ROM bits 34,508 34,508 43,553

(specifically, class 1/2 and class 2/3B codes) which are hazard-free (or layered) “by construction”, thus explaining the verylow value of ηA achieved by [26]. However, [26] is as efficientas our design (ηA ≈ 1.3) on the remaining nonlayeredWiMAX codes, but the authors do not perform layereddecoding on such codes.

For decoders with parallel processing units (see [7,25]) the architectural efficiency becomes a measure of theparallelization used in the processing units and it can beexpressed as ηA 1/d with d being the average check

−2.5 −2.25 −2 −1.75 −1.5 −1.25 −1 −0.75 −0.5 −0.25

SNR (dB)

3

4

56789

2

3

4

56789

10

100

Ave

rage

nu

mbe

rof

iter

atio

ns

IEEE 802.11n, N2 = 1944, r = 1/2

HLD referenceFS

Pipelined HLD & idle cycles:EOPROPUOP

Full-pipelined HLD(no idle cycles)

OriginalEOPROP

UOP

Figure 14: IEEE 802.11n, N2 = 1944, rate-1/2 LDPC code: averagedecoding speed for a maximum of 100 iterations.

node degree. Indeed, in a two-phase decoder, the numberof blocks can be equivalently defined as the overall numberof exchanged messages, divided by the number of functionalunits. If E is the number of edges in the code, then NB =2E/(N + rN), which is an index of the parallelization used inthe processors.

The different designs were also compared in terms ofenergy efficiency, defined as the energy spent per coded bitand per decoding iteration. This is computed as

ηE = Edec

N ·Nit= P

Γc ·Nit(19)

with Edec = P · Tdec being the decoding energy and Pbeing the power consumption. The latter was estimatedwith Synopsys Power Compiler and was averaged out overthree different SNRs (corresponding to different convergencespeeds) and includes the power dissipated in the memoryunits (about 70% of the total). In terms of energy, our designis more efficient than [25] and gets close to the paralleldecoder in [7].

Since the design in [10] is for the same WLAN LDPCcodes and implements a similar layered decoding algorithmwith the same number of processing units, a closer inspectionis compulsory. Thanks to the idle optimization, our solutionis more efficient in terms of throughput, the saving inefficiency ranging from 16% to 23%. Then, although ourdesign saves about 70 mW in power consumption withrespect to [10], the related energy efficiency has not beenincluded in Table 2 since the reference scenario used toestimate the power consumption (238 mW) was not clearlydefined. Finally, although curves for error correction perfor-mance are not available in [10], penalties are expected in viewof the smaller accuracy used to represent v2c (5 bits) and SOs(6 bits) messages.

Page 60: 541420

14 EURASIP Journal on Embedded Systems

Table 3: State-of-the-art LDPC decoder implementations.

[this] [7] [10] [25] [26]

Technology 65 nm CMOS 0.16 μm CMOS5-LM

0.13 μm TSMCCMOS

0.18 μm 1.8 VTSMC CMOS

0.13 μm CMOS

Algorithm layered flooding layered TDMP flooding/layered

CPU arch. serial parallel serial parallel serial

Nb. of CPUs 81 1536 81 64 96

Msg. width (c2v + SO) 5 + 7 4 + 4 5 + 6 4 + 5 6

Clock fr (MHz) 240 64 500 125 333

Rates 1/2, 2/3, 3/4, 5/6 1/2 1/2, 2/3, 3/4, 5/6 1/2 : 1/16 : 7/8 1/2, 2/3, 3/4, 5/6

Codeword length, N 648, 1296, 1944 1024 648, 1296, 1944 2048 576 : 96 : 2304

Codeword size, B 27, 54, 81 1 27, 54, 81 64 24 : 4 : 96

Nb. of blocks, NB 79–88 4,33 79–88 96 76–88

SpeedIterations Nit 12 64 5 10 16

Γc (Mbps) 262–401 1,024 541–1,618 640 177–999

AreaKgates (mm2) 100.7 (0.207) 1750 (52.5) 99.9 (1.85) 220 (14.3) 489.9 (2.964)

RAM bits 56,376 — 55,344 51,680 NA

Power consumption (W) 0.162 0.69 0.238 0.787 NA

ηA (cycle/bit/iter) 1.103–1.306 0.231 1.361–1.521 0.417 1.01–1.31

ηE (pjoule/bit/iter) 33.7–51.5 10.5 — 123 —

9. Conclusions

An effective method to counteract the pipeline hazardstypical of block-serial layered decoders of LDPC codes hasbeen presented in this paper. This method is based onthe rearrangement of the decoding elaborations in orderto minimize the number of idle cycles inserted betweenupdates and resulted in three different strategies namedequal, reversed, and unconstrained output (EOP, ROP, andUOP) processing.

Then, different semi-parallel VLSI architectures of a lay-ered decoder for architecture-aware LDPC codes supportingthe methods above have been described and applied to thedesign of a decoder for IEEE 802.11n LDPC codes.

The synthesis of the proposed decoder on a 65 nm low-power CMOS technology reached the clock frequency of240 MHz, which corresponds to a net throughput rangingfrom 131 to 334 Mbps with UOP and 12 decoding iterations,outperforming similar designs.

This work has proved that the layered decoding algo-rithm can be extended with no modifications nor approx-imations to every LDPC code, despite the interconnectionson its parity-check matrix, provided that idle cycles are usedto maintain the dependencies between the updates in thealgorithm.

Also, the paradigm of code-decoder codesign has beenreinforced in this work, since not only the describedtechniques have shown to be very effective to counteractthe pipeline hazards but also they provide at the same timeuseful guidelines for the design of good, hazard-free, LDPC

codes. To this extent, it is then overcome the assumption thatconsecutive layers do not have to share soft-outputs, like theWiMAX class 1/2 and 2/3B codes do, thus leaving more roomto the optimization of the code performance at the level ofthe code design.

References

[1] “Satellite digital video broadcasting of second generation(DVB-S2),” ETSI Standard EN302307, February 2005.

[2] IEEE Computer Society, “Air Interface for Fixed and MobileBroadband Wirelss Access Systems,” IEEE Std 802.16eTM-2005, February 2006.

[3] “IEEE P802.11nTM/D1.06,” Draft amendment to Standard forhigh throughput, 802.11 Working Group, November 2006.

[4] R. Gallager, Low-density parity-check codes, Ph.D. dissertation,Massachusetts Institutes of Technology, 1960.

[5] D. MacKay and R. Neal, “Good codes based on very sparsematrices,” in Proceedings of the 5th IMA Conference onCryptography and Coding, 1995.

[6] M. M. Mansour and N. R. Shanbhag, “High-throughputLDPC decoders,” IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, vol. 11, no. 6, pp. 976–996, 2003.

[7] A. Blanksby and C. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder,” IEEE Journal ofSolid-State Circuits, vol. 37, no. 3, pp. 404–412, 2002.

[8] H. Zhong and T. Zhang, “Block-LDPC: a practical LDPCcoding system design approach,” IEEE Transactions on Circuitsand Systems I, vol. 52, no. 4, pp. 766–775, 2005.

[9] D. E. Hocevar, “A reduced complexity decoder architecture vialayered decoding of LDPC codes,” in Proceedings of the IEEE

Page 61: 541420

EURASIP Journal on Embedded Systems 15

Workshop on Signal Processing Systems (SISP ’04), pp. 107–112,2004.

[10] K. Gunnam, G. Choi, W. Wang, and M. Yeary, “Multi-ratelayered decoder architecture for block LDPC codes of theIEEE 802.11n wireless standard,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS ’07),pp. 1645–1648, May 2007.

[11] T. Bhatt, V. Sundaramurthy, V. Stolpman, and D. McCain,“Pipelined block-serial decoder architecture for structuredLDPC codes,” in Proceedings of the IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’06),vol. 4, pp. 225–228, April 2006.

[12] C. P. Fewer, M. F. Flanagan, and A. D. Fagan, “A versatilevariable rate LDPC codec architecture,” IEEE Transactions onCircuits and Systems I, vol. 54, no. 10, pp. 2240–2251, 2007.

[13] E. Boutillon, J. Tousch, and F. Guilloud, “LDPC decoder,corresponding method, system and computer program,” USpatent no. 7,174,495 B2, February 2007.

[14] M. Rovini, F. Rossi, P. Ciao, N. L’Insalata, and L. Fanucci,“Layered decoding of non-layered LDPC codes,” in Proceedingsof the 9th Euromicro Conference on Digital System Design (DSD’06), August-September 2006.

[15] R. Tanner, “A recursive approach to low complexity codes,”IEEE Transactions on Information Theory, vol. 27, no. 5, pp.533–547, 1981.

[16] H. Zhang, J. Zhu, H. Shi, and D. Wang, “Layered approx-regular LDPC: code construction and encoder/decoderdesign,” IEEE Transactions on Circuits and Systems I, vol. 55,no. 2, pp. 572–585, 2008.

[17] R. Echard and S.-C. Chang, “The π-rotation low-densityparity check codes,” in Proceedings of the IEEE GlobalTelecommunications Conference (GLOBECOM ’01), pp. 980–984, November 2001.

[18] F. Guilloud, E. Boutillon, J. Tousch, and J.-L. Danger, “Genericdescription and synthesis of LDPC decoders,” IEEE Transac-tions on Communications, vol. 55, no. 11, pp. 2084–2091, 2006.

[19] H. Xiao and A. H. Banihashemi, “Graph-based message-passing schedules for decoding LDPC codes,” IEEE Transac-tions on Communications, vol. 52, no. 12, pp. 2098–2105, 2004.

[20] E. Sharon, S. Litsyn, and J. Goldberger, “Efficient serialmessage-passing schedules for LDPC decoding,” IEEE Trans-actions on Information Theory, vol. 53, no. 11, pp. 4076–4091,2007.

[21] F. Zarkeshvari and A. Banihashemi, “On implementation ofmin-sum algorithm for decoding low-density parity-check(LDPC) codes,” in Proceedings of the IEEE Global Telecommu-nications Conference (GLOBECOM ’02), vol. 2, pp. 1349–1353,November 2002.

[22] C. Jones, E. Valles, M. Smith, and J. Villasenor, “Approximate-MIN constraint node updating for LDPC code decoding,” inProceedings of the IEEE Military Communications Conference(MILCOM ’03), vol. 1, pp. 157–162, October 2003.

[23] M. Rovini, F. Rossi, N. L’Insalata, and L. Fanucci, “High-precision LDPC codes decoding at the lowest complexity,” inProceedings of the 14th European Signal Processing Conference(EUSIPCO ’06), September 2006.

[24] M. Rovini, G. Gentile, and L. Fanucci, “Multi-size circularshifting networks for decoders of structured LDPC codes,”Electronics Letters, vol. 43, no. 17, pp. 938–940, 2007.

[25] M. M. Mansour and N. R. Shanbhag, “A 640-Mb/s 2048-bitprogrammable LDPC decoder chip,” IEEE Journal of Solid-State Circuits, vol. 41, no. 3, pp. 684–698, 2006.

[26] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A synthesizable IPcore for WiMax 802.16E LDPC code decoding,” in Proceedingsof the 17th IEEE International Symposium on Personal, Indoorand Mobile Radio Communications (PIMRC ’06), pp. 1–5,September 2006.

Page 62: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 704174, 3 pagesdoi:10.1155/2009/704174

Letter to the Editor

Comments on “Techniques and Architectures for Hazard-FreeSemi-Parallel Decoding of LDPC Codes”

Kiran K. Gunnam,1, 2 Gwan S. Choi,2 and Mark B. Yeary3

1 Channel Architecture, Storage Peripherals Group, LSI Corporation, Milpitas, CA 95035, USA2 Department of ECE, Texas A&M University, College Station, TX 77843, USA3 Department of ECE, University of Oklahoma, Norman, OK 73019, USA

Correspondence should be addressed to Kiran K. Gunnam, [email protected]

Received 7 December 2009; Accepted 7 December 2009

This is a comment article on the publication “Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPCCodes” Rovini et al. (2009). We mention that there has been similar work reported in the literature before, and the previous workhas not been cited correctly, for example Gunnam et al. (2006, 2007). This brief note serves to clarify these issues.

Copyright © 2009 Kiran K. Gunnam et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

The recent work by Rovini and others in [1] states that“Gunnam et al. describe in [10] a pipelined semi-paralleldecoder for WLAN LDPC codes, but the authors do notmention the issue of the pipeline hazards; only, the needof properly scrambling the sequence of data in order toclear some memory conflicts is described.” On the contrary,we gave detailed explanation of our decoder architecture,concepts of out-of-order processing in [2–6]. The proposedapproach Unconstrained Output Processing (UoP) in [1] issimilar to our approach outlined in [2–6]. So we would liketo clarify more in this matter.

We describe in [2–6] a pipelined semi-parallel decoderfor WLAN LDPC codes, that used scheduling of layeredprocessing and out-of-order block processing to minimizethe pipeline hazards and memory stall cycles. The followingparagraph from [4, Page 1, Column 2], correctly describesour work: “This paper introduces the following conceptsto LDPC decoder implementation: Block serial scheduling[5], value-reuse, scheduling of layered processing, out-of-order block processing, master-slave router, dynamic state.All these concepts are termed as on-the-fly computation asthe core of these concepts is based on minimizing memoryand re-computations by employing just-in-time scheduling.”More detailed explanations and illustrations can be found inthe presentations [5, 6] which were available on-line fromOctober 2006 and May 2007, respectively.

Also [1] did not cite our work on layer reordering foroptimizing the pipeline and memory accesses. In [3, Page4, Column 2], last paragraph, we clearly mention that “Itis possible to do the decoding using a different sequenceof layers instead of processing the layers from 1 to j whichis typically used to increase the parallelism such that it ispossible to process two block rows simultaneously [4]. Inthis work, we use the concept of reordering of layers forincreased parallelism as well as for low complexity memoryimplementation and also for inserting additional pipelinestages without incurring overhead.”

Our proposal of out-of-order processing (OoP) for thelayered decoding [2–6] is to process the circulants in a layerin out-of-order (not necessarily sequential) to remove thepipeline and memory conflicts. This includes the partial stateprocessing and other related steps (Rold message generation,Q message generation, CNU partial state processing. thatis, the processing step of finding Min1,Min2, Min1 ID), inout-of-order fashion and the processing of Rnew messagesin out-of-order fashion. For instance, while processing thelayer 2, the blocks/circulants which depend on layer 1 willbe processed last to allow for the pipeline latency. Also Rnewselection is out-of-order (these messages will come from themost recently updated connected block), so that it can feedthe data required for the PS processing of the second layer. Adependent circulant or connected circulant is the non-zero

Page 63: 541420

2 EURASIP Journal on Embedded Systems

circulant that supplies the last updated information of Pmessage to the specified nonzero circulant. The dependentlayer is the layer which contains dependent circulant. Socirculants in second layer will get the latest P update basedon the Rnew messages from different connected circulantin different connected layers. Thus OoP for PS processingis across one layer (i.e., at any time the CNU partial stateprocessing is concerned with starting and completing onelayer; however, the order of the circulants processed in thelayer is processed in out-of-order to satisfy the pipeline andmemory constraints); OoP for Rnew message generationis across several layers. Also the P update (Q + Rnew),in [2, (9)] is computed on-the-fly along with reading ofthe Q message of the last updated circulant in the sameblock column from the Q memory and the Rnew messagegeneration that is, at the precise moment when it is needed;this avoids the use of P memory and needs a single-portread and single-port write Q memory whose storage capacityis equal to the code length multiplied by the word lengthof Q message. The bandwidth of this memory measure interms of number of Q messages is equal to the decoderparallelization [2–6]. Other decoder hardware architecturesand implementations use both P memory and Q memory,use mirror memories, or use more complicated multiportedmemory. Illustrations for out-of-order processing were givenin [5, 6].

We gave more explanation in [5]: “The decoder hardwarearchitecture is proposed to support out-of-order processingto remove pipeline and memory accesses or to satisfy anyother performance or hardware constraint. Remaining hard-ware architectures will not support out-of-order processingwithout further involving more logic and memory. For theabove hardware decoder architecture, the optimization ofdecoder schedule belongs to the class of NP-complete prob-lems. So there are several classic optimization algorithmssuch as dynamic programming that can be applied. We applythe following classic approach of optimal substructure.”

Step 1. “We will try different layer schedules (j! i.e., j factorialof j if there are j layers). For simplicity, we will try onlya subset of possible sequences so as to have more spreadbetween the original layers.”

Step 2. “Given a layer schedule or a re-ordered H matrix, wewill optimize the processing schedule of each layer. For this,we use the classic approach of optimal substructure that is,the solution to a given optimization problem can be obtainedby the combination of optimal solutions to its sub problems.So first we optimize the processing order to minimize thepipeline conflicts. Then we optimize the resulting processingorder to minimize the memory conflicts. So for each layerschedule, we are measuring the number of stall cycles (ourcost function).”

Step 3. “We choose a layer schedule which minimizes the costfunction that is meets the requirements with less stall cyclesdue to pipeline conflicts and memory conflicts and alsominimizes the memory accesses (such as FS memory accessesto minimize the number of ports needed and to save the

access power and to minimize the more muxing requirementand any interface memory access requirements).”

Also we would like to mention how we calculate thearchitecture efficiencies: we mention in [2], “Here, allcalculations for the decoded throughput are based on anaverage of 5 decoding iteration to achieve frame error rateof 10e-4, while itmax is set to 15.” If we are consideringthe actual system throughput, then we should considerhow many maximum iterations the system can run andwhat is the additional overhead from LLR/Q memory andhard decision memory statistical buffering and loading andunloading times. We have close to 1.5 iteration overheaddue to statistical buffering. So mixing the average numberof iterations with actual system throughput to calculate thedecoder core architecture efficiency is not a fair metric. Inour works [2–6], for the decoders based on one-circulantprocessing, the number of clock cycles for decoding ofeach block/circulant is 1. Note that [2] and [3] designsare similar and have the pipeline depth of 5. The ClockCycles Per Iteration (CCI) for most of the IEEE 802.11nand IEEE 802.16e H matrices after reordering of layers andout-of-order processing and data forwarding and speculativecomputations is the number of blocks in H matrix. The onlyexception is the rate 5/6 matrix of IEEE 802.16e and IEEE802.11n. For 802.16e 5/6 matrix, the CCI is 87 clock cyclesto process 80 blocks. For 802.11n 5/6 matrix, the CCI is 85clock cycles to process 79 blocks. We gave the worst case forCCI as total number of blocks in H matrix + 2 cycle overheadper each layer in [3, Page 6, Column 1, Line 23–28]. So ifwe were to report the Architecture Efficiency = CCI/IdealCCI, the architecture efficiency for our decoders would be1 for all the codes except 2 cases. For 802.16e 5/6 matrix,this number would be 1.0875. For 802.11n 5/6 matrix, thenumber would be 1.0759. Even if we use the above worst casenumber reported in [3] for all the codes even though it isnot necessary, then the architecture efficiency number wouldvary from 1.0759 to 1.29 for 802.11n codes and 1.0875 to1.3158 for 802.16e codes.

Also our work covers more aspects. We can apply OoPfor PS processing across multiple layers. While waiting forthe data from the currently processed layer 1, we can startprocessing the independent circulants in next layer 2 that willnot depend on current layer 1 and also the circulants in layer3 that will not depend on layer 1 and layer 2. In [5], “also wewill sequence the operations in layer such that we process theblock first that has dependent data available for the longesttime. This naturally leads us to true out-of-order processingacross several layers. In practice we won’t do out-of-orderpartial state processing involving more than 2 layers.”

References

[1] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, “Techniques andarchitectures for hazard-free semi-parallel decoding of LDPCcodes,” EURASIP Journal on Embedded Systems, vol. 2009,Article ID 723465, 15 pages, 2009.

[2] K. Gunnam, G. Choi, W. Wang, and M. Yeary, “Multi-ratelayered decoder architecture for block LDPC codes of the

Page 64: 541420

EURASIP Journal on Embedded Systems 3

IEEE 802.11n wireless standard,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS ’07),pp. 1645–1648, New Orleans, La, USA, May 2007.

[3] K. K. Gunnam, G. S. Choi, M. B. Yeary, and M. Atiquzzaman,“VLSI architectures for layered decoding for irregular LDPCcodes of WiMax,” in Proceedings of the IEEE InternationalConference on Communications (ICC ’07), pp. 4542–4547,Glasgow, UK, June 2007.

[4] K. K. Gunnam, G. S. Choi, W. Wang, E. Kim, and M. B.Yeary, “Decoding of quasi-cyclic LDPC codes using an on-the-fly computation,” in Proceedings of the 4th Asilomar Conferenceon Signals, Systems and Computers, pp. 1192–1199, October-November 2006.

[5] K. Gunnam, Area and energy efficient VLSI architectures forlow density parity-check decoders using an on-the-fly compu-tation, Ph.D. presentation, Texas A&M University, CollegeStation, Tex, USA, October 2006, http://dropzone.tamu.edu/∼kirang/10112006.pdf.

[6] K. Gunnam, G. Choi, W. Wang, and M. Yeary, “Multi-ratelayered decoder architecture for block LDPC codes of theIEEE 802.11n wireless standard,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS ’07),pp. 1645–1648, New Orleans, La, USA, May 2007.

Page 65: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 635895, 2 pagesdoi:10.1155/2009/635895

Letter to the Editor

Reply to “Comments on Techniques and Architectures forHazard-Free Semi-Parallel Decoding of LDPC Codes”

Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci

Department of Information Engineering, University of Pisa, Via G. Caruso, 56122 Pisa, Italy

Correspondence should be addressed to Massimo Rovini, [email protected]

Received 7 December 2009; Accepted 7 December 2009

This is a reply to the comments by Gunnam et al. “Comments on ‘Techniques and architectures for hazard-free semi-paralleldecoding of LDPC codes”’, EURASIP Journal on Embedded Systems, vol. 2009, Article ID 704174 on our recent work “Techniquesand architectures for hazard-free semi-parallel decoding of LDPC codes”, EURASIP Journal on Embedded Systems, vol. 2009, ArticleID 723465.

Copyright © 2009 Massimo Rovini et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

After a careful reading of the comments by Gunnam et al.,[1] we identified two main points to be further discussedhereafter.

1.1. Point 1: Cited Papers. Gunnam et al. claim that we didnot cite correctly their work [2] and refer to other fourpublications of their own to provide further explanation.Actually the introductory section of our work [3] aims atproviding an overview of the state-of-the-art architectureson the subject. The five works by Gunnam et al. basicallypropose the same LDPC architecture where the descriptionof all the features is spread across the five publications. Asa matter of fact, to be fair and balanced with the otherstate-of-the-art architectures we have decided to cite onlyone of their works and particularly the one providing themost details regarding the architecture and the implemen-tation results [2]. Finally the selected paper was correctlycited in our work [3] with no misleading information orwrong assertion regarding the architecture described byGunnam et al.

1.2. Point: Architectural Efficiency. In our paper [3], wedefined a metric to compare the efficiency of differentLDPC architectures in terms of (average) number of clockcycles per block and per iteration, with the term “block”

referring to a circulant of the parity check matrix. Weapplied this metric to our design as well as to otheravailable implementations including [2], in this process, weused the figures of throughput reported in each referencedpaper.

Gunnam et al. claim that this is not a fair metric becauseit involves the average number of iterations. Actually wehardly understand the point arisen. On one hand, it is com-mon practice referring to the average number of iterations toexpress the system throughput. On the other hand, Gunnamet al. themselves use in [2] the average number of iterationsto evaluate their throughput figures. Moreover Gunnam etal. state that the overhead of the statistical buffering has notbeen taken into account. Although there is no mention of thestatistical buffering within the cited paper [2], this does notaffect the system throughput but rather the decoding latency.Summarizing, we are quite confident regarding the fairnessof the considered Architectural Efficiency metric and of thedata provided in our paper.

2. Conclusion

In this brief reply we have provided a detailed explanationregarding the points arisen in [1]. The comments byGunnam et al. are indeed very useful to better understandtheir decoder architecture. So, in the future, we will cite [1]as the most effective description of their decoder.

Page 66: 541420

2 EURASIP Journal on Embedded Systems

References

[1] K. K. Gunnam, G. S. Choi, and M. B. Yeary, “Commentson “Techniques and architectures for hazard-free semi-paralleldecoding of LDPC codes”,” EURASIP Journal on EmbeddedSystems, vol. 2009, Article ID 704174, 3 pages, 2009.

[2] K. Gunnam, G. Choi, W. Wang, and M. Yeary, “Multi-ratelayered decoder architecture for block LDPC codes of theIEEE 802.11n wireless standard,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS ’07),pp. 1645–1648, New Orleans, La, USA, May 2007.

[3] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, “Techniques andarchitectures for hazard-free semi-parallel decoding of LDPCcodes,” EURASIP Journal on Embedded Systems, vol. 2009,Article ID 723465, 15 pages, 2009.

Page 67: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 574716, 11 pagesdoi:10.1155/2009/574716

Research Article

OLLAF: A Fine Grained Dynamically Reconfigurable Architecturefor OS Support

Samuel Garcia and Bertrand Granado

ETIS Laboratory, CNRS UMR8051, University of Cergy-Pontoise, ENSEA 6, Avenue du Ponceau, F 95000 Cergy-Pontoise, France

Correspondence should be addressed to Samuel Garcia, [email protected]

Received 15 March 2009; Revised 24 June 2009; Accepted 22 September 2009

Recommended by Markus Rupp

Fine Grained Dynamically Reconfigurable Architecture (FGDRA) offers a flexibility for embedded systems with a great powerprocessing efficiency by exploiting optimizations opportunities at architectural level thanks to their fine configuration granularity.But this increase design complexity that should be abstracted by tools and operating system. In order to have a usable solution, agood inter-overlapping between tools, OS, and platform must exist. In this paper we present OLLAF, an FGDRA specially designedto efficiently support an OS. The studies presented here show the contribution of this architecture in terms of hardware contextmanagement and preemption support. Studies presented here show the gain that can be obtained, by using OLLAF instead of aclassical FPGA, in terms of context management and preemption overhead.

Copyright © 2009 S. Garcia and B. Granado. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Many modern applications, for example robots navigation,have a dynamic behavior, but the hardware targets today arestill static and this dynamic behavior is managed in software.This management is lowering the computation performancesin terms of time and expressivity. To obtain best perfor-mances we need a dynamical computing paradigm. Thisparadigm exists as DRA (Dynamically Reconfigurable Archi-tecture), and some DRA components are already functionals.A DRA component contains several types of resources: logiccells, dedicated routing logic and input/output resources.The logic cells implement functions that may be describedby the designer. The routing logic connects the logic cellsbetween them and is also configured by the designer. The I/Oresources allow communication outside the reconfigurablearea.

Several types of configurable components exist. Forexample, fine grain architectures such as FPGA (FieldProgrammable Gate Array) may adapt the functioning andthe routing at bit level. Other coarse grain architecturesmay be adapted by reconfiguring dedicated operators (e.g.,multipliers, ALU units, etc.) at coarser level (bit vectors).In a DRA the functioning of the components may change

on line during run. FGDRA (Fine Grained DynamicallyReconfigurable Architecture) could obtain very high per-formances for a great number of algorithms because of itsbit level reconfiguration, but this level of reconfigurationinduces a great complexity. This complexity makes it hardto use even for an expert and could be abstracted at somelevel by two ways: at design time by providing design toolsand at run time by providing an operating system. Thisoperating system, in order to handle efficiently dynamicapplications, has to be able to respond rapidly to events.This can be achieved by providing dedicated services likehardware preemption that lowe configurations and contextstransfer times. In our previous work [1], we demonstratedthat we need to adapt the operating system to an FGDRA,but also we need to modify an FGDRA to have an efficientoperating system support.

In this paper we present OLLAF which is an FGDRAspecially designed to support dynamics applications and aspecific FGDRA operating system.

This paper will be organized as follows. First, anexplanation of the problematics of this work is presentedin Section 2. Section 3 presents the OLLAF FGDRA archi-tecture and its particularities. In Section 4, an analysis ofpreemption costs in OLLAF in comparison with others

Page 68: 541420

2 EURASIP Journal on Embedded Systems

existing platforms, including commercial FPGA using sev-eral preemption methods, is presented. Section 5 presentsapplication scenarios and compares context managementoverhead using OLLAF competing with FPGA, especially theVirtex family. Conclusions are then drawn in Section 6, aswell as perspectives on this work.

2. Context and Problematics

Fine Grained Dynamically Reconfigurable Architectures(FGDRA) such as FPGAs, due to their fine reconfigura-tion grain, allow to take better advantage of optimizationopportunities at architectural level. This feature leads in mostapplications to a better performance/consumption factorcompared with other classical architectures. Moreover, theability to dynamically reconfigure itself at run time allowsFGDRA to reach a dynamicity very close to that encounteredusing microprocessors.

The used model in a microprocessor development gainsits efficiency from a great overlapping between platforms,tools, and OS. First between OS and tools, as most mainframe OS offer specifically adapted tools to support theirAPI. Also between tools and platform, as an example RISCprocessors have an instruction set specifically adapted to theoutput of most compilers. Finally, between platform andOS then, by integrating some OS related component intohardware, MMU is an example of such an overlapping. Asfor microprocessors, for FGDRAs the keypoint to maximizeefficiency of a design model is the inter-overlapping betweenplatforms, tools, and OS.

This article presents a study of our original FGDRA calledOLLAF specifically designed to enhance the efficiency of OSservices necessary to manage such an architecture. OLLAFhas a great inter-overlapping between OS and platform. Thisparticular study mainly focuses on the contribution of thisarchitecture in terms of configuration management overheadcompared to other existing FGDRA solutions.

2.1. Problematics. Several studies have been led aroundFGDRA management that demonstrated the interest of usingan operating system to manage such a platform.

Few of them actually propose to bring some mod-ifications to the FGDRA itself in order to enhance theefficiency of some particular services as fast reconfigurationor task relocation. But most of recent studies concentrateon implementing an OS to manage an already existingcommercially available FPGA, most often from the Virtexfamily. This FPGA family is actually the only recent industrialFPGA family to allow partial reconfiguration thanks to aninterface called ICAP.

In a previous study, we presented a method allowing todrastically decrease preemption overhead of a FPGA basedtask, using a Virtex FPGA [1]. In this previous work, asin the one presented here, we made difference betweenconfiguration, which relates to the configuration bitstream,and context. Context is the data that have to be saved bythe operating system, prior to a preemption, in order tobe able to resume the task later without any data loss. In

this previous study, we thus proposed a method to managecontext, configuration being managed in a traditional way.Conclusions of this study were encouraging but revealed thatif we want to go further, we have to work at architecturelevel. That is why we proposed an architecture called OLLAF[2] specially designed to answer to problematics relatedto FGDRA management by an operating system. Amongthose, we wanted to address problems such as contextmanagement and task configuration loading speed, these twofeatures being of primary concern for an efficient preemptivemanagement of the system.

2.2. Related Works. Several researchs have been led in thefield of OS for FGDRA [3–6]. All those studies present an OSmore or less customized to enable specific FGDRA relatedservices. Example of such services are partial reconfigurationmanagement, hardware task preemption, or hardware taskmigration. They are all designed on top of a commercialFPGA coupled with a microprocessor. This microprocessormay be a softcore processor, an embedded hardwired core oreven an external processor.

Some works have also been published about the design ofa specific architecture for dynamical reconfiguration. In [7]authors discuss about the first multicontext reconfigurabledevice. This concept has been implemented by NEC on theDynamically Reconfigurable Logic Engine (DRLE) [8]. Atthe same period, the concept of Dynamically ProgrammableGate Arrays (DPGA) was introduced, it was proposed in[9] to implement a DPGA in the same die as a classicmicroprocessor to form one of the first System on Chip(SoC) including dynamically reconfigurable logic. In 1995,Xilinx even applied a patent on multicontext programmabledevice proposed as an XC4000E FPGA with multiple con-figuration planes [10]. In [11], authors study the use ofa configuration cache, this feature is provided to lowercostly external transfers. This paper shows the advantagesof coupling configuration caches, partial reconfiguration andmultiple configuration planes.

More recently, in [12], authors propose to add specialmaterial to an FGDRA to support OS services, they workedon top of a classic FPGA. The work presented in this papertry to take advantage of those previous works both abouthardware reconfigurable platform and OS for FGDRA.

Our previous work on OS for FGDRA was related topreemption of hardware task on FPGA [1]. For that purposewe have explored the use of a scanpath at task level. In orderto accelerate the context transfer, we explore the possibilityof using multiple parallel scanpaths. We also provided theContext Management Unit or CMU, which is a small IPthat manage the whole process of saving and restoring taskcontexts.

In that study both the CMU and the scanpath werebuilt to be implemented on top of any available FPGA.This approach showed number of limitations that couldbe summarized in this way: implementing this kind of OSrelated material on top of the existing FPGA introducesunacceptable overhead on both the tasks and the OS services.Differently said, most of OS related materials should be asmuch as possible hardwired inside the FGDRA.

Page 69: 541420

EURASIP Journal on Embedded Systems 3

3. OLLAF Architecture Overview

3.1. Specifications of an FGDRA with OS Support. We havedesigned an FGDRA with OS support following thosespecifications.

It should first address the problem of the configurationspeed of a task. This is one of the primary concerns becauseif the system spend more time configuring itself than actuallyrunning tasks its efficiency will be poor. The configurationspeed will thus have a big impact on the scheduling strategy.

In order to enable more choice on scheduling scheme,and to match some real time requirements, our FGDRAplatform must also include preemption facilities. For thesame reasons as configuration, the speed of context savingand restoring processes will be one of our primary concerns.On this particular point, previous work we have discussed inSection 2 will be adapted and reused.

Scheduling on a classical microprocessor is just a matterof time. The problem is to distribute the computation timebetween different tasks. In the case of an FGDRA the systemmust distribute both computation time and computationresources. Scheduling in such a system is then no morea one-dimensional problem, but a three-dimensional one.One dimension is the time and the two others represent thesurface of reconfigurable resources. Performing an efficientscheduling at run time for minimizing processing time isthen a very hard problem that the FGDRA should helpgetting close to solve. The primary concern on this subject isto ensure an easy task relocation. For that, the reconfigurablelogic core should be splited into several equivalent blocks.This will allow to move a task from one block to any anotherblock, or from a group of blocks to another group of blocks ofthe same size and the same form factor, without any changeon the configuration data. The size of those blocks would bea tradeoff between flexibility and scheduling efficiency.

Another aspect of an operating system is to provideintertask communication services. In our case we will dis-tinguish two cases. First the case of a task running on top ofour FGDRA and communicating with another task runningon a different computing unit. This last case will not becovered here as this problem concern a whole heterogeneousplatform, not only the particular FGDRA computing units.The second case is when two, or more, tasks run on top of thesame FGDRA communicate together. This communicationchannel should remain the same wherever the task is placedon the FGDRA reconfigurable core and whatever the stateof those tasks is (running, pending, waiting,. . .). That meansthat the FGDRA platform must provide a rationalizedcommunication medium including exchange memories.

The same arguments could also be applied to inputs/outputs. Here again two cases exists; first the case of I/O beinga global resource of the whole platform; second the case ofspecial I/O directly bounding to the FGDRA.

3.2. Proposed Solutions. Figure 1 shows a global view ofOLLAF, our original FGDRA designed to support efficientlyOS services like preemption or configuration transfers.

In the center, stands the reconfigurable logic core ofthe FGDRA. This core is a dual plane, an active plane

Application communication media

Reconfigurablelogic core

HW Sup+

HW RTK+

CCRHCM

LCM

HCM

LCM

HCM

LCM

HCM

LCMCM

U

CM

U

CM

U

CM

U

Control bus

Figure 1: Global view of OLLAF.

and a hidden one, organized in columns. Each columncan be reconfigured separately and offers the same setof services. A task is mapped on an integer number ofcolumns. This topology as been chosen for two reasons. First,using a partial reconfiguration by column transforms thescheduling problem into a two-dimensional problem (time +1D surface) which will be easier to handle for minimizing theprocessing time. Secondly as every column is the same andoffers the same set of services, tasks can be moved from onecolumn to another without any change on the configurationdata.

In the figure, at the bottom of each column you cannoticetwo hardware blocks called CMU and HCM. The CMU isan IP able to manage automatically task’s context saving andrestoring. The HCM standing for Hardware ConfigurationManager is pretty much the same but to handle configu-ration data is also called bitstream. More details about thiscontroller can be found in [1]. On each column a local cachememory named LCM is added. This memory is a first level ofcache memory to store contexts and configurations close tothe column where it might most probably be required. Theinternal architecture of the core provides adequate materialsto work with CMU and HCM. More about this will bediscussed in the next section.

On the right of the figure stands a big block called“HW Sup + HW RTK + CCR”. This block contains ahardware supervisor running a custom real time kernelspecially adapted to handle FGDRA related OS servicesand platform level communication services. In our firstprototype presented here, this hardware supervisor is aclassical 32 bits microprocessor. Along with this hardwaresupervisor a central memory is provided for OS use only.Basically this memory will store configurations and contextsof every task that may run on the FGDRA. This supervisorcommunicates with all columns using a dedicated controlbus. The hardware supervisor can initiate context transfers,from and to the hidden plane, by writing in CMU’s andHCM’s registers through this control bus.

Finally, on top of the Figure 1 you can see the applicationcommunication medium. This communication mediumprovides a communication port to each column. Thosecommunication ports will be directly bound to the recon-figurable interconnection matrix of the core. If I/O had to

Page 70: 541420

4 EURASIP Journal on Embedded Systems

ABCD (3..0)ABCD X

LUTClkCE

Rst

D

C

CER DFF

Q

LX

QX

Figure 2: Functional, task designer point of view of LE.

be bound to the FGDRA they would be connected withthis communication medium in the same way reconfigurablecolumns are.

This architecture has been developed as a VHDL modelin which the size and number of columns are genericparameters.

3.3. Logic Core Overview. The OLLAF’s logic core is func-tionally the same as logic fabric found in any common FPGA.Each column is an array of Logic Elements surroundedby a programmable interconnect network. Basic functionalarchitecture of an LE can be seen on Figure 2. It is composedof an LUT and a D-FlipFlop. Several multiplexors and/orprogrammable inverters can also be used.

All the material added to support OS in the reconfig-urable logic core, concern the configuration memories. Thatmean that in a user point of view, designing for OLLAF issimilar to designing for any common FPGA. This also meanthat if we want to improve the functionality of those LE theresults presented here will not change.

Configuration data and context data (Flipflops content)constitutes two separate paths. A context swap can beperformed without any change in configuration. This canbe interesting for checkpointing or when running more thanone instance of the same task.

3.4. Configuration, Preemption, and OS Interaction. In previ-ous sections an architectural view of our FGDRA has beenexposed. In this section, we discuss about the impact of thisarchitecture on OS services. We will here consider the threeservices most specifically related to the FGDRA:

(i) First, the configuration management service: on thehardware side, each column provides a HCM and a LCM.That means that configurations have to be prefetched inthe LCM. The associated service running on the hardwaresupervisor will thus need to take that into account. Thisservice must manage an intelligent cache to prefetch taskconfiguration on the columns where it might most probablybe mapped.

(ii) Second, the preemption service: the same principlemust be applicable here as those applied for configurationmanagement, except that contexts also have to be saved. Thecontext management service must ensure that there neverexists more than one valid context for each task in theentire FGDRA. Contexts must thus be transferred as soonas possible from LCM to the centralized global memory of

CSrs

D + clk

CSin + CSclk

DFF1

DFF2

Q

CSout

Figure 3: Dual plane configuration memory.

the hardware supervisor. This service will also have a bigimpact on the scheduling service as the ability to performpreemption with a very low overhead allows the use of moreflexible scheduling algorithms.

(iii) Finally the scheduling service, and in particular thespace management part of the scheduling: it takes advantageof the column topology and the centralized communicationscheme. The reconfigurable resource could then be managedas a virtual infinite space containing an undeterminednumber of columns. The job is to dynamically map thevirtual space into the real space (the actual reconfigurablelogic core of the FGDRA).

3.5. Context Management Scheme. In [1], we proposed acontext management scheme based on a scanpath, a localcontext memory and the CMU. The context managementscheme in OLLAF is slightly different in two ways. First, everycontext management related material is hardwired. Second,we added two more stages in order to even lower preemptionoverhead and to ensure the consistency of the system.

As context management materials are added at hardwarelevel and no more at task level, it needed to be spliteddifferently. As the programmable logic core is column based,it was natural to implement context management at columnslevel. A CMU and a LCM have then been added to eachcolumn, and one scanpath is provided for each column’s setof flipflops.

In order to lower preemption overhead, our recon-figurable logic core uses a dual plane, an active planeand a hidden plane. Flipflops used in logic elements arethus replaced with two flipflops with switching material.Architecture of this dual plane flipflops can be seen onFigure 3. Run and scan are then no more two working modesbut two parallel planes which can be swapped as well. Withthis topology, the context of a task can be shifted in whilethe previous task is still running, and shifted out while thenext one is already running. The effective task switchingoverhead is then taken down to one clock cycle as illustratedin Figure 5.

Contexts are transferred by the CMU into LCM in thehidden plane with a scanpath. Because the context of everycolumn can be transferred in parallel, LCM is placed atcolumn level. It is particularly useful when a task uses morethan one column. In the first prototype, those memories canstore 3 configurations and 3 contexts. LCM optimizes accessto a bigger memory called the Central Context Repository(CCR).

Page 71: 541420

EURASIP Journal on Embedded Systems 5

Dual plane

CMU

LCM

Control bus

CCR

SpeedSize

(nb of context)

1 (+1 active)

∼10

>100

Fixed1 clk

Fixeddepending oncolumn size

(1 clk/logic element)

Randombus access speed

Figure 4: Context memories hierarchy.

1 Tclk overhead

1 2

T1 T2

T2 config. transfert

T2 context restore T1 context save

Cur. active plane

Execution

Config. scan

Context scan

Time axe

Plane 1Plane 2

Figure 5: Typical preemption scenario.

CCR is a large memory space storing the context of eachtask instance run by the system. LCM should then storecontext of tasks who are most likely to be the next to be runon the corresponding column.

After a preemption of the corresponding task, a contextcan be stored in more than one LCM in addition to thecopy stored in the CCR. In such situation, care must betaken to ensure the consistency of the task execution. Forthat purpose, contexts are tagged by the CMU each timea context saving is performed with a version number. Theoperating system keeps track of this version number andalso increments it each time a context saving is performed.In this way the system can then check for the validity of acontext before a context restoration. The system must alsotry to update the context copy in the CCR as short as possibleafter a context saving is performed with a write-throughpolicy.

Dual plane, LCM and CCR form a complex mem-ory hierarchy specially designed to optimize preemptionoverhead as seen on Figure 4. The same memory schemeis also used for configuration management except that aconfiguration does not change during execution so it doesnot need to be saved and then no versioning control isrequired here. The programmable logic core uses a dualconfiguration plane equivalent to the dual plane used forcontext. Each column has an HCM which is a simplifiedversion of the CMU (without saving mechanism). LCM is

designed to be able to store an integer number of bothcontexts and configurations.

In best case, preemption overhead can then be bound toone clock cycle.

A scenario of a typical preemption is presented inFigure 5. In this scenario we consider the case where contextand configuration of both tasks are already stored into LCM.Let us consider that a task T1 is preempted to run anothertask T2, scenario of task preemption is then as follows:

(i) T1 is running and the scheduler decides to preempt itto run T2 instead,

(ii) T2 is configuration and eventual context are shiftedon the hidden plane,

(iii) once the transfer is completed the two configurationplanes are switched,

(iv) now T2 is running and T1’s context can be shifted outto be saved,

(v) T1’s context is updated as soon as possible in theCCR.

4. Preemption Cost Analysis

4.1. OLLAF versus Other FPGA Based Works. This sectionpresents an analytic comparison of preemption managementefficiency on different solutions using commercial FPGA

Page 72: 541420

6 EURASIP Journal on Embedded Systems

platform and on our FGDRA OLLAF. The comparison wasmade on six different management method to transfer thecontext and the configuration for the preemption incudingthe methods in use in OLLAF.

The six considered methods are

XIL: a solution based on the Xilinx XAPP290 [13] usingICAP interface to transfer both context and configurationand using the readback bitstream for context extraction,

Scan: a solution using a simple scanpath for context transferas described in both [1, 14], and using ICAP interface forconfiguration transfer,

PCS8: a solution that is similar to Scan solution but using 8parallel scanpath as described in [1] to transfer the context,ICAP interface is still used for configuration transfer,

DPScan: a solution that uses a dual plane scanpath similarto the one used in OLLAF for context transfer and ICAP forconfiguration transfer. This method is also studied in [14],referred as a shadow Scan Chain,

MM: a solution that uses ICAP for configuration transferand the memory mapped solution proposed in [14] forcontext transfer,

OLLAF: a solution that use separate dual plane scanpath forconfiguration transfer and context transfer as used in theFGDRA architecture proposed in this article.

We defines the preemption overhead H as the cost of apreemption for the system in terms of time, expressed as anumber of clock cycles or “tclk”. In the same way, all transfertimes are expressed and estimated in number of clock cycleas we want to focus on the architectural view only. Task sizeswill be parameterized as n, the number of flipflops used.

Preemption overhead can be due to context transfers(two transfers: one from the previously running task tosave it is context and one to the next task to restore it iscontext), configuration transfers (to configure the next task)and eventually context’s data extraction (if the context’s dataare spreaded among other data as in the XIL solution).

The five first solution uses the ICAP interface as config-uration transfer method. Using this method, transfers aremade as configuration bitstream. A configuration bistreamcontains both a configuration and a context. In the sameway, for the XIL solution that also use the ICAP interfacefor context saving, the readback bitsteam contains botha configuration and an context. In this case only contextis useful. But we need to transfer both configuration andcontext and then to spend some extra time to extract thecontext.

According to [14], we can estimate that for an n flipflopIP, and so an n bits context, the configuration is 20n bits. Thatmeans a typical ICAP bitstream of 21n bits.

Analytic expression of H for each case are estimated asfollows.

XIL. Assuming that it uses a 32-bit-width access bus, theICAP interface can transfers 32 bits per clock cycle. Acomplete preemption process will require the transfer of twocomplete bistreams at this rate. In [14], authors estimate thatit takes 20 clock cycles to extract each context bit from thereadback bitstream. This time should then also be taken intoacount for the preemption overhead

H = 21n32

+21n32

+ 20n � 21.3n. (1)

Scan. Using a simple scanpath for context transfer requires1 clock cycle per flipflop for each context transfer. As we usethe ICAP interface for configuration transfer, as mentionedearlier, that implies the effective transfer of a completebitstream. That means that the context of the next task istransfered two time even if only one of them contains thereal useful data

H = 21n32

+ 2n � 2.66n. (2)

PCS8. Using 8 parallel scanpath requires 1 clock cycle for 8flipflops. The configuration transfer remains the same as forthe previous solution

H = 21n32

+2n8� 0.9n. (3)

DPScan. Using a double plane scanpath, the context trans-fers can be hidden, the cost of those transfers is then always1 clock cycle. The configuration transfer remains the same asfor the previous solutions

H = 21n32

+ 1 � 0.66n + 1. (4)

MM. Using 32-bit-memory access, this case is similar tothe PCS8 but using 32 parallel paths instead of 8. Theconfiguration transfer remains the same as for the previoussolutions

H = 21n32

+2n32� 0.69n. (5)

OLLAF. In OLLAF, both context and configuration transferscould be hidden so the total cost of the preemption is always1 clock cycle whatever the size of the task

H = 1. (6)

As a point of comparison, considering a typical operatingsystem clock tick of 10 ms and assuming a typical clockfrequency of 100 MHz, the OS tick is 106 tclk.

To make our comparison, we consider two tasks T1 andT2. We consider a DES56 cryptographic IP that requires 862flipflops and a 16-tap-FIR filter that requires 563 flipflops.Both of those IPs can be found in www.opencores.org. To easethe computation we will consider two tasks using the averagenumber of flipflops of the two considered IP. So for T1 andT2, we got n = (862 + 563)/2 � 713. Table 1 shows theoverhead H for each presented method.

Page 73: 541420

EURASIP Journal on Embedded Systems 7

Table 1: Comparison of task preemption overhead for 713 flipflops task.

XIL Scan PCS8 DPScan MM OLLAF

H (tclk) 15188 1897 642 472 492 1

Table 2: Comparison of task preemption overhead for a whole 1M flipflops FGDRA.

XIL Scan PCS8 DPScan MM OLLAF

H (tclk) 21.3× 106 2.66× 106 900× 103 660× 103 690× 103 1

Those results show that in this case, using our methodleads to a preemption overhead around 500 times smallerthan using the best other method.

If we now consider that not only one task is preemptedbut the whole FGDRA surface, assuming a 1 Million LE’slogic core, estimation of overhead for each method is shownin Table 2. In the XIL case the preemption overhead isabout 20 times more than the tick period, which is notacceptable. Those results show clearly the benefit of OLLAFover actual FPGA concerning preemption. Using actualmethods, preemption overhead is linearly dependent on thesize of the task. In OLLAF, this overhead do not depends onthe size of the task and is always of only one clock cycle.

In OLLAF, both context and configuration transfers arehidden due to the use of dual plane. The latency L betweenthe moment a preemption is asked and the moment the newtask effectively begins to run can also be studied. This latencyonly depends on the size of the columns. In the worst casethis latency will be far shorter than the OS tick period. OStick period being in any case the shortest time in which thesystem must respond to an event, we can consider that thislatency will not affect the system at all.

5. Dynamic Applications Cases Studies

In this section, we will consider few applications cases todemonstrate the contribution of the OLLAF architectureespecially for the implementation of dynamical applications.Applications will be here presented as a task dependencygraph, each task being characterized by its execution time,its size as a number of columns occupied, and eventually itsperiodicity.

In this study, we consider an OLLAF prototype with fourcolumns. The study consists of comparing the execution ofa particular application case using three different contexttransfer methods. The first considered context transfermethod will be the use of an ICAP-like interface, this willbe the reference method as it is the one considered in mostof today’s works on reconfigurable computing. The secondconsider method will be the method used in the OLLAFarchitecture as presented earlier. We will here consider usingLCM of a size of 3 configurations and 3 contexts. Then inorder to study in more detail the contribution of dual planesand of the LCM we will also considered a method consistingon an OLLAF like architecture but using only one plane. Asthe use of a dual planes will have a major impact on thereconfigurable logic core’s performance, this last case is ofprimary concern to justify this cost.

TM TL1 TL0

LCM

LCM

LCM

LCM

Conf plane

Conf plane

Conf plane

Conf plane

Exec plane

Exec plane

Exec plane

Exec planeC

CR

Figure 6: Memory view of the considered implementation ofOLLAF.

Table 3: Transfer times and lengths in clock periods for each level.

TM TL1 TL0

Tr. length (#Tclk) 53760 16384 1

Transfer Time 537.6μs 16.38μs 10 ns

Figure 6 shows a hierarchy memory view of OLLAF.CCR is the main memory, LCM constitute the local columncaches and then the dual plane is the higher and very fastlevel. TL0, TL1 and TM represent the three transfer levels inOLLAF architecture. The “ICAP” like case will imply onlyTM , the “OLLAF simple” one will imply TM and TL1, andfinally the OLLAF case will involve the three transfer levels.Each transfer level is characterized by the time necessary totransfer the whole context of one column. In this study wechoose to use a reconfigurable logic core composed of fourcolumns of 16384 Logic Elements each. Using this layout, thecontext and configuration of a column comports 1680 Kbits.Table 3 gives transfer time for one column’s context andconfiguration in clock period assuming a working frequencyof 100 MHz. Those parameters will be useful as the studywill now consist on counting the number of transfers at eachlevel for every different application case and transfer methodcase. We will thus study the temporal cost of context transfersfor a whole sequence of each application case. We have todistinguish two cases, the very first execution, where caches

Page 74: 541420

8 EURASIP Journal on Embedded Systems

T1

T2

T3

P = 40 ms

T = 40 msS = 1

T = 10 msS = 3

T = 15 msS = 2

Figure 7: First case: simple linear application.

T1

T2

T3 T4

P = 40 ms

T = 40 msS = 1

T = 10 msS = 3

?

T = 15 msS = 2

T = 10 msS = 2

Figure 8: Second case: two dynamically chosen tasks.

are empty, and every later executions of the sequence, wherecaches and planes already contain contexts configurations.

Applications presented here involves each a first task T1which has a periodicity of 40 ms, each time execution of thistask finish, the remaining sequence begin (creation of taskT2 . . .) and a new instance of T1 is ran. This correspondto a typical real time imaging system, a task is in charge ofcapturing a picture, then each time a picture as been fullycaptured, this picture is being processed by a set of othertasks, while the next picture is being captured.

5.1. Considered Cases. The first case as seen on Figure 7 isan application composed of three linearly dependent tasks.It presents no particular dynamicity and thus will serve as areference case.

The second considered case, as seen on Figure 8, presentsa dynamical branch. By that we mean that depending on theresult of task T2’s processing, the system may run T3 or T4.By those two last tasks presenting different characteristics,the overall behavior of the system will be different dependingon input data. This is a typical example of dynamicapplication, in those cases, the system management must beperformed online. In order to study such a dynamical case,we gave a probability for each possible case. Here we consider

that probability of task T3 is 20% while the probability of T4is 80%. Those probabilities are given randomly in order tobe able to perform a statistical study of this application. Inreal case those probabilities may not be known in advance asit depends on input data, we could then consider having anonline profiling in order to improve efficiency of the cachingsystem, but this is beyond the scope of this article. Onecould note that MPEG encoding algorithm is an example ofalgorithm presenting this kind of dynamicity.

In the last considered case, on Figure 9, dynamicity isnot in which task will be executed next but in how manyinstances of a same task will be executed. This can beseen as dynamic thread creation. This kind of case can befound on some computer vision algorithm where a firsttask is detecting objects and then a particular treatment isbeing applied on each detected object. As we cannot knowin advance how many objects will be detected, treatmentsapplied on each of those objects must be dynamicallycreated. In this particular case, we consider that the systemcan handle from 0 up to 4 objects in each scene. Thatmean that depending on input data, from 0 up to 4instances of the task Tdyn can be created and executed. Theprobabilities for each possible number of object detected areshown on the probability graph on Figure 9, we chosen aGaussian like probability figure which is a typical realisticdistribution.

This case is particularly interesting for many reasons.First the loading condition of the task T2 dynamically

depends on the previous iteration of the sequence. As anexample, if no object has been detected in the previous scene,then no Tdyn has been created and thus T2 is still fullyoperational into the active plane, it may only eventually haveto be reseted. If now 3 or more objects has been detected andthus all the three free columns has been used, then the fullcontext of T2 have to be loaded from the second plane or insome cases from the local caches.

Another interesting aspect occurs when 4 objects aredetected and so 4 Tdyn are created and must be executed.In that case if three first Tdyn are executed, one on eachfree column, and then the fourth is executed on one randomcolumn, then a new image will be arrived before processingof the current one is finished, in other terms, the deadlineis missed. However, by scheduling those four Tdyn instancesusing a simple round robin algorithm with a quantum timeof 5 ms, real time treatment can be achieved. It should benoticed that this scheduling is only possible if preemption isallowed by the platform.

5.2. Results. Tables 4, 5, and 6 show execution results foreach presented application case in terms of transfer cost.For each case, we show the number of transfers that occursper sequence iteration at each possible stage depending onthe considered architecture. We also give the Total timespent in transferring context. Those results do not take intoaccount transfers that are hidden due to a parallelization withthe execution of a task in the considered column, as thosetransfers do not present any temporal cost for the system.Concerning level TL1 and TL0, multiple transfers can occur

Page 75: 541420

EURASIP Journal on Embedded Systems 9

T1

T2

P = 40 ms

T = 40 msS = 1

T = 10 msS = 3

Create #?

Tdyn Tdyn Tdyn · · ·

T = 15 msS = 1

Pro

babi

lity

(%)

40

30

20

10

0

0 1 2 3 4

# dynamical task created

Figure 9: Third case: dynamical creation of multiple instances of a task.

Table 4: Results for case 1 execution.

First iteration Next iterations

#TM #TL1 #TL0 Total time #TM #TL1 #TL0 Total time

ICAP-like 3 — — 1.61 ms 4 — — 2.15 ms

OLLAF simple 1 2 — 570 μs 0 1 — 32.8 μs

OLLAF 1 1 3 554 μs 0 0 2 20 ns

Table 5: Results for case 2 execution.

First iteration Next iterations

#TM #TL1 #TL0 Total time #TM #TL1 #TL0 Total time

ICAP-like 3 — — 1.61 ms 4 — — 2.15 ms

OLLAF simple 3 2 — 1.65 ms 0 1 — 32.8 μs

OLLAF 1 2 3 570 μs 0 0.5 2 8.21 μs

Table 6: Results for last case execution.

First iteration Next iterations

#TM #TL1 #TL0 Total time #TM #TL1 #TL0 Total time

ICAP-like 6.6 — — 3.55 ms 5.5 — — 2.96 ms

OLLAF simple 1 3.2 — 590 μs 0 3.1 — 50.8 μs

OLLAF 1 1 3.2 554 μs 0 0 2.1 21 ns

in parallel (one on each column), in those cases only onetransfer is counted as the temporal cost is always of onetransfer at considered stage.

Considering the results using OLLAF, for the firstiteration of the sequence, give information about the con-tribution of the dual planes while the results for the nextiterations using “OLLAF simple” give information about thecontribution of the LCM only. If we now consider the resultfor next iterations using OLLAF, we can see that a major gainis obtained by combining LCM and a dual planes. In the cases

considered here, this gain is a factor between 103 for case 2and 106 for case 1 and 3 compared to the ICAP solution.

We also have to consider the scalability of proposedsolutions. Transfers at level TL0 are not dependent of eithercolumn size or number of columns in the consideredplatform. TL1 transfer time depend on the size of eachcolumn but not on the number of columns in use. TMtransfers not only depends on the column size but also onthe number of column as all transfers at this level share thesame source memory (CCR) and the same bus. We can see

Page 76: 541420

10 EURASIP Journal on Embedded Systems

Case 1 Case 2 Case 31

103

106

Tota

ltra

nsf

erti

me

(log

10(n

s))

OLLAFOLLAF simpleICAP like

Figure 10: Summary of Total transfer cost per sequence.

that using the classical approach will face some scalabilityissues while OLLAF offer a far better scalability potential astransfers cost is far less dependent on the platform size.

Figure 10 gives a summarized view of results. It presentthe total transfer cost per sequence iteration in normalexecution (i.e., not for the first execution). Results arepresented here in nanoseconds using a decimal logarithmicscale. This figure reveal the contribution of the OLLAFarchitecture in terms of context transfer overhead reduction.In all the three cases, OLLAF is the best solution. Case 3shows that it is well adapted to dynamic applications.

Those results not only prove the benefit of the OLLAFarchitecture, but they also demonstrate that the use of LCMallows to take better advantage of dual planes.

6. Conclusion

In this paper we presented a Fine Grained Dynamical-ly Reconfigurable Architecture called OLLAF, speciallydesigned to enhance the efficiency of Operating System’sservices necessary to its management.

Case study considering several typical applications withdifferent degrees of dynamicity revealed that this architecturepermits to obtain a far better efficiency for task loadingand execution context saving services than actual FPGAtraditionally used as FGDRA in most recent studies. Inthe best case, task switching can be achieved in just oneclock cycle. More realistic statistical analysis showed thatfor any basic dynamic case considered, the OLLAF platformalways outperform commercially available solution by afactor around 103 to 106 concerning contexts transfer costs.The analysis showed that this result can be achieved thinks tothe combination of a dual planes and an LCM.

This feature allows fast preemption and thus permit tohandle dynamic applications efficiently. This also open thedoor to lot of different scheduling strategies that cannot beconsidered using classical architecture.

Future works will be led on the development of an onlinescheduling service taking into account new possibilitiesoffered by OLLAF. We could include prediction mechanismin this scheduler performing smart configurations and

contexts prefetch. Being able to predict in most cases thefuture task that will run in a particular column will permit totake even better advantage of the context and configurationmanagement scheme proposed in OLLAF.

This work contribute to make FGDRAs a much morerealistic option as universal computing resource, and makethem one possible solution to keep the evolution of electronicsystem going in the more than moore fashion. For thatpurpose, we claim that we have to put a lot of efforts to builda strong consistence between design tools, Operating Systemsand platforms.

References

[1] S. Garcia, J. Prevotet, and B. Granado, “Hardware task contextmanagement for fine grained dynamically reconfigurablearchitecture,” in Proceedings of the Workshop on Design andArchitectures for Signal and Image Processing (DASIP ’07),Grenoble, France, November 2007.

[2] S. Garcia and B. Granado, “OLLAF: a fine grained dynamicallyreconfigurable architecture for os support,” in Proceedings ofthe Workshop on Design and Architectures for Signal and ImageProcessing (DASIP ’08), Grenoble, France, November 2008.

[3] H. Simmler, L. Levinson, and R. Manner, “Multitasking onFPGA coprocessors,” in Proceedings of the 10th InternationalConference on Field-Programmable Logic and Applications (FPL’00), vol. 1896 of Lecture Notes in Computer Science, pp. 121–130, Villach, Austria, August 2000.

[4] G. Chen, M. Kandemir, and U. Sezer, “Configuration-sensitiveprocess scheduling for FPGA-based computing platforms,”in Proceedings of the Design, Automation and Test in EuropeConference and Exhibition (DATE ’04), vol. 1, pp. 486–493,Paris, France, February 2004.

[5] H. Walder and M. Platzner, “Reconfigurable hardware oper-ating systems: from design concepts to realizations,” inProceedings of the International Conference on Engineering ofReconfigurable Systems and Algorithms (ERSA ’03), pp. 284–287, 2003.

[6] G. Wigley, D. Kearney, and D. Warren, “Introducing recon-figme: an operating system for reconfigurable computing,”in Proceedings of the 12th International Conference on FieldProgrammable Logic and Application (FPL ’02), vol. 2438, pp.687–697, Montpellier, France, September 2002.

[7] X.-P. Ling and H. Amano, “Wasmii : a data driven computeron virtuel hardware,” in Proceedings of the IEEE Workshopon FPGAs for Custom Computing Machines, pp. 33–42, Napa,Calif, USA, April 1993.

[8] Y. Shibata, M. Uno, H. Amano, K. Furuta, T. Fujii, and M.Motomura, “A virtual hardware system on a dynamicallyreconfigurable logic device,” in Proceedings of the IEEE Sympo-sium on FPGAs for Custom Computing Machines (FCCM ’00),Napa Valley, Calif, USA, April 2000.

[9] A. DeHon, “DPGA-coupled microprocessors: commodity ICsfor the early 21st century,” in Proceedings of the IEEE Workshopon FPGAs for Custom Computing Machines (FCCM ’94), pp.31–39, Napa Valley, Calif, USA, April 1994.

[10] Xilinx, “Time multiplexed programmable logic device,” USpatent no. 5646545, 1997.

[11] Z. Li, K. Compton, and S. Hauck, “Configuration cachingtechniques for FPGA,” in Proceedings of the IEEE Symposiumon FPGA for Custom Computing Machines (FCCM ’00), NapaValley, Calif, USA, April 2000.

Page 77: 541420

EURASIP Journal on Embedded Systems 11

[12] V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauw-ereins, “Designing an operating system for a heterogeneousreconfigurable SoC,” in Proceedings of the 17th InternationalParallel and Distributed Processing Symposium (IPDPS ’03), p.174, Nice, France, April 2003.

[13] Xilinx, “Two flows for partial reconfiguration: module basedor difference based,” Xilinx, Application Note, Virtex, Virtex-E, Virtex-II, Virtex-II Pro Families XAPP290 (v1.2), Septem-ber 2004.

[14] D. Koch, C. Haubelt, and J. Teich, “Efficient hardware check-pointing: concepts, overhead analysis, and implementation,”in Proceedings of the 17th International Conference on FieldProgrammable Logic and Applications (FPL ’07), Amsterdam,The Netherlands, August 2007.

Page 78: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 175043, 21 pagesdoi:10.1155/2009/175043

Research Article

Trade-Off Exploration for Target Tracking Application ina Customized Multiprocessor Architecture

Jehangir Khan,1 Smail Niar,1 Mazen A. R. Saghir,2 Yassin El-Hillali,1

and Atika Rivenq-Menhaj1

1 Universite de Valenciennes et du Hainaut-Cambresis, ISTV2 - Le Mont Houy, 59313 Valenciennes Cedex 9, France2 Department of Electrical and Computer Engineering, Texas A&M University at Qatar, 23874 Doha, Qatar

Correspondence should be addressed to Smail Niar, [email protected]

Received 16 March 2009; Revised 30 July 2009; Accepted 19 November 2009

Recommended by Markus Rupp

This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC) architecture optimized for MultipleTarget Tracking (MTT) in automotive applications. An MTT system uses an automotive radar to track the speed and relativeposition of all the vehicles (targets) within its field of view. As the number of targets increases, the computational needs of the MTTsystem also increase making it difficult for a single processor to handle it alone. Our implementation distributes the computationalload among multiple soft processor cores optimized for executing specific computational tasks. The paper explains how wedesigned and profiled the MTT application to partition it among different processors. It also explains how we applied differentoptimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance andFPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that fits in acontemporary medium-sized FPGA and that meets the application’s real-time constraints.

Copyright © 2009 Jehangir Khan et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Technological progress has certainly influenced every aspectof our lives and the vehicles we drive today are no exception.Fuel economy, interior comfort, and entertainment featuresof these vehicles draw ample attention but the most impor-tant objective is to aid the driver in avoiding accidents.

Road accidents are primarily caused by misjudgment of adelicate situation by the driver. The main reason behind thedriver’s inability to judge a potentially dangerous situationcorrectly is the mental and physical fatigue due the stressfuldriving conditions. In cases where visibility is low due topoor weather or due to night-time driving, the stress on thedriver increases even further.

An automatic early warning and collision avoidancesystem onboard a vehicle can greatly reduce the pressure onthe driver. In the literature, such systems are called DriverAssistance Systems (DASs). DASs not only automatize safetymechanisms in a vehicle but also help drivers take correct andquick decisions in delicate situations. These systems providethe driver with a realistic assessment of the dynamic behavior

of potential obstacles before it is too late to react and avoid acollision.

In the past few years, various types of DASs have beenthe subject of research studies [1–3]. Most of these worksconcentrate on visual aid to the driver by using a videocamera. Cameras are usually used for recognizing road signs,lane departure warnings, parking assistance, and so forth.Identification of potential obstacles and taking correctiveaction are still left to the driver. Moreover, cameras havelimitations in bad weather and low-visibility conditions.

Our system uses a radar installed in a host vehicleto scan its field of view (FOV) for potential targets, andpartitions the scanned data into sets of observations, ortracks [4]. Potentially dangerous obstacles are then singledout and visual and audio alerts are generated for thedriver so that a preventive action can be taken. The outputsignals generated by the system can also be routed to anautomatic control system and safety mechanisms in caseof the vehicles equipped for fully autonomous driving. Weaim to use a low-cost automotive radar and complementit with an embedded tracking system for achieving higher

Page 79: 541420

2 EURASIP Journal on Embedded Systems

performance. The objective is to reduce the cost of the systemwithout sacrificing its accuracy and precision.

The principle contributions of this work are as follows.

(i) design and development of a new MTT systemspecially adapted to the requirements of automotivesafety applications,

(ii) feasible and scalable Implementation of the systemin a low-cost configurable and flexible platform(FPGA),

(iii) optimization of the system to meet the real timeperformance requirements of the application and toreduce the hardware size to the minimum possiblelimit, this not only helps to reduce the energyconsumption but also creates room for adding morefunctionality into the system using the same low-costplatform.

We implement our system in FPGA using a multiprocessorarchitecture which is inherently flexible and adaptable.FPGAs are increasingly being used as the platforms of choicefor implementing complex embedded systems due to theirhigh performance, flexibility, and fast design times. Multi-processor architectures have also become popular for severalreasons. For example, monitoring processor properties overthe last three decades shows that the performance of a singleprocessor has leveled off in the last decade. Using multipleprocessors with a lower frequency, results in comparableperformance in terms of instructions per second to a singlehighly clocked processor and reduces power consumptionsignificantly [5]. Dedicated fully hardware implementationmay be useful for high-speed processing but it does notoffer the flexibility and programmability desired for systemevolution. Fully hardware implementations also requirelonger design time and are inherently inflexible. Applicationswith low-power requirements and hardware size constraints,are increasingly resorting to MPSoC architectures. The moveto MPSoC design elegantly addresses the power issues facedon the hardware side while ensuring the speed performance.

2. MTT Terminology and Building Blocks

2.1. Terminology. In the context of target tracking applica-tions, a target represents an obstacle in the way of the hostvehicle. Every obstacle has an associated state represented bya vector that contains the parameters defining the target’sposition and its dynamics in space (e.g., its distance, speed,azimuth or elevation, etc.).

A state vector with n elements is called n-state vector. Aconcatenation of target states defining the target trajectoryor movement history at discrete moments in time is called atrack.

The behavior of a target can ideally be represented byits true state. The true state of a target is what characterizesthe target’s dynamic behavior and its position in space in a100% correct and exact manner. A tracking system attemptsto estimate the state of a target as close to this ideal state aspossible. The closer a tracking system gets to the true state,

the more precise and accurate it is. For achieving this goal, atracking system deals with three types of states

(i) The Observed State or the Observation corresponds tothe measurement of a target’s state by a sensor (radarin our application) at discrete moments in time. Itis one of the two representations of the true stateof the target. The observed state is obtained througha measurement model also termed as the observationmodel (refer to Section 4.2). The measurement modelmathematically relates the observed state to the truestate, taking into account the sensor inaccuracies andthe transmission channel noises. The sensor inac-curacies and the transmission noises are collectivelycalled measurement noise.

(ii) The Predicted State or the Prediction is the secondrepresentation of the target’s true state. Predictionis done for the next cycle before the sensor sendsthe observations. It is a calculated “guess” of thetarget’s true state before the observation arrives. Thepredicted state of a target is obtained through aprocess model (refer to Section 4.1). The process modelmathematically relates the predicted state to the truestate while taking into account the errors due to theapproximations of the random variables involved inthe prediction process. These errors are collectivelytermed as process noise.

(iii) The Estimated State or the Estimate is the correctedstate of the target that depends on both the obser-vation and the prediction. The correction is doneafter the observation is received from the sensor. Theestimated state is calculated by taking into accountthe variances of the observation and the prediction.To get a state that is more accurate than both theobserved and predicted states, the estimation processcalculates a weighted average of the observed andpredicted states favoring the one with lower variancemore over the one with larger variance.

In this paper, the term scan refers to the periodic sweepof radar field of view (FOV) giving observations of all thedetected targets. The FOV is usually a conical region in spaceinside which an obstacle can be detected by the radar. Thearea of this region depends upon the radar range (distance)and its view angle in azimuth.

The radar Pulse Repetition Time (PRT) is the time intervalbetween two successive radar scans. The PRT for the radarunit we are using is 25 ms. This is the time window withinwhich the tracking system must complete the processing ofthe information received during a scan. After this interval,new observations are available for processing. As we shallsee latter, the PRT puts an upper limit on the latency of theslowest module in the application.

2.2. MTT Building Blocks. A generalized view of a MultipleTarget Tracking (MTT) system is given in Figure 1. Thesystem can broadly be divided into two main blocks, namelyData Association and Filtering & Prediction. The two blockswork in a closed loop. The Data Association block is

Page 80: 541420

EURASIP Journal on Embedded Systems 3

Data association

Filtering& prediction

Observationsfrom the

sensor

Trackmaintenance

Observation to trackassignment

Gate computation

Estimatedtargetstates

Figure 1: A simplified view of MTT.

further divided into three subblocks: Track maintenance,Observation-to-Track Assignment and Gate Computation.

Figure 1 represents a text book view of the MTT sys-tem as presented in [6, 9]. Practical implementation andinternal details may vary depending on the end use andthe implementation technology. For example, the Filtering& Prediction module may be implemented choosing froma variety of algorithms such as α-β filter [4, 6], mean-shift algorithm [7], Kalman Filter [6, 8, 9], and so forth.Similarly, the Data Association module is usually modeled asan Assignment Problem. The assignment problem itself maybe solved in a variety of ways, for example, by using theAuction algorithm [10], or the Hungarian/Munkres algorithm[11, 12].

The choice of algorithms for the subblocks is drivenby factors like the application environment, the amount ofavailable processing resources, the hardware size of the endproduct, the track precision, and the system response time,and so forth.

3. Hardware Software Codesign Methodology

For designing our system, we followed the y-chart codesignmethodology depicted in Figure 2.

On the right hand side, the software design considera-tions are taken into account. This includes the choice of theprogramming language, the software development tools, andso forth. On the left hand side, the hardware design tools, thechoice of processors, the implementation platform, and theapplication programming interface (API) of the processorsare defined. In the middle, the MPSoC hardware is generatedand the software is mapped onto the processors.

After constructing the initial architecture, its perfor-mance is evaluated. If further performance improvementis needed, we track back to the initial steps and optimizevarious aspects of the software and/or the hardware toachieve the desired performance. The modalities of the“track back” step are mostly dependent on the experienceand expertise of the designer. For this work, we used amanual track back approach based on the profiling statisticsof the application. As a part of our ongoing work, we areformalizing the design approach to help the designer in

HW aspects SW aspects

Architectural platform

Programming model

Application design

Applicationdevelopmentenvironment

SW to HW mapping

Code generationprocess

Performance analysis

Improvearchitecture

Re-arrangeapplication

Improvemappingstrategies

Figure 2: The Y-chart flow for codesign.

choosing the right configuration parameters and mappingstrategies.

Following the codesign methodology, we first developedour application, details of which are described in the nextsection. After developing the application, we move on tothe architectural aspects of the system which are detailed inSection 6.

4. Application Design and Development:Our Approach

As stated above, the choice of algorithms for the MTT systemand the internal details are driven by various factors. Wedesigned the application for mapping onto a multiprocessorsystem. A multiprocessor architecture can be exploitedvery efficiently if the underlying application is dividedinto simpler modules which can run in parallel. Moreover,simple multiple modules can be managed and improvedindependently of one another as long as the interfaces amongthem remain unchanged.

For the purpose of a modular implementation, weorganized our MTT application into submodules as shownin Figure 3. The functioning of the system is explained asfollows. Assuming a recursive processing as shown by theloop in Figure 1, tracks would have been formed on theprevious radar scan. When new observations are receivedfrom the radar, the processing loop is executed.

In the first cycle of the loop, at most 20 of the incomingobservations would simply pass through the Gate Checker,the Cost Matrix Generator, and the Assignment Solver onto the filters’ inputs. A filter takes an observation as aninaccurate representation of the true state of the target, andthe amount of inaccuracy of the observation depends on themeasurement variance of the radar. The filter estimates thecurrent state of the target and predicts its next state before

Page 81: 541420

4 EURASIP Journal on Embedded Systems

Data association

Track maintenance

Obs-to-track assignment

Filtering & prediction

Correction(measurement

update)

Prediction(time update)

Observationlessgate identifier

New targetidentifier

TrackInit/del

Gate checkerCost matrix

generatorAssignment

solver

Gate computation

Radar

ObservationsEstimates

Figure 3: The Proposed MTT implementation.

the next observation is available. The estimation processand the MTT application as a whole rely on mathematicalmodels. The mathematical models we used in our approachare detailed below.

4.1. Process Model. The process model mathematicallyprojects the current state of a target to the future. This canbe presented in a linear stochastic difference equation as

Yk = AYk−1 + BUk +Wk−1. (1)

In (1), Yk−1 and Yk are n-dimensional state vectors thatinclude the n quantities to be estimated. Vector Yk−1

represents the state at scan k−1, while Yk represents the stateat scan k.

The n× n matrix A in the difference equation (1) relatesthe state at scan k − 1 to the state at scan k, in the absenceof either a driving function or process noise. Matrix A is theassumed known state transition matrix which may be viewedas the coefficient of state transformation from scan k − 1to scan k, in the absence of any driving signal and processnoise. The n × l matrix B relates the optional control inputUk ∈ Rl to the state Yk, whereas Wk−1 is zero-mean additivewhite Gaussian process noise (AWGN) with assumed knowncovarianceQ. Matrix B is the assumed known control matrix,andUk is the deterministic input, such as the relative positionchange associated with the host-vehicle motion.

4.2. Measurement Model. To express the relationshipbetween the true state and the observed state (measuredstate), a measurement model is formulated. It is described asa linear expression:

Zk = HYk +Vk. (2)

Here Zk is the measurement or observation vectorcontaining two elements, distance d and azimuth angle

θ. The 2 × n observation matrix H in the measurementequation (2) relates the current state to the measurement(observation) vector Zk. The term Vk in (2) is a randomvariable representing the measurement noise.

For implementation, we chose the example case given in[8]. In the rest of the paper, the numerical values of all thematrix and vector elements are borrowed from this example.In this example, the matrices and vectors in equations (1)and (2) have the forms shown below:

Yk =

⎡⎢⎢⎢⎢⎢⎢⎣

y11

y21

y31

y31

⎤⎥⎥⎥⎥⎥⎥⎦

, A =

⎡⎢⎢⎢⎢⎢⎢⎣

1 T 0 0

0 1 0 0

0 0 1 T

0 0 0 1

⎤⎥⎥⎥⎥⎥⎥⎦

, Zk =⎡⎣dθ

⎤⎦. (3)

Here y11 is the target range or distance; y21 is range rateor speed; y31 is the azimuth angle; y41 is angle rate orangular speed. In vector Zk, the element d is the distancemeasurement and θ is the azimuth angle measurement.Matrix B and control input Uk are ignored here because theyare not necessary in our application.

The radar Pulse Repetition Time (PRT) is denoted by Tand it is 0.025 seconds for the specific radar unit we are usingin our project.

Having devised the process and measurement models, weneed an estimator which would use these models to estimatethe true state. We use the Kalman filter which is a recursiveLeast Square Estimator (LSE) considered to be the optimalestimator for linear systems with Additive White GaussianNoise (AWGN) [9, 13].

4.3. Kalman Filter. The Filtering & Prediction block inFigure 3 is particularly important as the number of filtersemployed in this block is the same as the maximum numberof targets to be tracked. In our work, we fixed this number at

Page 82: 541420

EURASIP Journal on Embedded Systems 5

20 as the radar we are using can measure the coordinates ofa maximum of 20 targets. Hence this block uses 20 similarfilters running in parallel. If the number of the detectedtargets is less than 20, the idle filters are switched off toconserve energy.

Given the process and the measurement models in (1)and (2), the Kalman filter equations are

Y−k = AYk−1 + BUk, (4)

P−k = APk−1AT +Q, (5)

K = P−k HT(HP−k H

T + R)−1

, (6)

Yk = Y−k + K(Zk −HY−k

), (7)

Pk = (I − KH)P−k . (8)

Here Y−k is the state prediction vector; Yk−1 is the stateestimation vector, K is the Kalman gain matrix, P−k is theprediction error covariance matrix, Pk is the estimation errorcovariance matrix and I is an identity matrix of the samedimensions as Pk. Matrix R represents the measurementnoise covariance and it depends on the characteristics of theradar.

The newly introduced vectors and matrices in (4) to (8)have the following forms:

Yk =

⎡⎢⎢⎢⎢⎢⎢⎣

y11

y21

y31

y41

⎤⎥⎥⎥⎥⎥⎥⎦

, Y−k =

⎡⎢⎢⎢⎢⎢⎢⎣

y−11

y−21

y−31

y−41

⎤⎥⎥⎥⎥⎥⎥⎦

, H =⎡⎣1 0 0 0

0 0 1 0

⎤⎦,

R =⎡⎣r11 r12

r21 r22

⎤⎦ =

⎡⎣106 0

0 2.9∗ 10−4

⎤⎦,

Q =

⎡⎢⎢⎢⎢⎢⎢⎣

q11 q12 q13 q14

q21 q22 q23 q24

q31 q32 q33 q34

q41 q42 q43 q44

⎤⎥⎥⎥⎥⎥⎥⎦=

⎡⎢⎢⎢⎢⎢⎢⎣

0 0 0 0

0 330 0 0

0 0 0 0

0 0 0 1.3∗ 10−8

⎤⎥⎥⎥⎥⎥⎥⎦.

(9)

Here Y−11 is the range prediction, Y−21 is the speedprediction, Y−31 is the azimuth angle prediction, Y−41 is theangular speed prediction, Y11 is the range estimate, Y21 thespeed estimate, Y31 is the angle estimate and lastly Y41 is theangular speed estimate, all for instant k.

Matrices K and P−k have the following forms:

K =

⎡⎢⎢⎢⎢⎢⎢⎣

k11 k11

k21 k122

k31 k32

k41 k42

⎤⎥⎥⎥⎥⎥⎥⎦

, P−k =

⎡⎢⎢⎢⎢⎢⎢⎣

p−11 p−21 p−31 p−41

p−21 p−22 p−23 p−24

p−31 p−32 p−33 p−34

p−41 p−42 p−43 p−44

⎤⎥⎥⎥⎥⎥⎥⎦. (10)

Matrix Pk is similar in form to P−k except for thesuperscript “−”.The scan index k has been ignored in the

Seed valuesfor Yk−1 & Pk−1

Prediction(time update)

State prediction (V-C.1)

Error cov. pred. (V-C.2)

Correction(measurement update)

Filter gain (V-C.3)

State estimation (V-C.4)

Error cov. estim. (V-C.5)

Figure 4: The Kalman filter.

elements of these matrices and vectors for the sake ofnotational simplicity. The Kalman filter cycles through theprediction-correction loop shown pictorially in Figure 4. Inthe prediction step (also called time update), the filterpredicts the next state and the error covariance associatedwith the state prediction using (4) and (5), respectively.In the correction step (also called measurement update), itcalculates the filter gain, estimates the current state and theerror covariance of the estimation using (6) through (8),respectively.

Figure 5 shows the position of a target estimated by theKalman filter against the true position and the observedposition (measured by the radar). The efficacy of the filtercan be appreciated by fact that the estimated position followsthe true position very closely as compared with the observedposition after the 20 transitional iterations.

In the case of a system dedicated to tracking a singletarget, the estimated state given by the filter would be usedto null the offset between the current pointing angle ofthe radar and the angle at which the target is currentlysituated. This operation would need a control loop and anactuator to correct the pointing angle of the radar. But sincewe are dealing with multiple targets at the same time, wehave to identify which of the incoming observed states toassociate with the predicted states to get the estimation foreach target. This is the job of data association function. Thedata association submodules are explained one by one in thefollowing sections.

4.4. Gate Computation. The first step in data association isthe gate computation. The Gate Computation block receivesthe predicted states Y−k and the predicted error covariance P−kfrom the Kalman Filters for all the currently known targets.Using these two quantities the Gate Computation blockdefines the probability gates which are used to verify whetheran incoming observation can be associated with an existingtarget. The predicted states Y−k are located at the center ofthe gates. The dimensions of the gates are proportional to theprediction error covariance P−k . If the innovation “Zk−HY−k ”(also called the residual) for an observation, is greater thanthe gate dimensions, the observation fails the gate and hence

Page 83: 541420

6 EURASIP Journal on Embedded Systems

Position (true, measured and estimated)

Dis

tan

ce(m

)

14

16

18

20

22

24

Iterations (n)

400 600 800 1000 1200 1400 1600 1800

True Measured

Estimated

Figure 5: Estimated target position.

it cannot be associated with the concerned prediction. If anobservation passes a gate, then it may be associated with theprediction at the center of that gate. In fact, observations formore than one targets may pass a particular gate. In suchcases all these observations are associated with the singleprediction. The Gating process may be viewed as the firstlevel of “screening out” the unlikely prediction-observationassociations. In the second level of screening, namely theassignment solver (discussed latter in Section 4.7), a strictlyone-to-one coupling is established between observations andpredictions.

The gate computation model is summarized as follows.Define Y to be the innovation or the residual vector (Zk−

HY−k ). In general, for a track i, the residual vector is

Yi = Zk −HY−i . (11)

Now define a rectangular region such that an observationvector Zk (with elements zkl) is said to satisfy the gate of agiven track if all elements yil of residual vector Yi satisfy therelationship

∣∣∣Zkl −HY−il∣∣∣ =

∣∣∣Yil∣∣∣ ≤ KGlσr . (12)

In (11) and (12), i is an index for track i, G is gate and l isreplaced either by d or by θ, whichever is appropriate (see(17) and (18)). The term σr is the residual standard deviationand is defined in terms of the measurement variance σ2

z andprediction variance σ2

y−k. A typical choice for KGl is [KGl ≥

3.0]. This large choice of gating coefficient is typically madein order to compensate for the approximations involved inmodeling the target dynamics through the Kalman filtercovariance matrix [4]. This concept comes from the famous3 sigma rule in statistics.

In its matrix form for scan k and track i, (11) can besimplified down to

Yik =⎡⎣ yik11

yik21

⎤⎦ =

⎡⎣di − y−k11

θi − y−k31

⎤⎦. (13)

Consequently (12) gives∣∣∣∣∣∣yik11

yik21

∣∣∣∣∣∣ ≤ KGlσr . (14)

The residual standard deviations for the two state vectorelements are defined as follows

σrd =√r11 + p−22, (15)

σrθ =√r22 + p−44. (16)

From (14), (15), and (16), we get

∣∣ yik11∣∣ = ∣∣ yikd

∣∣ ≤ 3.0√r11 + p−22 (17)

∣∣ yik21∣∣ = ∣∣ yikθ

∣∣ ≤ 3.0√r22 + p−44 (18)

Equations (17) and (18) together put the limits on theresiduals yikd and yikθ . In other words, the difference betweenan incoming observation and prediction for track i mustcomply with (17) and (18) for the observation to be assignedto track i. The Gate Checker subfunction, explained next,tests all the incoming observations for this compliance.

4.5. Gate Checker. The Gate Checker tests whether anincoming observation fulfills the conditions set in (17) and(18). Incoming observations are first considered by theGate Checker for updating the states of the known targets.Gate checking determines which observation-to-predictionpairings are probable. At this stage the pairing between thepredictions and the observations are not done in a strictlyone-to-one fashion. A single observation may be pairedwith several predictions and vice versa, if (17) and (18) arecomplied with. In effect, the Gate Checker sets or resets thebinary elements of an N×N matrix termed as the Gate Maskmatrix M where N is the maximum number of targets to betracked,

M =

Predictions︷ ︸︸ ︷⎡⎢⎢⎢⎢⎢⎢⎢⎣

m11 m12 · · · m1N

m21 m22 · · · m2N

...... · · · ...

mN1 mN2 · · · mNN

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎫⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎭

observations,

mij =⎧⎨⎩

1 if obs i obey (V-D.7) & (V6D.8) for track j,

0 otherwise.(19)

If an observation i fulfills both the conditions of (17) and(18) for a prediction j, the corresponding element mij ofmatrixM is set to 1 otherwise it is reset to 0. MatrixM wouldtypically have more than one 1′s in a column or a row. Theultimate goal for estimating the states of the targets is to haveonly one ′1′ in a row or a column for a one-to-one couplingof observations and predictions. To achieve this goal, the firststep is to attach a cost to every possible coupling. This is doneby the Cost Generator block explained next.

Page 84: 541420

EURASIP Journal on Embedded Systems 7

4.6. Cost Matrix Generator. The Mask matrix is passed onto the Cost Matrix Generator which attributes a cost to eachpairing. The costs associated with all the pairings are puttogether in a matrix called the Cost Matrix C.

The cost ci j for associating an observation i with aprediction j is the statistical distance d2

i j between theobservation and the prediction when mij is 1. The cost isan arbitrarily large number when mij is 0. The statisticaldistance d2

i j is calculated as follows.Define

Si j = HP−k HT + R. (20)

Here i is an index for observation i and j is the index forprediction j in a scan, Si j is the residual covariance matrix.The statistical distance d2

i j is the norm of the residual vector,

d2i j = YT

i j S−1i j Yi j (21)

C =

Predictions︷ ︸︸ ︷⎡⎢⎢⎢⎢⎢⎢⎢⎣

c11 c12 · · · c1N

c21 c22 · · · c2N

...... · · · ...

cN1 cN2 · · · cNN

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎫⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎭

Observations

ci j =⎧⎨⎩

Arbitrary large number if mij is 0

d2i j if mij is 1

(22)

Equation (20) can be written in its matrix form andsimplified down to

Si j =⎡⎣p

−11 + r11 p−13

p−31 p−33 + r22

⎤⎦. (23)

Using (13), (21), and (23), d2i j is calculated as follows:

d2i j =

[ yik11 yik21 ][p−33+r22 −p−13

−p−31 p−11+r11

][yik11

yik21

]

((p−11 + r11

)∗ (p−33 + r22)− p−11 ∗ p−33

) . (24)

Recall here that yik11 = yikd and yik21 = yikθ .The cost matrix demonstrates a conflict situation where

several observations are candidates to be associated with aparticular prediction and vice versa. A conflict situation isillustrated in Figure 6.

The three rectangles represent the gates constructed bythe Gate Computation module. The predicted states aresituated at the center of the gates. Certain parts of thethree gates overlap one another. Some of the incomingobservations would fall into these overlapping regions ofthe gates. In such cases all the predictions at the center ofthe concerned gates are eligible candidates for associationwith the observations falling in the overlapping regions. The

2 way conflict

3 way conflict

Gate 1

Gate 2

Gate 3

1

3

3

21

2

PredictionsObservations

Figure 6: Conflict situation in data association.

mask matrix M and the cost matrix C corresponding to thissituation are shown below,

M =

Predictions︷ ︸︸ ︷⎡⎢⎢⎢⎣

0 1 0

1 0 0

1 1 1

⎤⎥⎥⎥⎦

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Observations,

C =

Predictions︷ ︸︸ ︷⎡⎢⎢⎢⎣

∞ d212 ∞

d221 ∞ ∞d2

31 d232 d2

33

⎤⎥⎥⎥⎦

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Observations.

(25)

The prediction with the smallest statistical distance d2i j

from the observation is the strongest candidate. To resolvethese kinds of conflicts, the cost matrix is passed on to theAssignment Solver block which treats it as the assignmentproblem [10, 12].

4.7. Assignment Solver. The assignment solver determinesthe finalized one-to-one pairing between predictions andobservations. The pairings are made in a way to ensureminimum total cost for all the finalized pairings. Theassignment problem is modeled as follows.

Given a cost matrix C with elements ci j , find a matrixX = [xi j] such that

C =n∑

i=1

m∑

j=1

ci jxi j is minimized (26)

subject to:

i

xi j = 1, ∀ j,

j

xi j = 1, ∀i.(27)

Here xi j is a binary variable used for ensuring that anobservation is associated with one and only one prediction

Page 85: 541420

8 EURASIP Journal on Embedded Systems

and a prediction is associated with one and only oneobservation. This requires xi j to be either 0 or 1, that is,xi j ∈ {0, 1}.

Matrix X can be found by using various algorithms. Themost commonly used among them are the Munkres algorithm[12] and the Auction algorithm [10]. We use the former in ourapplication due to its inherent modular structure.

Matrix X below shows a result of the Assignment Solverfor a 4 × 4 cost matrix. It shows that observation 1 is to bepaired with prediction 3, observation 2 with prediction 1 andso on:

X =

Predictions︷ ︸︸ ︷⎡⎢⎢⎢⎢⎢⎢⎣

0 0 1 0

1 0 0 0

0 0 0 1

0 1 0 0

⎤⎥⎥⎥⎥⎥⎥⎦

⎫⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎭

Observations. (28)

The finalized observation-prediction pairs are passed on tothe the relevant Kalman filters to start a new cycle of the loopfor estimating the current states of the targets, predictingtheir next states and the error covariances associated withthese states.

All the steps IV-C through IV-G are repeated indefinitelyin the loop in Figure 3. However, there are certain caseswhere some additional steps have to be taken too. Togetherthese steps are called Track Maintenance. The Track Mainte-nance and the circumstances where it becomes relevant areexplained in the next section.

4.8. Track Maintenance. The Track Maintenance block con-sists of three functions namely the New Target Identifier, theobs-less Gate Identifier and the Track Init/Del.

In real conditions there would be one or more targetsdetected in a scan which did not exist in the previous scans.On the other hand there would be situations where one ormore of the already known targets would no longer be in theradar range. In the first case we have to ensure if it is reallya new target or a false alarm. The New target Identificationsubblock takes care of such cases. In the latter case we haveto ascertain that the target has really disappeared from theradar FOV. The Observation-less Gate Identification subblockis responsible for dealing with such situations.

A new target is identified when its observation fails allthe already established gates, that is, when all the elements ofa row in the Gate Mask matrix M are zero. Such observationsare candidates for initiating new tracks after confirmation.The confirmation strategies we use in our work are based onempirical results cited in [4]. In this work, 3 observationsout of 5 scans for the same target initiate a new track. Thenew target identifier starts a counter for the newly identifiedtarget. If the counter reaches 3 in five scans, the target isconfirmed and a new track is initiated for it. The counteris reset every five scans thus effectively forming a slidingwindow.

The disappearance of a target means that, no observa-tions fall in the gate built around its predicted state. This isindicated when an entire column of the Mask matrix is filled

with zeros. The tracks for such targets have to be deleted afterconfirmation of their disappearance. The disappearance isconfirmed if the concerned gates go without an observationfor 3 consecutive scans out of 5. The obs-less gate identifierstarts a counter when an empty gate is detected. If thecounter reaches 3 in three consecutive scans out of 5, thedisappearance of the target is confirmed and its track isdeleted from the list. The counter is reset every five scans.

The Track Init/Del function prompts the system toinitiate new tracks or to delete existing ones when needed.

5. Implementation Platform and the Tools

For the system under discussion we work with Altera’s NiosIIdevelopment kit StratixII edition as the implementationplatform. The kit is built around Altera’s StratixII EP2S60FPGA.

5.1. Design Tools. The NiosII development kits are com-plemented with Altera’s Embedded Design Suite (EDS).The EDS offers a user friendly interface for designingNiosII based multiprocessor systems. A library of ready-to-use peripherals and customizable interconnect structurefacilitates creating complex systems. The EDS also providesa comprehensive API for programming and debugging thesystem. The NiosII processor can easily be reinforced withcustom hardware accelerators and/or custom instructionsto improve its performance. The designer can choose fromthree different implementations of the NiosII processor andcan add or remove features according to the requirements ofthe application.

The EDS consists of the three tools, namely the Quar-tusII, the SOPC Builder and the NiosII IDE.

The system design starts with creating a QuartusIIproject. After creating a project, the user can invoke theSOPC Builder tool from within the QuartusII. The designerchooses processors, memory interfaces, peripherals, busbridges, IP cores, interface cores, common microprocessorperipherals and other system components from the SOPCBuilder IP library. The designer can add his/her own customIP blocks and peripherals to the SOPC Builder componentlibrary. Using the SOPC Builder, the designer generates theAvalon switch fabric that contains all the decoders, arbiters,data path, and timing logic necessary to bind the chosenprocessors, peripherals, memories, interfaces, and IP cores.

Once the system integration is complete, RTL code isgenerated for the system. The generated RTL code is sentback into the QuartusII project directory where it can besynthesized, placed and routed and finally an FPGA can beconfigured with the system hardware.

After configuring the FPGA with a Nios II basedhardware, the next step is to develop and/or compile softwareapplications for the processor(s) in the system. The NiosIIIDE is used to manage the NiosII C/C++ application andsystem library or board support package (BSP) projectsand makefiles. The C/C++ application contains the softwareapplication files developed by the user. The system libraryincludes all the header files and drivers related to the system

Page 86: 541420

EURASIP Journal on Embedded Systems 9

Munkres algorithm analysis with GProf for 10 iterations

Ru

nti

me

(ms)

for

10it

erat

ion

s

0500

100015002000250030003500400045005000

Number of targets

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Call to MunkresStep 1Step 2Step 3

Step 4Step 5Step 6

Figure 7: Munkres algorithm profile obtained through GProf.

Munkres analysis with performance counter for 10 iterations

Ru

nti

me

(ms)

0

200

400

600

800

1000

1200

1400

1600

Number of targets

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Call to MunkresStep 1Step 2Step 3

Step 4Step 5Step 6

Figure 8: Profile of Munkres Algorithm obtained through Perfor-mance Counter.

hardware components. The system library can be used toselect project settings such as the choice of stdin, stdout,stderr devices, system clock timer, system time stamp timer,various memory locations, and so forth. Thus using thesystem library, the designer can choose the optimum systemconfiguration for an application.

5.2. Application Profiling and the Profiling Tools. A NiosIIapplication can be profiled in several ways, the most popularamong them being the use of the GProf profiler tool and thePerformance Counter peripheral.

5.2.1. GProf. The Gprof profiler tool, called nios2-elf-gprof,can be used without making any hardware changes to theNiosII system. This tool directs the compiler to add calls tothe profiler library functions into the application code.

The profiler provides an overview of the run-time behav-ior of the entire system and also reveals the dependenciesamong application modules. However adding instructions toeach function call for use by the GNU profiler affects thecode’s behavior in numerous ways. Each function becomes

larger because of the additional function calls to collectprofiling information. Collecting the profiling informationincreases the entry and exit time of each function. Theprofiling data is a sampling of the program counter taken atthe resolution of the system timer tick. Therefore, it providesan estimation, not an exact representation of the processortime spent in different functions [14].

5.2.2. Performance Counter. A performance counter periph-eral is a block of counters in hardware that measure theexecution time taken by the user-specified sections of theapplication code. It can monitor as many as seven codesections. A pair of counters tracks each code section. A 64-bit time counter counts the number of clock ticks duringwhich the code in the section is running while a 32-bit eventcounter counts the number of times the code section runs.These counters accurately measure the execution time takenby designated sections of the C/C++ code. Simple, efficientand minimally intrusive macros are used to mark the startand end of the blocks of interest (the measured code sections)in the program [14].

Figure 7 shows the Munkres algorithm’s profile obtainedthrough Gprof. The algorithm was executed on NiosII/swith 100 MHz clock and 4 KB instruction cache. The callto Munkres represents the processor time of the overallalgorithm for up to 20 obstacles. Step 1 through Step 6represent the behavior of individual subfunctions whichconstitute the algorithm.

Figure 8 shows the profile of the same algorithm obtainedthrough the performance counter for the same processorconfiguration. Clearly the two profiles have exactly the sameform. The difference is that while Gprof estimates that for20 obstacles the algorithm takes around 4500 ms to find asolution, the performance counter calculates the executiontime to be around 1500 ms. This huge difference is due tothe overhead added by Gprof when it calls its own libraryfunctions for profiling the code.

We profiled the application with both the tools. TheGprof was used for identifying the dependencies and theperformance counter for precisely measuring the latencies.All the performances cited in the rest of the paper are thoseobtained by using performance counter.

6. System Architecture

We coded our application in ANSI C following the generallyaccepted efficient coding practices and the O3 compilationoption. Before deciding to allocate processing resources tothe application modules, we profiled the application toknow the latencies, resource requirements and dependenciesamong the modules. Guided by the profiling results, wedistributed the application over different processors asdistinct functions communicating in a producer-consumerfashion as shown in Figure 9. Similar considerations havebeen proposed in [2, 15, 16].

The proposed multiprocessor architecture includes dif-ferent implementations of the NiosII processor and variousperipherals as system building blocks.

Page 87: 541420

10 EURASIP Journal on Embedded Systems

FIFO FIFO FIFO

FIFO

FIFO

FIFO

FIFO

Track maintenanceprocessor #23

I-cache D-cache

Local on-chip mem

Assignment solverprocessor #22

I-cache D-cache

Local on-chip mem

Gating moduleprocessor #21

I-cache D-cache

Local on-chip mem

Radar

Sharedoff-chipmem.

Shared memory interconnect

Track maint to KF interconnect

Assign solver to KF interconnect

KF to gating interconnect

· · ·

KF1processor #1

I-cache D-cache

Local on-chip mem

KF2processor #2

I-cache D-cache

Local on-chip mem

KF20processor #20

I-cache D-cache

Local on-chip mem

Figure 9: The proposed MPSoC architecture.

The NiosII is a configurable soft-core RISC processorthat supports adding or removing features on a system-by-system basis to meet performance or cost goals. A NiosIIbased system consists of NiosII processor core(s), a set of on-chip peripherals, on-chip memory and interfaces to off-chipmemory and peripherals, all implemented on a single FPGAdevice. Because NiosII processor systems are configurable,the memories and peripherals can vary from system tosystem.

The architecture hides the hardware details from the pro-grammer, so programmers can develop NiosII applicationswithout specific knowledge of the hardware implementation.

The NiosII architecture uses separate instruction anddata buses, classifying it as Harvard architecture. Both theinstruction and data buses are implemented as Avalon-MM master ports that adhere to the Avalon-MM interfacespecification. The data master port connects to both memoryand peripheral components while the instruction masterport connects only to memory components.

The Kalman filter, as mentioned earlier, is recursivealgorithm looping around prediction and correction steps.Both these steps involve matrix operations on floatingpoint numbers. These operations demand heavy processingresources to complete in a timely way. This makes the filtera strong candidate for mapping onto a separate processor.Thus for tracking 20 targets at a time, we need 20 identicalprocessors executing Kalman filters.

The Gate Computation block regularly passes infor-mation to Gate Checker which in turn, is in constantcommunication with Cost Matrix Generator. In view of thesedependencies, we group these three blocks together, collec-tively call them the Gating Module and map them onto asingle processor to minimize interprocessor communication.

Interprocessor communication would have required addi-tional logic and would have added to the complexity of thesystem. Avoiding unnecessary interprocessor communica-tion is also desirable for reducing power consumption.

The assignment-solver is an algorithm consisting of sixdistinct iterative steps [12]. Looping through these stepsdemands a long execution time. Moreover, these steps havedependencies among them. Hence the assignment solver hasto be kept together and cannot be combined with any of theother functions. So we allocated a separate processor to theassignment solver.

The three blocks of the Track Maintenance subfunctionindividually don’t demand heavy computational resources,so we group them together for mapping onto a processor.As can be seen in Figure 9, every processor has an I-cache,a D-cache and a local memory. Since the execution time ofthe individual functions and their latencies to access a largeshared memory, are not identical, dependence exclusively ona common system bus would become a bottleneck. Addi-tionally, since the communication between various functionsis of producer-consumer nature, complicated synchronizationand arbitration protocols are not necessary. Hence we choseto have a small local memory for every processor and alarge off-chip memory device as shared memory for noncritical sections of the application modules. As a resultthe individual processors have lower latencies for accessingtheir local memories containing the performance criticalcodes. In Sections 7.2 and 7.4 we will demonstrate how tosystematically determine the optimal sizes of these cachesand the local memories.

Every processor communicates with its neighboringprocessors through buffers. These buffers are dual-portFIFOs with handshaking signals indicating when the buffers

Page 88: 541420

EURASIP Journal on Embedded Systems 11

are full or empty and hence regulating the data transferbetween the processors. This arrangement forms a systemlevel pipeline among the processors. At the lower level, theprocessors themselves have a pipelined architecture (refer toTable 1). Thus the advantages of pipelined processing aretaken both at the system level as well as at the processor level.An additional advantage of this arrangement is that changesmade to the functions running on different processors do nothave any drastic effects on the overall system behavior as longas the interfaces remain unchanged. The buffers are flushedwhen they are full and the data transfer resumes after mutualconsent of the concerned processors. The loss of informationduring this procedure does not affect the accuracy becausethe data sampling frequency as set by the radar PRT, is highenough to compensate for this minor loss.

Access to the I/O devices is memory-mapped. Both datamemory and peripherals are mapped into the address spaceof the data master port of the NiosII processors. The NiosIIprocessor uses the Avalon switch fabric as the interface to itsembedded peripherals. The switch fabric may be viewed as apartial cross-bar where masters and slaves are interconnectedonly if they communicate. The Avalon switch fabric withthe slave-side arbitration scheme, enables multiple mastersto operate simultaneously [17]. The slave-side arbitrationscheme minimizes the congestion problems characterizingthe traditional bus.

In the traditional bus architectures, one or more busmasters and bus slaves connect to a shared bus. A singlearbiter controls the bus, so that multiple bus masters do notsimultaneously drive the bus. Each bus master requests thearbiter for control of the bus and the arbiter grants accessto a single master at a time. Once a master has control ofthe bus, the master performs transfers with any bus slave.When multiple masters attempt to access the bus at the sametime, the arbiter allocates the bus resources to a single master,forcing all other masters to wait.

The Avalon system interconnect fabric uses multimasterarchitecture with slave-side arbitration. Multiple masters canbe active at the same time, simultaneously transferring datato independent slaves. Arbitration is performed at the slavewhere the arbiter decides which master gains access to theslave only if several masters initiate a transfer to the sameslave in the same cycle. The arbiter logic multiplexes alladdress, data, and control signals from masters to a sharedslave. The arbiter logic evaluates the address and controlsignals from each master and determines which master, ifany, gains access to the slave next. If the slave is not shared,it isalways available to its master and hence multiple masters cansimultaneously communicate with their independent slaveswithout going through the arbiter.

6.1. System Software. When processors are used in a system,the use of system software or an operating system isinevitable. Many NiosII systems have simple requirementswhere a minimal operating system or a small footprintsystem software such as Altera’s Hardware AbstractionLayer (HAL) or a third party real-time operating system issufficient. We use the former because the available third partyreal time operating systems have large memory footprints

while one of our objectives is to minimize the memoryrequirements.

The HAL is a lightweight runtime environment thatprovides a simple device driver interface for programsto communicate with the underlying hardware. The HALapplication programming interface (API) is integrated withthe ANSI C standard library. The API facilitates access todevices and files using familiar C library functions.

HAL device driver abstraction provides a clear distinc-tion between application and device driver software. Thisdriver abstraction promotes reusable application code thatis independent of the underlying hardware. Changes inthe hardware configuration automatically propagate to theHAL device driver configuration, preventing changes in theunderlying hardware from creating bugs. In addition, theHAL standard makes it straightforward to write drivers fornew hardware peripherals that are consistent with existingperipheral drivers [17].

6.2. Constraints. The main constraints that we have tocomply with are as follows.

We need the overall response time of the system to beless than the radar PRT which is 25 ms. This means thatthe slowest application module must have less than 25 msof response time. Hence the first objective is to to meet thisdeadline.

The FPGA (StratixII EP2S60) we are using for thissystem, contains a total of 318 KB of configurable on-chipmemory. This memory has to make up the processors’instruction and data caches, their internal registers, periph-eral port buffers and locally connected dedicated RAM orROM. Thus the second constraint is that the total on-chipmemory utilization must not exceed this limit. We can useoff-chip memory devices but they are not only very slowin comparison to the on-chip memory but they also haveto be shared among the processors. Controlling access toshared memory needs arbitration circuitry which adds to thecomplexity of the system and further increase the access time.On the other hand we cannot totally eliminate the off-chipmemory for the reasons stated above. In fact we must balanceour reliance on the off-chip and on-chip memory in such away that neither the on-chip memory requirements exceedthe available amount of memory nor the system becomes tooslow to cope with the time constraints.

Another constraint is the amount of logic utilization.We must choose our hardware components carefully tominimize the use of the programmable logic on the FPGA.Excessive use of programmable logic not only complicatesthe design and consumes the FPGA resources but alsoincreases power consumption. For these reasons we optimizethe hardware features of the the individual processors andleave out certain options when they are not absolutelyessential for meeting the time constraints.

7. Optimization Strategies

To meet the constraints discussed above, we plan ouroptimization strategies as follows.

Page 89: 541420

12 EURASIP Journal on Embedded Systems

Table 1: Different NiosII implementations and their features.

Nios II/f Nios II/s Nios II/e

Fast Standard Economy

Pipeline 6 Stages 5 Stages None

HW Multiplier1 Cycle 3 Cycles

Emulated in

and Barrel Shifter Software

Branch Prediction Dynamic Static None

Instr. cache Configurable Configurable None

Data cache Configurable None None

Logic elements 1400–1800 1200–1400 600–700

(i) Select the appropriate processor type for each moduleto execute it in the most efficient way.

(ii) Identify the optimum cache configuration for eachmodule and customize the concerned processoraccordingly.

(iii) Explore the needs for custom instruction hardwarefor each module and implement the hardware wherenecessary.

(iv) Identify the performance critical sections in eachmodule and map them onto the fast on-chip memoryto improve the performance while keeping the on-chip memory requirements as low as possible.

(v) Look for redundancies in the code and remove themto improve the performance.

In the following sections we explain these strategies oneby one.

7.1. Choice of NiosII Implementations. The NiosII proces-sor comes in three customizable implementations. Theseimplementations differ in the FPGA resources they requireand their speeds. NiosII/e is the slowest and consumes theleast amount of logic resources while NiosII/f is the fastestand consumes the most logic resources. NiosII/s falls inbetween NiosII/e and NiosII/f with respect to logic resourcerequirements and speed.

Table 1 shows the salient features of the three implemen-tations of the NiosII processor.

Note here that the code written for one implementationof the processor will run on any of the other two with a dif-ferent execution speed. Hence changing from one processorimplementation to another requires no modifications to thesoftware code.

The choice of the right processor implementation isdependent on the speed requirements of a particular appli-cation module and the availability of sufficient FPGA logicresources. Optimization of the architecture trades off thespeed for resource saving or vice versa depending on therequirements of the application.

A second criterion for selecting a particular implementa-tion of the NiosII processor is the need (or lack thereof) forinstruction and data cache. For example if we can achieve therequired performance for a module without any cache, theNiosII/e would be the right choice for running that module.

Influence of I-cache & D-cache sizes on the Kalman filterwith NiosII/F 100 MHz

Tim

e(s

ecs)

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

D-cache size (KB)0 2 4 8 16 32 64

I-cache = 4 KBI-cache = 8 KBI-cache = 16 KB

I-cache = 32 KBI-cache = 64 KB

Figure 10: Kalman Filter performances for different Caches Sizes.

Kalman filter NosiII/F 4 KB D-cache all mem sections off chipT

ime

(ms)

0

2

4

6

8

10

12

14

16

Kalman Mat add Mat Mul Mat sub Mat trans Mat inv

Without FP custom instructionsWith FP custom instructions

Figure 11: Kalman Filter performances with 4 KB I-cache.

On the other hand, if a certain application module needsinstruction and data cache to achieve a desired performance,NiosII/f would be chosen to run it. If only instruction cachecan enable the processor to run an application module withthe desired performance then we shall use NiosII/s for thatmodule. The objective is to achieve the desired speed withthe least possible amount of hardware.

7.2. I-Cache and D-Cache. The NiosII architecture sup-ports cache memories on both the instruction master port(instruction cache) and the data master port (data cache).Cache memory resides on-chip as an integral part of theNiosII processor core. The cache memories can improve theaverage memory access time for NiosII processor systemsthat use slow off-chip memory such as SDRAM for programand data storage.

The cache memories are optional. The need for highermemory performance (and by association, the need for

Page 90: 541420

EURASIP Journal on Embedded Systems 13

Influence of I-cache & D-cache sizes on gating modulewith NiosI/F 100 MHz

Tim

e(s

ecs)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

D-cache size (KB)

0 2 4 8 16 32 64

I-cache = 4 KBI-cache = 8 KBI-cache = 16 KB

I-cache = 32 KBI-cache = 64 KB

Figure 12: Cache behavior for gating Module.

cache memory) is application dependent. Many applicationsrequire the smallest possible processor core, and can trade-off performance for size. A NiosII processor core mightinclude one, both, or neither of the cache memories. Further-more, for cores that provide data and/or instruction cache,the sizes of the cache memories are user-configurable. Theinclusion of cache memory does not affect the functionalityof programs, but it does affect the speed at which theprocessor fetches instructions and reads/writes data.

Optimal cache configuration is application specific. Forexample, if a NiosII processor system includes only fast,on-chip memory (i.e., it never accesses the slow off-chipmemory), an instruction or data cache is unlikely to offerany performance gain. As another example, if the criticalloop of a program is 2 KB, but the size of the instructioncache is 1 KB, this instruction cache will not improveexecution speed. In fact, an instruction cache may degradeperformance in this situation [17]. We must determine theoptimum instruction and data cache sizes that are necessaryfor achieving the desired performance for each module.

Both the Instruction and Data Cache sizes for NiosII/fcan range from 0 KB to 64 KB in discrete steps of 0 KB,2 KB, 4 KB, 8 KB, 16 KB, 32 KB, and 64 KB. We experimentedwith various combinations of I-cache and D-caches sizesto determine the optimum cache sizes for each module.In the following sections we discuss the outcome of theseexperiments and the guidance that we took from them.

7.2.1. Kalman Filter Cache Requirements. Using the perfor-mance counter with a NiosII/F processor, we measured theperformance of the Kalman filter with different instructionand cache sizes. Figure 10 shows the influence of I-cacheand D-cache sizes on the processor time of the Kalman filterrunning on NiosII/f with 100 MHz clock using the off-chipRAM.

Two very important conclusions can be drawn fromthis figure. One, whatever the I-cache or D-cache size,the processor time does not exceed 15 ms. Two, beyond16 KB I-cache and 2 KB D-cache, the execution time ismostly independent of the D-cache size. Based on theseobservations, we can say that 16 KB is the optimum I-cachesize for the processors executing the Kalman filters. However,as mentioned earlier, for tracking a maximum of 20 obstacleswe need 20 of these processors. Viewed in isolation, 16 KBmay not seem a large amount of memory but replicating it 20times is practically not possible. To find out the total amountof memory required by this configuration, we compiled aQuartusII project with a NiosII/f having 16 KB I-cache. Thetotal on-chip block memory used by a single processor,accounted for 7% of the memory available on our FPGA(StratixII EP2S60). Besides, we have to keep in mind thatthe other processors in the system also have on-chip memoryrequirements. Consequently we have to settle for a smaller I-cache and hence lower speed to avoid this prohibitive on-chipmemory usage.

The good news here is that even with 4 KB I-cache andno D-cache, the processor time is below the 25 ms threshold.Thus an I-cache of 4 KB would be the right choice for theKalman filters in these circumstances. Furthermore, sincewe do not a use a D-cache, replacing NiosII/f by NiosII/swould help reduce the logic size from the 1400–1800 LEsrange down to the 1200–1400 range which accounts for asizeable gain in size, considering the 20 processors for thefilters.

Figure 11 shows the performance of the Kalman filteron NiosII/s with 4 KB I-cache, no D-cache, 100 MHz clockand using off-chip memory exclusively. Even by keepingall memory sections in the off-chip device and using nofloating point custom instructions, the runtime is around15 ms. Thus we can conserve the scarce on-chip memoryby using only 4 KB of I-cache for the processors runningKalman filters without slowing down the system beyondtolerable limits. The on-chip block memory usage in thiscase drops down to only 3% of that exploitable on the FPGAwhich is more than 50% drop. For 20 Kalman processors thetotal on-chip memory usage is 60% of that available to theuser.

7.2.2. Gating Module Cache Requirements. The gating mod-ule’s behavior with respect to the I-cache and D-cache isshown in Figure 12. A remarkable speed up is observed whenI-cache size changes from 4 KB to 8 KB and again when itchanges from 8 KB to 16 KB. Beyond 16 KB the speed up forthe I-cache is insignificant.

The D-cache size does not matter much as long as it ismore than zero. The overall processor run time is minimum(70 ms) when I-cache size is 16 KB and D-cache size is 2 KB.Therefore, the right I-cache and D-cache sizes for the Gatingmodule are are 16 KB and 72 KB, respectively. The total on-chip memory usage for the processor with this configurationis 8% of that available on the FPGA. This also includes thememory used by the internal registers of the processor.

Using these cache sizes we charted the performanceof the processor while varying the number of obstacles

Page 91: 541420

14 EURASIP Journal on Embedded Systems

Gating module performance on NiosII/F 100 MHz with16 KB I-cache, 2 KB D-cache and using off chip SDRAM

Tim

e(m

s)

0

10

20

30

40

50

60

70

80

Number of obstacles

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cost Mat GenGate checkerGate mask generator

Innov d calcultorInnov a calcultor

Figure 13: Gating Module performances with 16 KB I-cache and2 KB D-cache.

from 2 through 20 as shown in Figure 13. The Innov dand Innov a calculators are two subroutines used by theGate Mask Generator function to calculate distance andangle innovations. The sum of the times taken by thesetwo subroutines is roughly equal to the time taken by theGate Mask Generator. The Gate Checker and the Gate MaskGenerator functions are in turn called by Cost Mat Gen whichis the top level function of the Gating module. The CostMat Gen represents the overall behavior of the whole GatingModule.

Although the overall runtime for 20 obstacles is min-imum (70 ms) for the given configuration, yet it is muchhigher than the 25 ms we are aiming for. In Sections 7.3.2and 7.4.2 we discuss the techniques employed for furtherimproving this execution time.

7.2.3. Munkres Algorithm Cache Requirements. Using a costmatrix with floating point elements and a range of instruc-tion and data cache sizes, Munkres algorithm showed thebehavior depicted in Figure 14.

The first observation here is that when the D-cachesize is more than zero, the runtime decreases profoundlywhatever the I-cache size. Looking closely at the figure we caneliminate 4 KB from the list of competitors for the I-cachesize. An 8 KB I-cache along with 16 KB D-cache results in theminimum execution time, that is, 71.07 ms. Hence this is theoptimum I-cache/D-cache combination for this module. ANiosII system with these cache sizes uses 9% of the on-chipblock memory available on the FPGA.

Figure 15 shows the performance of the algorithm usingthis system composition for the number of obstacles rangingfrom 2 to 20.

We notice here that the two main contributors to thetotal runtime are Step 4 and Step 6. This is because these twofunctions contain nested loops and they are invoked multipletimes during the solution finding process.

Effects of I-cache & D-cache sizes on Munkres algorithmwith NiosII/F 100 MHz 20 obstacles

Tim

e(s

ecs)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

D-cache size (KB)

0 2 4 8 16 32 64

I-cache = 4 KBI-cache = 8 KBI-cache = 16 KB

I-cache = 32 KBI-cache = 64 KB

Figure 14: Cache performances for Munkres Algorithm.

Munkres algorithm on NiosII/F 100 MHz, 8 KB I-cache,16 KB D-cache using off chip SDRAM

Tim

e(m

s)

0

10

20

30

40

50

60

70

80

Number of obstacles

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Call to MunkresStep 1Step 2Step 3

Step 4Step 5Step 6

Figure 15: Munkres Algorithm performance with 8 KB I-cache and16 KB D-Cache.

The overall run time for 20 obstacles, is 71 ms which ishigher than the 25 ms bound. We need to further optimizethe processor to decrease this runtime. In Sections 7.3.3,7.4.3, and 7.5 we explain the steps taken for achieving thisgoal.

7.3. Floating Point Custom Instructions. The floating-pointcustom instructions, optionally available on the NiosII pro-cessor, implement single precision floating-point arithmeticoperations in hardware. They accelerate floating-point oper-ations in NiosII C/C++ applications. The basic set of floating

Page 92: 541420

EURASIP Journal on Embedded Systems 15

point custom instructions includes single precision floating-point addition, subtraction, and multiplication. Floating-point division is available as an extension to the basicinstruction set.

The NiosII software development tools recognize a Ccode that takes advantage of the floating-point instructionspresent in the processor core. When the floating-pointcustom instructions are present in the target hardware,the NiosII compiler generates the code to use the custominstructions for floating-point operations, including addi-tion, subtraction, multiplication, division and the newlibmath library [14].

The best choice for a hardware design depends ona balance among floating-point usage, hardware resourceusage and system performance. While the floating-pointcustom instructions speed up floating-point arithmetic,they add substantially to the size of the hardware design.When resource usage is an issue, it is advisable to reworkthe algorithms to minimize floating-point arithmetic (seeSection 7.5).

We used the floating point custom instructions in theprocessors to assess the tradeoffs between performance andhardware size for each processor. Sections 7.3.1, 7.3.2 and7.3.3 examine the outcome of this assessment and therecommendations based thereon.

7.3.1. Kalman Filter and Floating Point Custom Instructions.As mentioned earlier, the Kalman filter’s runtime neverexceeds 15 ms so there is no need at the moment to accelerateit further at the cost of precious FPGA resources. Neverthe-less we tested the floating point custom instructions’ impacton the Kalman filter’s performance for better understandingthe trade-offs and for exploring the opportunities for theeventual future optimization. Figure 11 shows the results ofthese tests.

An overall speed up of more than 50% is achieved incomparison to the scenario when no floating point custominstructions are used. The most significant improvementis witnessed in case of the Mat Mul subfunction. Thisimprovement can be attributed to two factors. One, Mat Mulrelies heavily on floating point multiplication and second, itis called 11 times in a single iteration of the filter algorithm.Floating point custom instructions are the most effective insuch situations hence this remarkable improvement. Thisspeed up comes at the cost of a bulkier hardware. Thehardware size increases by 8% when floating point custominstructions are used. We stick to our earlier decision of usingregular NiosII/s with 4 KB I-cache and no other add-ons forthe Kalman filter in the present work. Since the use of floatingpoint instructions reduces the execution time for the Kalmanfilter considerably, in our future work we will take this optionto process more than one targets per processor.

7.3.2. Gating Module and Floating Point Custom Instructions.In case of the Gating Module the use of floating point custominstructions is a necessity rather than an option. The reasonis that even with the optimum cache size selection, theGating Module takes 70 ms to execute. Moreover, the Gating

module runs on only one processor so we don’t have toreplicate the floating point custom instructions hardware.Figure 16 shows the performance of the Gating Module afterthe floating point custom instructions are added to theprocessor.

Floating point custom instructions with NiosII processorfor the Gating Module improve the overall performance byapproximately 50%. If we compare Figure 16 with Figure 13,we notice two interesting differences between the two figures.The first and very obvious difference is the drop from 70 msto 37 ms of the overall runtime for 20 obstacles.

The second difference is that the curve for the GateChecker, which was earlier above the Innov a and Innov d,is now below them. This change in behavior is due to thefact that in addition to the floating point multiplication anddivision, the Gate Checker uses the sqrt() function of theANSI C math library. The sqrt() function itself relies onmultiply and divide operations internally. Hence the floatingpoint custom instructions improve the performance of theGate Checker more than the Innov a and Innov d which donot use the sqrt() function.

Although by using the floating point custom instructionwe managed to bring the execution time from 70 ms downto 37 ms for the Gating Module, yet we are still above our25 ms target. In Section 7.4.2, we explore other possibilitiesof improving the performance of the Gating Module evenfurther.

7.3.3. Munkres Algorithm and Floating Point Custom Instruc-tions. Floating point custom instructions bring Munkresalgorithm’s execution time from 71 ms down to 47 ms for 20obstacles, as shown in Figure 17.

Although this is a 33.8% improvement yet 47 ms isalmost twice the time we aim to attain, that is, 25 ms. Thismotivates us to look for ways and means to further decreasethis time. To arrive at this goal, we employ several techniquesas explained in Sections 7.4.3 and 7.5.

7.4. On-Chip versus Off-Chip Memory Sections. The HAL-based systems are linked using an automatically generatedlinker script that is created and managed by the NiosII IDE.The linker script controls the mapping of the code andthe data within the available memory sections. It createsstandard code and data sections (.text, .data, and .bss), plus asection for every physical memory device in the system.

Typically, the .text section is reserved for the programinstructions. The .data section is the part of the objectmodule that contains initialized static data, for example,initialized static variables, string constants, and so forth.The .bss (Block Started by Symbol) section defines the spacefor non initialized static data. The heap section is used fordynamic memory allocation, for example, when malloc() ornew() are used in the C or C++ code, respectively. The stacksection is used for holding the return addresses (programcounter) when function calls occur.

In general, the NiosII design flow automatically specifiesa sensible default partitioning. However, we may wish tochange the partitioning in certain situations. For example, toimprove the performance, we can place performance-critical

Page 93: 541420

16 EURASIP Journal on Embedded Systems

Gating module performance on NiosII/F 100 MHz with 16 KBI-cache, 2 KB D-cache and floating point custom instructions

Tim

e(m

s)

0

5

10

15

20

25

30

35

40

Number of obstacles

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cost Mat GenGate checkerGate mask generator

Innov d calcultorInnov a calcultor

Figure 16: Gating Module performance with 16 KB I-cache, 2 KBD-Cache and Floating Point Custom Instructions.

Munkres algorithm on NiosII/F 100 MHz, 8 KB I-cache,16 KB D-cache, FP custom instruction & using off chip SDRAM

Tim

e(m

s)

0

5

10

15

20

25

30

35

40

45

50

Number of obstacles

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Call to MunkresStep 1Step 2Step 3

Step 4Step 5Step 6

Figure 17: Munkres Algorithm performance with 8 KB I-cache,16 KB D-Cache and Floating Point Custom Instructions.

code and data in the fast on-chip RAM. In these cases, wehave to allocate the memory sections manually.

We can control the placement of the .text, .data, heap andstack memory partitions by altering the NiosII system libraryor BSP settings. By default, the heap and the stack are placedin the same memory partition as the .rwdata section. Wecan place any of the memory sections in the on-chip RAMif needed, to achieve the desired performance.

Ideally we would put all the memory sections in the faston-chip memory but the amount of the on-chip memory in

Table 2: Memory requirements of various application modules.

Section Name Memory Foot Print

Kalman filter

Whole Code + Initialized Data 81 KB

.text Section alone 69.6 KB

.data Section alone 10.44 KB

stack Section alone approximately 2 KB

heap section alone approximately 1 KB

Gating module

Whole Code + Initialized Data 63 KB

.text Section alone 51.81 KB

.data Section alone 8.61 KB

stack Section alone approximately 2 KB

heap section alone approximately 1 KB

Munkres algorithm

Whole Code + Initialized Data 62 KB

.text Section Alone 52.34 KB

.data Section Alone 10.44 KB

.stack Section Alone approximately 2 KB

heap section alone approximately 1 KB

the FPGA is limited. Hence we have to rely greatly on theoff-chip SDRAM or SSRAM. However accessing the off-chipmemory is inherently far slower than the on-chip memory.Moreover, different processors would have to go throughthe arbitration logic to access the shared off-chip memorydevice. This would increase the memory access time evenfurther. Consequently, on the one hand we cannot use theoff-chip memory exclusively since it would slow the systemdown beyond the acceptable limits. On the other hand, wehave to minimize our dependence on on-chip memory foreach processor due to the scarcity of the on-chip memory.We therefore have to balance our reliance on dedicatedon-chip memory and the shared off-chip memory withoutcompromising the performance too much.

Compiling the application modules in the NiosII IDEgives us an estimate of the memory needs of these modules.We selected the appropriate compiler compression optionsto generate a compact object code for each module. Table 2summarizes the memory requirements of all the applicationmodules.

We can see in this table that the memory requirementsof the whole code and the .text sections for all the modulesare too high to be accommodated in the on-chip memory.However if a certain module uses malloc() or new() abun-dantly, placing the heap section in the on-chip memory canimprove its speed by a large margin. Similarly if a modulemakes frequent calls to other functions, putting the stacksection in the on-chip memory can help achieve a higherexecution speed for that module.

We performed experiments by placing the memorysections for the different modules in the off-chip and the on-chip memories and observed some interesting results. Theseresults are discussed in the following sections.

Page 94: 541420

EURASIP Journal on Embedded Systems 17

7.4.1. Kalman Filter and Memory Sections. Although theKalman filter takes 15 ms with only 4 KB I-Cache andno further optimization, yet we investigated the prospectsof improving it further through on-chip memory place-ment. The outcome of this investigation is summarized inFigure 18. As before Kalman represents the overall algorithmand the other bars are its constituent subfunctions. Theseresults are obtained with 4 KB I-cache and using floatingpoint instructions.

Even with all memory sections in the off-chip device, theruntime is 6.35 ms while moving only the stack section tothe on-chip memory reduces this time by more than 50%.Since the stack section of the memory requires only 2 KB,we can bring the time down to 3.2 ms by connecting 2 KBof on-chip dedicated memory to the processors for the stackand using a NiosII/S with 4 KB of I-cache and floating pointcustom instructions. One of our experiments showed that ifwe use a NiosII/f with 16 KB I-cache and 2 KB D-cache andall the other optimizations implemented, we can reduce theprocessing time for the filter to 1 ms. This opens up a newvenue for our future work where we shall route several targetsinto a Kalman filter to reduce the number of processors forthe filters. With this arrangement, we have two options. Wecan reduce the number of processors for the filters from 20to 2 and thereby losing some of the flexibility. Alternatively,we can run all the 20 filters on separate processors andguarantee flexibility of being able to switch to other typesof filters instead of the Kalman filter depending on thetarget characteristics. At the moment we are using the latteroption.

7.4.2. Gating Module and Memory Sections. The GatingModule’s performance for different memory placementexperiments is shown in Figure 19. Here we can deducethat a minimum execution time of 22 ms for 20 obstaclescan be achieved by keeping all the memory sections inthe on-chip memory. But this would require the on-chipRAM to be at least 61 KB. This combined the I-cache andD-cache, adds up to 79 KB. Obviously this is a very highrequirement considering the limited amount of the on-chipmemory. The next best solution of 23 ms is obtained whenwe place the stack and the heap sections in the on-chipmemory. There is a very small speed loss but in this caseonly 3 KB of dedicated on-chip memory is sufficient to getthis speed up. Clearly this is a considerable gain in on-chip memory saving compared to the earlier requirementof 79 KB. So the Gating Module can operate satisfactorily byusing a NiosII/F processor with 16 KB I-cache, 2 KB D-cache,3 KB dedicated on-chip RAM and floating point custominstructions.

The Innov d and Innov a calculators together accountfor more than half the execution time taken by the gatingmodule (cf. Figure 16). Executing these two on separateprocessors in parallel will pave the way for scaling the systemfor more than 20 targets. As an alternative scaling solutionwe are currently experimenting on DSP-VLIW processors toexploit data level parallelism and hardware accelerators, sinceafter the optimizations, we have enough space available foradding more circuitry (see Section 10).

7.4.3. Munkres Algorithm and Memory Sections. Placingvarious memory sections on chip does not have a noteworthyinfluence on the Munkres algorithm performance, althoughthere was some improvement as shown in Figure 20. Wegain only 6 ms if all the memory sections are put on chip.The next best gain is achieved by putting the heap on chip.this is due to the use of a few malloc() statements in thecode. While neither of these gains is enough to reduce theexecution time below 25 ms, the former is not even feasiblegiven the memory foot print of the algorithm. We have tolook elsewhere for a possible and practicable solution. Thefollowing section explains our approach to this issue.

7.5. Floating Point versus Integer Cost Matrix For MunkresAlgorithm. Munkres algorithm operates on the cost matrixiteratively to find an optimum solution. It looks for theminimum value in every column and row of the cost matrixsuch that only one value in a row and a column is selected.It comes out with a solution when the sum of the selectedelements of the cost matrix reaches its minimum. Thisprocedure remains the same whether the elements of thecost matrix are floating point numbers or integer numbers.We found out that if we truncate the fractional part of thefloating point elements of the cost matrix, the final solutionis the same as in the case of the floating point cost matrix.Hence we can replace the floating point cost matrix by a“representative” integer cost matrix without sacrificing theaccuracy of the final solution. This does not require that theall the elements of the cost matrix have to be different fromone another; the algorithm still finds a unique solution evenif all the elements of the cost matrix have the same numericalvalue hence it does not require “distinct” integer values in thecost matrix. The advantage of this manipulation however, isthat with the integer cost matrix the mathematical operationsbecome simpler and faster, reducing the runtime of thealgorithm by a large margin. Additionally, using an integercost matrix obviates the need for the floating point custominstruction hardware. Consequently the size of the processoris reduced by 8%.

We made the necessary modifications to the Munkresalgorithm and the cost matrix generating function to incor-porate this rearrangement. A glimpse of the advantage ofthese transformations can be seen in Figure 21 which showsthe optimal cache configuration for the integer version of theMunkres algorithm.

Certainly 8 KB I-cache and 16 KB D-cache are still thebest choices, the point worth noting here is that with theinteger cost matrix, the runtime for the overall algorithmdrops down to 24 ms as opposed to the 82 ms with floatingpoint cost matrix (refer to Figure 14). This drop takes placewhile all the memory sections are placed in the off-chipSDRAM.

The Munkres algorithm analysis shows the followingfacts: Step 4 and Step 6 are the most time consuming sub-functions of the algorithm in the same order (cf. Figure 17).Although, after optimization, for 20 targets a single processorexecutes the algorithm within the required time interval,yet to scale the system up for more than 20 targets, thesesubfunctions can be executed on separate processors in

Page 95: 541420

18 EURASIP Journal on Embedded Systems

Execution time for Kalman filter on NiosII andfloating point custom instructions

Tim

e(m

s)

0

1

2

3

4

5

6

7

All memsectionsoff chip

Textsectionon chip

Datasectionon chip

Heapsectionon chip

Stacksectionon chip

All memsectionson chip

Mat invMat transMat sub

Mat MulMat add

Figure 18: Effects of on chip and off chip memory sections onKalman Filter performances.

Gating module on NiosII/F with 16 KB I-cache, 2 KB D-cache& 100 MHz clock

Tim

e(m

s)

0

5

10

15

20

25

30

35

40

Number of obstacles

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

All mem secs off chipHeap on chip

Heap & stack on chipAll mem secs on chip

Figure 19: Effects of on chip and off chip memory sections onGating Module.

parallel, to reduce the execution time of the algorithm. Ina similar fashion to Gating Module, as an alternative, weare currently experimenting with DSP-VLIW processors toexploit data level parallelism and hardware accelerators.

7.6. Track Maintenance. So far we have not mentionedthe track maintenance block of the MTT application inthe context of optimization. The reason for this deliberateomission is that a very short processing time is required forthis block. A simple NiosII/e processor executes this block in8 ms. In future we may even eliminate this processor and runthe track maintenance block as a second task on one of theother processors.

Effects of various memory sections on the performanceof Munkres algorithm for 20 obstacles

Tim

e(m

s)

40

41

42

43

44

45

46

47

48

All memsectionsoff chip

Textsectionon chip

Datasectionon chip

Heapsectionon chip

Stacksectionon chip

All memsectionson chip

Figure 20: Effects of on chip and off chip memory sections onMunkres algorithm performances.

Effects of I-cache & D-cache sizes on Munkers algorithmwith NiosII/F 100 MHz, 20 obstacles & integer cost matrix

Tim

e(s

ecs)

0

0.05

0.1

0.15

0.2

0.25

D-cache size (KB)

0 2 4 8 16 32 64

I-cache = 4 KBI-cache = 8 KBI-cache = 16 KB

I-cache = 32 KBI-cache = 64 KB

Figure 21: Cache Behavior for Munkres Algorithm with IntegerCost Matrix.

8. Discussion

After the success of the last optimization of the Munkresalgorithm discussed in Section 7.5, we investigated the othermodules for similar optimizations. We found out that thistechnique cannot be extended to all the modules in theapplication for the following reasons.

(i) The Kalman filter calculates the predicted states,prediction error covariances, estimated states and estimationerror covariance for the targets. The differences in the valuesof these quantities from one radar scan to another are verysmall and they occur to the right of the decimal point. Ittakes hundreds of scans for these changes to flow over tothe left of the decimal point. Hence the integer part of thesefloating point numbers remain unchanged for hundreds ofscans. Nevertheless, these small differences play an important

Page 96: 541420

EURASIP Journal on Embedded Systems 19

role not only in the filter itself but also in the GatingModule.

In the filter, the estimated state and estimation errorcovariance are fed back to the prediction stage of the filter.The prediction stage uses them as the basis of predictions forthe next scan. If we use just the integer parts of quantities,there would be no change in the estimated values forhundreds of scans. Obviously, this would introduce an errorinto the predictions. Due to the cyclic feedback between theprediction and correction stages of the filter, an avalanche oferrors would be generated in a few seconds.

(ii) The predicted states and the prediction errorcovariances are also used by the Gating Module to locatethe centers of the probability gates and to calculate thedifference between the measured and the predicted targetcoordinates, that is, innov d and innov a, respectively. Ifwe use only the integer part of the predicted and mea-sured coordinates, there would be two catastrophic errorsintroduced into the system. First, because of the non-changing integer parts of the predicted coordinates, thegates would be centered at the same fixed locations forhundreds of scans. Second, for the same reasons, innov dand innov a would remain zero for hundreds of cycles. Zeroinnovations mean that the predicted coordinates are exactlyidentical to the measured coordinates which is practicallyimpossible.

(iii) TheGating Module uses the prediction error covari-ance to calculate the dimensions of the probability gates.Using the constant integer part of the covariance would fixthe gate dimensions to a constant size for hundreds of scans.This again, is unrealistic and would inject even more errorinto the system.

(iv) For the Munkres algorithm (the Assignment Solver)the case is different. The Assignment Solver is the last stepof the application loop. By the time the application reachesthis step, most of the floating point operations have alreadybeen completed resulting in the Cost Matrix. The outputof the Assignment Solver is the Matrix X which has either1′s or 0′s as its elements. The 1’s in the matrix are used toidentify the most probable observation-prediction pairs. Noarithmetic operations are performed on the Matrix X .

Altera provides a tool called C2H compiler which isintended to transform C code into a hardware accelerator.At a stage in our work we tried this tool but it turnedout that it has some serious limitations. It can be usedfor codes operating only on integers. So, for the abovestated reasons, we could use it only for the Munkresalgorithm. But again, the tool can accelerate only a singlefunction and that function too must not involve complexcomputations. So we could not use it for Step 4 and Step 6of the algorithm where we needed it most. The tool simplystops working when we try to accelerate either of these twofunctions.

In case of small functions (like Step 3) where it doeswork, the hardware size of the accelerator is almost half thatof the processor to which it is attached while the speedup isnominal. In brief, if this tool is improved to remove theselimitations it can be very useful. In its current status it is farfrom its stated goals.

9. Related Work

To our understanding, comprehensive literature about theimplementation of a complete MTT system in FPGA, doesnot exist. Works about the application of MTT to DAS’s areeven harder to find.

Some work has been done on different isolated com-ponents of the MTT system but in different contexts. Forexample an implementation of the Kalman filter only, isproposed in [18]. It is not only limited to the filter but it alsois a fully hardware implementation. As mentioned earlier inthe introduction, fully hardware designs lack the flexibilityand programmability needed for the ever evolving modernday embedded applications. Moreover, the authors reporttwo alternative implementations of the Kalman filter namelythe Scalar-Based Direct Algorithm Mapping (SBDAM) andthe Matrix-Based Systolic Array Engineering (MBSAE).The former consumes 4564 logic cells whereas the latterconsumes 8610 logic cells for a single filter each. Apartfrom the large sizes, the internal components of both theimplementations are manually organized and re-organizedto get the desired performance. This is obviously not scalableand repeatable in a complex system like ours where the filteris not the only component to be optimized.

An attempt to implement an MTT system in hardware fora maritime application is documented in [19]. In additionto being a completely hardware implementation, the workpresented here is inconclusive.

The data association aspect of MTT has been dealt withnicely in [11] but the physical implementation of the systemis not a consideration in this work. Only matlab simulationsare reported for that part of the MTT.

Although the title of [20] sounds very close to ourwork, yet this work describes the theory of the ExtendedKalman Filter (EKF) with a smoothing window. The paperdiscusses the velocity estimation of slow moving vehicles andemphasizes on the necessity of reducing the liberalizationerrors in the process. While the paper presents a viablesolution to the problem of liberalization errors in EKF, thephysical implementation of the EKF or the tracking systemdoes not figure among the objectives of the work.

A systolic array based FPGA implementation of theKalman filter only, is reported in [21]. This work con-centrates on the use of a matrix manipulation algorithm(Modified Faddeev) for reducing the complexity of thecomputation. This article again, presents an interestingaccount of implementing the Kalman filter in an efficientway. In cases where very fast filtering is the main objective,this may be a good solution.

In fact software forms of the algorithms like EKF [20] andModified Faddeev based implementation of the Kalman filter[21] can be easily integrated into our system. For exampleEKF is useful in situations where a target exhibits a abruptchanges in its dynamic behavior as in hilly regions. Similarly,other algorithms like [21] can be added on if required. So theworks discussed above can be considered as complementaryrather than competitors to our work.

Most of the available works treat the individual compo-nents of the MTT (mainly the Kalman filter) in isolation.

Page 97: 541420

20 EURASIP Journal on Embedded Systems

However, putting these and other components together todesign a coherent MTT application and adapting it toautomotive safety utilization, is not a trivial task.

Our work is unique in several aspects. In contrast tothe works mentioned above, we consider a complete MTTsystem implementation. Our reconfigurable MPSoC archi-tecture of the system is inherently flexible, programmableand scalable. Thus it can evolve very easily with advancesin technology and with improvements in application algo-rithms. Moreover, the use of several concurrently runningprocessors meets the overall real time deadlines. Severallow frequency processors running concurrently consume lesspower compared to a single processor with a high clockfrequency and doing the same job [5]. The reconfigurabilityof the processors and other components in our design, allowfor customizing them according to application requirementswhile keeping the hardware size as small as possible. Thesystem we propose is a complete plug-and-play solution thatcan be easily integrated with the existing electronic systemsonboard a vehicle.

10. Summary and Conclusion

We presented the procedure we adopted for designing andoptimizing an application specific MPSoC. The MultipleTarget Tracking (MTT) application is designed for DriverAssistance Systems (DAS). These systems are used in vehiclesfor collision avoidance and early warning for assisting thedriver.

First, we described a general view of the MTT applicationand then we presented our own approach to the designand development of the customized application for DriverAssistance. We developed the mathematical models for theapplication and coded the application in ANSI C. We dividedthe application into easily manageable modules that can beexecuted in parallel.

After developing the application, we profiled it to iden-tify performance bottlenecks and dependencies among theapplication modules. This helped us in allocating processingresources to the application modules and in laying outour optimization strategies. Using three different hardwareimplementations of the NiosII soft core embedded processorand other components, we devised a heterogeneous MPSoCarchitecture for the system.

To formulate our optimization strategies we also identi-fied the constraints to be met. The constraints include the25 ms time limit for the application execution, the limitedamount of available on-chip memory and the size of thesystem hardware.

To avoid overusing the on-chip memory we optimizedthe I-cache and D-cache sizes for each application module.Understanding the I-cache and D-cache requirements notonly helped us in accelerating the system but also in selectingthe right configuration of the NiosII processor for eachmodule.

The optimum cache configurations reduced the execu-tion times by at least 50%. The Gating Module and theAssignment Solver needed further acceleration to arrive at

Table 3: Summary of the final system.

Kalman Gating Munkres Track Maint.

Number 20 1 1 1

of proc.

NIOSII S F F E

type

I cache 4 16 8 0

in KB

D cache 0 2 16 0

in KB

Local mem 0 3 3 0

in KB

% Mem. used 60 8 9 1

on FPGA

FP custom No Yes No No

instructions

Run time 15 23 24 8

in mSec

FPGA resource usage (LEs)Stratix II EP2S60 (total 60,000 LEs)

Unused17974 Filters

26000

FIFOs andinterconnects

12000

Track maintenance600 Assignment solver

1700

Gating module1726

Figure 22: System Hardware Size.

25 ms cut-off set by the radar PRT. We incorporated thefloating point custom instructions hardware in the relevantprocessors to accelerate them further. Floating Point Custominstructions reduced the runtime from 70 ms to 37 ms (47%speedup) for the Gating Module and from 71 ms to 47 ms(34% speedup) for the Assignment Solver. To bring thesetimes below 25 ms, we needed to speed these modules upeven more.

Shifting the whole application to the fast on-chip mem-ory could greatly improve the speed however it is not feasibledue to the large memory footprint of the application and thelimited amount of the on-chip memory. We experimentedwith placing different memory sections like the stack and theheap in the fast on-chip RAM. Placing only the stack andthe heap memory sections on-chip for the Gating Module,brought the runtime down to 23 ms which is below the 25 mscut-off and hence we settled for it.

Page 98: 541420

EURASIP Journal on Embedded Systems 21

For the Assignment Solver (Munkres algorithm) we couldgain only 6 ms in runtime by putting the entire module inthe on-chip memory. This gain is neither enough to get us toour goal nor we can afford to put the entire module on chip,in the finalized system.

Exploring the algorithm we found that the final outputof the algorithm remains unchanged if we drop down thefractional part of the floating point elements of the input CostMatrix. This manipulation of the input matrix reduced theruntime for the algorithm to 24 ms without compromisingthe accuracy of the final solution. It also allowed usto do away with the floating point custom instructions.Consequently we use the lighter NiosII/s instead of heavierNiosII/f for the assignment solver.

Speed was not the only objective in choosing the systemcomponents and the optimization strategies, we wanted tokeep the on-chip memory utilization and the hardware sizein check too. We traded speed for FPGA resource economywhere we could afford it, for example, in the case of theKalman filters.

Taking into account the results of the optimizations,we finalized the components and their respective features.Table 3 summarizes the salient features of the finalizedarchitecture with reference to Figure 9. The whole system fitsin a single StratixII EP2S60 FPGA. The design uses 42,000of the 60,000 logic elements (LEs) available on the FPGA asshown in Figure 22 and it meets the runtime constraints ofthe application.

References

[1] A. Techmer, “Application development of camera-based driverassistance systems on a programmable multi-processor archi-tecture,” in Proceedings of IEEE Intelligent Vehicles Symposium(IV ’07), pp. 1211–1216, Istanbul, Turkey, June 2007.

[2] M. Beekema and H. Broeders, “Computer Architecturesfor Vision-Based Advanced Driver Assistance Systems,”http://www.xs4all.nl/∼hc11/paper ca st.pdf.

[3] STMicroelectronics and Mobileye, “Stmicroelectronics andmobileye deliver second generation system-on-chip for vision-based driver assistance systems,” Press release, May 2008.

[4] S. Blackman and R. Popoli, Design and Analysis of ModernTracking Systems, Artech House, Boston, Mass, USA, 1999.

[5] Frank Schirrmeister Imperas, Inc., “Multi-core processors:fundamentals, trends, and challenges,” in Proceedings of theEmbedded Systems Conference, 2007, ESC351.

[6] E. Brookner, Tracking and Kalman Filtering Made Easy, JohnWiley & Sons, New York, NY, USA, 1998.

[7] V. Nedovi, “Tracking moving video objects using mean-shift algorithm,” Project Report, http://staff.science.uva.nl/∼vnedovic/MMIR2004/vnedovicProjReport.pdf.

[8] Z. Salcic and C.-R. Lee, “FPGA-based adaptive trackingestimation computer,” IEEE Transactions on Aerospace andElectronic Systems, vol. 37, no. 2, pp. 699–706, 2001.

[9] R. E. Kalman, “A new approach to linear filtering andprediction problems,” Journal of Basic Engineering, vol. 82, pp.35–45, 1960.

[10] D. P. Bertsekas and D. A. Castaon, “A forward/reverse auc-tion algorithm for asymmetric assignment problems,” http://web.mit.edu/dimitrib/www/For Rev Asym Auction.pdf.

[11] P. Konstantinova, et al., “A study of target tracking algorithmusing global nearest neighbor approach,” in Proceedings of theInternational Conference on Computer Systems and Technolo-gies (CompSysTech ’03), Sofia, Bulgaria, June 2003.

[12] Munkres’ Assignment Algorithm, Modified for Rectangu-lar Matrices http://csclab.murraystate.edu/bob.pilgrim/445/munkres.html.

[13] G. Welch and G. Bishop, “An introduction to the KalmanFilter,” 2001, http://www.cs.unc.edu/∼welch/kalman/.

[14] Altera Corporation, http://www.altera.com/literature/an/an391.pdf.

[15] R. Joost and R. Salomon, “Advantages of FPGA-based multi-processor systems in industrial applications,” in Proceedings ofthe 31st Annual Conference of IEEE Industrial Electronics Soci-ety (IECON ’05), pp. 445–450, Raleigh, NC, USA, November2005.

[16] H. Penttinen, T. Koskinen, and M. Hnnikinen, “Leon3 MPon Altera FPGA,” Final Project Report, August 2007, AlteraInnovate Nordic.

[17] Altera Corporation, NIOS II Processor Reference Handbook,http://www.altera.com/literature/hb/nios2/n2cpu nii5v1.pdf.

[18] Z. Salcic and C.-R. Lee, “Scalar-based direct algorithmmapping FPLD implementation of a Kalman filter,” IEEETransactions on Aerospace and Electronic Systems, vol. 36, no.3, part 1, pp. 879–888, 2000.

[19] Y. Boismenu, Etude d’une carte de tracking radar, These dedoctorat, Universit de Bourgogne, Dijon, France, 2000.

[20] A. Goransson and B. Sohlberg, “Tracking low velocity vehiclesfrom radar measurements,” in Proceedings of the IASTEDInternational Conference on Circuits, Signals, and Systems (CSS’03), pp. 51–55, Cancun, Mexico, May 2003.

[21] G. Chen and L. Guo, “The FPGA implementation of Kalmanfilter,” in Proceedings of the 5th WSEAS International Con-ference on Signal Processing, Computational Geometry andArtificial Vision, Malta, 2005.

Page 99: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 105979, 20 pagesdoi:10.1155/2009/105979

Research Article

A Prototyping Virtual Socket System-On-PlatformArchitecture with a Novel ACQPPS Motion Estimator forH.264 Video Encoding Applications

Yifeng Qiu and Wael Badawy

Department of Electrical and Computer Engineering, University of Calgary, Alberta, Canada T2N 1N4

Correspondence should be addressed to Yifeng Qiu, [email protected]

Received 25 February 2009; Revised 27 May 2009; Accepted 27 July 2009

Recommended by Markus Rupp

H.264 delivers the streaming video in high quality for various applications. The coding tools involved in H.264, however, makeits video codec implementation very complicated, raising the need for algorithm optimization, and hardware acceleration. Inthis paper, a novel adaptive crossed quarter polar pattern search (ACQPPS) algorithm is proposed to realize an enhanced interprediction for H.264. Moreover, an efficient prototyping system-on-platform architecture is also presented, which can be utilizedfor a realization of H.264 baseline profile encoder with the support of integrated ACQPPS motion estimator and related videoIP accelerators. The implementation results show that ACQPPS motion estimator can achieve very high estimated image qualitycomparable to that from the full search method, in terms of peak signal-to-noise ratio (PSNR), while keeping the complexity at anextremely low level. With the integrated IP accelerators and optimized techniques, the proposed system-on-platform architecturesufficiently supports the H.264 real-time encoding with the low cost.

Copyright © 2009 Y. Qiu and W. Badawy. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Digital video processing technology is to improve the codingvalidity and efficiency for digital video images [1]. It involvesthe video standards and relevant realizations. With the jointefforts of ITU-T VCEG and ISO/IEC MPEG, H.264/AVC(MPEG-4 Part 10) has been built up as the most advancedstandard so far in the world, targeting to achieve very highdata compression. H.264 is able to provide a good videoquality at bit rates which are substantially lower than whatprevious standards need [2–4]. It can be applied to a widevariety of applications with various bit rates and videostreaming resolutions, intending to cover practically almostall the aspects of audio and video coding processing withinits framework [5–7].

H.264 includes many profiles, levels and feature defi-nitions. There are seven sets of capabilities, referred to asprofiles, targeting specific classes of applications: BaselineProfile (BP) for low-cost applications with limited comput-ing resources, which is widely used in videoconferencing andmobile communications; Main Profile (MP) for broadcasting

and storage applications; Extended Profile (XP) for stream-ing video with relatively high compression capability; HighProfile (HiP) for high-definition television applications;High 10 Profile (Hi10P) going beyond present mainstreamconsumer product capabilities; High 4 : 4 : 2 Profile (Hi422P)targeting professional applications using interlaced video;High 4 : 4 : 4 Profile (Hi444P) supporting up to 12 bits persample and efficient lossless region coding and an integerresidual color transform for RGB video. The levels in H.264are defined as Level 1 to 5, each of which is for specific bit,frame and macroblock (MB) rates to be realized in differentprofiles.

One of the primary issues with H.264 video applicationslies on how to realize the profiles, levels, tools, and algorithmsfeatured by H.264/AVC draft. Thanks to the rapid develop-ment of FPGA [8] techniques and embedded software systemdesign and verification tools, the designers can utilize thehardware-software (HW/SW) codesign environment whichis based on the reconfigurable and programmable FPGAinfrastructure as a dedicated solution for H.264 videoapplications [9, 10].

Page 100: 541420

2 EURASIP Journal on Embedded Systems

The motion estimation (ME) scheme has a vital impacton H.264 video streaming applications, and is the mainfunction of a video encoder to achieve image compression.The block-matching algorithm (BMA) is an important andwidely used technique to estimate the motions of regularblock, and generate the motion vector (MV), which isthe critical information for temporal redundancy reductionin video encoding. Because of its simplicity and codingefficiency, BMA has been adopted as the standard motionestimation method in a variety of video standards, such as theMPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264. Fastand accurate block-based search techniques and hardwareacceleration are highly demanded to reduce the coding delayand maintain satisfied estimated video image quality. A noveladaptive crossed quarter polar pattern search (ACQPPS)algorithm and its hardware architecture are proposed inthis paper to provide an advanced motion estimation searchmethod with the high performance and low computationalcomplexity.

Moreover, an integrated IP accelerated codesign system,which is constructed with an efficient hardware architecture,is also proposed. With integrations of H.264 IP acceleratorsinto the system framework, a complete system-on-platformsolution can be set up to realize the H.264 video encodingsystem. Through the codevelopment and co-verification forsystem-on-platform, the architecture and IP cores developedby designers can be easily reused and therefore transplantedfrom one platform to others without significant modification[11]. These factors make a system-on-platform solutionoutperform a pure software solution and more flexible thana fully dedicated hardware implementation for H.264 videocodec realizations.

The rest of paper is organized as follows: in the nextSection 2, H.264 baseline profile and its applications arebriefly analyzed. In Section 3, the ACQPPS algorithm isproposed in details, while Section 4 describes the hardwarearchitecture for the proposed ACQPPS motion estimator.Furthermore, a hardware architecture and host interfacefeatures of the proposed system-on-platform solution iselaborated in Section 5, and the related techniques for systemoptimizations are illustrated in Section 6. The completeexperimental results are generated and analyzed in Section 7.The Section 8 concludes the paper.

2. H.264 Baseline Profile

2.1. General Overview. The profiles and levels specify theconformance points, which are designed to facilitate theinteroperability between a variety of video applications ofthe H.264 standard that has similar functional requirements.A profile defines a set of coding tools or algorithms thatcan be utilized in generating a compliant bitstream, whereasa level places constraints on certain key parameters of thebitstream.

H.264 baseline profile was designed to minimize thecomputational complexity and provide high robustness andflexibility for utilization over a broad range of networkenvironment and conditions. It is typically regarded as

the simplest one in the standard, which includes all theH.264 tools with the exception of the following tools: B-slices, weighted prediction, field (interlaced) coding, pic-ture/macroblock adaptive switching between the frame andfield coding (MB-AFF), context adaptive binary arithmeticcoding (CABAC), SP/SI slices and slice data partition-ing. This profile normally targets the video applicationswith low computational complexity and low delay require-ments.

For example, in the field of mobile communications,H.264 baseline profile will play an important role becausethe compression efficiency is doubled in comparison withthe coding schemes currently specified by the H.263 Baseline,H.263+ and MPEG-4 Simple Profile.

2.2. Baseline Profile Bitstream. For mobile and videocon-ferencing applications, H.264 BP, MPEG-4 Visual SimpleProfile (VSP), H.263 BP, and H.263 Conversational HighCompression (CHC) are usually considered. Practically,H.264 outperforms all other considered encoders for videostreaming encoding. H.264 BP allows an average bit ratesaving of about 40% compared to H.263 BP, 29% to MPEG-4VSP and 27% to H.263 CHC, respectively [12].

2.3. Hardware Codec Complexity. The implementation com-plexity of any video coding standard heavily depends onthe characteristics of the platform, for example, FPGA, DSP,ASIC, SoC, on which it is mapped. The basic analysis withrespect to the H.264 BP hardware codec implementationcomplexity can be found in [13, 14].

In general, the main bottleneck of H.264 video encodingis a combination of multiple reference frames and largesearch ranges.

Moreover, the H.264 video codec complexity ratio is inthe order of 10 for basic configurations and can grow up tothe 2 orders of magnitude for complex ones [15].

3. The Proposed ACQPPS Algorithm

3.1. Overview of the ME Methods. For motion estimation,the full search algorithm (FS) of BMA exhaustively checks allpossible block pixels within the search window to find out thebest matching block with minimal matching error (MME).It can usually produce a globally optimal solution to themotion estimation, but demand a very high computationalcomplexity.

To reduce the required operations, many fast algorithmshave been developed, including the 2D logarithmic search(LOGS) [16], the three-step search (TSS) [17], the new three-step search (NTSS) [18], the novel four-step search (NFSS)[19], the block-based gradient descent search (BBGDS)[20], the diamond search (DS) [21], the hexagonal search(HEX) [22], the unrestricted center-biased diamond search(UCBDS) [23], and so forth. The basic idea behind thesemultistep fast search algorithms is to check a few of blockpoints at current step, and restrict the search in next step tothe neighboring of points that minimizes the block distortionmeasure.

Page 101: 541420

EURASIP Journal on Embedded Systems 3

These algorithms, however, assume that the error surfaceof the minimum absolute difference increases monotonicallyas the search position moves away from the global minimumon the error surface [16]. This assumption would bereasonable in a small region near the global minimum,but not absolutely true for real video signals. To avoidtrapped in undesirable local minimum, some adaptive searchalgorithms have been devised intending to achieve the globaloptimum or sub-optimum with adaptive search patterns.One of those algorithms is the adaptive rood pattern search(ARPS) [24].

Recently, a few of valuable algorithms have been devel-oped to further improve the search performance, suchas the Enhanced Predictive Zonal Search (EPZS) [25,26] and Unsymmetrical-Cross Multi-Hexagon-grid Search(UMHexagonS) [27], which were even adopted by H.264 asthe standard motion estimation algorithms. These schemes,however, are not especially suitable for the hardware imple-mentation, as the search principle of these methods iscomplicated. If the hardware architecture is required for therealization of H.264 encoder, these algorithms are usually notregarded as the efficient solution.

To improve the search performance and reduce the com-putational complexity as well, an efficient and fast method,adaptive crossed quarter polar pattern search algorithm(ACQPPS), is therefore proposed in this paper.

3.2. Algorithm Design Considerations. It is known that a smallsearch pattern with compactly spaced search points (SP)is more appropriate than a large search pattern containingsparsely spaced search points in detecting small motions[24]. On the contrary, the large search pattern has theadvantage of quickly detecting large motions to avoid beingtrapped into local minimum along the search path and leadsto unfavorable estimation, an issue that the small searchpattern encounters. It is desirable to use different searchpatterns, that is, adaptive search patterns, in view of a varietyof the estimated motion behaviors.

Three main aspects are considered to improve or speedup the matching procedure for adaptive search methods: (1)type of the motion prediction; (2) selection of the searchpattern shape and direction; (3) adaptive length of searchpattern. The first two aspects can reduce the number ofsearch points, and the last one is to give more accuratesearching result with a large motion.

For the proposed ACQPPS algorithm under H.264encoding framework, a median type of the predictedmotion vector, that is, median vector predictor (MVP)[28], is produced for determining the initial search range.The shape and direction of the search pattern is adap-tively selected. The length (radius) of the search arm isadjusted to improve the search. Two main search stepsare involved in the motion search: (1) initial search stage;(2) refined search stage. In the initial search stage, someinitial search points are selected to obtain an initial MMEpoint. For the refined search, a unit-sized square patternis applied iteratively to obtain the final best motion vec-tor.

3.3. Shape of the Search Pattern. To determine the followingsearch step according to whether the current best matchingpoint is positioned at the center of search range, a new searchpattern is devised to detect the potentially optimal searchpoints in the initial search stage. The basic concept is to pickup some initial points along with the polar (circular) searchpattern. The center of the search circles is the current blockposition.

Under the assumption that the matching error surfacehas a property of monotonic increasing or decreasing,however, some redundant checking points may exist in theinitial search stage. It is obvious that some redundant pointsare not necessary to be examined under the assumption ofunimodal distortion surface. To reduce the number of initialchecking points and keep the probability of getting optimalmatching points as high as possible, a fractional or quarterpolar search pattern is used accordingly.

Moreover, it is known that the accuracy of motionpredictor is very important to the adaptive pattern search.To improve the performance of adaptive search, extra relatedmotion predictors can be used other than the initial MVP.The extra motion predictors utilized by ACQPPS algorithmonly require an extension and a contraction of the initialMVP that can be easily obtained. Therefore, at the crossingof quarter circle and motion predictors, the search methodis equipped with the adaptive crossed quarter polar patternsfor efficient motion search.

3.4. Adaptive Directions of the Search Pattern. The searchdirection, which is defined by the direction of a quarter circlecontained in the pattern, comes from the MVP. Figure 1shows the possible patterns designed, and Figure 2 depictshow to determine the direction of a search pattern. Thepatterns employ the directional information of a motionpredictor to increase the possibility to get the best MMEpoint for the refined search. To determine an adaptivedirection of the search pattern, certain rules are obeyed.

(3.4.1) If the predicted MV (motion predictor) = 0, set up aninitial square search pattern with a pattern size = 1,around the search center, as shown in Figure 2(a).

(3.4.2) If the predicted MV falls onto a coordinate axis,that is, PredMVy = 0 or PredMVx = 0, the patterndirection is chosen to be E, N, W, or S, as shown inFigures 1(a), 1(c), 1(e), 1(g). In this case, the pointat the initial motion predictor is overlapped with aninitial search point which is on the N, W, E, or Scoordinate axis.

(3.4.3) If the predicted MV does not fall onto any coor-dinate axis, and Max{|PredMVy|, |PredMVx|} >2∗Min{|PredMVy|, |PredMVx|}, the pattern direc-tion is chosen to be E, N, W, or S, as shown inFigure 2(b).

(3.4.4) If the predicted MV does not fall onto any coor-dinate axis, and Max{|PredMVy|, |PredMVx|} ≤2∗Min{|PredMVy|, |PredMVx|}, the pattern direc-tion is chosen to be NE, NW, SW, or SE, as shown inFigure 2(c).

Page 102: 541420

4 EURASIP Journal on Embedded Systems

SWSW

NENW

S

EW

N

(a) E Pattern

SWSW

NENW

S

EW

N

(b) NE Pattern

SWSW

NENW

S

EW

N

(c) N Pattern

SWSW

NENW

S

EW

N

(d) NW Pattern

SWSW

NENW

S

EW

N

(e) W Pattern

SWSW

NENW

S

EW

N

Points with the predicted MV and extension

Initial SPs along the quarter circle

(f) SW Pattern

SWSW

NENW

S

EW

N

Points with the predicted MV and extension

Initial SPs along the quarter circle

(g) S Pattern

SWSW

NENW

S

EW

N

Points with the predicted MV and extension

Initial SPs along the quarter circle

(h) SE Pattern

Figure 1: Possible adaptive search patterns designed.

3.5. Size of the Search Pattern. To simplify the selection ofsearch pattern size, the horizontal and vertical componentsof motion predictor is still utilized. The size of search pattern,that is, the radius of a designed quarter polar search pattern,is simply defined as

R = Max{∣∣PredMVy

∣∣, |PredMVx|}, (1)

where R is the radius of quarter circle, PredMVy andPredMVx the vertical and horizontal components of themotion predictor, respectively.

3.6. Initial Search Points. After the direction and size ofa search pattern are decided, some search points willbe selected in the initial search stage. Each search pointrepresents a block to be checked with intensity matching. Theinitial search points include (when MVP is not zero):

(1) the predicted motion vector point;

(2) the center point of search pattern, which represents thecandidate block in the current frame;

(3) some points on the directional axis;

Page 103: 541420

EURASIP Journal on Embedded Systems 5

−4 −3 −2 −1 0 1 2 3 4

4

3

2

1

0

−1

−2

−3

−4

Initial SPs with a square pattern when PredMV = 0

(a)

SESW

NENW

EW

N

S

Point with the predicted MV

Max{|PredMVy|, |PredMVx|} > 2∗Min{|PredMVy|, |PredMVx|}N/E/W/S pattern selected

(b)

SESW

NENW

EW

N

S

Point with the predicted MV

Max{|PredMVy|, |PredMVx|} ≤ 2∗Min{|PredMVy|, |PredMVx|}NW/NE/SW/SE pattern selected

(c)

Figure 2: (a) Square pattern size = 1, (b) N/W/E/S search patternselected, (c) NW/NE/SW/SE search pattern selected.

Table 1: A look-up table for the definition of vertical and horizontalcomponents of initial search points on NW/NE/SW/SE axis.

R |SPx| |SPy| R |SPx| |SPy|0 0 0 6 4 4

1 1 1 7 5 5

2 2 2 8 6 6

3 2 2 9 6 6

4 3 3 10 7 7

5 4 4 — — —

(4) the extension predicted motion vector point (the pointwith prolonged length of motion predictor), and thecontraction predicted motion vector point (the pointwith contracted length of motion predictor)

Normally, if no overlapping exists, there will be totallyseven search points selected in the initial search stage, inorder to get a point with the MME, which can be used asa basis for the refined search stage thereafter.

If a search point is on the axis of NW, NE, SW, or SE,the corresponding decomposed coordinates of that point willsatisfy,

R =√

(SPx)2 +(

SPy

)2, (2)

where SPx and SPy are the vertical and horizontal compo-nents of a search point on the axis of NW, NE, SW, or SE.Because |SPx| is equal to |SPy| in this case, then

R = √2 · |SPx| =√

2 ·∣∣∣SPy

∣∣∣. (3)

Obviously, neither |SPx| nor |SPy| is an integer, as Ris always an integer-based radius for block processing. Tosimplify and reduce the computational complexity of asearch point definition on the axis of NW, NE, SW or SE,a look-up table (LUT) is employed, as listed in Table 1.The values of SPx and SPy are predefined according to theradius R, and now they are integers. Figure 3 illustrates someexamples of defined initial search points with the look-uptable.

When the radius R > 20, the value of |SPx| and |SPy| canbe determined by

|SPx| =∣∣∣SPy

∣∣∣ = Round(

R√2

). (4)

There are two initial search points related to the extendedmotion predictors. One is with a prolonged length of motionpredictor (extension version), whereas the other is with areduced length of motion predictor (contraction version).Two scaled factors are adaptively defined according to theradius R, for the lengths of those two initial search pointscan be easily derived from the original motion predictor, asshown in Table 2. The scaled factors are chosen so that theinitial search points related to the extension and contractionof the motion predictor can be distributed reasonably aroundthe motion predictor point to obtain the better motionpredictor points.

Page 104: 541420

6 EURASIP Journal on Embedded Systems

SESW

NENW

S

EW

N

SPy

SPx

R

Point with the predicted MVInitial SPs when E pattern selected SPyand SPx determined by look-up table

(a)

SESW

NENW

S

EW

N

SPy

SPxR

Point with the predicted MVInitial SPs when NE pattern selected SPy

and SPx determined by look-up table

(b)

Figure 3: (a) An example of initial search points defined for E pattern using look-up table; (b) an example of initial search points definedfor NE pattern using look-up table.

Table 2: Definition of scaled factors for initial search points relatedto motion predictor.

RScaled factor forextension (SFE)

RScaled factor for

contraction (SFC)

0 ∼ 2 3 0 ∼ 10 0.5

3 ∼ 5 2 >10 0.75

6 ∼ 10 1.5

>10 1.25

Therefore, the initial search points related to the motionpredictor can be identified as

EMVP = SFE ·MVP, (5)

CMVP = SFC ·MVP, (6)

where MVP is a point representing the median vector predic-tor. SFE and SFC are the scaled factors for the extension andcontraction, respectively. EMVP and CMVP are the initialsearch points with the prolonged and contracted lengths ofpredicted motion vector, respectively. If the horizontal orvertical component of EMVP and CMVP is not an integerafter the scaling, the component value will be truncated tothe integer for video block processing.

3.7. Algorithm Procedure

Step 1. Get a predicted motion vector (MVP) for thecandidate block in current frame for the initial search stage.

Step 2. Find the adaptive direction of a search pattern byrules (3.4.1)–(3.4.4), determine the pattern size “R” withthe (1), choose initial SPs in the reference frame along thequarter circle and predicted MV using look-up table, (5) and(6).

Step 3. Check the initial search points with block pixelintensity measurement, and get an MME point which has aminimum SAD as the search center for the next search stage.

Step 4. Refine local search by applying unit-sized squarepattern to the MME point (search center), and check itsneighboring points with block pixel intensity measurement.If after search, the MME point is still the search center, thenstop searching and obtain the final motion vector for thecandidate block corresponding to the final best matchingpoint identified in this step. Otherwise, set up the new MMEpoint as the search center, and apply square pattern search tothat MME point again, until the stop condition is satisfied.

3.8. Algorithm Complexity. As the ACQPPS is a predictedand adaptive multistep algorithm for motion search, thealgorithm computational complexity exclusively depends onthe object motions contained in the video sequences andscenarios for estimation processing. The main overhead ofACQPPS algorithm lies in the block SAD computations.Some other algorithm overhead, such as the selection ofadaptive search pattern direction, the determination ofsearch arm and initial search points, are merely consumedby a combination of if-condition judgments, and thus can beeven ignored when compared with block SAD calculations.

If the large, quick, and complex object motions areincluded in video sequences, the number of search points(NSP) will be reasonably increased. On the contrary, if thesmall, slow and simple object motions are shown in thesequences, it only requires the ACQPPS algorithm a fewof processing steps to finish the motion search, that is, thenumber of search points is correspondingly reduced.

Unlike the ME algorithms with fixed search ranges,for example, the full search algorithm, it is impracticalto precisely identify the number of computational stepsfor ACQPPS. On an average, however, an approximation

Page 105: 541420

EURASIP Journal on Embedded Systems 7

Look-up table

Initial search processing unit

Motion predictor storage

Refined search processing unit

Current & reference video frame storage

Pipelined multi-level SAD calculator

SAD comparator

MV generatedMME point

Residual data

Reference data

MME point MV generated

18× 18 registerarray with reference

block data

16× 16 register arraywith current block data

Figure 4: A hardware architecture for ACQPPS motion estimator.

equation can be utilized to represent the computationalcomplexity for ACQPPS method. The worst case of motionsearch for a video sequence is to use the 4 × 4 block size,if the fixed block size is employed. In this case, the numberof search points for ACQPPS motion estimation is usuallyaround 12 ∼ 16, according to the practical motion searchresults. Therefore, the algorithm complexity can be simplyidentified as, in terms of image size and frame rate,

C ≈ 16× Block SAD computations

×Number of blocks in a video frame× Frame rate,(7)

where the block size is 4 × 4 for the worst case ofcomputations. For a standard software implementation, itactually requires 16 subtractions and 15 additions, that is, 31arithmetic operations, for each 4×4 block SAD calculations.Accordingly, the complexity of ACQPPS is approximately14 and 60 times less than the one required by full searchalgorithm with the [−7, +7] and [−15, +15] search range,respectively. In practice, the ACQPPS complexity is roughlyat the same level as the simple DS algorithm.

4. Hardware Architecture ofACQPPS Motion Estimator

The ACQPPS is designed with low complexity, whichis appropriate to be implemented based on a hardwarearchitecture. The hardware architecture takes advantageof the pipelining and parallel operations of the adaptivesearch patterns, and utilizes a fully pipelined multilevelSAD calculator to improve the computational efficiency and,therefore, reduce the clock frequency reasonably.

As mentioned above, the computation of motion vectorfor a smallest block shape, that is, 4× 4 block, is the worstcase for calculation. The worst case refers to the percentageusage of the memory bandwidth. It is necessary that thecomputational efficiency be as high as possible in the worst

case. All of the other block shapes can be constructed from4× 4 blocks so that the computation of distortion in 4× 4partial solutions and result additions can solve all of the otherblock shapes.

4.1. ACQPPS Hardware Architecture. An architecture for theACQPPS motion estimator is shown in Figure 4. There aretwo main stages for the motion vector search, includingthe initial and refined search, indicated by the hardwaresemaphore. In the initial search stage, the architecture utilizesthe previously calculated motion vectors to produce anMVP for the current block. Some initial search points aregenerated utilizing the MVP and LUT to define the searchrange of adaptive patterns. After an MME point is foundin this stage, the search refinement will take into effectapplying square pattern around MME points iteratively toobtain a final best MME point, which indicates the finalbest MV for the current block. For motion estimation, thereference frames are stored in SRAM or DRAM, while thecurrent frame and produced MVs are stored in dual-portmemory (BRAM). Meanwhile, The LUT also uses the BRAMto facilitate the generation of initial search points.

Figure 5 illustrates a data search flow of the ACQPPShardware IP with regard to each block motion search. Theinitial search processing unit (ISPU) is used to generate theinitial search points and then perform the initial motionsearch. To generate the initial search points, previouslycalculated MVs and an LUT are employed. The LUT containsthe vertical and horizontal components of the initial searchpoints defined in Table 1. Both produced MVs and LUTvalues are stored in BRAM, for they can be accessed throughtwo independent data ports in parallel to facilitate theprocessing. When the initial search stage is finished, therefined search processing unit (RSPU) is enabled to work. Itemploys the square pattern around the MME point derivedin initial search stage to refine the local motion search. Thelocal refined search steps might be iteratively performed afew of times, until the MME point is still at the search center

Page 106: 541420

8 EURASIP Journal on Embedded Systems

Initial search stage

MVP generation

Remove the overlapping initial search

points

Obtain the initial MME

point position

Obtain the refinement MME point

position

Generate new offset SPs using diamond or square

pattern according to MME point for refinement search

Refined search stage

MME point is not the search center

MM

E poin

t is the

search cen

ter

Final MV for current block

Clock cycles

· · · · · · · · · · · ·1 2 3 4 5 6 7 8 N + 1 N + 2 N + 3 N + 4 N + 5 N + 6 N + 7 N + 8

Preload currentblock data to

16× 16 registers

Load reference data ofMVP offset point to

18× 18 register array andenable SAD calculation

Load reference data of(0, 0) offset point to

18× 18 register array andenable SAD calculation

Load reference data of eachof other initial offset SPs to18× 18 register array and

enable SAD calculation

Decide searchpattern, generateinitial SPs exceptMVP and (0, 0)

Load reference data ofeach of refinement SPs

to 18× 18 register array,enable SAD calculation

(a)

MME point is the search center

MME point is not the search center

Final MV for current block

MM

E Poin

t is the

search cen

ter

Refined search stageInitial search stage

Obtain the refinement MME point

position

Generate new offset SPs using diamond or square

pattern according to MME point for refinement search

Obtain the current MME

point position

MVP generation

Clock cycles

· · · · · · · · · · · ·1 2 3 4 5 6 7 8 N + 1 N + 2 N + 3 N + 4 N + 5 N + 6 N + 7 N + 8

Preload currentblock data to

16× 16 registers

Load reference data of (0, 0)offset point (search center)

to 18× 18 register array andenable SAD calculation

Load reference data of each ofother offset SPs defined by squarepattern to 18× 18 register array

and enable SAD calculation

Load reference data ofeach of refinement SPs

to 18× 18 register array,enable SAD calculation

(b)

Figure 5: (a) A data search flow for the individual block motion estimation when MVP is not zero; (b) a data search flow for the individualblock motion estimation when MVP is zero. Note The clock cycles for each task are not on the exact timing scale, only for illustrationpurpose.

after certain refined steps. The search data flow of ACQPPSIP architecture conforms to the algorithm steps defined inSection 3.7, with further improvement and optimization ofhardware parallel and pipelining features.

4.2. Fully Pipelined SAD Calculator. As main ME operationsare related to SAD calculations that have a critical impacton the performance of hardware-based motion estimator, afully pipelined SAD calculator is designed to speed up theSAD computations. Figure 6 displays a basic architecture ofthe pipelined SAD calculator, with the processing support

of variable block sizes. According to the VBS indicatedby block shape and enable signals, SAD calculator canemploy appropriate parallel and pipelining adder opera-tions to generate SAD result for a searched block. Withthe parallel calculations of basic processing unit (BPU),it can take 4 clock cycles to finish the 4 × 4 blockSAD computations (BPU for 4 × 4 block SAD), and 8clock cycles to produce a final SAD result for a 16 × 16block.

To support the VBS feature, different block shapes mightbe processed based on the prototype of the BPU. In such case,a 16 × 16 macroblock is divided into 16 basic 4 × 4 blocks.

Page 107: 541420

EURASIP Journal on Embedded Systems 9

Reg

iste

r da

ta a

rray

Reg

iste

r da

ta a

rray

Reg

iste

r da

ta a

rray

Reg

iste

r da

ta a

rray

Data selection control

ACCACC

BPU1Mu

x

Mux Mux

Mux Mux

Mu

xM

ux

Mu

x

BPU2

BPU3

BPU4

ACC

ACC

ACC

ACC

ACC

Current 4× 4 block 0

Current 4× 4 block 4

Current 4× 4 block 8

Current 4× 4 block 12

Current 4× 4 block 1

Current 4× 4 block 5

Current 4× 4 block 9

Current 4× 4 block 13

Current 4× 4 block 2

Current 4× 4 block 6

Current 4× 4 block 10

Current 4× 4 block 14

Current 4× 4 block 3

Current 4× 4 block 7

Current 4× 4 block 11

Current 4× 4 block 15

Reference 4× 4 block 0/4/8/12

Reference 4× 4 block 1/5/9/13

Reference 4× 4 block 2/6/10/14

Reference 4× 4 block 3/7/11/15

8× 8 or 8× 16SAD (0)

8× 4 SAD (0)

4× 4 SAD (0)

4× 8 SAD (0)

4× 8 SAD (1)

4× 4 SAD (1)

16× 8 or 16× 16 SAD

8× 8 or 8× 16SAD (1)

8× 4 SAD (1)

4× 4 SAD (2)

4× 8 SAD (2)

4× 8 SAD (3)

4× 4 SAD (3)

Figure 6: An architecture for pipelined multilevel SAD calculator.

98

7654

321

10

15141312

11

0

10

9

8

7

6

5

4

3

2

1

0

15

14

13

12

11

1514131211109876543210 Organization of VBS using 4× 4 blocks16× 16: {0, 1, . . . , 14, 15}16× 8: {0, 1, . . . , 6, 7}, {8, 9, . . . , 14, 15}8× 16: {0, 1, 4, 5, 8, 9, 12, 13},

{2, 3, 6, 7, 10, 11, 14, 15}8× 8: {0, 1, 4, 5}, {2, 3, 6, 7},

{8, 9, 12, 13}, {10, 11, 14, 15}8× 4: {0, 1}, {2, 3}, {4, 5}, {6, 7},

{8, 9}, {10, 11}, {12, 13}, {14, 15}4× 8: {0, 4}, {1, 5}, {2, 6}, {3, 7},

{8, 12}, {9, 13}, {10, 14}, {11, 15}4× 4: {0}, {1}, . . . , {14}, {15}Computing stages for VBS using 4× 4 blocksVBS Stage 1 Stage 2 Stage 3 Stage 416× 16 {0, 1, 2, 3} {4, 5, 6, 7} {8, 9, 10, 11} {12, 13, 14, 15}16× 8 {0, 1, 2, 3}/{8, 9, 10, 11} {4, 5, 6, 7}/{12, 13, 14, 15} - -

8× 16 {0, 1}/{2, 3} {4, 5}/{6, 7} {8, 9}/{10, 11} {12, 13}/{14, 15}8× 8 {0, 1}/{2, 3}/

{8, 9}/{10, 11}{4, 5}/{6, 7}/

{12, 13}/{14, 15}- -

8× 4 {0, 1}/{2, 3}/{4, 5}/{6, 7}/{8, 9}/{10, 11}/{12, 13}/{14, 15}

- - -

4× 8 {0}/{1}/{2}/{3}/{8}/{9}/{10}/{11}

{4}/{5}/{6}/{7}/{12}/{13}/{14}/{15}

- -

4× 4 {0}/{1}/ . . . /{14}/{15} - - -

Figure 7: Organization of Variable Block Size based on Basic 4× 4 Blocks.

Other 6 block sizes in H.264, that is, 16× 16, 16× 8, 8× 16,8× 8, 8× 4, and 4× 8, can be organized by the combinationof basic 4× 4 blocks, shown in Figure 7, which also describescomputing stages for each variable-sized block constructedon the basic 4× 4 blocks to obtain VBS SAD results.

For instance, for a largest 16 × 16 block, it will require 4stages of the parallel data loadings from the register arrays tothe SAD calculator to obtain a final block SAD result. In thiscase, the schedule of data loading will be {0, 1, 2, 3} → {4,5, 6, 7} → {8, 9, 10, 11} → {12, 13, 14, 15}, where “{}”

Page 108: 541420

10 EURASIP Journal on Embedded Systems

indicates each parallel pixel data input with the current andreference block data.

4.3. Optimized Memory Structure. When a square patternis used to refine the MV search results, the mapping ofthe memory architecture is important to speed up theperformance. In our design, the memory architecture will bemapped onto a 2D register space for the refined stage. Themaximum size of this space is 18 × 18 with pixel bit depth,that is, the mapped register memory can accommodate alargest 16× 16 macroblock plus the edge redundancy for therotated data shift and storage operations.

A simple combination of parallel register shifts andrelated data fetches from SRAM can reduce the memorybandwidth, and facilitate the refinement processing, asmany of the pixel data for searching in this stage remainunchanged. For example, 87.89% and 93.75% of the pixeldata will stay unchanged, when the (1,−1) and (1,0) offsetsearches for the 16× 16 block are executed, respectively.

4.4. SAD Comparator. The SAD comparator is utilized tocompare the previously generated block SAD results toobtain a final estimated MV which corresponds to the bestMME point that has the minimum SAD with the lowestblock pixel intensity. To select and compare the properblock SAD results as shown in Figure 6, the signals ofdifferent block shapes and computing stages are employedto determine the appropriate mode of minimum SAD to beutilized.

For example, if the 16 × 16 block size is used for motionestimation, the 16 × 16 block data will be loaded into theBPU for SAD calculations. Each 16 × 16 block requires 4computing stages to obtain a final block SAD result. Inthis case, the result mode of “16 × 8 or 16 × 16 SAD”will be first selected. Meanwhile, the signal of computingstages is also used to indicate the valid input to the SADcomparator for retrieving proper SAD results from BPU, andthus obtain the MME point with a minimum SAD for thisblock size.

The best MME point position obtained by SAD com-parator is further employed to produce the best matchedreference block data and residual data which are importantto other video encoding functions, such as mathematicaltransforms and motion compensation, and so forth.

5. Virtual Socket System-on-PlatformArchitecture

The bitstream and hardware complexity analysis derivedin Section 2 helps guiding both the architecture designfor prototyping IP accelerated system and the optimizedimplementation of an H.264 BP encoding system based onthat architecture.

5.1. The Proposed System-On-Platform Architecture. A vari-ety of options, switches, and modes required in videobitstream actually results in the increasing interactionsbetween different video tasks or function-specific IP blocks.

Consequently, the functional oriented and fully dedicatedarchitectures will become inefficient, if high levels of theflexibility are not provided in the individual IP modules.To make the architectures remain efficient, the hardwareblocks need optimization to deal with the increasing com-plexity for visual objects processing. Besides, the hardwaremust keep flexible enough to manage and allocate variousresources, memories, computational video IP accelerators fordifferent encoding tasks. In view of that the programmablesolutions will be preferable for video codec applicationswith programmable and reconfigurable processing cores,the heterogeneous functionality and the algorithms can beexecuted on the same hardware platform, and upgradedflexibly by software manipulations.

To accelerate the performance on processing cores,parallelization will be demanded. The parallelization cantake place at different levels, such as task, data, andinstruction. Furthermore, the specific video processingalgorithms performed by IP accelerators or processingcores can improve the execution efficiency significantly.Therefore, the requirements for H.264 video applicationsare so demanding that multiple acceleration techniquesmay be combined to meet the real-time conditions. Theprogrammable, reconfigurable, heterogeneous processors arethe preferable choice for an implementation of H.264 BPvideo encoder. Architectures with the support for concurrentperformance and hardware video IP accelerators are wellapplicable for achieving the real-time requirement imposedby the H.264 standard.

Figure 8 shows the proposed extensible system-on-platform architecture. The architecture consists of a pro-grammable and reconfigurable processing core which is builtupon FPGA, and two extensible cores with RISC and DSP.The RISC can take charge of general sequences control andIP integration information, give mode selections for videocoding, and configure basic operations, while DSP can beutilized to process the particular or flexible computationaltasks.

The processing cores are connected through the het-erogeneous integrated onplatform memory spaces for theexchange of control information. The PCI/PCMCIA stan-dard bus provides a data transfer solution for the hostconnected to the platform framework, reconfigures andcontrols the platform in a flexible way. Desirable video IPaccelerators will be integrated in the system platform archi-tecture to improve the encoding performance for H.264 BPvideo applications.

5.2. Virtual Socket Management. The concept of virtualsocket is thus introduced to the proposed system-on-platform architecture. Virtual socket is a solution for thehost-platform interface, which can map a virtual memoryspace from the host environment to the physical storageon the architecture. It is an efficient mechanism for themanagement of virtual memory interface and heterogeneousmemory spaces on the system framework. It enables atruly integrated, platform independent environment for thehardware-software codevelopment.

Page 109: 541420

EURASIP Journal on Embedded Systems 11

DSP

RISC

BRAMFPGA

IP module 1

IP module 2

SRAM DRAM

Interrupt

IP m

emor

y in

terf

ace

VS

con

trol

ler

Loca

l bu

s m

ux

inte

rfac

e

PC

I bu

s in

terf

ace

IP module N

...

Figure 8: The proposed extensible system-on-platform hardware architecture.

Through the virtual socket interface, a few of virtualsocket application programming interface (API) functioncalls can be employed to make the generic hardwarefunctional IP accelerators automatically map the virtualmemory addresses from the host system to different memoryspaces on the hardware platform. Therefore, with theefficient virtual socket memory organization, the hardwareabstraction layer will provide the system architecture withsimplified memory access, interrupt based control andshielded interactions between the platform framework andthe host system. Through the integration of IP acceleratorsto the hardware architecture, the system performance will beimproved significantly.

The codesign virtual socket host-platform interfacemanagement and system-on-platform hardware architectureactually provide a useful embedded system approach forthe realization of advanced and complicated H.264 videoencoding system. Hence, the IP accelerators on FPGA,together with the extensible DSP and RISC, construct anefficient programmable embedded solution to perform thededicated and real-time video processing tasks. Moreover,due to the various video configurations for H.264 encoding,the physically implemented virtual socket interface as wellas APIs can easily enable the encoder configurations, datamanipulations and communications between the host com-puter system and hardware architecture, in return facilitatethe system development for H.264 video encoders.

5.3. Integration of IP Accelerators. The IP accelerator illus-trated here can be any H.264 compliant hardware blockwhich is defined to handle a computationally extensivetask for video applications without a specific design forinteraction controls between IP and the host. For encoding,the basic modules to be integrated include Motion Estimator,Discrete Cosine Transform and Quantization (DCT/Q),Deblocking Filter and Context Adaptive Variable LengthCoding (CAVLC), while Inverse Discrete Cosine Transformand Inverse Quantization (IDCT/Q−1), and Motion Com-

pensation (MC) for decoding. An IP memory interface isprovided by the architecture to achieve the integration.All IP modules are connected to the IP memory interface,which provides accelerators a straight way to exchange databetween the host and memory spaces. Interrupt signals canbe generated by accelerators when demanded. Moreover, tocontrol the concurrent performance of accelerators, an IPbus arbitrator is designed and integrated in the IP memoryinterface, for the interface controller to allocate appropriatememory operation time for each IP module, and avoid thememory access conflicts possibly caused by heterogeneous IPoperations.

IP interface signals are configured to connect the IPmodules to the IP memory interface. It is likely that eachaccelerator has its own interface requirement for interactionbetween the platform and IP modules. To make the inte-gration easy, it is required that certain common interfacesignals be defined to link IP blocks and memory interfacetogether. With the IP interface signals, the acceleratorswill focus on their own computational tasks, and thus thearchitecture efficiency can be improved. Practically, the IPmodules can be flexibly reused, extended, and migrated toother independent platforms very easily. Table 3 defines thenecessary IP interface signals for the proposed architecture.IP modules only need to issue the memory requests andaccess parameters to the IP memory interface, and rest ofthe tasks are taken by platform controllers. This featureis especially useful when motion estimator is integrated inthe system.

5.4. Host Interface and API Function Calls. The host interfaceprovides the architecture with necessary data for videoprocessing. It can also control video accelerators to operatein sequential or parallel mode, in accordance with theH.264 video codec specifications. The hardware-softwarepartitioning is simplified so that the host interface can focuson the data communication as well as flow control for videotasks, while hardware accelerators deal with local memory

Page 110: 541420

12 EURASIP Journal on Embedded Systems

Table 3: IP interface signals.

Interface signals Description

Clk, reset, start Platform signals for IP

Input Valid,Output Valid

Valid strobes for IP memory access

Data In, Data Out Input and output memory data for IP

Memory Read IP request for memory read

Mem HW Accel, offset,count

IP number, offset, and data countprovided by IP/Host for memoryread

Mem HW Accel1,offset1, count1

IP Number, offset, and data countprovided by IP/Host for memorywrite

Mem Read Req IP bus request for memoryMem Write Req access

Mem Read Release Req IP bus release request forMem Write Release Req memory access

Mem Read Ack IP bus request grant forMem Write Ack memory access

Mem Read Release Ack IP bus release grant forMem Write Release Ack memory access

Done IP interrupt signal

accesses and video codec functions. Therefore, the softwareabstraction layer covers the feature of data exchange andvideo task flow control for hardware performance.

A set of related virtual socket API functions is definedto implement the host interface features. The virtual socketAPIs are software function calls coded in C/C++, whichperform data transfers and signal interactions between thehost and hardware system-on-platform. The virtual socketAPI as a software infrastructure can be utilized by a varietyof video applications to control the implementation ofhardware feature defined. With virtual socket APIs, themanipulation of video data in local memories can beexecuted conveniently. Therefore, the efficiency of hardwareand software interactions can be kept high.

6. System Optimizations

6.1. Memory Optimization. Due to the significant memoryaccess requirement for video encoding tasks, a large amountof clock cycles is consumed by the processing core whilewaiting for the data fetch from local memory spaces. Toreduce or avoid the overhead of memory data access, thememory storage of video frame data can be organizedto utilize multiple independent memory spaces (SRAMand DRAM) and dual-port memory (BRAM), in order toenable the parallel and pipelined memory access during thevideo encoding. This optimized requirement can practicallyprovide the system architecture with the multi-port memorystorage to reduce the data access bandwidth for each of theindividual memory space.

Furthermore, with the dual-port data access, DMA canbe scheduled to transfer a large amount of video frame datathrough PCI bus and virtual socket interface in parallel with

the operations of encoding tasks, so that the processing corewill not suffer memory and encoding latency. In such case,the data control flow of video encoding will be managed tomake the DMA transfer and IP accelerator operations in fullyparallel and pipelined stages.

6.2. Architecture Optimization. As the main video encodingfunctions (such as ME, DCT/Q, IDCT/Q−1, MC, DeblockingFilter, and CAVLC) can be accelerated by IP modules, theinterconnection between those video processing acceleratorshas an important impact on the overall system performance.To make the IP accelerators execute main computationalencoding routines in full parallel and pipelining mode,the IP integration architecture has to be optimized. Afew of caches are inserted between the video IP acceler-ators to facilitate the encoding concurrent performance.The caches can be organized as parallel dual-port mem-ory (BRAM) or pipelined memory (FIFO). The intercon-nection control of data streaming between IP moduleswill be defined using those caches targeting to eliminatethe extra overhead of processing routines, for encodingfunctions can be operated in full parallel and pipeliningstages.

6.3. Algorithm Optimization. The complexity of encodingalgorithms can be modified when the IP accelerators areshaping. This optimization can be taken after choosing themost appropriate modes, options, and configurations for theH.264 BP applications. It is known that the motion estimatorrequires the major overhead for encoding computations.To reduce the complexity of motion estimation, a veryefficient and fast ACQPPS algorithm and correspondinghardware architecture have been realized based on thereduction of spatio-temporal correlation redundancy. Someother algorithm optimizations can also be executed. Forexample, a simple algorithm optimization may be appliedto mathematic transform and quantization. As many blockstend to have minimal residual data after the motion com-pensation, the mathematic transform and quantization formotion-compensated blocks can be ignored, if SAD of suchblocks is lower than a prescribed threshold, in order tofacilitate the processing speed.

The application of memory, algorithm, and architectureoptimizations combined in the system can meet the majorchallenges for the realization of video encoding system. Theoptimization techniques can be employed to reduce theencoding complexity and memory bandwidth, with the well-defined parallel and pipelining data streaming control flow,in order to implement a simplified H.264 BP encoder.

6.4. An IP Accelerated Model for Video Encoding. An opti-mized IP accelerated model is presented in Figure 9 for arealization of simplified H.264 BP video encoder. In thisarchitecture, BRAM, SRAM, and DRAM are used as multi-port memories to facilitate video processing. The currentvideo frame is transferred by DMA and stored in BRAM.Meanwhile, IP accelerators fetch the data from BRAM and

Page 111: 541420

EURASIP Journal on Embedded Systems 13

IP memory interface

Virtu

al socket controller

SRAMBRAM DRAM

IP memory interface control

ME DCT/Q

BRAM(1)

BRAM(2)

CAVLCMC

FIFO(1) FIFO(2) BRAM(3)

Deblock

BRAM(4)

IDCT/Q−1

Figure 9: An optimized architecture for simplified H.264 BP video encoding system.

Transfer the current frame to BRAM

ME triggered to start

Fetch data for each block

ME generates MVs and residual data

Save data to BRAM(1)

and BRAM(2)

Frame finished?

Interrupt to end

Yes

No

Pipelined process Pipelined process

Parallel process

CAVLC starts to work

CAVLC produces bitstreams

Pipelined process Pipelined process

DCT/Q starts to work

DCT/Q generates block coefficients

Save data to FIFO(2)

Save data to BRAM(3)

Save data to SRAM/DRAM

Save data to FIFO(1) and

BRAM(4)

MC starts to work

Deblocking starts to work

Deblocking generates filtered

block pixels

MC generates reconstructed

block pixels

IDCT/Q−1

starts to work

DCT/Q−1 generatesblock coefficients

Figure 10: A video task partitioning and data control flow for the optimized system architecture.

start video encoding routines. As BRAM is a dual-portmemory, the overhead of DMA transfer is eliminated by thisdual-port cache.

This IP accelerated system model includes the mem-ory, algorithm, and architecture optimization techniquesto enable the reduction and elimination of the overheadresulted from the heterogeneous video encoding tasks. The

video encoding model provided in this architecture iscompliant with H.264 standard specifications.

A data control flow based on the video task partitioningis shown in Figure 10. According to the data streaming, it isobvious that the parallel and pipelining operations dominatein the whole part of encoding tasks, which are able to yieldan efficient processing performance.

Page 112: 541420

14 EURASIP Journal on Embedded Systems

7. Implementations

The proposed ACQPPS algorithm is integrated and verifiedunder H.264 JM Reference Software [28], while the hard-ware architectures, including the ACQPPS motion estimatorand system-on-platform framework, are synthesized withSynplify Pro 8.6.2, implemented using Xilinx ISE 8.1iSP3 targeting Virtex-4 XC4VSX35FF668-10, based on theWILDCARD-4 [29].

The system hardware architecture can sufficiently processthe QCIF/SIF/CIF video frames with the support of on-platform design resources. The Virtex-4 XC4VSX35 contains3,456 Kb BRAM [30], 192 XtremeDSP (DSP48) slices [31],and 15,360 logic slices, which are equivalent to almost1 million logic gates. Moreover, WILDCARD-4 integratesthe large-sized 8 MB SRAM and 128 MB DRAM. With thesufficient design resources and memory support, the wholevideo frames of QCIF/SIF/CIF can be directly stored in theon-platform memories for the efficient hardware processing.

For example, if a CIF YUV (YCbCr) 4 : 2 : 0 videosequence is encoded with the optimized hardware architec-ture proposed in Figure 9, the total size of each current frameis 148.5 Kb. Therefore, each of the current CIF frame can betransferred from host system and directly stored in BRAMfor motion estimation and video encoding, whereas thegenerated reference frames are stored in SRAM or DRAM.The SRAM and DRAM can accommodate a maximum of upto 55 and 882 CIF reference frames, respectively, which aremore than enough for the practical video encoding process.

7.1. Performance of ACQPPS Algorithm. A variety of videosequences which contain different amount of motions, listedin Table 4, is examined to verify the algorithm performancefor real-time encoding (30 fps). All sequences are in the for-mat of YUV (YCbCr) 4 : 2 : 0 with luminance component tobe processed for ME. The frame size of sequences varies fromQCIF to SIF and CIF, which is the typical testing condition.The targeted bit rate is from 64 Kbps to 2 Mbps. SAD is usedas the intensity matching criterion. The search window is[−15, +15] for FS. EPZS uses extended diamond pattern andPMVFAST pattern [32] for its primary and secondary refinedsearch stages. It also enables the window based, temporal,and spatial memory predictors to perform advanced motionsearch. UMHexagonS utilizes search range prediction anddefault scale factor optimized for different image sizes.Encoded frames are produced in a sequence of IPP, . . .PPP,as H.264 BP encoding is employed. For reconstructed videoquality evaluation, the frame-based average peak signal-to-noise ratio (PSNR) and number of search points (NSP)per MB (16 × 16 pixels) are measured. Video encoding isconfigured with the support of full-pel motion accuracy,single reference frame and VBS. As VBS is a complicatedfeature defined in H.264, to make easy and practical thecalculation of NSP regarding different block sizes, all searchpoints for variable block estimation are normalized to thesearch points regarding the MB measurement, so that theNSP results can be evaluated reasonably.

The implementation results in Tables 6 and 7 showthat the estimated image quality produced by ACQPPS, in

Table 4: Video sequences for experiment with real-time frame rate.

Sequence (bit rate Kbps) Size/frame rate No. of frames

Foreman (512) QCIF/30 fps 300

Carphone (256) QCIF/30 fps 382

News (128) QCIF/30 fps 300

Miss Am (64) QCIF/30 fps 150

Suzie (256) QCIF/30 fps 150

Highway (192) QCIF/30 fps 2000

Football (2048) SIF/30 fps 125

Table Tennis (1024) SIF/30 fps 112

Foreman (1024) CIF/30 fps 300

Mother Daughter (128) CIF/30 fps 300

Stefan (2048) CIF/30 fps 90

Highway (512) CIF/30 fps 2000

Table 5: Video sequences for experiment with low bit and framesrates.

Sequence (bit rate Kbps) Size/frame rate No. of frames

Foreman (90) QCIF/7.5 fps 75

Carphone (56) QCIF/7.5 fps 95

News (64) QCIF/15 fps 150

Miss Am (32) QCIF/15 fps 75

Suzie (90) QCIF/15 fps 75

Highway (64) QCIF/15 fps 1000

Football (256) SIF/10 fps 40

Table Tennis (150) SIF/10 fps 35

Foreman (150) CIF/10 fps 100

Mother Daughter (64) CIF/10 fps 100

Stefan (256) CIF/10 fps 30

Highway (150) CIF/10 fps 665

terms of PSNR, is very close to that from FS, while thenumber of average search points is dramatically reduced.The PSNR difference between ACQPPS and FS is in therange of −0.13 dB ∼ 0 dB. In most cases, PSNR degradationof ACQPPS is less than 0.06 dB, as compared to FS. Insome cases, PSNR results of ACQPPS can be approximatelyequivalent or equal to those generated from FS. Whencompared with other fast search methods, that is, DS(small pattern), UCBDS, TSS, FSS and HEX, ACQPPSresult is able to outperform their performance. ACQPPScan always yield higher PSNR than those fast algorithms.In this case, ACQPPS can obtain an average PSNR of+0.56 dB higher than those algorithms with evaluated videosequences.

Besides, ACQPPS performance is comparable to thatof the complicated and advanced EPZS and UMHexagonSalgorithms, as it can achieve an average PSNR in the range of−0.07 dB ∼ +0.05 dB and−0.04 dB ∼ +0.08 dB, as comparedto EPZS and UMHexagonS, respectively.

In addition to the real-time video sequence encodingwith 30 fps, many other application cases, such as the mobilescenario and videoconferencing, require video encodingunder the low bit and frame rate environment with less

Page 113: 541420

EURASIP Journal on Embedded Systems 15

Table 6: Average PSNR performance for experiment with real-time and frame rate.

Sequence FS DS UCBDS TSS FSS HEX EPZS UMHexagonS ACQPPS

Foreman (QCIF) 38.48 38.09 37.93 38.27 38.19 37.87 38.45 38.44 38.42

Carphone (QCIF) 36.43 36.23 36.16 36.30 36.24 36.04 36.42 36.37 36.37

News (QCIF) 37.44 37.26 37.35 37.28 37.29 37.25 37.43 37.35 37.43

Miss Am (QCIF) 39.07 39.01 39.01 39.00 38.94 39.01 38.98 39.01 39.03

Suzie (QCIF) 38.65 38.46 38.47 38.59 38.54 38.45 38.61 38.58 38.60

Highway (QCIF) 38.23 37.99 38.13 38.11 38.09 38.06 38.18 38.17 38.13

Football (SIF) 31.37 31.23 31.23 31.22 31.24 31.20 31.40 31.37 31.36

Table Tennis (SIF) 33.87 33.71 33.79 33.62 33.72 33.71 33.87 33.84 33.84

Foreman (CIF) 36.30 35.91 35.83 35.72 35.70 35.69 36.27 36.24 36.25

Mother Daughter (CIF) 36.26 36.16 36.24 36.21 36.21 36.22 36.26 36.23 36.26

Stefan (CIF) 33.87 33.46 33.36 33.45 33.39 33.30 33.89 33.82 33.82

Highway (CIF) 37.96 37.70 37.83 37.79 37.77 37.76 37.89 37.87 37.83

Table 7: Average number of search points per MB for experiment with real-time and frame rate.

Sequence FS DS UCBDS TSS FSS HEX EPZS UMHexagonS ACQPPS

Foreman (QCIF) 2066.73 60.64 109.70 124.85 122.37 109.26 119.02 125.95 55.63

Carphone (QCIF) 1872.04 46.82 91.54 108.44 106.52 94.82 114.83 121.91 54.02

News (QCIF) 1719.92 33.72 74.48 88.28 90.30 81.12 81.36 79.73 41.32

Miss Am (QCIF) 1471.96 30.70 64.35 74.95 76.52 68.43 62.94 56.27 32.32

Suzie (QCIF) 1914.32 44.19 88.19 108.19 104.97 93.21 96.98 88.74 47.59

Highway (QCIF) 1791.86 40.27 85.14 101.49 100.94 90.27 85.04 84.24 46.12

Football (SIF) 2150.45 68.42 118.21 131.82 129.91 117.62 184.81 202.19 72.63

Table Tennis (SIF) 2031.72 55.66 105.56 120.09 121.27 108.79 128.36 124.95 54.25

Foreman (CIF) 1960.07 76.83 124.56 128.85 125.76 117.21 122.22 124.26 67.20

Mother Daughter (CIF) 1473.73 35.08 70.39 82.18 82.80 73.67 80.38 63.51 40.89

Stefan (CIF) 1954.21 69.32 116.72 118.23 122.37 113.50 137.59 149.80 58.91

Highway (CIF) 1730.90 45.63 90.81 104.90 103.98 93.22 78.82 75.98 47.57

than 30 fps. Accordingly, the satisfied settings for videoencoding are usually 7.5 fps ∼ 15 fps for QCIF and 10 fps ∼15 fps for SIF/CIF with various low bit rates, for example,90 Kbps for QCIF and 150 Kbps for SIF/CIF, to maximizethe perceived video quality [40, 41]. In order to furtherevaluate the ME algorithms under low bit and frame ratecases, video sequences are provided in Table 5, and Tables 8and 9 generate the corresponding performance results.

The experiments show that the PSNR difference betweenACQPPS and FS is still small, which is in an acceptablerange of −0.49 dB ∼ −0.02 dB. In most cases, there is onlyless than 0.2 dB PSNR discrepancy between them. Moreover,ACQPPS still sufficiently outperforms DS, UCBDS, TSS,FSS and HEX. For mobile scenarios, there are usually quickand considerable motion displacements existing, under theenvironment of low frame rate video encoding. In suchcase, ACQPPS is particularly much better than those fastalgorithms, and a result of up to +2.42 dB for PSNR can beachieved with the tested sequences. When compared withEPZS and UMHexagonS, ACQPPS can yield an averagePSNR in the range of −0.36 dB ∼ +0.06 dB and −0.15 dB ∼+0.07 dB, respectively.

Normally, ACQPPS is useful to produce a favorablePSNR for the sequences not only with small object motions,

but also large amount of motions. In particular, if a sequenceincludes large object motions or considerable amount ofmotions, the advantage of ACQPPS algorithm is obvious, asthe ACQPPS can adaptively choose different shapes and sizesfor the search pattern which is applicable to the efficient largemotion search.

Such search advantage can be observed when ACQPPS iscompared with DS. It is know that DS has a simple diamondpattern for a very low complexity based motion search. Forvideo sequences with slow and small motions contained,for example, Miss Am (QCIF) and Mother Daguhter (CIF)at 30 fps, the PSNR performance of DS and ACQPPS isrelatively close, which indicates that DS performs well inthe case of simple motion search. When the complicatedand large amount of motions included in video images,however, DS is unable to yield good PSNR, as its motionsearch will be easily trapped in undesirable local minimum.For example, the PSNR differences between DS and ACQPPSare 0.34 dB and 0.44 dB, when Foreman (CIF) is testedwith 1 Mbps at 30 fps and 150 Kbps at 10 fps, respectively.Furthermore, ACQPPS can produce an average PSNR of+0.02 dB ∼ +0.36 dB higher than DS in the case of real-timevideo encoding, and +0.07 dB ∼ +1.94 dB in the case of lowbit and frame rate environment.

Page 114: 541420

16 EURASIP Journal on Embedded Systems

Table 8: Average PSNR performance for experiment with low bit and frame rates.

Sequence FS DS UCBDS TSS FSS HEX EPZS UMHexagonS ACQPPS

Foreman (QCIF) 34.88 34.44 34.19 34.40 34.42 33.99 34.85 34.80 34.80

Carphone (QCIF) 34.12 33.99 33.96 34.02 33.99 33.84 34.08 34.04 34.06

News (QCIF) 35.28 35.21 35.20 35.11 35.20 35.19 35.25 35.21 35.24

Miss Am (QCIF) 38.36 38.23 38.33 38.25 38.23 38.31 38.28 38.27 38.34

Suzie (QCIF) 36.54 36.40 36.34 36.44 36.39 36.26 36.52 36.50 36.50

Highway (QCIF) 36.19 35.80 36.01 35.96 35.94 35.90 36.13 36.09 35.98

Football (SIF) 25.11 24.82 24.92 24.79 24.84 24.89 25.08 25.10 25.01

Table Tennis (SIF) 27.57 26.65 26.95 26.85 26.84 26.86 27.57 27.60 27.45

Foreman (CIF) 31.95 31.29 31.32 31.16 31.25 31.06 31.90 31.79 31.73

Mother Daughter (CIF) 36.07 35.84 35.95 35.90 35.92 35.91 36.07 36.02 36.05

Stefan (CIF) 27.02 24.59 24.92 24.11 24.12 24.96 26.89 26.67 26.53

Highway (CIF) 37.21 36.88 36.98 36.94 36.92 36.94 37.12 37.09 37.01

Table 9: Average number of search points per MB for experiment with low bit and frame rates.

Sequence FS DS UCBDS TSS FSS HEX EPZS UMHexagonS ACQPPS

Foreman (QCIF) 2020.51 90.20 140.01 134.64 133.92 125.12 163.63 190.38 98.94

Carphone (QCIF) 1836.04 58.40 102.76 112.65 111.28 100.74 141.56 160.32 71.81

News (QCIF) 1680.68 34.74 74.11 87.22 88.92 79.64 96.40 102.30 52.61

Miss Am (QCIF) 1406.26 32.60 64.56 75.39 75.08 67.74 68.05 63.20 44.22

Suzie (QCIF) 1823.23 52.96 94.39 110.43 106.30 95.11 115.96 112.17 64.30

Highway (QCIF) 1710.86 42.06 84.42 97.99 97.34 87.36 98.22 97.77 58.13

Football (SIF) 1914.43 80.13 132.67 123.20 125.01 119.62 192.88 246.76 92.51

Table Tennis (SIF) 1731.44 50.10 98.45 97.71 100.73 93.82 159.45 182.19 64.39

Foreman (CIF) 1789.76 91.32 140.31 124.01 124.24 120.48 154.55 170.89 88.62

Mother Daughter (CIF) 1467.56 42.21 78.32 87.45 87.75 78.42 90.40 78.14 52.36

Stefan (CIF) 1663.89 65.44 110.17 100.53 102.36 103.71 153.97 194.64 78.69

Highway (CIF) 1715.63 52.26 97.20 109.24 107.49 96.94 91.74 92.27 64.45

The number of search points for each method, whichmainly represents the algorithm complexity, is also obtainedto measure the search efficiency of different approaches.The NSP results show that the search efficiency of ACQPPSis higher than other algorithms, as ACQPPS can producevery good performance, in terms of PSNR, with reasonablypossessed NSP. The NSP of ACQPPS is one of the leastamong all methods.

If ACQPPS is compared with DS, it is shown thatACQPPS has the similar NSP as DS. It is true that NSPof ACQPPS is usually a little bit increased in comparisonwith that of DS. However, the increasing of the NSP islimited and very reasonable, and is able to in turn bringACQPPS much better PSNR for the encoded video quality.Furthermore, for the video sequences containing complexand quick object motions, for example, Foreman (CIF) andStefan (CIF) at 30 fps, the NSP of ACQPPS can be even lessthan that of DS, which verifies that ACQPPS has a muchsatisfied search efficiency than DS, due to its highly adaptivesearch patterns.

In general, the complexity of ACQPPS is very low, andwith high search performance, which makes it especiallyuseful for the hardware architecture implementation.

7.2. Design Resources for ACQPPS Motion Estimator. As thecomplexity and search points of ACQPPS have been greatlyreduced, design resources used by ACQPPS architecturecan be kept at a very low level. The main part of designresources is for SAD calculator. Each BPU requires one 32-bit processing element (PE) to implement SAD calculations.Every PE has two 8-bit pixel data inputs, one from thecurrent block and the other from reference block. Besides,every PE contains 16 subtractors, 8 three-input adders, 1latch register, and does not require extra interim registers oraccumulators. As a whole, a 32 × 4 PE array will be neededto implement the pipelined multilevel SAD calculator, whichrequires totally 64 subtractors, 32 three-input adders, and 4latch registers. Other related design resources mainly includean 18 × 18 register array, a 16 × 16 register array, a fewof accumulators, subtractors and comparators, which areused to generate the block SAD results, residual data andfinal estimated MVs. Moreover, some other multiplexers,registers, memory access, and data flow control logic gatesare also needed in the architecture. A comparison of designresources between ACQPPS and other ME architectures [33–36] is presented in Table 10. The results show that proposedACQPPS architecture can utilize greatly reduced design

Page 115: 541420

EURASIP Journal on Embedded Systems 17

Table 10: Performance comparison between proposed ACQPPS and other motion estimation hardware architectures.

[33] [34] [35] [36] Proposed architecture

Type ASIC ASIC ASIC ASIC FPGA + DSP

Algorithm FS FS FS FS ACQPPS

Search range [−16, +15] [−32, +31] [−16, +15] [−16, +15] Flexible

Gate count 103 K 154 K 67 K 108 K 35 K

8× 8

Support block sizes All All 16× 16 All All

32× 32

Freq. [MHz] 66.67 100 60 100 75

Max fps of CIF 102 60 30 56 120

Min Freq. [MHz]for CIF 30 fps 19.56 50 60 54 18.75

Table 11: Design resources for system-on-platform architecture.

Target FPGA Critical Path Gates DFFs/Latches

XC4VSX35FG668-10 5 ns 279,774 3,388

Target FPGA LUTs CLB Slices Resource

XC4VSX35FG668-10 3,161 3,885 25%

Table 12: DMA performance for video sequence transfer.

QCIF 4 : 2 : 0 YCrCb DMA Write(ms)

DMA Read(ms)

DMA R/W(ms)

WildCard-4 0.556 0.491 0.515

CIF 4 :2 : 0 YCrCb DMA Write(ms)

DMA Read(ms)

DMA R/W(ms)

WildCard-4 2.224 1.963 2.059

resources to realize a high-performance motion estimator forH.264 encoding.

7.3. Throughput of ACQPPS Motion Estimator. Unlike theFS which has a fixed search range, search points and searchrange of ACQPPS depend on video sequences. ACQPPSsearch points will be increased, if a video sequence containsconsiderable or quick motions. On the contrary, searchpoints can be reduced, if a video sequence includes slow orsmall amount of motions.

The ME scheme with a fixed block size can be typicallyapplied to the throughput analysis. In such case, the worstcase will be the motion estimation using 4× 4 blocks, whichis the most time consuming in the case of fixed block size.Hence, the overall throughput result produced by ACQPPSarchitecture can be reasonably generalized and evaluated.

In general, if the clock frequency is 50 MHz and thememory (SRAM, BRAM and DRAM) structure is organizedas DWORD (32-bit) for each data access, the ACQPPShardware architecture will approximately need an averageof 12.39 milliseconds for motion estimation in the worstcase of using 4 × 4 blocks. For a real-hardware architectureimplementation, the typical throughput in the worst caseof 4×4 blocks can represent the overall motion search abilityfor this motion estimator architecture.

Therefore, the ACQPPS architecture can complete themotion estimation for more than 4 CIF (352 × 288) videosequences or equivalent 1 4 CIF (704 × 576) video sequenceat 75 MHz clock frequency within each 33.33 millisecondstime slot (30 fps) to meet the real-time encoding requirementfor a low design cost and low bit rate implementation.The throughput ability of ACQPPS architecture can becompared with those of a variety of other recently developedmotion estimator hardware architectures, as illustrated inTable 10. The comparison results show that the proposedACQPPS architecture can achieve higher throughput thanother hardware architectures, with the reduced operationalclock frequency. Generally, it will only require a very lowclock frequency, that is, 18.75 MHz, to generate the motionestimation results for the CIF video sequences at 30 fps.

7.4. Realization of System Architecture. Table 11 lists thedesign resources utilized by system-on-platform framework.The implementation results indicate that the system architec-ture uses approximately 25% of the FPGA design resourceswhen there is no hardware IP accelerator integrated in theplatform system. If video functions are needed, there willbe more design resources demanded, in order to integrateand accommodate necessary IP modules. Table 12 givesa performance result of the platform DMA video frametransfer feature.

Different DMA burst sizes will result in different DMAdata transfer rates. In our case, the maximum DMA burst sizeis defined to accommodate a whole CIF 4 : 2 : 0 video frame,that is, 38,016 Dwords for each DMA data transfer buffer.Accordingly, the DMA transfer results verify that it onlytakes an average of approximately 2 milliseconds to transfera whole CIF 4 : 2 : 0 video frame based on WildCard-4. Thistransfer performance can sufficiently support up to level 4bitstream rate for the H.264 BP video encoding system.

7.5. Overall Encoding Performance. In view of the complexityanalysis of H.264 video tasks described in Section 2, the mosttime consuming task is motion estimation. Other encodingtasks have much less overhead. Therefore, the video tasks canbe scheduled to operate in parallel and pipelining stages asdisplayed in Figures 9 and 10 for the proposed architecture

Page 116: 541420

18 EURASIP Journal on Embedded Systems

Table 13: An overall performance comparison for H.264 BP video encoding systems.

Implementation [37] [38] [39] Proposed architecture

Architecture ASIC Codesign Codesign Codesign(Extensible multiple processing cores)

ME Algorithm Full Search (FS) Full Search (FS) Hexagon (HEX) ACQPPS

Freq. [MHz] 144 100 81 75

Max fps of CIF 272.73 5.125 18.6 120

Min Freq. [MHz] for CIF 30 fps 15.84 585 130.65 18.75

Core Voltage Supply 1.2 V 1.2 V 1.2 V 1.2 V

I/O Voltage Supply 1.8/2.5/3.3 V 1.8/2.5/3.3 V 2.5/3.3 V 2.5/3.3 V

model. In this case, the overall encoding time for a videosequence is approximately equal to the following

Encoding time

= Total motion estimation time

+ Processing time of DCT/Q for the last block

+ Max{

Processing time of IDCT/Q−1 + MC

+ Deblocking for the last block,

Processing time of CAVLC for the last block}.

(8)

The processing time of DCT/Q, IDCT/Q−1, MC,Deblocking Filter, and CAVLC for a divided block directlydepends on the architecture design for each of the module.On an average, it is normal that the overhead of those videotasks for encoding an individual block is much less than thatof motion estimation. As a whole, the encoding time derivedfrom those video tasks for the last one block can be evenignored, when it is compared to the total processing time ofthe motion estimator for a whole video sequence. Therefore,to simplify the overall encoding performance analysis for theproposed architecture model, the total encoding overheadderived from the system architecture for a video sequence canbe approximately regarded as

Encoding time ≈ Total motion estimation time. (9)

This simplified system encoding performance analysis isvalid as long as the video tasks are operated in concurrent andpipelined stages with the efficient optimization techniques.Accordingly, when the proposed ACQPPS motion estimatoris integrated into the system architecture to perform themotion search, the overall encoding performance for theproposed architecture model is generalized.

A performance comparison can be presented in Table 13,where the proposed architecture is compared with someother recently developed H.264 BP video encoding systems[37–39] including both fully dedicated hardware and code-sign architectures. The results indicate that this proposedsystem-on-platform architecture, when integrated with theIP accelerators, can yield a very good performance whichis comparable or even better than other H.264 videoencoding systems. Especially, if compared with other code-sign architectures, the proposed system has much higherencoding throughput, which is about 30 and 6 times higher

than the processing ability of the architectures presentedin [38, 39], respectively. The generated high performanceof proposed architecture is directly contributed from theefficient ACQPPS motion estimation architecture and thetechniques employed for the system optimizations.

8. Conclusions

An integrated reconfigurable hardware-software codesignIP accelerated system-on-platform architecture is proposedin this paper. The efficient virtual socket interface andoptimization approaches for hardware realization have beenpresented. The system architecture is flexible for the hostinterface control and extensible with multiple cores, whichcan actually construct a useful integrated and embeddedsystem approach for the dedicated functions.

An advanced application for this proposed architectureis to facilitate the development of H.264 video encodingsystem. As the motion estimation is the most complicatedand important task in video encoder, a block-based noveladaptive motion estimation search algorithm, ACQPPS,and its hardware architecture are developed for reducingthe complexity to extremely low level, while keeping theencoding performance, in terms of PSNR and bit rate,as high as possible. It is beneficial to integrate video IPaccelerators, especially ACQPPS motion estimator, into thearchitecture framework for improving the overall encodingperformance. The proposed system architecture is mappedon an integrated FPGA device, WildCard-4, toward animplementation for a simplified H.264 BP video encoder.

In practice, with the proposed system architecture, therealization of multistandard video codec can be greatlyfacilitated and efficiently verified, other than the H.264video applications. It can be expected that the advantagesof the proposed architecture will become more desirable forprototyping the future video encoding systems, as new videostandards are emerging continually, for example, the comingH.265 draft.

Acknowledgment

The authors would like to thank the support from AlbertaInformatics Circle of Research Excellence (iCore), XilinxInc., Natural Science and Engineering Research Council ofCanada (NSERC), Canada Foundation for Innovation (CFI),and the Department of Electrical and Computer Engineeringat the University of Calgary.

Page 117: 541420

EURASIP Journal on Embedded Systems 19

References

[1] M. Tekalp, Digital Video Processing, Signal Processing Series,Prentice Hall, Englewood Cliffs, NJ, USA, 1995.

[2] “Information technology—generic coding of moving picturesand associated audio information: video,” ISO/IEC 13818-2,September 1995.

[3] “Video Coding for Low Bit Rate Communication,” ITU-TRecommendation H.263, March 1996.

[4] “Coding of audio-visual objects—part 2: visual, amendment1: visual extensions,” ISO/IEC 14496-4/AMD 1, April 1999.

[5] Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITU-T recommendation and final draft international standardof joint video specification (ITU-T Rec. H.264 — ISO/IEC14496-10 AVC),” JVT-G050r1, May 2003; JVT-K050r1 (non-integrated form) and JVT-K051r1 (integrated form), March2004; Fidelity Range Extensions JVT-L047 (non-integratedform) and JVT-L050 (integrated form), July 2004.

[6] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra,“Overview of the H.264/AVC video coding standard,” IEEETransactions on Circuits and Systems for Video Technology, vol.13, no. 7, pp. 560–576, 2003.

[7] S. Wenger, “H.264/AVC over IP,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 13, no. 7, pp. 645–656,2003.

[8] B. Zeidman, Designing with FPGAs and CPLDs, PublishersGroup West, Berkeley, Calif, USA, 2002.

[9] S. Notebaert and J. D. Cock, Hardware/Software Co-design ofthe H.264/AVC Standard, Ghent University, White Paper, 2004.

[10] W. Staehler and A. Susin, IP Core for an H.264 Decoder SoC,Universidade Federal do Rio Grande do Sul (UFRGS), WhitePaper, October 2008.

[11] R. Chandra, IP-Reuse and Platform Base Designs, STMicroelec-tronics Inc., White Paper, February 2002.

[12] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J.Sullivan, “Rate-constrained coder control and comparison ofvideo coding standards,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 13, no. 7, pp. 688–703, 2003.

[13] J. Ostermann, J. Bormans, P. List, et al., “Video codingwith H.264/AVC: tools, performance, and complexity,” IEEECircuits and Systems Magazine, vol. 4, no. 1, pp. 7–28, 2004.

[14] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC baseline profile decoder complexity analysis,”IEEE Transactions on Circuits and Systems for Video Technology,vol. 13, no. 7, pp. 704–716, 2003.

[15] S. Saponara, C. Blanch, K. Denolf, and J. Bormans, “The JVTadvanced video coding standard: complexity and performanceanalysis on a tool-by-tool basis,” in Proceedings of the PacketVideo Workshop (PV ’03), Nantes, France, April 2003.

[16] J. R. Jain and A. K. Jain, “Displacement measurement and itsapplication in interframe image coding,” IEEE Transactions onCommunications, vol. 29, no. 12, pp. 1799–1808, 1981.

[17] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro,“Motion compensated interframe coding for video conferenc-ing,” in Proceedings of the IEEE National TelecommunicationsConference (NTC ’81), vol. 4, pp. 1–9, November 1981.

[18] R. Li, B. Zeng, and M. L. Liou, “A new three-step searchalgorithm for block motion estimation,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 4, pp. 438–442,1994.

[19] L.-M. Po and W.-C. Ma, “A novel four-step search algorithmfor fast block motion estimation,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 6, no. 3, pp.313–317, 1996.

[20] L.-K. Liu and E. Feig, “A block-based gradient descent searchalgorithm for block motion estimation in video coding,” IEEETransactions on Circuits and Systems for Video Technology, vol.6, no. 4, pp. 419–421, 1996.

[21] S. Zhu and K. K. Ma, “A new diamond search algorithm forfast block-matching motion estimation,” in Proceedings of theInternational Conference on Information, Communications andSignal Processing (ICICS ’97), vol. 1, pp. 292–296, Singapore,September 1997.

[22] C. Zhu, X. Lin, and L.-P. Chau, “Hexagon-based searchpattern for fast block motion estimation,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 12, no. 5, pp.349–355, 2002.

[23] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim,“A novel unrestricted center-biased diamond search algorithmfor block motion estimation,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 8, no. 4, pp. 369–377,1998.

[24] Y. Nie and K.-K. Ma, “Adaptive rood pattern search for fastblock-matching motion estimation,” IEEE Transactions onImage Processing, vol. 11, no. 12, pp. 1442–1449, 2002.

[25] H. C. Tourapis and A. M. Tourapis, “Fast motion estimationwithin the H.264 codec,” in Proceedings of the IEEE Interna-tional Conference on Multimedia and Expo (ICME ’03), vol. 3,pp. 517–520, Baltimore, Md, USA, July 2003.

[26] A. M. Tourapis, “Enhanced predictive zonal search for singleand multiple frame motion estimation,” in Visual Communi-cations and Image Processing, vol. 4671 of Proceedings of SPIE,pp. 1069–1079, January 2002.

[27] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG,“Fast integer pel and fractional pel motion estimation forAVC,” JVT-F016, December 2002.

[28] K. Suhring, “H.264 JM Reference Software v.15.0,” September2008, http://iphome.hhi.de/suehring/tml/download.

[29] Annapolis Micro Systems, “WildcardTM—4 Reference Man-ual,” 12968-000 Revision 3.2, December 2005.

[30] Xilinx Inc., “Virtex-4 User Guide,” UG070 (v2.3), August2007.

[31] Xilinx Inc., “XtremeDSP for Virtex-4 FPGAs User Guide,”UG073(v2.1), December 2005.

[32] A. M. Tourapis, O. C. Au, and M. L. Liou, “Predictivemotion vector field adaptive search technique (PMVFAST)—enhanced block based motion estimation,” in Proceedings ofthe IEEE Visual Communications and Image Processing (VCIP’01), pp. 883–892, January 2001.

[33] Y.-W. Huang, T.-C. Wang, B.-Y. Hsieh, and L.-G. Chen,“Hardware architecture design for variable block size motionestimation in MPEG-4 AVC/JVT/ITU-T H.264,” in Proceed-ings of the IEEE International Symposium on Circuits andSystems (ISCAS ’03), vol. 2, pp. 796–798, May 2003.

[34] M. Kim, I. Hwang, and S. Chae, “A fast VLSI architecture forfull-search variable block size motion estimation in MPEG-4AVC/H.264,” in Proceedings of the IEEE Asia and South PacificDesign Automation Conference, vol. 1, pp. 631–634, January2005.

[35] J.-F. Shen, T.-C. Wang, and L.-G. Chen, “A novel low-power full-search block-matching motion-estimation designfor H.263+,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 11, no. 7, pp. 890–897, 2001.

[36] S. Y. Yap and J. V. McCanny, “A VLSI architecture foradvanced video coding motion estimation,” in Proceedingsof the IEEE International Conference on Application-SpecificSystems, Architectures, and Processors (ASAP ’03), vol. 1, pp.293–301, June 2003.

Page 118: 541420

20 EURASIP Journal on Embedded Systems

[37] S. Mochizuki, T. Shibayama, M. Hase, et al., “A 64 mW highpicture quality H.264/MPEG-4 video codec IP for HD mobileapplications in 90 nm CMOS,” IEEE Journal of Solid-StateCircuits, vol. 43, no. 11, pp. 2354–2362, 2008.

[38] R. R. Colenbrander, A. S. Damstra, C. W. Korevaar, C. A.Verhaar, and A. Molderink, “Co-design and implementationof the H.264/AVC motion estimation algorithm using co-simulation,” in Proceedings of the 11th IEEE EUROMICROConference on Digital System Design Architectures, Methods andTools (DSD ’08), pp. 210–215, September 2008.

[39] Z. Li, X. Zeng, Z. Yin, S. Hu, and L. Wang, “The design andoptimization of H.264 encoder based on the nexperia plat-form,” in Proceedings of the 8th IEEE International Conferenceon Software Engineering, Artificial Intelligence, Networking, andParallel/Distributed Computing (SNPD ’07), vol. 1, pp. 216–219, July 2007.

[40] S. Winkler and F. Dufaux, “Video quality evaluation formobile applications,” in Visual Communications and ImageProcessing, vol. 5150 of Proceedings of SPIE, pp. 593–603,Lugano, Switzerland, July 2003.

[41] M. Ries, O. Nemethova, and M. Rupp, “Motion basedreference-free quality estimation for H.264/AVC video stream-ing,” in Proceedings of the 2nd International Symposium onWireless Pervasive Computing (ISWPC ’07), pp. 355–359,February 2007.

Page 119: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 893897, 16 pagesdoi:10.1155/2009/893897

Research Article

FPSoC-Based Architecture fora Fast Motion Estimation Algorithm in H.264/AVC

Obianuju Ndili and Tokunbo Ogunfunmi

Department of Electrical Engineering, Santa Clara University, Santa Clara, CA 95053, USA

Correspondence should be addressed to Tokunbo Ogunfunmi, [email protected]

Received 21 March 2009; Revised 18 June 2009; Accepted 27 October 2009

Recommended by Ahmet T. Erdogan

There is an increasing need for high quality video on low power, portable devices. Possible target applications range fromentertainment and personal communications to security and health care. While H.264/AVC answers the need for high quality videoat lower bit rates, it is significantly more complex than previous coding standards and thus results in greater power consumptionin practical implementations. In particular, motion estimation (ME), in H.264/AVC consumes the largest power in an H.264/AVCencoder. It is therefore critical to speed-up integer ME in H.264/AVC via fast motion estimation (FME) algorithms and hardwareacceleration. In this paper, we present our hardware oriented modifications to a hybrid FME algorithm, our architecture basedon the modified algorithm, and our implementation and prototype on a PowerPC-based Field Programmable System on Chip(FPSoC). Our results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FMEalgorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant. Weshow that although our implementation platform is FPGA-based, our implementation results compare favourably with previousarchitectures implemented on ASICs. Finally we also show an improvement over some existing architectures implemented onFPGAs.

Copyright © 2009 O. Ndili and T. Ogunfunmi. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Motion estimation (ME) is by far the most powerfulcompression tool in the H.264/AVC standard [1, 2], andit is generally carried out in two stages: integer-pel thenfractional pel as a refinement of the integer-pel search.ME in H.264/AVC features variable block sizes, quarter-pixel accuracy for the luma component (one-eighth pixelaccuracy for the chroma component), and multiple referencepictures. However the power of ME in H.264/AVC comes atthe price of increased encoding time. Experimental results[3, 4] have shown that ME can consume up to 80% ofthe total encoding time of H.264/AVC, with integer MEconsuming a greater proportion. In order to meet real-time and low power constraints, it is desirable to speedup the ME process. Two approaches to ME speed-upinclude designing fast ME algorithms and accelerating ME inhardware.

Considering the algorithm approach, there are tradi-tional, single search fast algorithms such as new three-stepsearch (NTSS) [5], four-step search (4SS) [6], and diamondsearch (DS) [7]. However these algorithms were developedfor fixed block size and cannot efficiently support variableblock size ME (VBSME) for H.264/AVC. In addition, whilethese algorithms are good for small search range and lowresolution video, at higher definition for some high motionsequences such as “Stefan,” these algorithms can drop into alocal minimum in the early stages of the search process [4].In order to have more robust fast algorithms, some hybridfast algorithms that combine earlier single search techniqueshave been proposed. One of such was proposed by Yi et al.[8, 9]. They proposed a fast ME algorithm known variouslyas the Simplified Unified Multi-Hexagon (SUMH) searchor Simplified Fast Motion Estimation (SFME) algorithm.SUMH is based on UMHexagonS [4], a hybrid fast motionestimation algorithm. Yi et al. show in [8] that with similar or

Page 120: 541420

2 EURASIP Journal on Embedded Systems

even better rate-distortion performance, SUMH reduces MEtime by about 55% and 94% on average when compared withUMHexagonS and Fast Full Search, respectively. In addition,SUMH yields a bit rate reduction of up to 18% when com-pared with Full Search in low complexity mode. Both SUMHand UMHexagonS are nonnormative parts of the H.264/AVCstandard.

Considering ME speed-up via hardware acceleration,although there has been some previous work on VLSIarchitectures for VBSME in H.264/AVC, the overwhelmingmajority of these works have been based on the Full SearchMotion Estimation (FSME) algorithm. This is because FSMEpresents a regular-patterned search window which in turnprovides good candidate-level data reuse (DR) with regularsearching flows. A good candidate-level DR results in thereduction of data access power. Power consumption for aninteger ME module mainly comes from two parts: data accesspower to read reference pixels from local memories andcomputational power consumed by the processing elements.For FSME, the data access power is reduced because thereference pixels of neighbouring candidates are considerablyoverlapped. On the other hand, because of the exhaustivesearch done in FSME, the computational complexity andthus the power consumed by the processing elements, islarge.

Several low-power integer ME architectures with corre-sponding fast algorithms were designed for standards priorto H.264/AVC [10–13]. However, these architectures donot support H.264/AVC. Additionally, because the irregularsearching flows of fast algorithms usually lead to poorintercandidate DR, the power reduction at the algorithmlevel is usually constrained by the power reduction at thearchitecture level. There is therefore an urgent need forarchitectures with hardware oriented fast algorithms forportable systems implementing H.264/AVC [14]. Note alsothat because the data flow of FME is very similar to that offractional pel search, some hardware reuse can be achieved[15].

For H.264/AVC, previous works on architectures forfast motion estimation (FME) [14–18] have been based ondiverse FME algorithms.

Rahman and Badawy in [16] and Byeon et al. in [17]base their works on UMHexagonS. In [14], Chen et al.propose a parallel, content-adaptive, variable block size, 4SSalgorithm, upon which their architecture is based. In [15],Zhang and Gao base their architecture on the followingsearch sequence: Diamond Search (DS), Cross Search (CS)and finally, fractional-pel ME.

In this paper, we base our architecture on SUMHwhich has been shown in [8] to outperform UMHexagonS.We present hardware oriented modifications to SUMH.We show that the modified SUMH has a better PSNRperformance that of the parallel, content-adaptive variableblock size 4SS proposed in [14]. In addition, our results(see Section 2) show that for the modified SUMH, theaverage PSNR loss is 0.004 dB to 0.03 dB when comparedwith FSME, while when compared to SUMH, most ofthe sequences show an average improvement of up to0.02 dB, while two of the sequences show an average loss

of 0.002 dB. Thus in general, there is an improvement overSUMH. In terms of percentage computational time savings,while SUMH saves 88.3% to 98.8% when compared withFSME, the modified SUMH saves 60.0% to 91.7% whencompared with FSME. Finally, in terms of percentage bitrate increase, when compared with FSME, the modifiedSUMH shows a bit rate improvement (decrease in bit rate),of 0.02% in the sequence “Coastguard.” The worst bit rateincrease is in “Foreman” and that is 1.29%. When comparedwith SUMH, there is a bit rate improvement of 0.03% to0.34%.

The rest of this paper is organized as follows. In Section 2we summarize integer-pel motion estimation in SUMH andpresent the hardware oriented SUMH along with simulationresults. In Section 3 we briefly present our proposedarchitecture based on the modified SUMH. We also presentour implementation results as well as comparisons with priorworks. In Section 4 we present our prototyping efforts onthe XUPV2P development board. This board contains anXC2VP30 Virtex-II Pro FPGA with two hardwired PowerPC405 processors. Finally our conclusions are presented inSection 5.

2. Motion Estimation Algorithm

2.1. Integer-Pel SUMH Algorithm. H.264/AVC uses blockmatching for motion vector search. Integer-pel motionestimation uses the sum of absolute differences (SADs), asits matching criterion. The mathematical expression for SADis given in

SAD(dx,dy

) =X−1∑

x=0

Y−1∑

y=0

∣∣a(x, y)− b(x + dx, y + dy

)∣∣,

(1)(

MVx, MVy

)= (

dx,dy)∣∣

min SAD(dx,dy). (2)

In (1), a(x, y) and b(x, y) are the pixels of the current,and candidate blocks, respectively. (dx,dy) is the displace-ment of the candidate block within the search window.X × Y is the size of the current block. In (2) (MVx, MVy)is the motion vector of the best matching candidateblock.

H.264/AVC features seven interprediction block sizeswhich are 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and4×4. These are referred to as block modes 1 to 7. An up layerblock is a block that contains sub-blocks. For example, mode5 or 6 is the up layer of mode 7, and mode 4 is the up layer ofmode 5 or 6.

SUMH [8] utilizes five key steps for intensive search,integer-pel motion estimation. They are cross search,hexagon search, multi big hexagon search, extended hexagonsearch, and extended diamond search. For motion vector(MV) prediction, SUMH uses the spatial median and uplayer predictors, while for SAD prediction, the up layerpredictor is used. In median MV prediction, the medianvalue of the adjacent blocks on the left, top, and top-right(or top-left) of the current block is used to predict the

Page 121: 541420

EURASIP Journal on Embedded Systems 3

MV of the current block. The complete flow chart of theinteger-pel, motion vector search in SUMH is shown inFigure 1.

The convergence and intensive search conditions aredetermined by arbitrary thresholds shifted by a blocktypeshift factor. The blocktype shift factor specifies the numberof bits to shift to the right in order to get the correspondingthresholds for different block sizes. There are 8 blocktypeshift factors corresponding to 8 block modes: 1 dummy blockmode and the 7 block modes in H.264/AVC. The 8 blockmodes are 16×16 (dummy), 16×16, 16×8, 8×16, 8×8, 8×4,4× 8, and 4× 4. The array of 8 blocktype shift factors corre-sponding, respectively, to these 8 block modes is given in

blocktype shift factor = {0, 0, 1, 1, 2, 3, 3, 1}. (3)

The convergence search condition is described in pseu-docode in

(min mcost <

(ConvergeThreshold

� blocktype shift factor[blocktype

])),

(4)

where min mcost is the minimum motion vector cost. Theintensive search condition is described in pseudo-code in

⎛⎜⎜⎜⎜⎝

(blocktype == 1 &&

min mcost>(CrossThreshold1�blocktype shift factor

[blocktype

])

)

||(min mcost>(CrossThreshold2�blocktype shift factor [blocktype]))

⎞⎟⎟⎟⎟⎠

,

(5)

where the thresholds are empirically set as follows:ConvergeThreshold = 1000, CrossThreshold1 = 800, andCrossThreshold2 = 7000.

2.2. Hardware Oriented SUMH Algorithm. The goal of ourhardware oriented modification is to make SUMH lesssequential without incurring performance losses or increasesin the computation time.

The sequential nature of SUMH arises from the fact thatthere are a lot of data dependencies. The most severe datadependency arises during the up layer predictor search step.This dependency forces the algorithm to sequentially andindividually conduct the search for the 41 possible SADs in a16 × 16 macroblock. The sequence begins with the 16 × 16macroblock then computes the SADs of the subblocks ineach quadrant of the 16 × 16 macroblock. Performing thealgorithm in this manner consumes a lot of computationaltime and power, yet its rate-distortion benefits can still beobtained in a parallel implementation. In our modification,we skip this search step.

The decision control structures in SUMH are anotherfeature that makes the algorithm unsuitable for hardwareimplementation. In a parallel and pipelined implementation,these structures would require that the pipeline be flushedat random times. This is in turn wasteful of clock cycles aswell as adds more overhead to the hardware’s control circuit.

In our modification, we consider the convergence conditionnot satisfied, and intensive search condition satisfied. Thisremoves the decision control structures that make SUMHunsuitable for parallel processing. Another effect of thismodification is that we expect to have a better rate-distortionperformance. On the other hand, the expected disadvantageof this modification is an increase in computation time.However, as shown by our complexity analysis and results,this increase is minimal and will also be easily compensatedfor by hardware acceleration.

Further modifications we make to SUMH are theremoval of the small local search steps and the convergencesearch step.

Our modifications to SUMH allow us to process inparallel, all the candidate macroblocks (MB), for one currentmacroblock (CMB). We use the so-called HF3V2 2-stitchedzigzag scan proposed in [19], in order to satisfy the datadependencies between CMBs. These data dependencies arisebecause of the side information used to predict the MV ofthe CMB. Note that if we desire to process several CMBs inparallel, we will need to set the value of the MV predictor tothe zero displacement MV, that is, MV = (0, 0). Experimentsin [20–22], as well as our own experiments [23], show thatwhen the search window is centered around MV = (0, 0), theaverage PSNR loss is less than 0.2 dB compared with whenthe median MV is also used. Figure 2 shows the completeflow chart of the modified integer-pel, SUMH.

2.3. Complexity Analysis of the Motion Estimation Algorithms.We consider a search range s. The number of search pointsto be examined by FSME algorithm is directly proportionalto the square of the search range. There are (2s + 1)2 searchpoints. Thus the algorithm complexity of Full Search isO(s2).

We obtain the algorithm complexity of the modifiedSUMH algorithm by considering the algorithm complexityof each of its search steps as follows.

(1) Cross search: there are s search points both horizon-tally and vertically yielding a total of 2s search points.Thus the algorithm complexity of this search step isO(2s).

(2) Hexagon and extended hexagon search: There are 6search points each in both of these search steps, yield-ing a total of 12 search points. Thus the algorithmcomplexity of this search step is constant O(1).

(3) Multi-big hexagon search: there are (1/4)s hexagonswith 16 search points per hexagon. This yields a totalof 4s search points. Thus the algorithm complexity ofthis search step is O(4s).

(4) Diamond search: there are 4 search points in thissearch step. Thus the algorithm complexity of thissearch step is constant O(1).

Therefore in total there are 1 + 2s + 12 + 4 + 4s searchpoints in the modified SUMH, and its algorithm complexityis O(6s).

In order to obtain the algorithm complexity of SUMH,we consider its worst case complexity, even though the

Page 122: 541420

4 EURASIP Journal on Embedded Systems

Start: check predictors

Satisfyconvergencecondition?

Small local search

Satisfy intensivesearch condition?

Cross search

Hexagon search

Multibig hexagon search

Up layer predictor search

Small local search

Extended hexagon search

Satisfyconvergencecondition?

Extended diamond search

Convergence search

Stop

Yes

No

No

Yes

No

Yes

Figure 1: Flow chart of integer-pel search in SUMH.

Start: check center and median MV predictor

Cross search

Hexagon search

Multibig hexagon search

Extended hexagon search

Extended diamond search

Stop

Figure 2: Flow chart of modified integer-pel search.

Table 1: Complexity of algorithms in million operations per second(MOPS).

AlgorithmNumber of searchpoints for search

range s = ±16

Number of MOPSfor CIF video at

30 Hz

FSME 1089 17103

Best case SUMH 5 78

Worst case SUMH 127 1995

Median case SUMH 66 1037

Modified SUMH 113 1775

algorithm may terminate much earlier. The worst casecomplexity of SUMH is similar to that of the modifiedSUMH, except that it adds 14 more search points. Thisnumber is obtained by adding 4 search points each for 2small local searches and 1 convergence search, and 2 searchpoints for the worst case up layer predictor search. Thus forthe worst case SUMH, there are in total 14+1+2s+12+4+4ssearch points and its algorithm complexity is O(6s). Notethat in the best case, SUMH has only 5 search points: 1 forthe initial search candidate and 4 for the convergence search.

Another way to define the complexity of each algorithmis in terms of the number of required operations. We can thenexpress the complexity as Million Operations Per Second(MOPS). To compare the algorithms in terms of MOPS weassume the following.

(1) The macroblock size is 16× 16.

(2) The SAD cost function requires 2×16×16 data loads,16 × 16 = 256 subtraction operations, 256 absoluteoperations, 256 accumulate operations, 41 compareoperations and 1 data store operation. This yields atotal of 1322 operations for one SAD computation.

(3) CIF resolution is 352×288 pixels = 396 macroblocks.

(4) The frame rate is 30 frames per second.

(5) The total number of operations required to encodeCIF video in real time is 1322× 396× 30× za, whereza is the number of search points for each algorithm.

Thus there are 15.7za MOPS per algorithm, where oneOP (operation) is the amount of computation it takes toobtain one SAD value.

In Table 1 we compare the computational complexitiesof the considered algorithms in terms of MOPS. As expected,FSME requires the largest number of MOPS. The number ofMOPS required for the modified SUMH is about 10% lessthan that required for the worst case SUMH and about 40%more than that required for the median case SUMH.

2.4. Performance Results for the Modified SUMH Algorithm.Our experiments are done in JM 13.2 [24]. We use thefollowing standard test sequences: “Stefan” (large motion),“Foreman” and “Coastguard” (large to moderate motion)and “Silent” (small motion). We chose these sequencesbecause we consider them extreme cases in the spectrum oflow bit-rate video applications. We also use the following

Page 123: 541420

EURASIP Journal on Embedded Systems 5

Table 2: Simulation conditions.

Sequences Quantization parameter Search range Frame size No. of frames

Foreman 22, 25, 28, 31, 33, 35 32 CIF 100

Mother-daughter 22, 25, 28, 31, 33, 35 32 CIF 150

Stefan 22, 25, 28, 31, 33, 35 16 CIF 90

Flower 22, 25, 28, 31, 33, 35 16 CIF 150

Coastguard 18, 22, 25, 28, 31, 33 32 QCIF 220

Carphone 18, 22, 25, 28, 31, 33 32 QCIF 220

Silent 18, 22, 25, 28, 31, 33 16 QCIF 220

Table 3: Comparison of speed-up ratios with full search.

QuantizationParameter

18 22 25 28 31 33 35

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

Foreman N/A N/A 48.55 8.16 41.55 6.86 32.68 5.66 25.87 4.77 21.68 4.23 19.11 3.74

Stefan N/A N/A 15.35 4.62 13.16 4.21 12.20 3.93 10.67 3.50 10.05 3.23 8.96 3.06

Mother-daughter

N/A N/A 16.63 2.49 19.31 2.72 21.56 3.01 28.63 3.47 35.43 4.20 43.90 5.08

Flower N/A N/A 9.73 3.07 10.72 3.29 11.32 3.49 12.94 3.78 13.77 4.02 15.02 4.21

Coastguard 86.34 12.06 70.12 10.31 58.05 9.01 43.62 7.98 36.04 6.80 30.10 6.13 N/A N/A

Silent 21.86 3.54 16.74 3.18 13.17 2.99 11.90 2.82 9.29 2.66 8.56 2.64 N/A N/A

Carphone 24.67 4.14 29.44 4.62 37.12 5.38 46.97 6.02 53.97 7.07 64.07 8.82 N/A N/A

Table 4: Comparison of percentage time savings with full search.

QuantizationParameter

18 22 25 28 31 33 35

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

SUMHModifiedSUMH

Foreman N/A N/A 97.94 87.75 97.59 85.43 96.94 82.34 96.13 79.04 95.38 76.36 94.76 73.31

Stefan N/A N/A 93.48 78.38 92.40 76.29 91.80 74.61 90.63 71.46 90.05 69.05 88.83 67.35

Mother-daughter

N/A N/A 93.98 60.00 94.82 63.34 95.36 66.85 96.50 71.22 97.17 76.21 97.72 80.35

Flower N/A N/A 89.72 67.45 90.67 69.62 91.16 71.37 92.27 73.56 92.71 75.14 93.34 76.27

Coastguard 98.84 91.71 98.57 90.30 98.27 88.91 97.70 87.47 97.22 85.29 96.67 83.70 N/A N/A

Silent 95.42 71.77 94.02 68.62 92.40 66.61 91.60 64.56 89.23 62.47 88.32 62.20 N/A N/A

Carphone 95.94 75.87 96.60 78.36 97.30 81.41 97.87 83.41 98.14 85.87 98.43 88.66 N/A N/A

sequences: “Mother-daughter” (small motion, talking headand shoulders), “Flower” (large motion with camera pan-ning), and “Carphone” (large motion). The sequences arecoded at 30 Hz. The picture sequence is IPPP with I-framerefresh rate set at every 15 frames. We consider 1 referenceframe. The rest of our simulation conditions are summarizedin Table 2.

Figure 3 shows curves that compare the rate-distortionefficiencies of Full Search ME, SUMH, and the modifiedSUMH. Figure 4 shows curves that compare the rate-distortion efficiencies of Full Search ME and the single- andmultiple-iteration parallel content-adaptive 4SS of [14]. In

Tables 3 and 4, we show a comparison of the speed-upratios of SUMH and the modified SUMH. Table 5 showsthe average percentage bit rate increase of the modifiedSUMH when compared with Full Search ME and SUMH.Finally Table 6 shows the average Y-PSNR loss of themodified SUMH when compared with Full Search ME andSUMH.

From Figures 3 and 4, we see that the modified SUMHhas a better rate-distortion performance than the proposedparallel content-adaptive 4SS of [14], even under smallersearch ranges. In Section 3 we will show comparisons ofour supporting architecture with the supporting architecture

Page 124: 541420

6 EURASIP Journal on Embedded Systems

31

32

33

34

35

36

37

38

39

40

41

Y-P

SNR

(dB

)

500 1000 1500 2000 2500 3000 3500

Bitrate (kbps)

R-D curve (Stefan, CIF, SR = 16, 1 ref frame, IPPP...)

(a)

33

34

35

36

37

38

39

40

41

Y-P

SNR

(dB

)

400 600 800 1000 1200 1400

Bitrate (kbps)

R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...)

(b)

34

36

38

40

42

44

Y-P

SNR

(dB

)

100 150 200 250 300 350 400

Bitrate (kbps)

R-D curve (Silent, QCIF, SR = 16, 1 ref frame, IPPP...)

Full searchSUMHModified SUMH

(c)

32

34

36

38

40

42

Y-P

SNR

(dB

)

200 300 400 500 600 700 800 900 1000 1100

Bitrate (kbps)

R-D curve (Coastguard, QCIF, SR = 32, 1 ref frame, IPPP...)

Full searchSUMHModified SUMH

(d)

Figure 3: Comparison of rate-distortion efficiencies for the modified SUMH.

proposed in [14]. Note though that the architecture in [14] isimplemented on an ASIC (TSMC 0.18-μ 1P6M technology),while our architecture is implemented on an FPGA.

From Figure 3 and Table 6 we also observe that the largestPSNR losses occur in the “Foreman” sequence, while the leastPSNR losses occur in “Silent.” This is because the “Foreman”sequence has both high local object motion and greater high-frequency content. It therefore performs the worst under agiven bit rate constraint. On the other hand, “Silent” is a lowmotion sequence. It therefore performs much better underthe same bit rate constraint.

Given the tested frames from Table 2 for each sequence,we observe additionally from Table 6 that Full Searchperforms better than the modified SUMH for sequenceswith larger local object (foreground) motion, but lit-tle or no background motion. These sequences include“Foreman,” “Carphone,” “Mother-daughter,” and “Silent.”However the rate-distortion performance of the modifiedSUMH improves for sequences with large foreground andbackground motions. Such sequences include “Flower,”“Stefan,” and “Coastguard.” We therefore suggest that a yetgreater improvement in the rate-distortion performance of

Page 125: 541420

EURASIP Journal on Embedded Systems 7

32

33

34

35

36

37

38P

SNR

(dB

)

700 900 1100 1300 1500 1700 1900

Bitrate (kbps)

R-D curve (Stefan, CIF, SR = 32, 1 ref frame, IPPP...)

(a)

32

33

34

35

36

37

38

PSN

R(d

B)

170 270 370 470 570 670

Bitrate (kbps)

R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...)

(b)

32

33

34

35

36

37

38

PSN

R(d

B)

120 220 320

Bitrate (kbps)

R-D curve (Silent, CIF, SR = 32, 1 ref frame, IPPP...)

FSProposed content-adaptive parallel-VBS 4SSSingle iteration parallel-VBS 4SS

(c)

32

33

34

35

36

37

38

PSN

R(d

B)

600 1000 1400 1800

Bitrate (kbps)

R-D curve (Coastguard, CIF, SR = 32, 1 ref frame, IPPP...)

FSProposed content-adaptive parallel-VBS 4SSSingle iteration parallel-VBS 4SS

(d)

Figure 4: Comparison of rate-distortion efficiencies for parallel content-adaptive 4SS of [25] (Reproduced from [25]).

the modified SUMH algorithm can be achieved by improvingits local motion estimation.

For Table 3, we define the speed-up ratio as the ratioof the ME coding time of Full Search to ME coding timeof the algorithm under consideration. From Table 3 wesee that speed-up ratio increases as quantization parameter(QP) decreases. This is because there are less skip modemacroblocks as QP decreases. From our results in Table 3,we further calculate the percentage time savings t for MEcalculation, according to

t =(

1 − 1r

)× 100, (6)

where r are the data points in Table 3. The percentage timesavings obtained are displayed in Table 4. From Table 4, wefind that SUMH saves 88.3% to 98.8% in ME computationtime compared to Full Search, while the modified SUMHsaves 60.0% to 91.7%. Therefore, the modified SUMH doesnot incur much loss in terms of ME computation time.

In our experiments we set rate-distortion optimizationto high complexity mode (i.e., rate-distortion optimizationis turned on), in order to ensure that all of the algorithmscompared have a fair chance to yield their highest rate-distortion performance. From Table 5 we find that the

Table 5: Average percentage bit rate increase for modified SUMH.

SequencesCompared with

Full search SUMH

Foreman 1.29 −0.04

Stefan 0.40 −0.34

Mother-daughter 0.15 −0.05

Flower 0.19 −0.17

Coastguard −0.02 −0.03

Silent 0.56 −0.33

Carphone 0.27 −0.06

average percentage bit rate increase of the modified SUMHis very low. When compared with Full Search, there is a bitrate improvement (decrease in bit rate), in “Coastguard” of0.02%. The worst bit rate increase is in “Foreman” and thatis 1.29%. When compared with SUMH, there is a bit rateimprovement (decrease in bit rate), going from 0.04% (in“Coastguard”) to 0.34% (in “Stefan”).

From Table 6 we see that the average PSNR loss for themodified SUMH is very low. When compared to Full Search,the PSNR loss for modified SUMH ranges from 0.006 dB to

Page 126: 541420

8 EURASIP Journal on Embedded Systems

0.03 dB. When compared to SUMH, most of the sequencesshow a PSNR improvement of up to 0.02 dB, while two ofthe sequences show a PSNR loss of 0.002 dB.

Thus in general, the losses when compared with FullSearch are insignificant, while on the other hand there isan improvement when compared with SUMH. We thereforeconclude that the modified SUMH can be used withoutmuch penalty, instead of Full Search ME, for ME inH.264/AVC.

3. Proposed Supporting Architecture

Our top-level architecture for fast integer VBSME is shownin Figure 5. The architecture is composed of search window(SW) memory, current MB memory, an address generationunit (AGU), a control unit, a block of processing units (PUs),an SAD combination tree, a comparison units and a registerfor storing the 41 minimum SADs and their associatedmotion vectors.

While the current and reference frames are stored off-chip in external memory, the current MB (CMB) data andthe search window (SW) data are stored in on-chip, dual-port block RAMS (BRAMS). The SW memory hasN 16×16BRAMs that storeN candidate MBs, whereN is related to thesearch range s. N can be chosen to be any factor or multipleof |s| so as to achieve a tradeoff between speed and hardwarecosts. For example, if we consider a search range of s = ±16,then we can choose N such that N ∈ {. . . , 32, 16, 8, 4, 2, 1}.The AGU generates addresses for blocks being processed.

There are N PUs each containing 16 processing elements(PEs), in a 1D array. A PU shown in Figure 6 calculates16 4 × 4 SADs for one candidate MB while a PE shown inFigure 8 calculates the absolute difference between two pixels,one each from the candidate MB and the current MB. FromFigure 6, groups of 4 PEs in the PU calculate 1 column of4× 4 SADs. These are stored via demultiplexing, in registersD1–D4 which hold the inputs to the SAD combination tree,one of which is shown in Figure 7. For N PUs there are NSAD combination trees. Each SAD combination tree furthercombines the 16 4 × 4 output SADs from one PU, to yield atotal of 41 SADs per candidate MB. Figure 7 shows that the16 4 × 4 SADs are combined such that registers D6 contain4× 8 SADs, D7 contain 8× 8 SADs, D8 contain 8× 16 SADs,D9 contain 16×8 SADs, D10 contain 8×4 SADs, and finally,D11 contains the 16 × 16 SAD. These SADs are comparedappropriately in the comparison unit (CU). CU consists of41 N-input comparing elements (CEs). A CE is shown inFigure 9.

3.1. Address Generation Unit. For each of N MBs beingprocessed simultaneously, the AGU generates the addressesof the top row and the leftmost column of 4 × 4 sub-blocks.The address of each sub-block is the address of its top leftpixel. From the addresses of the top row and leftmost columnof 4×4 sub-blocks, we obtain the addresses of all other blockpartitions in the MB.

The interface of the AGU is fixed and we parameterizeit by the address of the current MB, the search type and the

Table 6: Average Y-PSNR loss for modified SUMH.

SequencesCompared with

Full search SUMH

Foreman 0. 0290 dB −0. 0065 dB

Stefan 0. 0058 dB −0. 0125 dB

Mother-daughter 0. 0187 dB −0. 0020 dB

Flower 0. 0042 dB −0. 0002 dB

Coastguard 0. 0078 dB 0. 0018 dB

Silent 0. 0098 dB 0. 0018 dB

Carphone 0. 0205 dB −0. 0225 dB

Table 7: Search passes for modified SUMH.

Pass Description

1-2 Horizontal scan of cross search. CandidateMBs seperated by 2 pixels

3-4 Vertical scan of cross search. Candidate MBsseperated by 2 pixels

5 Hexagon search has 6 search points

6–13 Multi-big hexagon search has (1/4)(|s|)hexagons, each containing 16 search points

14 Extended hexagon search has 6 search points

15 Diamond search has 4 search points

search pass. The search type is modified SUMH. Howeverwe can expand our architecture to support other types ofsearch, for example, Full Search, and so forth. The search passdepends on the search step and the search range. We showfor instance, in Table 7 that there are 15 search passes for themodified SUMH considering a search range s = ±16. Thereis a separation of 2 pixels between 2 adjacent search points inthe cross search, therefore address generation for search pass1 to 4 in Table 7 is straightforward. For the remaining searchpasses5–15, tables of constant offset values are obtainedfrom JM reference software [24]. These offset values arethe separation in pixels, between the minimum MV fromthe previous search pass, and the candidate search point. Ingeneral, the affine address equations can be represented by

AEx = iCx, AEy = iCy , (7)

where AEx and AEy are the horizontal and vertical addressesof the top left pixel in the MB, i is a multiplier, Cx and Cy areconstants obtained from JM reference software.

3.2. Memory. Figures 10 and 11 show CMB and searchwindow (SW) memory organization for N = 8 PUs.Both CMB and SW memories are synthesized into BRAMs.Considering a search range of s = ±16, there are 15 searchpasses for the modified SUMH search flowchart shown inFigure 2. These search passes are shown in Table 7. In eachsearch pass, 8 MBs are processed in parallel, hence the SWmemory organization is shown in Figure 11. SW memory is128 bytes wide and the required memory size is 2048 bytes.For the same search range s = ±16, if FSME was usedalong with levels A and B data reuse, the SW size would be

Page 127: 541420

EURASIP Journal on Embedded Systems 9

Candi-date MBN − 2

Candi-date MBN − 1

Candi-date MB

N

Candi-date MB

1

Candi-date MB

2

Candi-date MB

3· · ·

SW memory

PU 1 PU 2 PU 3 · · · PU N − 2 PU N − 1 PU N

CE 1 CE 2 CE 3 CE 41· · · · · ·Comparison unit

SAD combination tree

Register that stores minimum 41 SADs and associated MVs

To external memory

Con

trol

un

itAGU

Current MB(CMB)

memory

Figure 5: The proposed architecture for fast integer VBSME.

D1 D2 D3 D4 D1 D2 D3 D4

PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 PE 16

+ +

+

DemuxCntr

D1 D2 D3 D4

D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5

D1 D2 D3 D4

+ +

+

DemuxCntr

+ +

+

DemuxCntr

+ +

D0D0D0D0D0D0D0D0

+

DemuxCntr

Figure 6: The architecture of a Processing Unit (PU).

48 × 48 pixels, that is 2304 bytes [25]. Thus by using themodified SUMH, we achieve an 11% on-chip memorysavings even without a data reuse scheme.

In each clock cycle, we load 64 bits of data. This meansthat it takes 256 cycles to load data for one search pass and3840 (256 × 15) cycles to load data for one CMB. Undersimilar conditions for FSME it would take 288 clock cyclesto load data for one CMB. Thus the ratio of the requiredmemory bandwidth for the modified SUMH to the requiredmemory bandwidth for FSME is 13.3. While this ratio isundesirably high, it is well mitigated by the fact that there

are only 113 search locations for one CMB in the modifiedSUMH, compared to 1089 search locations for one CMB inFSME. In other words, the amount of computation for oneCMB in the modified SUMH is approximately 0.1 that forFSME. Thus there is an overall power savings in using themodified SUMH instead of FSME.

3.3. Processing Unit. Table 8 shows the pixel data schedulefor two search passes of the N PUs. In Table 8 we areconsidering as an illustrative example the cross search anda search range s = ±16, hence the given pixel coordinates.

Page 128: 541420

10 EURASIP Journal on Embedded Systems

TopSAD

TopSAD

TopSAD

TopSAD

BottomSAD

BottomSAD

BottomSAD

BottomSAD

+ + + +

+ +

+

+

+ + + +

D5

D6D6D6D6D6D6D6

D7

D5

D6

D8

D9

D10

D11D7

D7 D7 D7

D8D9D9D8

D10 D10 D10 D10

D11

D10 D10 D10 D10

D6

D5D5D5D5D5D5D5D5D5D5D5D5D5D5D5

+ + + +

+ +

+

+ + + +

+ +

Top4× 4 SAD

Bottom4× 4 SAD

4× 4 SAD

4× 8 SAD

8× 8 SAD

8× 16 SAD

16× 8 SAD

8× 4 SAD

16× 16 SAD

Figure 7: SAD Combination tree.

Table 8: Data schedule for processing unit (PU).

Clock PU1 · · · PU8 Comments

1–16(−15, 0)–(0,0) (−1,0)–(14,0)

Search pass 1: left horizontal scan of cross search... · · ·...

(−15, −15)–(0, −15) (−1, −15)–(14, −15)

17– 32(1, 0)–(16,0) (15,0)–(30,0)

Search pass 2: right horizontal scan of cross search... · · ·...

(1, −15)–(16, −15) (15, −15)–(30, −15)

33–48(0, 15)–(15,15) (0, 1)–(15, 1)

Search pass 3: top vertical scan of cross search... · · ·...

(0, 0)–(15, 0) (0, −14)–(15, −14)

49–64(0, −1)–(15, −1) (0, −15)–(15, −15)

Search pass 4: bottom vertical scan of cross search... · · ·...

(0, −16)–(15, −16) (0, −30)–(15, −30)...

......

...

Table 8 shows that it takes 16 cycles to output the 16 4 × 4SADs from each PU.

3.4. SAD Combination Tree. The data schedule for the SADcombination is shown in Table 9. There areN SAD combina-

tion (SC) trees, each processing 16 4×4 SADs that are outputfrom each PU. It takes 5 cycles to combine the 16 4× 4 SADsand output 41 SADs for the 7 interprediction block sizes inH.264/AVC: 1 16 × 16 SAD, 2 16 × 8 SADs, 2 8 × 16 SADs,4 8×8 SADs, 8 8×4 SADs, 8 4×8 SADs, and 16 4×4 SADs.

Page 129: 541420

EURASIP Journal on Embedded Systems 11

RegisterAbsolutedifference

CandidateMBpixel

CurrentMBpixel

Controlsignals

+

Figure 8: Processing element (PE).

Table 9: Data schedule for SAD combination (SC) unit.

Clock SC1 SC2 · · · SC8

17 16 4× 4 SAD 16 4× 4 SAD 16 4× 4 SAD

188 4× 8 SAD 8 4× 8 SAD 8 4× 8 SAD

8 8× 4 SAD 8 8× 4 SAD 8 8× 4 SAD

19 4 8× 8 SAD 4 8× 8 SAD 4 8× 8 SAD

202 8× 16 SAD 2 8× 16 SAD 2 8× 16 SAD

2 16× 8 SAD 2 16× 8 SAD 2 16× 8 SAD

21 1 16× 16 SAD 1 16× 16 SAD 1 16× 16 SAD

3.5. Comparison Unit. The data schedule for the CU isshown in Table 10. The CU consists of 41 CE, eachelement processing N SADs of the same interpredictionblock size, from the N PUs. Each CE compares SADs intwos. It therefore takes log2N + 1 cycles to output the 41minimum SADs. Thus given N = 8, the CU consumes 4cycles.

3.6. Summary of Dataflow. The dataflow represented by thedata schedules described variously in Tables 8–10 may besummarized by the algorithmic state machine (ASM) chartshown in Figure 12. The ASM chart also represents themapping of the modified SUMH algorithm in Figure 2, toour proposed architecture in Figure 5.

In our ASM chart, there are 6 states and 2 decisionboxes. The states are labeled S1 to S6, while the decisionboxes are labeled Q1 and Q2. In each state box, we providethe summary description of the state as well as its outputvariables in italic font.

From Figure 12 we see that implementation of themodified SUMH on our proposed architecture IP core startsin state S1 when the motion vector (MV) predictors arechecked. This is done by the PowerPC processor which ispart of our SoC prototyping platform (see Section 4). TheMV predictors are stored in external memory and accessedfrom there by the PowerPC processor. The output from stateS1 is the MV predictors. In the next state S2, the minimumMV cost is obtained and mode decision is done to obtain theright blocktype. This is also done by the PowerPC processorand the outputs of this state are the minimum MV, its SAD

Control signals

AGU input

Min SAD

MV

N SADs

Min SADN-inputcomparator

Figure 9: Comparing element (CE).

16 pixels

16pi

xels

8 bytes 8 bytes

...

Figure 10: Data arrangement in current macroblock (CMB)memory.

128 pixels

16pi

xels

8 bytes 8 bytes 8 bytes 8 bytes

...

· · ·

Figure 11: Data arrangement in search window (SW) memory.

cost, its blocktype, and its address. The minimum MV cost isobtained by minimizing the cost in

Jmotion(−→m, REF | λmotion

)

=SAD(dx,dy, REF,−→m)+ λmotion·

(R(−→m−−→p

)+R(REF)

),

(8)

where −→m = (mx,my)T is the current MV being considered,

REF denotes the reference picture, λmotion is the Lagrangianmultiplier, SAD(dx,dy, REF,−→m) is the SAD cost obtained asin (1),−→p = (px, py) is the MV used for the prediction, R(−→m−−→p ) represents the number of bits used for MV coding, and,R(REF) is the bits for coding REF.

In the state S3, some of the outputs from state S2are passed into our proposed architecture IP core. In stateS4, the AGU computes the addresses of candidate blocks,using the address of the MV predictor as the base address,and the control unit waits for the initialization of searchwindow data in the BRAMs. The output of state S4 is

Page 130: 541420

12 EURASIP Journal on Embedded Systems

Table 10: Data schedule for comparison unit (CU).

Clock CE1–CE16 CE17–CE32 CE33–CE36 CE37–CE40 CE41

22 8 4× 4 SAD8 4× 8 SAD

8 8× 8 SAD8 8× 16 SAD

8 16× 16 SAD8 8× 4 SAD 8 16× 8 SAD

23 4 4× 4 SAD4 4× 8 SAD

4 8× 8 SAD4 8× 16 SAD

4 16× 16 SAD4 8× 4 SAD 4 16× 8 SAD

242 4× 4 SAD

2 4× 8 SAD2 8× 8 SAD

2 8× 16 SAD2 16× 16 SAD

2 8× 4 SAD 2 16× 8 SAD

251 4× 4 SAD

1 4× 8 SAD1 8× 8 SAD

1 8× 16 SAD1 16× 16 SAD

1 8× 4 SAD 1 16× 8 SAD

S1: check MV predictors in PowerPCMVs

S2: obtain min MV cost and perform mode decision in PowerPC

min MV, min SAD cost, blocktype and base address

S3: min MV, min SAD cost and address in IP coremin MV, min SAD cost, and base address

S4: AGU computes addresses of candidate blocks from base address,and control unit waits for initialization of BRAM data for search pass

addresses, BRAM initialization complete

S5: PUs and SCs compute SADs for search passaddresses, SADs

S6: obtain min SAD for search pass and update base address from addresses

base address, 41 min SADs, 41 min MVs

Q1: lastsearch pass of

step?

Q2: last search passof modified SUMH?

No Yes

No

Yes

IP core

Figure 12: Algorithmic state machine chart for the modified SUMH algorithm.

the addresses of the candidate blocks and a flag indicatingthat BRAM initialization is complete. In state S5, theprocessing units and SAD combination trees compute theSADs of the candidate blocks. The output of S5 is thecomputed SADs and unchanged AGU addresses. In stateS6, the CU compares these SADs with previously computedSADs and obtains the 41 minimum SADs. The outputsof S6 are the 41 minimum SADs and their correspondingaddresses.

In the decision block Q1, we check if the current searchpass is the last search pass of a particular search step, forexample, the cross search step. If no, we continue with otherpasses of that search step. If yes, we go to decision block Q2.In Q2 we check if it is the last search pass of the modifiedSUMH algorithm. If no, we move onto the next search step,for example, hexagon search. If yes, we check for the MVpredictors of the next current macroblock, according to theHF3V2 2-stitched zigzag scan proposed in [19].

Page 131: 541420

EURASIP Journal on Embedded Systems 13

Table 11: Synthesis results.

Process (μm) 0.13 (FPGA)

Number of slices 11.4K

Number of slice flip flops 16.4K

Number of 4-input LUTs 18.7K

Total equivalent gate count 388K

Max frequency (MHz) 145.2

Algorithm Modified SUMH

Video specifications CIF 30-fps

Search range ±16

Block size 16× 16 to 4× 4

Minimum required frequency (MHz) 24.1

Number of 16× 8-bit dual-port RAMs 129

Memory utilization (Kb) 398

Voltage (V) 1.5

Power consumed (mW) 25

3.7. Synthesis Results and Analysis. The proposed architec-ture has been implemented in Verilog HDL. Simulation andfunctional verification of the architecture was done using theMentor Graphics ModelSim tool [26]. We then synthesizedthe architecture using the Xilinx sythesis tool (XST). XSTis part of the Xilinx integrated software environment (ISE)[27]. After synthesis, place and routing is done targeting theVirtex-II Pro XC2VP30 Xilinx FPGA on our developmentboard. Finally we obtain power analysis for our design, usingthe XPower tool which is also part of Xilinx ISE.

Our synthesis results are shown in Table 11. FromTable 11 we see that our architecture can achieve a maximumfrequency of 145.2 MHz. The FPGA power consumption ofour architecture is 25 mW obtained using Xilinx XPowertool. The total equivalent gate count is 388 K.

Our simulations in ModelSim support our dataflowdescribed in Sections 3.1 to 3.6. We find that it takes 27cycles to obtain the minimum SAD from each search pass,after initialization. The 27 cycles are obtained from 1 cyclefor the AGU, 1 cycle to read data from on-chip memory, 16cycles for the PU, 5 cycles for the SAD combination tree,and 4 cycles for the comparison unit. Therefore, it takes405 (15 × 27) cycles to complete the search for 1 CMB, 1reference frame, and s = ±16. For a CIF image (396 MBs) at30 Hz and considering 5 reference frames, a minimum clockfrequency of approximately 24.1 (405 × 396 × 30 × 5) MHzis required. Thus with a maximum possible clock speed of145.2 MHz, our architecture can compute in real-time CIFsequences within a search range of±16 and using 5 referenceframes.

We provide Table 12 which compares our architecturewith previous state-of-the-art architectures implemented onASICs. Note that a direct comparison of our implementationwith implementations done on ASIC technology is impos-sible because of the fact that the platforms are different.ASICs still provide the highest performance in terms ofarea, power consumed, and maximum frequency. However,we provide Table 12 not for direct comparisons, but toshow that our implementation achieves ASIC-like levels of

performance. This is desirable because it indicates that anASIC implementation of our architecture will yield evenbetter performance results. Our Verilog implementationwas kept portable in order to simplify FPGA to ASICmigration.

From Table 12 we see that our architecture achievesmany desirable results. The most remarkable is that thepower consumption is very low despite the fact that ourimplementation is done on an FPGA which typically con-sumes more power than an ASIC. Besides the low powerconsumption of our architecture, other favorable results arethat the algorithm we use has better PSNR performance thanthe algorithms used in the other works. We also note thatour architecture achieves the highest maximum frequency.By extension our architecture is the only one that can supporthigh definition (HD) 1080 p sequences at 30 Hz, a searchrange s = ±16 and 1 reference frame. This would need aminimum frequency of approximately 85.9 MHz.In the next section we discuss our prototyping efforts andcompare our results with similar works.

4. Architecture Prototype

The top-level prototype design of our architecture is shownin Figure 13. It is based on the prototype design in [25]. In[25], Canals et al. propose an FPSoC-based architecture forFull Search block matching algorithm. Their implementationis done on a Virtex-4 FPGA.

Our prototype is done on the XUPV2P developmentboard available from Digilent Inc. [28]. The board containsa Virtex-II Pro XC2VP30 FPGA with 30,816 Logic Cells, 13618-bit multipliers, 2,448 Kb of block RAM, and two PowerPCProcessors. There are several connectors which include aserial RS-232 port for communication with a host personalcomputer. The board also features JTAG programming viaon-board USB2 port as well as a DDR SDRAM DIMM thatcan accept up to 2 Gbytes of RAM.

The embedded development tool used to design ourprototype is the Xilinx Platform Studio (XPS), in the XilinxEmbedded Development Kit (EDK) [29]. The EDK makesit relatively simple to integrate user Intellectual Property(IP) cores as peripherals in an FPSoC. Hardware/softwarecosimulation can then be done to test the user IP.

In our prototype design, as shown in Figure 13, weemploy a PowerPC hardcore embedded processor, as ourcontroller. The processor sends stimuli to the motionestimation IP core and reads results back for comparison.The processor is connected to the other design modules, viaa 64bit processor local bus (PLB).

The boot program memory is a 64 kb BRAM. It containsa bootloop program necessary to keep the processor ina known state after we load the hardware and before weload the software. The PLB connects to the user IP corethrough an IP interface (IPIF). This interface exposes severalprogrammable interconnects. We use a slave-master FIFOattachment that is 64-bits wide and 512 positions deep. Thestatus and control signals of the FIFO are available to the userlogic block. The user logic block contains logic for reading

Page 132: 541420

14 EURASIP Journal on Embedded Systems

Table 12: Comparison with other architectures implemented on ASICS.

Chao’s et al. [11]Miyakoshi’set al. [12]

Lin’s [13] Chen’s et al. [14] This Work

Process (μm) 0.35 0.18 0.18 0.18 0.13 FPGA

Voltage (V) 3.3 1.0 1.8 1.3 1.5

Transistors count 301 K 1000 K 546 K 708 K 388 K

Maximumfrequency (MHz)

50 13.5 48.67 66 145.2

Video Spec. CIF 30-fps CIF 30-fps CIF 30-fps CIF 30-fps CIF 30-fps

frequency (MHz) 50 13.5 48.67 13.5 24.1

Algorithm Diamond search Gradient decent 4SSSingle-IterationParallel VBS 4SSw/1-ref.

Hardwareoriented SUMH

Block size 16×16 and 8×816× 16 and8× 8

16× 16 16× 16 to 4× 4 16× 16 to 4× 4

power (mW) 223.6 6.56 8.46 2.13 25

NormalizedPower (1.8 V,0.18 μm)∗

17.60 21.25 8.46 4.08 69.02

Architecture1D tree. No datareuse scheme

1D tree. No datareuse scheme

1D tree. Level Adata reusescheme

2D tree. Level Bdata reusescheme

1D tree. No datareuse scheme

Can supportHD1920 × 1080 p

No No No No Yes

∗Normalized power = Power× (0.182/process2)× (1.82/voltage2).

PowerPC

IPIF control

Read control

Read FIFO

Write control

Write FIFO

Write control Read control

Status and control

Pixel data memory

Motion estimation IP core

User logic

64-bit PLB bus

Boot programmemory

PLB IPIF

Figure 13: FPSoC prototype design of our architecture.

and writing to the FIFO and the Verilog implementation ofour architecture.

During operation, the PowerPC processor writes inputstimuli to the FIFO and sets status and control bits. The

Table 13: Comparison with other FPSOC architectures.

Canals et al. [25] This work

FPSoC FPGA Virtex-4 Virtex-II Pro

Algorithm Full Search Hardware orientedSUMH

Video format QCIF QCIF

Search range ±16 ±16

Number of slices 12.5 K 11.4 K

Memory utilization(Kb)

784 398

Clock frequency(MHz)

100 100

user logic reads the status and control signals and whenappropriate, reads data from the FIFO. The data passes intothe IP core and when the ME computation is done, theresults are written back on the FIFO. The PowerPC reads theresults and does a comparison with expected results to verifyaccuracy of the IP. Intermediate results during the operationare sent to a terminal on the host personal computer, via theRS-232 serial connection.

We target QCIF video for our prototype, in order tocompare our results with the results in [25]. Table 13 showsthis comparison. We see from Table 13 that our architectureconsumes less FPGA resources and has a lower memoryutilization. Again, we note that a direct comparison of botharchitectures is complicated by the fact that different FPGAs

Page 133: 541420

EURASIP Journal on Embedded Systems 15

were used in both prototyping platforms. The work in [25]is based on a Virtex-4 FPGA which uses 90-nm technology,while our work is based on Virtex-II Pro FPGA which uses130-nm technology.

5. Conclusion

In this paper we have presented our low power, FPSoC-based architecture for a fast ME algorithm in H.264/AVC. Wedescribed our adopted fast ME algorithm which is a hardwareoriented SUMH algorithm. We showed that the modifiedSUMH has superior rate-distortion performance comparedto some existing state-of-the-art fast ME algorithms. We alsodescribed our architecture for the hardware oriented SUMH.We showed that the FPGA-based implementation of ourarchitecture yields ASIC-like levels of performance in termsof speed, area, and power. Our results showed in addition,that our architecture has the potential to support HD 1080 punlike the other architectures we compared it with. Finallywe have discussed our prototyping efforts and comparedthem with a similar prototyping effort. Our results showedthat our implementation uses less FPGA resources.

In summary therefore, the modified SUMH is moreattractive than SUMH because it is hardware oriented. Itis also more attractive than Full Search because Full Searchis hardware oriented, it is much more complex than themodified SUMH and thus will require more hardware area,speed, and power for implementation.

We therefore conclude that for low power handhelddevices, the modified SUMH can be used without muchpenalty, instead of Full Search, for ME in H.264/AVC.

Acknowledgments

The authors acknowledge the support from Xilinx Inc.,the Xilinx University Program, the Packard Foundationand the Department of Electrical Engineering, Santa ClaraUniversity, California. The authors also thank the editor andReviewers of this journal for their useful comments.

References

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra,“Overview of the H.264/AVC video coding standard,” IEEETransactions on Circuits and Systems for Video Technology, vol.13, no. 7, pp. 560–576, 2003.

[2] G. J. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVCadvanced video coding standard: overview and introductionto the fidelity range extensions,” in Proceedings of the 27thConference on Applications of Digital Image Processing, vol.5558 of Proceedings of SPIE, pp. 454–474, August 2004.

[3] H.-C. Lin, Y.-J. Wang, K.-T. Cheng, et al., “Algorithms andDSP implementation of H.264/AVC,” in Proceedings of theAsia and South Pacific Design Automation Conference (ASP-DAC ’06), pp. 742–749, Yokohama, Japan, January 2006.

[4] Z. Chen, P. Zhou, and Y. He, “Fast integer pel and fractionalpel motion estimation for JVT,” in Proceedings of the 6thMeeting of the Joint Video Team (JVT) of ISO/IEC MPEG andITU-T VCE, Awaji Island, Japan, December 2002, JVT-F017.

[5] R. Li, B. Zeng, and M. L. Liou, “New three-step searchalgorithm for block motion estimation,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 4, no. 4, pp.438–442, 1994.

[6] L.-M. Po and W.-C. Ma, “A novel four-step search algorithmfor fast block motion estimation,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 6, no. 3, pp.313–317, 1996.

[7] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim,“A novel unrestricted center-biased diamond search algorithmfor block motion estimation,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 8, no. 4, pp. 369–377,1998.

[8] X. Yi, J. Zhang, N. Ling, and W. Shang, “Improved andsimplified fast motion estimation for JM,” in Proceedings of the16th Meeting of the Joint Video Team (JVT) of ISO/IEC MPEGand ITU-T VCEG, Posnan, Poland, July 2005, JVT-P021.doc.

[9] X. Yi and N. Ling, “Improved normalized partial distortionsearch with dual-halfway-stop for rapid block motion estima-tion,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 995–1003, 2007.

[10] C. De Vleeschouwer, T. Nilsson, K. Denolf, and J. Bor-mans, “Algorithmic and architectural co-design of a motion-estimation engine for low-power video devices,” IEEE Trans-actions on Circuits and Systems for Video Technology, vol. 12,no. 12, pp. 1093–1105, 2002.

[11] W.-M. Chao, C.-W. Hsu, Y.-C. Chang, and L.-G. Chen, “Anovel hybrid motion estimator supporting diamond searchand fast full search,” in Proceedings of IEEE InternationalSymposium on Circuits and Systems (ISCAS ’02), vol. 2, pp.492–495, Phoenix, Ariz, USA, May 2002.

[12] J. Miyakoshi, Y. Kuroda, M. Miyama, K. Imamura, H.Hashimoto, and M. Yoshimoto, “A sub-mW MPEG-4 motionestimation processor core for mobile video application,”in Proceedings of the Custom Integrated Circuits Conference(ICC ’03), pp. 181–184, 2003.

[13] S.-S. Lin, Low-power motion estimation processors for mobilevideo application, M.S. thesis, Graduate Institute of ElectronicEngineering, National Taiwan University, Taipei, Taiwan,2004.

[14] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G.Chen, “Fast algorithm and architecture design of low-powerinteger motion estimation for H.264/AVC,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 17, no. 5, pp.568–576, 2007.

[15] L. Zhang and W. Gao, “Reusable architecture and complexity-controllable algorithm for the integer/fractional motion esti-mation of H.264,” IEEE Transactions on Consumer Electronics,vol. 53, no. 2, pp. 749–756, 2007.

[16] C. A. Rahman and W. Badawy, “UMHexagonS algorithmbased motion estimation architecture for H.264/AVC,” inProceedings of the 5th International Workshop on System-on-Chip for Real-Time Applications (IWSOC ’05), pp. 207–210,Banff, Alberta, Canada, 2005.

[17] M.-S. Byeon, Y.-M. Shin, and Y.-B. Cho, “Hardware archi-tecture for fast motion estimation in H.264/AVC videocoding,” in IEICE Transactions on Fundamentals of Electronics,Communications and Computer Sciences, vol. E89-A, no. 6, pp.1744–1745, 2006.

[18] Y.-Y. Wang, Y.-T. Peng, and C.-J. Tsai, “VLSI architecturedesign of motion estimator and in-loop filter for MPEG-4AVC/H.264 encoders,” in Proceedings of the IEEE InternationalSymposium on Circuits and Systems (ISCAS ’04), vol. 2, pp. 49–52, Vancouver, Canada, May 2004.

Page 134: 541420

16 EURASIP Journal on Embedded Systems

[19] C.-Y. Chen, C.-T. Huang, Y.-H. Chen, and L.-G. Chen,“Level C+ data reuse scheme for motion estimation withcorresponding coding orders,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 16, no. 4, pp. 553–558,2006.

[20] S. Yalcin, H. F. Ates, and I. Hamzaoglu, “A high performancehardware architecture for an SAD reuse based hierarchicalmotion estimation algorithm for H.264 video coding,” in Pro-ceedings of the International Conference on Field ProgrammableLogic and Applications (FPL ’05), pp. 509–514, Tampere,Finland, August 2005.

[21] S.-J. Lee, C.-G. Kim, and S.-D. Kim, “A pipelined hard-ware architecture for motion estimation of H.264/AVC,” inProceedings of the 10th Asia-Pacific Conference on Advancesin Computer Systems Architecture (ACSAC ’05), vol. 3740of Lecture Notes in Computer Science, pp. 79–89, Springer,Singapore, October 2005.

[22] C.-M. Ou, C.-F. Le, and W.-J. Hwang, “An efficient VLSIarchitecture for H.264 variable block size motion estimation,”IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp.1291–1299, 2005.

[23] O. Ndili and T. Ogunfunmi, “A hardware oriented integerpel fast motion estimation algorithm in H.264/AVC,” inProceedings of the IEEE/ECSI/EURASIP Conference on Designand Architectures for Signal and Image Processing (DASIP ’08),Bruxelles, Belgium, November 2008.

[24] H.264/AVC Reference Software JM 13.2., 2009,http://iphome.hhi.de/suehring/tml/download.

[25] J. A. Canals, M. A. Martınez, F. J. Ballester, and A. Mora,“New FPSoC-based architecture for efficient FSBM motionestimation processing in video standards,” in Proceedings ofthe International Society for Optical Engineering, vol. 6590 ofProceedings of SPIE, p. 65901N, 2007.

[26] Mentor Graphics ModelSim SE User’s Manual—SoftwareVersion 6.2d, 2009, http://www.model.com/support.

[27] Xilinx ISE 9.1 In-Depth Tutorial, 2009, http://download.xilinx.com/direct/ise9 tutorials/ise9tut.pdf.

[28] Xilinx Virtex-II Pro Development System, 2009, http://www.digilentinc.com/Products/Detail.cfm?Prod=XUPV2P.

[29] Xilinx Platform Studio and Embedded Development Kit,2009, http://www.xilinx.com/ise/embedded/edk pstudio.htm.

Page 135: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 162078, 10 pagesdoi:10.1155/2009/162078

Research Article

FPGA Accelerator for Wavelet-Based AutomatedGlobal Image Registration

Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,Changsha 410073, China

Correspondence should be addressed to Baofeng Li, [email protected]

Received 14 February 2009; Accepted 30 June 2009

Recommended by Bertrand Granado

Wavelet-based automated global image registration (WAGIR) is fundamental for most remote sensing image processing algorithmsand extremely computation-intensive. With more and more algorithms migrating from ground computing to onboard computing,an efficient dedicated architecture of WAGIR is desired. In this paper, a BWAGIR architecture is proposed based on a blockresampling scheme. BWAGIR achieves a significant performance by pipelining computational logics, parallelizing the resamplingprocess and the calculation of correlation coefficient and parallel memory access. A proof-of-concept implementation with 1BWAGIR processing unit of the architecture performs at least 7.4X faster than the CL cluster system with 1 node, and at least 3.4Xthan the MPM massively parallel machine with 1 node. Further speedup can be achieved by parallelizing multiple BWAGIR units.The architecture with 5 units achieves a speedup of about 3X against the CL with 16 nodes and a comparative speed with the MPMwith 30 nodes. More importantly, the BWAGIR architecture can be deployed onboard economically.

Copyright © 2009 Baofeng Li et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

With the rapid innovations of remote sensing technology,more and more remote sensing image processing algorithmsare enforced to be finished onboard instead of at groundstation to meet the requirement of processing numerousremote sensing data realtimely. Image registration [1, 2]is the basis of many image processing operations, suchas image fusion, image mosaic, and geographic naviga-tor. Considering the computation-intensive and memory-intensive characteristics of remote sensing image regis-tration and the limited computing power of onboardcomputers, to implement image registration efficiently andeffectively with dedicated architecture is of great signifi-cance.

In the past twenty years, FPGA technology has beendeveloped significantly. The volume and performance ofFPGA chip have increased greatly to adapt many large-scale applications. Due to its excellent reconfigurabil-ity and convenient design flow, FPGA has been the

most popular choice for hardware designers to imple-ment kinds of application-specific architectures. Therefore,to implement the remote sensing image registration inFPGA efficiently is just the point of this paper. ThoughCarstro-Pareja et al. [3, 4] have proposed a fast auto-matic image registration (FAIR) architecture of mutualinformation-based 3D image registration for medical imag-ing applications, few works addressing hardware accelera-tion of the remote sensing image registration have beenreported.

Many approaches have been proposed for remotesensing image registration. As for hardware implementa-tion, only the automated algorithms are suitable becauseonboard computing demands that the algorithms shouldbe accurate and robust and operate without manualintervention. Proposed automated remote sensing imageregistration algorithms can be classified into two cate-gories, CPs-based algorithms [5–12] and global algorithms[13–22]. In the former, some matched control points(CPs) are extracted from both images automatically to

Page 136: 541420

2 EURASIP Journal on Embedded Systems

decide the final mapping function. However, the prob-lem is that it is difficult to automatically determine effi-cient CPs. The selected CPs need to be accurate, suf-ficient, and with even distribution. Missing or spuriousCPs make CPs-based algorithms unreliable and unsta-ble [23]. Hence, CPs-based algorithms are not in ourconsideration.

Automated global registration, however, is an approachthat does not rely on point-to-point matching. The finalmapping function is computed globally over the images.Therefore, the algorithms are stable and robust and easyto be automatically processed. One of the disadvantages ofglobal registration is that it is computationally expensive.Fortunately, the wavelet decomposition helps to relieve thissituation because it provides us a way to obtain the finalresult progressively. A wavelet-based automated global imageregistration (WAGIR) algorithm for the remote sensingapplication proposed by Moigne et al. [13–15] has beenproved to be efficient and effective. In WAGIR, the lowest-resolution wavelet subbands are firstly registered with arough accuracy and a wider search interval, and a local bestresult is obtained. Nextly, this result is refined repeatedly afterthe iterative registrations on the higher-resolution subbands.The final result is obtained at the highest-resolution sub-bands, viz. the original images.

Many parallel schemes of WAGIR are proposed in previ-ous works, such as parameter-parallel scheme (PP), image-parallel (IP) scheme, hybrid-parallel (HP) scheme whichmerges PP and IP, and group-parallel (GP) scheme [13, 24–27] which are implemented targeting large, expensive super-computers, cluster system or grid system that are impracticalto be deployed onboard. In this paper, we propose ablock wavelet-based automated global image registration(BWAGIR) architecture based on a block resampling scheme.The architecture with 1 processing unit outperforms the CLcluster system with 1 node by at least 7.4X, and the MPMmassively parallel machine with 1 node by at least 3.4X. Andthe BWAGIR with 5 units achieves a speedup of about 3Xagainst the CL with 16 nodes and a comparable speed withthe MPM with 30 nodes. More importantly, our work istargeting onboard computing.

The remainder of this paper is organized as follows. InSection 2, the traditional WAGIR algorithm is reviewed andanalyzed based on the hierarchy architecture. The proposedblock resampling scheme is detailed in Section 3. And thearchitecture of BWAGIR is presented in Section 4. Section 5gives out the proof-of-concept implementation and theexperimental results with comparison to several relatedworks. Finally, this paper is concluded in Section 6.

2. Wavelet-Based Automatic Global ImageRegistration Algorithm

Image registration is the process that determines the mostaccurate match between two images of the same sceneor object. In the global registration process, one image isregistered according to another known standard image. Werefer to the former as input image, the latter as referenceimage, the best matching image as registered image, and theimage after each resampling process as resampled image.

2.1. Review of WAGIR Algorithm. WAGIR can be describedas the pseudocode in Algorithm 1. Here assume that theLL subbands form the feature space; 2D rotations andtranslations are considered as search space; the searchstrategy follows the multiresolution approach provided bywavelet decomposition; the cross correlation coefficient isadopted as similarity metric. Firstly, an N-level waveletdecomposes the input image and the reference image withsize of M × M into nLLi and nLLr sequences where nrepresents corresponding decomposition level. Then NLLiand NLLr with the lowest resolution are registered withaccuracy of δ2N . A local best combination of rotations andtranslations (bestθ, bestX , bestY) is obtained and used asthe search center of registering the next level subbands,(N − 1)LLi and (N − 1)LLr. And another combination withaccuracy of δ2N−1 is gained. This process iterates until theoverall best result with expected accuracy δ, is retrievedafter registering the original input image (0LLi) and referenceimages (0LLr). Finally, a resampling process is carried out toget the registered image.

At each level, the algorithm shown in Algorithm 2 isemployed to register nLLi and nLLr. The result of previouslevel (θC,XC,YC) is used as the search center. For eachcombination of rotations and translations, the algorithmshown in Algorithm 3 is performed to get a resampledimage of nLLr. Then a correlation coefficient is calculatedto measure the similarity between the resampled nLLr andthe nLLi. The combination corresponding to the maximalcorrelation coefficient is the best result of current level. Theresampling algorithm is performed by sequentially selectingone registered image location once, calculating the corre-sponding coordinate of the selected location in the referenceimage, accessing the neighboring 4 × 4 pixel window in thereference image, calculating the corresponding interpolationweights according to the computed coordinate, and finallycalculating the pixel value of the selected location by thecubic convolution interpolation method. The correlationcoefficient is calculated with (1):

C(A,B) =∑M−1

i=0

∑M−1j=0

(Aij × Bij

)− (1/M2

)(∑M−1i=0

∑M−1j=0 Aij

)×(∑M−1

i=0

∑M−1j=0 Bij

)√(∑M−1

i=0

∑M−1j=0 A2

i j − (1/M2)(∑M−1

i=0

∑M−1j=0 Aij

)2)×(∑M−1

i=0

∑M−1j=0 B2

i j − (1/M2)(∑M−1

i=0

∑M−1j=0 Bij

)2) .

(1)

Page 137: 541420

EURASIP Journal on Embedded Systems 3

Input: input image and reference imageOutput: registered image1 Initialize registration process (wavelet level −N, search scope − rotation

angle: (θscope L, θscope R), horizontal offset: (Xscope L,Xscope R), andvertical offset: (Yscope L, Yscope L));

2 Perform wavelet decomposition of the input image and reference image;3 best θ = 0; best X = 0; best Y = 0;4 step θ = δ2N ; step X = δ2N ; step Y = δ2N ;5 for (n = N ;n ≥ 0; n−) do

(width, height) = (image width/2n, image height/2n);//registering at current wavelet level based on the results of previous level.Perform Register (nLLi, nLLr, best θ, bestX, bestY, step θ, stepX, stepY);(θscope L, θscope R) = (−step θ, step θ);(Xscope L, Xscope R) = (−stepX, stepX);(Yscope L, Yscope R) = (−stepY; stepY);

step θ /=2; stepX /=2; stepY /= 2;bestX∗= 2; bestY∗= 2; //size of next wavelet subband is twice of current

6 Resample (input image, best θ, bestX, bestY, registered image);// last resampe to obtain the result image.

7 Over.

Algorithm 1: Main WAGIR algorithm.

Register (the registering algorithm)Input: nLLi, nLLr, θcenter, Xcenter, Ycenter, stepθ, stepX, stepYOutput: local best θ, bestX, and bestY1 (angle, x, y) = (θscope L,Xscope L,Yscope L); //control variables2 max co = −1; // record the maximum correlation

// the registration processing3 while (angle <= θscope R) do

while (x <= Xscope R) dowhile (y <= Yscope R) do

(θC, XC, YC) = (θcenter + angle, Xcenter + x, Ycenter + y);// resample nLLr with θC, XC, YC to get registered image outPerform Resample (nLLr, image out, θC, XC, YC);// compute the correlation betwwen image out and nLLicorre = Correlation (image out, nLLi);if (corre > max co) then

max co = corre;(bestθ, bestX, bestY) = (θC, XC, yC);

y = y + stepY;x = x + stepX;y = Yscope L;

angle = angle + stepθ;x = Xscope L;

4 Over.

Algorithm 2: The registering algorithm.

2.2. Analysis of WAGIR Algorithm. All analysis are based ona common assumption of the hierarchy architecture shownin Figure 1. The off-chip external memory is used to storethe tremendously growing image data. The on-chip memoryserves as an buffer to bridge the speed gap between externalmemory and accelerator.

In WAGIR, for each possible combination of rotationsand translations, a resampling process is performed, and a

correlation coefficient is calculated to decide which one is thebest transformation between the input image and referenceimage. Runtime profiles from a software implementation ofWAGIR listed in Table 1 show that the resampling process(without the time to compute the correlation coefficient) isthe most time-consuming, and the calculation of correlationcoefficient is essential because each resampling processcorresponds one calculation of correlation coefficient though

Page 138: 541420

4 EURASIP Journal on Embedded Systems

Resample (the resample algorithm)Input: nLLr; transθ, transX, transYOutput: image out1 (w, h) = (width/2n, height/2n); //width and height of the input nLLr

//compute the cubic convolution weights for a xLen× tab × yLen× tab template2 cubicTable (4, 4, tab);3 for (ty = 0; ty < h; ty++) do

for (tx = 0; tx < w; tx++) do//The inverse mapping functionx = cos(transθ)∗ (tx − transX −w/2) + sin(transθ)∗ (ty − transY − h/2) + w/2;y = − sin(transθ)∗ (tx − transX −w/2) + cos(transθ)∗ (ty − transY − h/2) + h/2;(x int, y int) = (int x, int y);(x fra, y fra) = (x − x int, y − y int);(f x, f y) = ((int)(x fra∗tab), (int)(y fra∗tab));//read the corresponding xLen∗ yLen weights from cubicTable to ctct = cubicTable + ( f y ∗ tab + f x)∗mem size;if ((0 < x int < (w − 2)) && (0 < y int < (h− 2))) then

// read the corresponding xLen× yLen coefficients from nLLr to stRead (nLLr, x int − 1, y int − 1, xLen, yLen, st);pixel d = 0;for (ch = 0; ch < 4; ch++) do

for (cw = 0; cw < 4; cw++) dopixel d += (∗ct)∗(∗st);ct++; st++;

if (pixel d > 255) then pixel d = 255;if (pixel d < 0) then pixel d = 0;pixel = (char)pixel d;∗(image out + ty ∗w + tx) = pixel;

4 Over

Algorithm 3: The resampling algorithm.

Table 1: Runtime profiles from a software implementation ofWAGIR (three-level wavelet decomposition, grayscale image, pro-filing platform is Intel Celeron(R) 1.7 GHz CPU, 256M DDR266SDRAM ×2, Microsoft VC 6.0 and WindowsXP Prof.).

Image size Wavelet dec. Resampling process Correlation cal.

512× 512 0.4 95.6 0.0001

1K × 1K 0.6 94.2 0.0001

2K × 2K 0.6 93.8 0.0000

3K × 3K 0.7 93.7 0.0000

it consumes a little execution time. For example, to registeran input image and a reference image with size of M × Min search space [θL, θR] × [xL, xR] × [yL, yR], (θR − θL) ×(xR − xL) × (yR − yL) resampling processes and (θR −θL) × (xR − xL) × (yR − yL) calculations of coefficient areneeded. For each resampling process, M × M resamplingoperations are needed. Though wavelet decomposition canrelieve this situation, the computation requirement remainssignificant. Therefore, to accelerate WAGIR is just to acceler-ate the resampling process and the calculation of correlationcoefficient.

In addition to the great computation requirement,WAGIR also has great memory requirement. In each resam-pling process, for each location of the resampled image,a neighboring 4 × 4 pixel window in the reference image

is needed. That means that 16M2 memory accesses arerequired for each resampling process. Meanwhile, a totalof 2M2 accesses are also required for each calculation ofcorrelation coefficient. Considering the great amount ofresampling processes and calculations of correlation coeffi-cient, the total access amount comes out of a massive num-ber. Even worse, the whole reference image may be neededmaximally to compute one row pixels of the resampled imagewhen the rotation angle is ±π/4 as shown in Figure 2. Thisdemands that the on-chip memory should be capable tohold the whole image. If not, the amount of memory accessgrows significantly. And also, each resampled image shouldbe buffered on-chip because it is needed in calculation ofcorrelation coefficient. It is infeasible to assign so great on-chip memory for hardware implementation because of themass size of remote sensing images and scarcity of on-chipmemory resources. Therefore, a good memory schedulingstrategy is imperative.

3. Block Resampling Scheme

To accommodate the great computation and memoryrequirements of WAGIR, a block resampling scheme isemployed. The foundation is to produce the resampledimage block by block because the computations of differentlocations are absolutely independent and irrelevant. The

Page 139: 541420

EURASIP Journal on Embedded Systems 5

Block Resample (the block resample algorithm)Input: nLLr; transθ, transX, transYOutput: image out1 (w,h) = (width/2n, height/2n); // width and height of the input nLLr

//compute the cubic convolution weights for a xLen × tab × yLen × tab template2 cubicTable (4, 4, tab);3 for (s = 0; s <h/S; s++) do

for (r = 0; r <w/S; r++) dofor (ty = 0; ty <S;ty++) do

for (tx = 0; tx <S; tx++) do//The inverse mapping functionx = cos(transθ)∗ (tx − transX −w/2) + sin(transθ)∗ (ty − transY − h/2) +w/2;y = − sin(transθ)∗ (tx − transX −w/2) + cos(transθ)∗ (ty − transY − h/2) + h/2;(x int, y int) = (int x, int y);(x fra, y fra) = (x − x int, y − y int);( f x, f y) = ((int)(x fra∗tab), (int)(y fra∗tab));//read the corresponding xLen∗ yLen weights from cubicTable to ctct = cubicTable + ( f y ∗ tab + f x)∗mem size;if ((0 < x int < ( w − 2)) && (0 < y int < ( h− 2))) then

// read the corresponding xLen× yLen coefficients from nLLr to stRead(nLLr, x int-1, y int-1, xLen, yLen, st);pixel d = 0;for (ch = 0; ch < 4;ch++) do

for (cw = 0; cw < 4; cw++) dopixel d += (∗ct)∗(∗st);ct++; st++;

if (pixel d > 255) then pixel d = 255;if (pixel d < 0) then pixel d = 0;pixel = (char)pixel d;∗(image out + ty ∗w + tx) = pixel;

4 Over

Algorithm 4: Block resampling algorithm.

FPGA

Off-chipexternal memory

On-chip memoryAcceleratingarchitecture

Figure 1: Assumption of hierarchy architecture.

pseudocode in Algorithm 4 describes the block resamplingscheme in which the resampled image is computed sequen-tially in consecutive S× S subblocks.

The reason what causes the great memory requirementis that the resampled image is generated row by row intraditional resampling algorithm. This way of computationresults in great scope of preloading the reference image.According to the mapping function (2), the scope of requiredreference image pixels to compute one row pixels of theresampled image is [0,M] × [0,M] maximally, that is, thewhole reference image. But the scope to compute an S × Ssubblock is just [((1 − √

2)/2)S, ((1 +√

2)/2)S] × [((1 −√2)/2)S, ((1 +

√2)/2)S]. Because S � M, the preloading

scope is decreased greatly. Accordingly, the size of required

on-chip memory is reduced significantly:

x = cos(trans θ)∗(tx − transX − M

2

)

+ sin(trans θ)∗(ty − transY − M

2

)+M

2,

y = − sin(trans θ)∗(tx − transX − M

2

)

+ cos(trans θ)∗(ty − transY − M

2

)+M

2.

(2)

Another benefit gained from the block resamplingscheme is that the calculations of all pixels within one block

Page 140: 541420

6 EURASIP Journal on Embedded Systems

Input image

tx

ty

Reference image Registered image

θ

Figure 2: Memory requirement of WAGIR.

only need once preloading, and the amount of memoryaccess is decreased. In traditional resampling algorithm,the pixels of reference image must be loaded from theexternal memory again and again if there is no enough on-chip memory to store the whole image. But in the blockresampling scheme, the block size is decided by the availableon-chip memory.

4. The BWAIR Architecture

As mentioned above, the resampling process and the cal-culation of correlation coefficient account for major of theexecution time. Therefore the BWAIR architecture aims toaccelerate WAGIR by accelerating the resampling algorithmsand the calculation of corresponding correlation coefficient.

The BWAGIR architecture is detailed in Figure 3. Thecoordinate calculation module computes the coordinate ofthe pixel in reference image corresponding to each locationin the resampled image. The interpolation weights calculationmodule is responsible to compute the 16 weights for the 4×4interpolation window. The reference image RAM controllerloads the neighboring 4 × 4 window. The resampled pixelcalculation module is in charge of computing the values ofresampled pixels. The input image RAM controller loads theinput image pixels for calculation of correlation coefficient.The correlation calculation module computes the correlationcoefficient. And the FIFOs are set to bridge the speed gapamong the modules mentioned above.

Proposed BWAGIR architecture optimizes the resam-pling process and the calculation of correlation coefficientby means of parallelizing the resample process and corre-sponding calculation of correlation coefficient, pipelining allcalculation modules, and parallel memory access.

4.1. Parallelizing Resampling and Calculation of Correlation.In a standard software implementation, the resamplingprocess and the calculation of correlation coefficient areperformed sequentially. The calculation of correlation coef-ficient starts after all the pixels of the resampled image areproduced. This means that the resampled image must bewritten back into the external memory or stored in extra on-chip memory after the resampling and then read back when

calculating the correlation coefficient. Extra memory volumeand memory access are evident.

In BWAGIR architecture, we partition the correlationcalculation into two steps.

(1) Calculate the sum of pixels of input image(∑M−1

i=0

∑M−1j=0 Aij), the sum of pixels of resampled

image (∑M−1

i=0

∑M−1j=0 Bij), the sum of square of pixels

of input image (∑M−1

i=0

∑M−1j=0 A2

i j), the sum of square

of pixels of resampled image (∑M−1

i=0

∑M−1j=0 B2

i j),and the sum of the production of pixels of inputimage and corresponding pixels of resampled image(∑M−1

i=0

∑M−1j=0 (Aij × Bij)).

(2) Calculate the final correlation coefficient according to(1).

This partition can avoid the extra memory volume andmemory access. Once a pixel in the resampled image isproduced, it is computed in step 1 and then discarded. Oncestep 1 finishes the calculations of all pixels, the five sumsare sent to step 2 to finalize the calculation of correlationcoefficient. Therefore, the resampling process parallel withthe calculation of correlation coefficient.

4.2. Pipelining. All the calculation modules are pipelined toimprove the system throughput and operating frequency.As shown in Figure 3, the BWAGIR is divided into fourmacrostages according to the processing flow. The first stagecalculates the coordinate of the pixel in the reference imagecorresponding to each location of resampled image andwrites the integral and fractional components into accordingFIFOs. At the second stage, the Reference Image RAMController reads the neighboring 4 × 4 reference image pixelwindow, and at the same time, 16 interpolation weights areproduced by the Interpolation Weights Calculation Module.Stage 3 calculates the value of each location in the resampledimage by multiplying the pixels with its correspondingweights and adding these product together. Finally, theCorrelation Calculation Module computes the correlationcoefficient by means described in Section 4.1. With pipelin-ing, the total time of performing the resampling process andcorresponding correlation calculation once becomes equal tothe product of the worst pipeline stage time and the numberof pixels in the resampled image.

Page 141: 541420

EURASIP Journal on Embedded Systems 7

A BWAGIR resampling processor (FPGA)

Coordinate calculation module

FIFO: integral part FIFO: fractional part

Extern

al SDR

AM

reference im

age

Interpolation weightscalculation module

FIFO: cubic windowpixels FIFO: weights

Resampled pixelcalculation module E

xternal SD

RA

Min

put im

age

On

-chip R

AM

mem

oryin

put im

age

Correlationcalculation module

Correlation efficient

FIFO: resampledpixels

FIFO: input image

Reference imagecontroller

Input imagecontroller

Preloadin

g controller

reference im

ageP

reloading con

trollerin

put im

age

Control logicStage 1

Stage 3

Stage 4

On

-chip R

AM

mem

oryreferen

ce image

Sage 2

Figure 3: The BWAGIR architecture.

Table 2: Comparison of the registration time (milliseconds) with the CL cluster system.

SizeCL BWAGIR

1node 2nodes 5nodes 15nodes 16nodes 1unit 5units

512× 512

PP 103.11 51.80 20.82 7.64 6.86

8.7 1.8IP 102.98 61.00 38.54 27.50 27.44

HP 103.22 51.60 20.83 7.94 8.82

GP 103.22 51.79 20.85 7.25 6.80

1K × 1K

PP 345.93 173.77 69.86 25.62 23.02

39.4 7.9IP 345.94 187.43 88.72 44.66 43.31

HP 345.90 172.22 69.84 25.01 24.56

GP 345.95 172.32 69.86 24.21 22.90

3K × 3K

PP 2849.95 1445.13 575.63 213.55 192.41

385.7 75.5IP 2849.88 1440.52 626.68 231.41 218.20

HP 2849.97 1442.60 575.62 207.36 191.83

GP 2849.96 1439.75 576.02 204.50 191.87

4.3. Parallel Memory Access. Parallelizing the resamplingprocess and calculation of correlation coefficient demandsparallel access to input image and reference image. But theway of accessing the input image differs with that of accessingthe reference image. The input image is accessed sequentially,and the reference image is accessed by 4 × 4 window.Therefore, two external memories are used to store theinput image and the reference image, respectively. And theyare preloaded into the respective on-chip RAMs block byblock.

As mentioned above, the performance of the pipeline isdecided by the worst stage calculation time. The four pipelinestages differ in data source and operation. The worst caseis the second stage because a 4 × 4 neighborhood, that is,16 pixels, is loaded from on-chip reference image RAM. Ifthese pixels are loaded normally, it takes at least 16 cyclesto calculating each location in the resampled image. Thisrestraints the throughput of the pipeline significantly. As arule, multibank memory organization can settle this problemby distributing the sequential multiple access to different

Page 142: 541420

8 EURASIP Journal on Embedded Systems

Table 3: Comparison of the registration time (milliseconds) with the MPM parallel machine.

SizeMPM BWAGIR

1node 5nodes 15nodes 16nodes 30nodes 1unit 5units

512× 512

PP 39.052 7.917 3.116 2.635 1.714

8.7 1.8IP 39.051 10.414 5.350 5.022 4.189

HP 39.050 7.917 2.850 2.916 1.564

GP 39.055 7.917 2.750 2.633 1.450

1K × 1K

PP 145.625 29.317 11.318 9.567 6.184

39.4 7.9IP 145.629 32.188 12.841 12.142 8.683

HP 145.633 29.067 10.117 9.816 5.415

GP 145.633 29.183 10.082 9.665 5.271

3K × 3K

PP 1327.00 276.336 102.926 86.233 68.517

385.7 75.5IP 1326.24 292.485 105.821 100.466 75.243

HP 1327.25 270.950 91.967 87.015 46.801

GP 1328.00 267.150 88.350 87.717 45.516

memory banks with separate ports. Because the 16 pixels arenot consecutive, it is difficult to distribute them evenly into16 banks. Therefore, we adopt a compromising strategy thatthe on-chip memory for the reference image is divided into 8banks each of which has three ports—one for write and theother two for read. This makes convenient to write the 64-bitword which is composed of 8 consecutive 8 bits pixels intoon-chip memory parallelly and load the 8 pixels (two lines inthe 4 × 4 window) parallelly. Thereby, it takes only 2 cyclesto load the 16 pixels within a window. Though this cannotmatch the speed of calculation modules yet (one result, onecycle), the stage 2 calculation time is decreased by 8 times.

4.4. Parallelizing Multiple BWAGIR Processing Units. Theprocessing speed of proposed architecture can be furtherimproved by parallelizing multiple BWAGIR processingunits. There are two ways to achieve this.

(i) Processing multiple blocks belonging to the same resam-pled image. This way multiplies the preloading scope,and the on-chip memory volume. Another disadvan-tage is the data is not utilized sufficiently becauseeach preloading only supports the calculation of oneresampled image.

(ii) Processing multiple blocks belonging to different resam-pled images. This way enlarges the preloading scopea little because the blocks at the same position ofdifferent resampled images almost have the samepreloading scope. But the on-chip memory volume isstill multiplied because parallel processing demandsgreat data memory bandwidth, that is, each pro-cessor requires an independent data memory. Thisparallel can decrease the memory access becauseeach preloading supports the calculation of multipleresampled images.

Therefore the more economical way is to process multipleblocks which are at the same position of different resampledimages.

5. Implementation and Experimental Results

As a proof of concept, the BWAGIR architecture is modeledwith VerilogHDL, simulated with ModelSim SE 6.1d, synthe-sized with Quartus II 6.0, and implemented in an externalprototype board with an Altera EP2S130F1020C3 FPGA.One BWAGIR unit occupies about 67% ALUTs (70557) and35% memory bits (2345144) and can operate at a clock rateof 100 MHz. Two Micron MT16LSDT12864AG-1 GB PC133SDRAMs are used as the external memories for input imageand reference image, respectively. And the on-chip memoriesfor input image and reference image are implemented withinternal memory blocks. The prototype board is connectedto a host computer with a USB cable. Only the registrationcomponent of WAGIR is executed on the board, and theother components are all performed on the host. Table 2lists a comparison of the timings between the BWAGIRarchitecture and the CL machine which is a cluster systemwith 16 nodes; each node is equipped with Pentium4-1.7 GCPU and 512 MB local storage, and all nodes are connectedby the 100 Mb/s Ethernet. Table 3 lists a comparison of thetimings between the BWAGIR and the MPM machine whichis a massively parallel computer with MIMD architecture andhas 32 processors with 1 GB local storage for each processor,speed of MPM CPU is valued as 1.66 gigaflops/sec, topologyof network is fat tree, and point-to-point bandwidth is1.2 Gb/s. The images are processed with three-level waveletdecomposition and registered within the search space of θ ∈(−16◦, 16◦), x ∈ (−16, 16), y ∈ (−16, 16).

It can be concluded what follows.

(i) The BWAGIR with 5 units may perform more than5X faster than that with 1 unit because parallelizingmultiple processing units cannot only improve theexecution speed but also reduce the amount ofmemory access.

(ii) The BWAGIR with 1 unit outperforms 7.4X at leastover kinds of parallel schemes on the CL with 1node and 3.4X at least faster than the MPM with 1

Page 143: 541420

EURASIP Journal on Embedded Systems 9

node because the way of memory access cannot fullybenefit from the traditional cache-based memoryarchitectures present in most modern computers.

(iii) The BWAGIR with 5 units achieves a speedup ofabout 3X against the CL with 16 nodes, a speedupof greater than 1X against the MPM with 16 nodes,and a comparative speed with the MPM with 30nodes. This is because numerous communicationsbetween nodes cut down the expected performanceimprovement of the parallel schemes.

It should be noted that the timings of BWAGIR with 1unit is obtained on the prototype board actually, and thetimings of the BWAGIR with 5 units are simulation timesbecause our board cannot support parallelizing multipleunits with limitation of available volume and number ofFPGAs.

6. Conclusion

WAGIR algorithm for remote sensing application isextremely computation-intensive and demands executiontimes on the order of minutes, even hours, on moderndesktop computers. Therefore a customized FPGA archi-tecture is proposed in this paper to accommodate thegreat computational requirements of the algorithm and thetrend of migration from ground computing to onboardcomputing.

To implement the algorithm in FPGA efficiently, ablock resampling scheme is adopted to relieve the greatcomputation and memory requirements. And based on theblock scheme, the proposed BWAGIR architecture derivesits improvement from (1) pipelining all computationallogics, (2) parallelizing the resampling process and calcu-lation of correlation coefficient, and (3) parallel memoryaccess. A practical implementation with two standard PC133SDRAMs, operating at 100MHz, outperforms 7.4X com-pared to the CL cluster system with 1 node and about3.4X over the MPM machine with 1 node. This speedup isderived using just one BWAGIR processing unit. For furtherimprovement, multiple units can be paralleled to implementarrays of processing units using VLSI or FPGAs to performdistributed image registration. Compared with the CL with16 nodes, the BWAGIR architecture with 5 units can achieveabout 3X speedup. And also it achieves a comparative speedwith the MPM machine with 30 nodes. More importantly,our architecture can meet the requirement of onboardcomputing.

Acknowledgments

Our work is supported by the National Science Foundationof China under contracts no. 60633050 and no. 60621003and the National High Technology Research and Develop-ment Program of China under contract no. 2007AA01Z106and no. 2007AA12Z147.

References

[1] L. Brown, “A survey of image registration techniques,” ACMComputing Surveys, vol. 24, no. 4, pp. 325–376, 1992.

[2] B. Zitova and J. Flusser, “Image registration methods: asurvey,” Image and Vision Computing, vol. 21, no. 11, pp. 977–1000, 2003.

[3] C. R. Castro-Pareja and R. Shekhar, “Hardware acceleration ofmutual information-based 3D image registration,” Journal ofImaging Science and Technology, vol. 49, no. 2, pp. 105–113,2005.

[4] C. R. Castro-Pareja, J. M. Jagadeesh, and R. Shekhar, “FAIR:a hardware architecture for real-time 3D image registration,”IEEE Transactions on Information Technology in Biomedicine,vol. 7, no. 4, pp. 426–434, 2003.

[5] J.-P. Djamdji, A. Bijaoui, and R. Maniere, “Geometricalregistration of images: the multiresolution approach,” Pho-togrammetric Engineering & Remote Sensing, vol. 59, no. 5, pp.645–653, 1993.

[6] J. Flusser, “An adaptive method for image registration,” PatternRecognition, vol. 25, no. 1, pp. 45–54, 1992.

[7] B. S. Manjunath, C. Shekhar, and R. Chellappa, “A newapproach to image feature detection with applications,” Pat-tern Recognition, vol. 29, no. 4, pp. 627–640, 1996.

[8] A. D. Ventura, A. Rampini, and R. Schettini, “Image reg-istration by recognition of corresponding structures,” IEEETransactions on Geoscience and Remote Sensing, vol. 28, no. 3,pp. 305–314, 1990.

[9] Q. Zheng and R. Chellappa, “A computational vision approachto image registration,” IEEE Transactions on Image Processing,vol. 2, no. 3, pp. 311–326, 1993.

[10] H. Li, B. S. Manjunath, and S. K. Mitra, “A contour-based approach to multisensor image registration,” IEEETransactions on Image Processing, vol. 4, no. 3, pp. 320–334,1995.

[11] J. P. Djamdji and A. Bijaoui, “Disparity analysis: a wavelettransform approach,” IEEE Transactions on Geoscience andRemote Sensing, vol. 33, no. 1, pp. 67–76, 1995.

[12] H. H. Li and Y.-T. Zhou, “A wavelet-based point featureextractor for multisensor image restoration,” in SPIE AerosenseWavelet Applications III, vol. 2762 of Proceedings of SPIE, pp.524–534, Orlando, Fla, USA, 1996.

[13] J. Le Moigne, W. J. Campbell, and R. F. Cromp, “An automatedparallel image registration technique based on the correlationof wavelet features,” IEEE Transactions on Geoscience andRemote Sensing, vol. 40, no. 8, pp. 1849–1864, 2002.

[14] J. Le Moigne, A. Cole-Rhodes, R. Eastman, et al., “Multi-sensor registration of earth remotely sensed imagery,” in Imageand Signal Processing for Remote Sensing VII, vol. 4541 ofProceedings of SPIE, pp. 1–10, Toulouse, France, September2001.

[15] J. Le Moigne and I. Zavorin, “Use of wavelets for image regis-tration,” in Wavelet Applications VII, vol. 4056 of Proceedingsof SPIE, pp. 99–104, Orlando, Fla, USA, April 2000.

[16] P. Thevenaz, U. E. Ruttimann, and M. Unser, “A pyramidapproach to subpixel registration based on intensity,” IEEETransactions on Image Processing, vol. 7, no. 1, pp. 27–41, 1998.

[17] R. L. Allen, F. A. Kamangar, and E. M. Stokely, “Laplacian andorthogonal wavelet pyramid decompositions in coarse-to-fineregistration,” IEEE Transactions on Signal Processing, vol. 41,no. 12, pp. 3536–3541, 1993.

[18] Q.-S. Chen, M. Defrise, and F. Deconinck, “Symmetric phase-only matched filtering of Fourier-Mellin transforms for imageregistration and recognition,” IEEE Transactions on Pattern

Page 144: 541420

10 EURASIP Journal on Embedded Systems

Analysis and Machine Intelligence, vol. 16, no. 12, pp. 1156–1168, 1994.

[19] J. C. Olivo, J. Deubler, and C. Boulin, “Automatic registrationof images by a wavelet-based multiresolution approach,” inWavelet Applications in Signal and Image Processing III, vol.2569 of Proceedings of SPIE, pp. 234–244, San Diego, Calif,USA, 1995.

[20] H. Shekarforoush, M. Berthod, and J. Zerubia, “Subpixelimage registration by estimating the polyphase decompositionof cross power spectrum,” in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recogni-tion, pp. 532–537, 1996.

[21] R. C. Hardie, K. J. Barnard, and E. E. Armstrong, “JointMAP registration and high-resolution image estimation usinga sequence of undersampled images,” IEEE Transactions onImage Processing, vol. 6, no. 12, pp. 1621–1633, 1997.

[22] R. J. Althof, M. G. J. Wind, and J. T. Dobbins, “A rapidand automatic image registration algorithm with subpixelaccuracy,” IEEE Transactions on Medical Imaging, vol. 16, no.3, pp. 308–316, 1997.

[23] Y. C. Hsieh, D. M. McKeown, and F. P. Perlant, “Performanceevaluation of scene registration and stereo matching forcartographic feature extraction,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 14, no. 2, pp. 214–238,1992.

[24] A. El-Ghazaw and P. Chalermwa, “Wavelet-based image regis-tration on parallel computers,” in Proceedings of ACM/IEEE onHigh Performance Networking and Computing (SuperComput-ing ’97), p. 20, 1997.

[25] P. Chalermwat, High performance automatic image registrationfor remote sensing, Ph.D. Thesis, George Mason University,Fairfax, Va, USA, 1999.

[26] H. Zhou, X. Yang, H. Liu, and Y. Tang, “First evaluation ofparallel methods of automatic global image registration basedon wavelets,” in Proceedings of the International Conference onParallel Processing (ICPP ’05), pp. 129–136, 2005.

[27] H. Zhou, Y. Tang, X. Yang, and H. Liu, “Research ongrid-enabled parallel strategies of automatic wavelet-basedregistration of remote-sensing images and its application inChinaGrid,” in Proceedings of the 4th International Conferenceon Image and Graphics (ICIG ’07), pp. 725–730, 2007.

Page 145: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 716317, 15 pagesdoi:10.1155/2009/716317

Research Article

A System for an Accurate 3D Reconstruction inVideo Endoscopy Capsule

Anthony Kolar,1 Olivier Romain,1 Jade Ayoub,1 David Faura,1

Sylvain Viateur,1 Bertrand Granado,2 and Tarik Graba3

1 Departement SOC—LIP6, Universite P&M CURIE—Paris VI, Equipe SYEL, 4 place Jussieu, 75252 Paris, France2 ETIS, CNRS/ENSEA/Universite de Cergy-Pontoise, 95000 Cergy, France3 Electronique des systemes numeriques complexe, Telecom ParisTech, 46 rue Barrault, 75252 Paris, France

Correspondence should be addressed to Anthony Kolar, [email protected]

Received 15 March 2009; Revised 9 July 2009; Accepted 12 October 2009

Recommended by Ahmet T. Erdogan

Since few years, the gastroenterologic examinations could have been realised by wireless video capsules. Although the images makeit possible to analyse some diseases, the diagnosis could be improved by the use of the 3D Imaging techniques implementedin the video capsule. The work presented here is related to Cyclope, an embedded active vision system that is able to givein real time both 3D information and texture. The challenge is to realise this integrated sensor with constraints on size,consumption, and computational resources with inherent limitation of video capsule. In this paper, we present the hardwareand software development of a wireless multispectral vision sensor which allows to transmit, a 3D reconstruction of a scenein realtime. multispectral acquisitions grab both texture and IR pattern images at least at 25 frames/s separately. The differentIntellectual Properties designed allow to compute specifics algorithms in real time while keeping accuracy computation. We presentexperimental results with the realization of a large-scale demonstrator using an SOPC prototyping board.

Copyright © 2009 Anthony Kolar et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Examination of the whole gastrointestinal tract represents achallenge for endoscopists due to its length and inaccessi-bility using natural orifices. Moreover, radiologic techniquesare relatively insensitive for diminutive, flat, infiltrative, orinflammatory lesions of the small bowel. Since 1994, videocapsules (VCEs) [1, 2] have been developed to allow directexamination of this inaccessible part of the gastrointestinaltract and to help doctors to find the cause of symptoms suchas stomach pain, disease of Crohn, diarrhoea, weight loss,rectal bleeding, and anaemia.

The Pillcam video capsule designed by Given ImagingCompany is the most popular of them. This autonomousembedded system allows acquiring about 50 000 images ofgastrointestinal tract during more than twelve hours of ananalysis. The off-line image processing and its interpretationby the practitioner permit to determine the origin of thedisease. However, recent benchmark [3] published showssome limitations on this video capsule as the quality of

images and the inaccuracy on the size of the polyps. Accuracyis a real need because the practitioner makes an ablationof a polyp only if it exceeds a minimum size. Actually thepolyp size is estimated by practitioner’s experience with moreor less error for one practitioner to another. One of thesolutions could be to use techniques of 3D imagery, eitherdirectly in the video capsule or on a remote computer.

This later solution is actually used in the Pillcam capsuleby using the 2–4 images that are taken per second and storedwirelessly in a recorder that is worn around the waist. 3Dprocessing is performed off-line from the estimation of thedisplacement of the capsule. However, the speed of video-capsule is not constant; for example, in the oesophagus, itis of 1.44 m/s, and in the stomach it is almost null and is0.6 m/s in the intestine. Consequently, by taking images atfrequencies constant, certain areas of the transit will not berebuilt. Moreover, the regular transmission of the imagesby the body consumes too much energy and limits theautonomy of the video capsules to 10 hours. Ideally, thequantity of information to be transmitted must be reduced

Page 146: 541420

2 EURASIP Journal on Embedded Systems

at the only pertinent information like polyps or other 3Dobjects. The first development necessary to the deliveryof such objects relies on the use of algorithm of patternrecognition on 3D information inside the video capsule.

The introduction of 3D reconstruction techniques insidea video capsule needs to define a new system that takes intoaccount the hard constraints of size, low power consumption,and processing time. The most common 3D reconstructiontechniques are those based on passive or active stereoscopicvision methods, where image sensors are used to provide thenecessary information to retrieve the depth. Passive methodconsists of taking at least two images of a scene at twodifferent points of view. Unfortunately using this method,only particular points, with high gradient or high texture,can be detected [4]. The active stereo-vision methods offeran alternative approach when processing time is critical.They consist in replacing one of the two cameras by aprojection system which delivers a pattern composed by aset of structured rays. In this latter case, only an image ofthe deformation of the pattern by the scene is necessary toreconstruct a 3D image. Many implementations based onactive stereo-vision have been realised in the past [5, 6] andprovided significant results on desktop computers. Generally,these implementations have been developed to reconstruct3D large objects as building [7–14].

In our research work, we have focused on an integrated3D active vision sensor: “Cyclope.” The concept of this sensorwas first described in [4]. In this new article we focus onthe presentation of our first prototype which includes theinstrumentation and processing blocks. This sensor allowsmaking in real time a 3D reconstruction taking into accountthe size and power consumption constraints of embeddedsystems [15]. It can be used in wireless video capsules orwireless sensor networks. In the case of video capsule in orderto be comfortable for the patient, the results could be storedin a recorder around the waist. It is based on a multispectralacquisition that must facilitate the delivery of a 3D texturedreconstruction in real time (25 images by second).

This paper is organised as follows, Section 2 describesbriefly Cyclope and deals with the principles of the activestereo-vision system and 3D reconstruction method. InSection 3 we present our original multispectral acquisition.In Section 4 we present the implementation of the opticalcorrection developed to correct the lens distortion. Section 5deals with the implementation of a new thresholding andlabelling methods. In Sections 6 and 7, we present theprocessing of matching in order to give a 3D representationof the scene. Section 8 deals with wireless communicationconsideration. Finally, before a conclusion and perspectivesof this work, we present, in Section 9, a first functionalprototype and its performances which attest the feasibility ofthis original approach.

2. Cyclope

2.1. Overview of the Architecture. Cyclope is an integratedwireless 3D vision system based on active stereo-visiontechnique. It uses many different algorithms to increase

Cyclope

VCSEL

CMOS imagerInstrumentation

block

FPGA

Processing block

μP

RF block

Figure 1: Cyclope Diagram.

accuracy and reduce processing time. For this purpose, thesensor is composed of three blocks (see Figure 1).

(i) Instrumentation block: it is composed of a CMOScamera and a structured light projector on IR band.

(ii) Processing block: it integrates a microprocessor coreand a reconfigurable array. The microprocessor isused for sequential processing. The reconfigurablearray is used to implement parallels algorithms.

(iii) RF block: it is dedicated for the OTA (Over the Air)communications.

The feasibility of Cyclope was studied by an implemen-tation on an SOPC (System On Programmable Chip) target.These three parts will be realised in different technologies:CMOS for the image sensor and the processing units,GaAs for the pattern projector, and RF–CMOS for thecommunication unit. The development of such integrated“SIP” (System In Package) is actually the best solution toovercome the technological constraints and realise a chipscale package. This solution is used in several embeddedsensors such as The “Human++” platform [16] or SmartDust [17].

2.2. Principle of the 3D Reconstruction. The basic principle of3D reconstruction is the triangulation. Knowing the distancebetween two cameras (or the various positions of the samecamera) and defining of the line of views, one passing by thecenter of camera and the other by the object, we can find theobject distance.

The active 3D reconstruction is a method aiming toincrease the accuracy of the 3D reconstruction by theprojection on the scene of a structured pattern. The matchingis largely simplified because the points of interest in theimage needed to the reconstruction are obtained by theextraction of the pattern; it also has the effect to increase thespeed of processing.

The setup of active stereo-vision system is representedin Figure 2. The distance between the camera and the laserprojector is fixed. The projection of the laser beams on aplane gives an IR spots matrix.

Page 147: 541420

EURASIP Journal on Embedded Systems 3

Central ray

Laserprojector

Camera

P1

Pk

O

y

xz

Figure 2: Active stereo-vision system.

Image planP

Oc

eCL

Projectorcenter

Epipolar plan

P

zc

Figure 3: Epipolar projection.

The 3D reconstruction is achieved through triangulationbetween laser and camera. Each point of the projectedpattern on the scene represents the intersection of two lines(Figure 3):

(i) the line of sight, passing through the pattern point onthe scene and its projection in the image plan,

(ii) the laser ray, starting from the projection center andpassing through the chosen pattern point.

If we consider the active stereoscopic system as shown inFigure 3, where p is the projection of P in the image plan ande the projection of CL on the camera plan OC , the projectionof the light ray supporting the dot on the image plan is astraight line. This line is an epipolar line [18–20].

To rapidly identify a pattern point on an image we canlimit the search to the epipolar lines.

For Cyclope the pattern is a regular mesh of points. Foreach point ( j, k) of the pattern we can find the correspondingepipolar line:

v = ajk · u + bjk, (1)

Image plan

Cf

B

P

p

z2

z1

p2

p1

π2 π1

Figure 4: Spot image movement versus depth.

where (u, v) are the image coordinates and the parameters(ajk, bjk) are estimated through an off-line calibration pro-cess.

In addition to the epipolar lines, we can establish therelation between the position of a laser spot in the image andits distance to the stereoscopic system.

On Figure 4, we consider a laser ray (5) projected ontwo different plans π1 and π2 located, respectively, at z1 andz2, the trajectory d of the coordinates in the image will beconstrained to the epipolar line.

By considering the two triangles CPp1 and CPp2, we canexpress d as

d = B

[(z1 − f

z1

)−(z2 − f

z2

)]= B f

(z1 − z2)z1z2

, (2)

where B is the stereoscopic, f the focal length of the cameraand d the distance in pixels:

d =√

(u1 − u2)2 + (v1 − v2)2. (3)

Given the epipolar line we can express d as a function of onlyone image coordinates:

d =√

1 + a2 (u1 − u2). (4)

From (2) and (4), we can express, for each pattern point( j, k), the depth as a hyperbolic function:

z = 1αjku + βjk

, (5)

where the αjk and βjk parameters are also estimated duringthe off-line calibration of the system [21].

We can compute the inverse of the depth z to simplifythe implementation. Two operations are only needed: anaddition and a multiplication. The computation of the depthof each point is independent of the others. So, all the laserspots can be computed separately allowing the parallelisationof the architecture.

3. An Energetic Approach forMultispectral Acquisition

The main problem when you design a 3D reconstructionprocessing for an integrated system is the limitation of

Page 148: 541420

4 EURASIP Journal on Embedded Systems

Multispectralimage

acquisition

Distortioncorrection

Thresholding Labeling

Centerdetection

Matching3Dreconstruction

Wirelesscommunication

Figure 5: Acquisition and 3D reconstruction flow chart.

Texture image Patternimage

400 nm 700 nmVisible Near IR

λ

Figure 6: Multispectral image sensor.

the resources. However, we can obtain a good accuracyconsidering hard constraints by using the following methodwhich is shown in Figure 5:

(1) the multispectral acquisition which makes the dis-crimination between the pattern and the texture byan energetic method;

(2) the correction of the error coordinates due to theoptical lens distortion;

(3) the processing before the 3D reconstruction asthresholding, segmentation, labelling, and the com-putation of the laser spot center;

(4) the computation of the matching and the thirddimension;

(5) the transmission of the data with a processor core andan RF module.

The spectral response of the Silicon cuts near 1100 nmand it covers UV to near Infrared domains. This importantcharacteristic allows defining a multispectral acquisition bygrabbing on the visible band the colour texture image and,on the near infrared band, the depth information. Cyclopeuses this original acquisition method, which permits toaccess directly at the depth information’s independently fromtexture image processing (Figure 6).

The combination of the acquisition of the projectedpattern on the infrared band, the acquisition of the textureon the visible band, and the mathematical model of theactive 3D sensor makes it possible to restore the 3Dtextured representation of the scene. This acquisition needsto separate texture and 3D datas. For this purpose we havedeveloped a multispectral acquisition [15]. Generally, filtersare used to cut the spectral response. We used here an

Figure 7: 64× 64 image sensor microphotograph.

energetic method, which has the advantage of being genericfor imagers.

To allow real-time acquisition of both pattern andtexture, we have developed a first 64 × 64 pixels CMOSimager prototype in 0.6 μm for a total surface of 20 mm2

(Figure 7). This sensor has programmable light integrationand shutter time to allow dynamic change. It was designedto have large response in the visible and near infrared. Thisfirst CMOS imager prototype, which is not the subject of thisarticle, had allowed the validation of our original energeticapproach, but its small size needs to be increased to havemore information. So, in our demonstrator we have useda greater CCD sensor (CIF resolution 352 × 288 pixels) toobtain normal size images and validate the 3D processingarchitecture.

The projector pulses periodically on the scene anenergetic IR pattern. An image acquisition with a shortintegration time allows grabbing the image of the patternwith a background texture which appears negligible. Asecond image acquisition with a longer integration allowsto grab the texture when the projector is off. Figure 8 showsthe sequential scheduling of the images acquisition. To reacha video rate of 25 images/s this acquisition sequence mustbe done in less than 40 milisecond. The global acquisitiontime is given in (6) where Trst is the reset time, Trd is thetime needed to read the entire image, and TintVI and TintIR

Page 149: 541420

EURASIP Journal on Embedded Systems 5

Laser pulse

Read time(Trd)

Short integration

(TintIR)

Long integration

(TintVI)

Less than 40 ms

Camera

Patternprojector

Figure 8: Acquisition sequence.

are, respectively, the integration time for both visible and IRimage.

Ttotal = 2 · Trst + 2 · Trd + TintVI + TintIR. (6)

The typical values are

Trst = 0.5 ms,

Trd = 0.5 ms,

TintVI = 15 ms,

TintIR = 20μm.

(7)

4. Optical Distortion Correction

Generally, the lenses used in the VCE introduce largedeformations on acquired images because of their weak focal[22]. This distortion is manifested in inadequate spatial rela-tionships between pixels in the image and the correspondingpoints in the scene. Such change in the shape of capturedobject may have critical influence in medical applications,where quantitative measurements in Endoscopy depend onthe position and orientation of the camera and its model.The used camera model needs to be accurate. For this reasonwe introduce firstly the pinhole camera model and later thecorrection of geometric distortion that are added to enhanceit. For practical purposes two different methods are studiedto implement this correction, and it is up to researchersto choose their own model depending on their requiredaccuracy level and computational cost.

Pinhole camera model (see Figure 9) is based on theprinciple of linear projection where each point in the objectspace is projected by a straight line through the projectioncenter into the image plane. This model can be used onlyas an approximation of the real camera that is actually notperfect and sustains a variety of aberration [23]. So, pinholemodel is not valid when high accuracy is required like in ourexpected applications (Endoscopes, robotic surgery, etc.). Inthis case, a more comprehensive camera model must be used,

Yc

Xc

O f O′u

vr′

(u, v)

v

r

x

y

Zc

P(Xw ,Yw ,Zw)

Figure 9: Pinhole camera model (Xw ,Yw ,Zw): World coordinates;(O,Xc,Yc,Zc): camera coordinates; (O′,u, v): image plane coordi-nates.

taking into account the corrections for the systematicallydistorted image coordinates. As a result of several typesof imperfections in the design and assembly of lensescomposing the camera’s optical system, the real projectionof the point P in the image plane takes into account theerror between the real image observed coordinates and thecorresponding ideal (non observable) image coordinates:

u′ = u + δu(u, v),

v′ = v + δv(u, v),(8)

where (u, v) are the ideal nonobservable, distortion-freeimage coordinates, (u′, v′) are the corresponding real coor-dinates, and δu and δv are, respectively, the distortion alongthe u and v axes. Usually, the lens distortion consists ofradial symmetric distortion, decentering distortion, affinitydistortion, and nonorthogonally deformations. Several casesare presented on Figure 10.

The effective distortion can be modelled by

δu(u, v) = δur + δud + δup,

δv(u, v) = δvr + δvd + δvp,(9)

where δur represent radial distortion [24], δud representdecentering distortion, and δup represent thin prism distor-tion. Assuming that only the first- and second-order termsare sufficient to compensate the distortion, and the terms oforder higher than three are negligible, we obtain a fifth-orderpolynomials camera model (expression 8), where (ui, vi) arethe distorted image coordinates in pixels, and (ui, vi) are truecoordinates (undistorted):

ui

= DuSu(k2u

5i + 2k2u

3i v

2i + k2uiv

4i + k1u

3i

+k1uiv2i + 3p2u

2i + 2p1uivi + p2v

2i + ui

)+ u0,

vi

= Dv

(k2u

4i vi + 2k2u

2i v

3i + k1u

2i vi + k2v

5i

+k1v3i + p1u

2i + 2p2uivi + 3p1v

2i + ui

)+ v0.

(10)

Page 150: 541420

6 EURASIP Journal on Embedded Systems

(a) (b) (c)

Figure 10: (a) The ideal undistorted grid. (b) Barrel distortion. (c) Pincushion distortion.

An approximation of the inverse model is done by (11):

ui

=u′i + u′i

(a1r

2i + a2r

4i

)+ 2a3u

′i v′i + a4

(r2i + 2u

′2i

)

a5r2i + a6u

′i + a7v

′i + a8

)r2i + 1

,

vi

=v′i + v′i

(a1r

2i + a2r

4i

)+ 2a4u

′i v′i + a3

(r2i + 2v

′2i

)

a5r2i + a6u

′i + a7v

′i + a8

)r2i + 1

,

(11)

where

u′i =ui − u0

DuSU,

v′i =vi − v0

Dv,

r2i =

√v′2i + v

′2i .

(12)

The unknown parameters a1, . . . , a8 are solved usingdirect least mean-squares fitting [25] in the off-line calibra-tion process.

4.1. Off-Line Lens Calibration. There are many proposedmethods that can be used to estimate intrinsic camera andlens distortion parameters, and there are also methods thatproduce only a subset of the parameter estimates. We chosea traditional calibration method based on observing a planarcheckerboard in front of our system at different poses andpositions (see Figure 11) to solve the equations of unknownparameters (11). The results of the calibration procedure arepresented in Table 1.

4.2. Hardware Implementation. After the computation ofparameters in (11) through an off-line calibration process,we used them to correct the distortion of each frame. Withthe input frame captured by the camera denoted as the sourceimage and the corrected output as the target image, the taskof correcting the source distorted image can be defined asfollows: for every pixel location in the target image, computeits corresponding pixel location in the source image. Twoimplementation techniques of distortion correction havebeen compared:

Direct Computation. Calculate the image coordinatesthrough evaluating the polynomials to determine intensityvalues for each pixel.

Figure 11: Different checkerboard positions used for calibrationprocedure.

Table 1: Calibration results.

Parameter Value Error

u0 (pixels) 178.04 1.28

v0 (pixels) 144.25 1.34

f DuSU (pixels) 444.99 1.21

f Dv (pixels) 486.39 1.37

a1 −0.3091 0.0098

a2 −0.0033 0.0031

a3 0.0004 0.0001

a4 0.0014 0.0004

a5 0.0021 0.0002

a6 0.0002 0.0001

a7 0.0024 0.0005

a8 0.0011 0.0002

Lookup Table. Calculate the image coordinates throughevaluating the polynomials correction in advance, storingthem in a lookup table which is referenced at run-time.All parameters needed for LUT generation are knownbeforehand; therefore for our system, the LUT is computedonly once and off-line.

However, since the source pixel location can be a realnumber, using it to compute the actual pixel values ofthe target image requires some form of pixel interpolation.For this purpose we have used the nearest neighbour

Page 151: 541420

EURASIP Journal on Embedded Systems 7

0

50

100

150

200

250

BR

AM

s-K

B

0 20 40 60 80 100 120

Image size-KB

LUTReal time calculation

Memory occupation

Figure 12: Block Memory occupation for both direct computationand LUT-based approaches.

Table 2: Area and Clock characteristics of two approaches.

Implementation Area (%) Clock (MHz)

Direct Computation 58 10

Look Up Table 6 24

interpolation approach that means that the pixel valueclosest to the predicted coordinates is assigned to the targetcoordinates. This choice is reasonable because it is a simpleand fast method for computation, and visible image artefactshave no subject with our system.

Performance results of these two techniques are pre-sented in terms of (i) execution time and (ii) FPGA logicresource requirements.

The proposed architectures described above have beendescribed in VHDL in a fixed point fashion, implemented ona Xilinx Virtex II FPGA device and simulated using industryreference simulator (ModelSim). The pixel values of both theinput distorted and the output corrected images use an 8-bitword length integer number. The coordinates use an 18-bitword length.

The results are presented in Figures 12 and 13 andTable 2.

The execution time for the direct computation imple-mentation is comparatively very slow. This is due to thefact that the direct computation approach consumes amuch greater amount of logic resources than the Look-upTable approach. Moreover the slow clock cycle (10 MHz)could be increased by splitting the complex arithmeticlogic into several smaller stages. The significant differencebetween these two approaches is that the direct computationapproach requires more computation time and arithmeticoperations, while the LUT approach requires more memoryaccesses and more RAM Blocks occupation. Regardinglatency, both approaches can be executed with respect to

0

500

1000

1500

2000

2500

3000

Tim

e-u

s

0 20 40 60 80 100 120

Image size-KB

LUTReal time calculation

Execution time versus image size

Figure 13: Execution time for both direct computation and LUT-based approaches.

real-time constraint of video cadence (25 frames per sec-ond). Depending on the applications, the best compromisebetween time and resources must be chosen by the user. Forour application, arithmetic operations are intensively neededfor later stages in the preprocessing block, while memoryblocks are available; so we chose to use the LUT approachto benefit in time and resources.

5. Thresholding and Labelling

After lens distortion correction, the laser spots projectedmust be extracted from the gray level image for deliveringa 3D representation of the scene. Laser spots on the imageappear with variable sizes (depending on the absorptionof the surface and the projection angle). At this level,a preprocessing block has been developed and hardwareimplemented to make an adaptive thresholding in order togive a binary image and a labelling to classify each laser spotto compute later their center.

5.1. Thresholding Algorithm. Several methods exist froma static threshold value defined by user up to dynamicalgorithm as Otsu method [26].

We have chosen to develop a new approach less complexthan Otsu or others well-known dynamic methods in orderto reduce the processing time [27]. The simple method isdescribed in (Figure 14):

(i) building the histogram of grey-level image,

(ii) finding the first maxima of the Gaussian correspond-ing to the Background; compute its mean μ andstandard deviation σ ,

(iii) calculating the threshold value with (13):

Page 152: 541420

8 EURASIP Journal on Embedded Systems

Background

2σ Bias = μ + α× σ

Laser spots

μ

Figure 14: Method developed in Cyclope.

Threshold = σ · α + μ, (13)

where α is an arbitrary constant. A parallel architecture ofprocessing has been designed to compute the threshold andto give a binary image. Full features of this implementationare given in [28].

5.2. Labeling. After this first stage of extraction of spot laserfrom the background, it is necessary to classify each laserspot in order to compute separately their center. Severalmethods have been developed in the past. We chose touse a classical two passes component connected labelingalgorithms with an 8-connectivity. We designed a specificoptimized Intellectual Property in VHDL. This intellectualproperty uses fixed point number.

6. Computation of Spots Centers

The threshold and labeling processes applied to the capturedimage allow us to determine the area of each spot (numberof pixels). The coordinates of center of these spots could becalculated as follows:

ugI =∑

i∈I uiNI

,

vgI =∑

i∈I viNI

,

(14)

where ugI and vgI and the abscissa and ordinate of Ith spotcenter. ui and vi are the coordinates of pixels constructing thespot. NI is the number of pixels of Ith spot (area in pixels).

To obtain an accuracy 3D reconstruction, we need tocompute the spots centers with higher possible precisionwithout increasing the total computing time to satisfy thereal-time constraint. The hardest step in center detection partis the division operations A/B in (14). Several methods existto solve this problem.

6.1. Implementation of a Hardware Divider. The simplestmethod is the use of hardware divider but they are com-putationally expensive and consume a considerable amount

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

Figure 15: Smallest rectangle containing active pixel.

of resources. This not acceptable for a real-time embeddedsystems. Some other techniques are used to compute thecenter of laser spots avoiding the use of hardware dividers.

6.2. Approximation Method. Some studies suggest approx-imation methods to avoid implementation of hardwaredividers. Such methods like that implemented in [29] replacethe active pixels by the smallest rectangle containing thisregion and then replace the usual division by simple shifting(division by 2):

u∗gI =Max(ui) + Min(ui)

2,

v∗gI =Max(vi) + Min(vi)

2.

(15)

This approach is approximated in (15), where (ui, vi)are the active pixel coordinates, and (u∗gI , v

∗gI) are the

approximated coordinates of the spot center.The determination of rectangle limits needs two times

scanning of the image, detecting in every scanning step,respectively, the minimum and maximum of pixels coordi-nates. For each spot, we should compare the coordinates ofevery pixel by last registered minimum and maximum toassign new values toUm,UM ,Vm, andVM . (m: Minimum;M:maximum). While Np is the average area of spots (numberof pixels), we can estimate the number of operations neededto calculate the center of each spot by 4Np + 6. And inglobal, Nop ≈ 25 ∗ N ∗ (4Np + 6) operations are neededto calculate the centers of N spots (video-cadence of 25 fps).Such approximation is simple and easy to use but still needsconsiderable time to be calculated. Beside, the error is notnegligible. The error in such method is nearly 0.22 pixel,and the maximum error is more than 0.5 pixel [29]. Takingthe spot of Figure 15 as an example of inaccuracy of sucha method, the real center position of these pixels is (4.47;6.51). But when applying this approximation method, thecenter position will be (5; 6). This inaccuracy will resultmismatching problem that affects the measurement resultwhen reconstructing object.

Page 153: 541420

EURASIP Journal on Embedded Systems 9

uv

a0b0

a1b1

a2b2

aNbN

......

Estimation

Estimation

Estimation

Estimation

Compare

Compare

Compare

Compare

Coderαi

βiPara

met

erm

emor

y

1/z

com

pute

Figure 16: 3D unit.

6.3. Our Method. The area of each spot (number of pixels)is always a positive integer, while its value is limited ina predeterminate interval [Nmin,Nmax], where Nmin andNmax are, respectively, the minimum and maximum areasof laser spot in the image. The spot areas depend onobject illumination, distance between object and camera,and the angle of view of the scene. Our method consistsin a memorisation of 1/N where N represent the spot pixelnumber and can take value in [1,Nlimit]. Nlimit represent themaximum considered size, in pixels, of a spot.

In this case we need only to compute a multiplication,that is resume here:

ugI = (u1 + u2 + · · · + uI)∗ 1NI

,

vgI = (v1 + v2 + · · · + vI)∗ 1NI.

(16)

The implementation of such a filter is very easy, regardingthat the most of DSP functions are provided for earlierFPGAs. For example, Virtex-II architecture [30] provides an18 × 18 bits Multiplier with a latency of about 4.87 ns at205 MHz and optimised for high-speed operations. Addi-tionally, the power consumption is lower compared to a sliceimplementation of an 18-bit by 18-bit multiplier [31]. ForN luminous spots source, the number of operations neededto compute the centers coordinates is Nop ≈ 25 ∗ N ∗ NP ,and Np is the average area of spots. When implementing ourapproach to Virtex II Pro FPGA (XC2VP30), it was clear thatwe gain in execution time and size. Comparison of differentimplementation approaches is described in the next section.

7. Matching Algorithm

The set of parameters for the epipolar and depth modelsare used during run time to make point matching (identifythe original position of a pattern point from its image) and

calculate the depth using the coordinates of each laser spotcenter.

For this purpose we have developed a parallel architec-ture visible in Figure 16, described in detail in [32].

Starting from the point abscissa (u) we calculate itsestimated ordinate (v) if it belongs to a epipolar line. Wecompare this estimation with the true ordinate (v).

These operations are made for all the epipolar linesimultaneously. After thresholding the encoder returns theindex of the corresponding epipolar line.

The next step is to calculate the z coordinate from theu coordinate and the appropriate depth model parameters(α,β)

These computation blocs are synchronous and pipelined,allowing, thus, high processing rates.

7.1. Estimation Bloc. In this bloc the estimated ordinate iscalculated v = a · u + b. The (a, b) parameters are loadedfrom memory.

7.2. Comparison Bloc. In this bloc the absolute value of thedifference between the ordinate v and its estimation v iscalculated. This difference is then thresholded.

The thresholding avoids a resource consuming sort stage.The threshold was a priori chosen as half the minimum dis-tance between two consecutive epipolar lines. The thresholdcan be adjusted for each comparison bloc.

This bloc returns a “1” result if the distance is underneaththe threshold.

7.3. Encoding Bloc. If the comparison blocs return a unique“1” result, then the encoder returns the correspondingepipolar line index.

If no comparison bloc returns a “true” result, the point isirrelevant and considered as picture noise.

Page 154: 541420

10 EURASIP Journal on Embedded Systems

Cyclope

Di (data in)

/CTS

Do (data out)

/RTS

Xbee module Xbee module PC

Di (data in)

/CTS

Do (data out)

/RTS

Figure 17: Wireless communication.

If more than one comparison blocs returns “1”, then weconsider that we have a correspondence error and a flag is set.

The selected index is then carried to the next stage wherethe z coordinate is calculated. It allows the selection of theright parameters to the depth model.

We compute (1/z), rather than z as we said earlier, to havea simpler computation unit. This computation bloc is thenidentical to the estimation bloc.

8. Wireless Communication

Finally, after computation, the 3D coordinates of the laserdots accompanied by the image of texture are sent to anexternal reader. So, Cyclope is equipped with a block ofwireless communication which allows us to transmit theimage of texture, the coordinates 3D of the centers of thespots laser, and even to remotely reconfigure the digitalprocessing architecture (an Over The Air FPGA). Whileattending the IEEE802.15 Body Area Network standard [33],the frequency assigned for implanted device RF communica-tion is around 403 MHz and referred to as the MICS (MedicalImplant Communication System) band due to essentiallythree reasons:

(i) a small antenna,

(ii) a minimum losses environment which allows todesign low-power transmitter,

(iii) a free band without causing interference to otherusers of the electromagnetic radio spectrum [34].

In order to make rapidly a wireless communication ofour prototype, we chose to use Zigbee module at 2.45 GHzavailable on the market contrary to modules MCIS. Weare self-assured that later frequency is not usable for thecommunication between the implant and an external reader,due to the electromagnetic losses of the human body. TwoXbee-pro modules from the Digi Corporation have beenused. One for the demonstrator and the second plugged on aPC host where a human machine interface has been designedto visualise in real-time the 3D textured reconstruction of thescene.

Communication between wireless module and the FPGAcircuit is performed by a standard UART protocol. this prin-ciple is shown on Figure 17. To make this communicationwe integrated a Microblaze softcore processor with UARTfunctionality. The Softcore recovers all the data stored inmemory (texture and 3D coordinates) and sends them to thewireless module.

RF block

Processingblock

Instrumentationblock

ZigBee

FPGA

CCD

Laser

Figure 18: Demonstrator

9. Demonstrator, Testbench and Results

9.1. Experimental Demonstrator. To demonstrate the feasi-bility of our system, a large-scale demonstrator has beenrealised. It uses an FPGA prototyping board based on a XilinxVirtex2Pro, a pulsed IR LASER projector [35] coupled witha diffraction network that generates a 49-dot pattern and aCCD imager.

Figure 18 represents the experimental set. It is composedof a standard 3 mm lens, the CCD camera with an external8 bits DAC, a projector IR pattern, and a Virtex2proprototyping board.

FPGA is used mainly for computation unit but also tocontrol image acquisition, laser synchronisation, analog-to-digital conversion, and image storage and displays the resultthrough a VGA interface.

Figure 19 shows the principal parts of the control andstorage architecture as set in the FPGA. Five parts have beendesigned:

(i) a global sequencer to control the entire process,

(ii) a reset and integration time configuration unit,

(iii) a VGA synchronisation interface,

(iv) a dual port memory to store the images and to allowasynchronous acquisition and display operations,

(v) a wirless communication module based on theZigBee protocol.

A separated pulsed IR projector has been added to thesystem to demonstrate the system functionality.

Page 155: 541420

EURASIP Journal on Embedded Systems 11

Pulsed pattern projector

TimeCTRL

ADCCTRL

Display

CM

OS

cam

era

ADC

DP RAM

DP RAM

VGA CTRL

Figure 19: Implementation of the control and storage architecture.

40

60

80

100

120

140

160

Fm

ax(M

hz)

0 20 40 60 80 100 120 140 160

Parallel operations

Figure 20: FPGA working frequency evolution.

The computation unit was described in VHDL andimplemented on FPGA XilinX VirtexIIpro (XC2VP30) with30816 logic cells and 136 hardware multipliers [31]. Thesynthesis and placement were achieved for 49 parallelprocessing elements. We use here 28% of the LUTs and 50hardware multipliers, for a working frequency of 148 Mhz.

9.2. Architecture Performance. To estimate the evolutionof the architecture performances, we have used a genericdescription and repeat the synthesis and placement fordifferent pattern sizes (number of parallel operations).Figure 20 shows that in every case our architecture mappedon an FPGA can work at least at almost 90 Mhz and thenobtain a real time constraint of 40 milliseconds.

Table 3: Performances of the distortion correction.

Slices 1795 (13%)

Latency 11.43 ms

Error < 0.01 pixels

(a) Image without correction (b) Image with correction

Figure 21: (a) Checkerboard image before distortion correction.(b) Checkerboard image after correction.

9.3. Error Estimation of the Optical Correction. The imple-mentation results of distortion correction method are sum-marised in Table 3. In this table we have implemented thecorrection model only to the active light spots. However,Figure 21 present an image before and after our lensdistortion correction.

Regarding size and latency, it is clear that the results aresuitable for our application.

Comparing our used method to compute the spotscenters with two other methods (see Table 4), it is clearthat our approach has higher accuracy and smaller size thanapproximation method. Since it has nearly the same accuracyas method using hardware divider, it still uses less resources.

Page 156: 541420

12 EURASIP Journal on Embedded Systems

0

0.5

1

1.5

2

2.5

Err

or(%

)

45 50 55 60 65 70

Depth (cm)

Distorted imageCorrected image

Depth estimation error

Figure 22: Error comparison before and after applying distortioncorrection and centers recomputing.

Regarding latency, the results of all three approaches respectreal time constraint of video cadence (25 frames per second).Comparing many measures on the depth estimation beforeand after the implementation of our improvements, theresults indicate that the precision of the system increased, sothat the residual error is reduced about 33% (Figure 22).

These results were interpolated with a scale factor tomeasure the error lens in the case of integration inside avideo capsule, and the results can be shown in Figure 23.This scaling was calculated with a distance between thelaser projector and the imager of 1 cm. It is the maximaldistance that can be considered for endoscopy. This distancecorresponds to the diameter of the PillCam video capsule.We can show that the correction of the distortion producedby the lens increases the accuracy of our sensor.

9.4. Error Estimation of the 3D Reconstruction. In order tovalidate our reconstruction architecture, we have comparedthe results obtained with the synthesised IP (Table 5) andthose obtained from a floating point mathematical modelwhich was already validated by experimentation. As wecan see, the calculation error margin is relatively weak incomparison with the distance variations and shows that ourapproach to translate a complex mathematical model into adigital processing for embedded system is valid.

Table 6 shows the error of reconstruction for differentdistances and sizes of the stereoscopic base. We can see thatfor a base of 5 mm we are able to have a 3D reconstructionwith an error above to 4% at a distance of 10 cm. Thisprecision is perfectly enough in the context of the humanbody exploration and an integration of a stereoscopic basewith a such size is relatively simple.

0

0.5

1

1.5

2

2.5

Err

or(%

)

5 5.5 6 6.5 7 7.5 8 8.5 9

Depth (cm)

DistortedCorrected

Depth estimation error

Figure 23: Error comparison before and after applying distortioncorrection and centers recomputing after scaling for integration.

Table 4: Centers computation performances.

Method Slices Latency Error

(μs) (pixel)

Approximation 287 4.7 0.21

Hardware divider 1804 1.77 0.0078

Our approach 272 2.34 0.015

Table 5: Results Validation.

Coordinates couples Model results IP results

abscise/ordinate (pixel) (meter) (meter)

401/450 1.57044 1.57342

357/448 1.57329 1.57349

402/404 1.57223 1.57176

569/387 1.22065 1.21734

446/419 1.11946 1.11989

478/319 1.07410 1.07623

424/315 1.04655 1.04676

375/267 1.03283 1.03297

420/177 1.03316 1.03082

Table 6: Precision versus the size of the stereoscopic base.

Base of 0.5 cm Base of 1.5 cm

Distance Error Distance Error

(cm) (%) (cm) (%)

5 1.8 5 0.61

10 3.54 10 1.21

50 15.52 50 5.77

100 26.87 100 10.91

Page 157: 541420

EURASIP Journal on Embedded Systems 13

Texture IRLaser spots

3D-VRML

Figure 24: Visualisation of the results by our application.

9.5. Example of Reconstruction. We have used the calibrationresults to reconstruct the volume of an object (a 20 cmdiameter cylinder). The pattern was projected on the sceneand the snapshots were taken.

The pattern points were extracted and associated to laserbeams using the epipolar constraint. The depth of each pointwas then calculated using the appropriate model. The textureimage was mapped on the reconstructed object and renderedin an VRML player.

We have created an application written in C++ forvisualising the Cyclope’s results (Figure 24). The applicationgets the textural information and the barycenter spatialposition of the 49 infrared laser spots from a wirelesscommunication module. After the result reception, it drawsthe texture of three binary maps representing the location ofthe 49 barycenters on a 3D coordinates system (XY ,ZX , andZY).

The recapitulation of the hardware requirement is pre-sented Table 7. We can observe that the design is smalland if we make an equivalence in logics gates, it should beintegrated in a small area chip like IGLOO AGL 1000 devicefrom Actel. Such a device has a size of 10 × 10 mm2 and itscore can be integrated in a VCE which has a diameter around1 cm. At this moment, we did not make implementation onthis last platform. It is a feasibility study but the first resultsprove that this solution is valid if we consider the neededressources.

We present also an estimation of the energetic consump-tion which was realised with two tools. This estimationis visible in Table 8. The first tool is XPE ( Xilinx PowerEstimation) from Xilinx to evaluate the power consumptionof a Virtex, and the second is IGLOOpowercalculator fromActel to evaluate the power consumption of a low powerconsumption FPGA.

Table 7: Recapitulation of the performances.

Architecture Clb slices Latches LUT RAM

Camera 309 337 618 4

Optical correction∗ 92/94 8/8 176/190 32/56

Thresholding 107 192 214 1

Labelling 114 102 227 0

Matching 1932 3025 3864 0

Communication 170 157 277 3

Total used∗ 2323/2325 3821 1555/1569 40/64

Total free 13693 29060 27392 136∗Direct computation/Look up table.

Table 8: Processing block power consumption estimation.

Device Power consumption Duration

1 battery 3 battery

Virtex 1133 mW 29 min 1h 26 min

IGLOO 128,4 mW 4 hours 12 hours

These two tools use the processing frequency, the numberof logic cells, the number of D flip-flop, and the amount ofmemory of the design to estimate the power consumption.To realise our estimation, we use the results summarised inTable 7. Our estimation is made with an activity rate of 50%that is the worst case.

To validate the power consumption estimation in anembedded context, we consider that a 3V-CR1220 battery( 3V-CR1220 is a 3 Volt battery, its diameter is of 1.2 cm,and its thickness is of 2 mm) which has a maximum of180 mAh power consumption, that is to say an ideal powerof 540 mWh. This battery is fully compatible with a VCE likethe Pillcam from Given Imaging.

As we can see, the integration of a Virtex in a VCE isimpossible because of the SRAM memory that consumes toomuch energy. If we consider the IGLOO technology based onflash memory, we can observe that its power consumptionis compatible with a VCE. Such technology permits fourhours of autonomy with only one battery, and twelve hoursof autonomy if we used three 3V-CR1220 in the VCE. Thisresult is encouraging because at this time the mean durationof an examination is ten hours.

10. Conclusion and Perspectives

We have presented in this paper Cyclope, a sensor designedto be a 3D video capsule.

we have explained a method to acquire the images at a 25-frame/s video rate with a discrimination between the textureand the projected pattern. This method uses an energeticapproach, a pulsed projector, and an original 64× 64 CMOSimage sensor with programmable integration time. Multipleimages are taken with different integration times to obtainan image of the pattern which is more energetic thanthe background texture. Our CMOS imager validates thismethod.

Page 158: 541420

14 EURASIP Journal on Embedded Systems

Also we present a 3D reconstruction processing thatallows a precise and real-time reconstruction. This process-ing which is specifically designed for an integrated sensorand its integration in an FPGA-like device has a low powerconsumption compatible with a VCE examination.

The method was tested on a large scale demonstratorusing an FPGA prototyping board and a 352 × 288 pixelsCCD sensor. The results show that it is possible to integratea stereoscopic base which is designed for a integrated sensorand to keep a good precision for a human body exploration.

The next step to this work is the chip level integration ofboth the image sensor and the pattern projector. Evaluate thepower consumption of the pulsed laser projector consideringthe optical efficiency of the diffraction head.

The presented version of Cyclope is the first step towardthe final goal of the project. After this, the goal is to realisea real-time pattern recognition with processing-like supportvector machine or neuronal network. The final issue ofCyclope is to be a real smart sensor that can realize a partof a diagnosis inside the body and then increase its fiability.

References

[1] G. Iddan, G. Meron, A. Glukhovsky, and P. Swain, “Wirelesscapsule endoscopy,” Nature, vol. 405, no. 6785, pp. 417–418,2000.

[2] J.-F. Rey, K. Kuznetsov, and E. Vazquez-Ballesteros, “Olympuscapsule endoscope for small and large bowel exploration,”Gastrointestinal Endoscopy, vol. 63, no. 5, p. AB176, 2006.

[3] M. Gay, et al., “La video capsule endoscopique: qu’en atten-dre?” CISMEF, http://www.churouen.fr/ssf/equip/capsules-videoendoscopiques.html.

[4] T. Graba, B. Granado, O. Romain, T. Ea, A. Pinna, and P.Garda, “Cyclope: an integrated real-time 3d image sensor,” inProceedings of the 19th International Conference on Design ofCircuits and Integrated Systems, 2004.

[5] F. Marzani, Y. Voisin, L. L. Y. Voon, and A. Diou, “Activesterovision system: a fast and easy calibration method,” inProceedings of the 6th International Conference on ControlAutomation, Robotics and Vision (ICARCV ’00), 2000.

[6] W. Li, F. Boochs, F. Marzani, and Y. Voisin, “Iterative 3dsurface reconstruction with adaptive pattern projection,” inProceedings of the 6th IASTED International Conference onVisualization, Imaging, and Image Processing (VIIP ’06), pp.336–341, August 2006.

[7] P. Lavoie, D. Ionescu, and E. Petriu, “A high precision 3d objectreconstruction method using a color coded grid and nurbs,” inProceedings of the International Conference on Image Analysisand Processing, 1999.

[8] Y. Oike, H. Shintaku, S. Takayama, M. Ikeda, and K.Asada, “Real-time and high resolution 3-d imaging systemusing light-section method and smart CMOS sensor,” inProceedings of the IEEE International Conference on Sensors(SENSORS ’03), vol. 2, pp. 502–507, October 2003.

[9] A. Ullrich, N. Studnicka, J. Riegl, and S. Orlandini, “Long-range highperformance time-of-flight-based 3d imaging sen-sors,” in Proceedings of the International Symposium on 3DData Processing Visualization and Transmission, 2002.

[10] A. Mansouri, A. Lathuiliere, F. S. Marzani, Y. Voisin, and P.Gouton, “Toward a 3d multispectral scanner: an applicationto multimedia,” IEEE Multimedia, vol. 14, no. 1, pp. 40–47,2007.

[11] F. Bernardini and H. Rushmeier, “The 3d model acquisitionpipeline,” Computer Graphics Forum, vol. 21, no. 2, pp. 149–172, 2002.

[12] S. Zhang, “Recent progresses on real-time 3d shape measure-ment using digital fringe projection techniques,” Optics andLasers in Engineering, vol. 48, no. 2, pp. 149–158, 2010.

[13] F. W. Depiero and M. M. Triverdi, “3d computer vision usingstructured light: design, calibration, and implementationissues,” Journal of Advances in Computers, pp. 243–278, 1996.

[14] E. E. Hemayed, M. T. Ahmed, and A. A. Farag, “CardEye: a3d trinocular active vision system,” in Proceedings of the IEEEConference on Intelligent Transportation Systems (ITSC ’00),pp. 398–403, Dearborn, Mich, USA, October 2000.

[15] A. Kolar, T. Graba, A. Pinna, O. Romain, B. Granado, andE. Belhaire, “Smart Bi-spectral image sensor for 3d vision,”in Proceedings of the 6th IEEE Conference on SENSORS (IEEESENSORS ’07), pp. 577–580, Atlanta, Ga, USA, October 2007.

[16] B. Gyselinckx, C. Van Hoof, J. Ryckaert, R. F. Yazicioglu,P. Fiorini, and V. Leonov, “Human++: autonomous wirelesssensors for body area networks,” in Proceedings of the IEEECustom Integrated Circuits Conference, pp. 12–18, 2005.

[17] B. Warneke, M. Last, B. Liebowitz, and K. S. J. Pister, “Smartdust: communicating with a cubic-millimeter computer,”Computer, vol. 34, no. 1, pp. 44–51, 2001.

[18] R. Horaud and O. Monga, Vision par Ordinateur, chapter 5,Hermes, 1995.

[19] O. Faugeras, Three-Dimensional Computer Vision, a GeometricViewpoint, MIT Press, Cambridge, Mass, USA, 1993.

[20] J. Batlle, E. Mouaddib, and J. Salvi, “Recent progress in codedstructured light as a technique to solve the correspondenceproblem: a survey,” Pattern Recognition, vol. 31, no. 7, pp. 963–982, 1998.

[21] S. Woo, A. Dipanda, F. Marzani, and Y. Voisin, “Determinationof an optimal configuration for a direct correspondence inan active stereovision system,” in Proceedings of the IASTEDInternational Conference on Visualization, Imaging, and ImageProcessing, 2002.

[22] O.-Y. Mang, S.-W. Huang, Y.-L. Chen, H.-H. Lee, and P.-K. Weng, “Design of wide-angle lenses for wireless capsuleendoscopes,” in Optical Engineering, vol. 46, October 2007.

[23] J. Heikkila and O. Silven, “A four-step camera calibrationprocedure with implicit image correction,” in Proceedings ofthe IEEE Computer Society Conference on Computer Vision andPattern Recognition, pp. 1106–1112, San Juan, Puerto Rico,USA, 1997.

[24] K. Hwang and M. G. Kang, “Correction of lens distortionusing point correspondence,” in Proceedings of the IEEE Region10 Conference (TENCON ’99), vol. 1, pp. 690–693, 1999.

[25] J. Heikkila, Accurate camera calibration and feature based3-D reconstruction from monocular image sequences, Ph.D.dissertation, University of Oulu, Oulu, Finland, 1997.

[26] N. Otsu, “A threshold selection method from gray level his-togram,” IEEE Transactions on Systems, Man, and Cybernetics,vol. 9, no. 1, pp. 62–66, 1979.

[27] J. N. Kapur, P. K. Sahoo, and A. K. C. Wong, “A new methodfor gray-level picture thresholding using the entropy of thehistogram,” Computer Vision, Graphics, & Image Processing,vol. 29, no. 3, pp. 273–285, 1985.

[28] D. Faura, T. Graba, S. Viateur, O. Romain, B. Granado, andP. Garda, “Seuillage dynamique temps reel dans un systemeembarque,” in Proceedings of the 21eme Colloque du Groupede Recherche et d’Etude du Traitement du Signal et des l’Image(GRETSI ’07), 2007.

Page 159: 541420

EURASIP Journal on Embedded Systems 15

[29] T. Graba, Etude d’une architecture de traitement pour uncapteur integre de vision 3d, Ph.D. dissertation, UniversitePierre et Marie Curie, 2006.

[30] M. Adhiwiyogo, “Optimal pipelining of the I/O ports of thevirtex-II multiplier,” XAPP636, vol. 1.4, June 2004.

[31] Xilinx, “Virtex-II Pro and Virtex-II Pro Platform FPGA:Complete Data Sheet,” October 2005.

[32] A. Kolar, T. Graba, A. Pinna, O. Romain, B. Granado, and T.Ea, “A digital processing architecture for 3d reconstruction,”in Proceedings of the International Workshop on ComputerArchitecture for Machine Perception and Sensing (CAMPS ’06),pp. 172–176, Montreal, Canada, August 2006.

[33] ieee802, http://www.ieee802.org/15/pub/TG6.html.[34] M. R. Yuce, S. W. P. Ng, N. L. Myo, J. Y. Khan, and W. Liu,

“Wireless body sensor network using medical implant band,”Journal of Medical Systems, vol. 31, no. 6, pp. 467–474, 2007.

[35] Laser2000, http://www.laser2000.fr/index.php?id=368949&L=2.

Page 160: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 826296, 16 pagesdoi:10.1155/2009/826296

Research Article

Performance Evaluation of UML2-Modeled Embedded StreamingApplications with System-Level Simulation

Tero Arpinen, Erno Salminen, Timo D. Hamalainen, and Marko Hannikainen

Department of Computer Systems, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland

Correspondence should be addressed to Tero Arpinen, [email protected]

Received 27 February 2009; Accepted 21 July 2009

Recommended by Bertrand Granado

This article presents an efficient method to capture abstract performance model of streaming data real-time embedded systems(RTESs). Unified Modeling Language version 2 (UML2) is used for the performance modeling and as a front-end for a toolframework that enables simulation-based performance evaluation and design-space exploration. The adopted application meta-model in UML resembles the Kahn Process Network (KPN) model and it is targeted at simulation-based performance evaluation.The application workload modeling is done using UML2 activity diagrams, and platform is described with structural UML2diagrams and model elements. These concepts are defined using a subset of the profile for Modeling and Analysis of Realtime andEmbedded (MARTE) systems from OMG and custom stereotype extensions. The goal of the performance modeling and simulationis to achieve early estimates on task response times, processing element, memory, and on-chip network utilizations, among otherinformation that is used for design-space exploration. As a case study, a video codec application on multiple processors is modeled,evaluated, and explored. In comparison to related work, this is the first proposal that defines transformation between UML activitydiagrams and streaming data application workload meta models and successfully adopts it for RTES performance evaluation.

Copyright © 2009 Tero Arpinen et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Multiprocessor System-on-Chip (SoC) offers high perfor-mance, yet energy-efficient, and programmable platformfor modern embedded devices. However, parallelism andincreasing complexity of applications necessitate efficientand automated design methods. Model-driven development(MDD) aims to shorten the design time using abstraction,gradual refinement, and automated analysis with transfor-mation of models. The key idea is to utilize models tohighlight certain aspects of the system (behavior, structure,timing, power consumption models, etc.) without an imple-mentation.

Unified Modeling Language version 2 (UML2) [1] is astandard language for MDD. In embedded system domain,its adoption is seen promising for several purposes: require-ments specification, behavioral and architectural modeling,test bench generation, and IP integration [2]. However, itshould be noted that UML2 has had also criticism on itssuitability in MDD [3, 4]. UML2 offers a rich set of diagramsfor modeling and also expansion and tailoring methods

to derive domain-specific languages. For example, severalUML profiles targeted at embedded system design have beendeveloped [5–7].

SoC complexity requires efficient performance evalua-tion and design-space exploration methods. These methodsare often utilized at the system level to make early designdecisions. Such decisions include, for instance, choosingthe number and type of processors, and determining themapping and scheduling of application tasks. Design-spaceexploration seeks to find optimum solution for a givenapplication (domain) and boundary constraints. Designspace, that is, the number of possible system configurations,is practically always so large that it becomes intractable notonly for manual design but also for brute force optimization.Hence, efficient methods are needed, for example, optimiza-tion heuristics, tool frameworks, and models [8].

This article presents an efficient method to captureabstract performance model of a streaming data real-timeembedded system (RTES). Figure 1 presents the overallmethodology used in this work. The goal of the performancemodeling and simulation is to achieve early estimates on

Page 161: 541420

2 EURASIP Journal on Embedded Systems

Executionmonitoring

(simulation results)

Design-space exploration(models and simulation

results)

System-level simulation(SystemC)

Applicationworkload modeling(UML2 activities)

Platformperformance modeling

(UML2 structural)

Figure 1: The methodology used in this work.

PE, memory, and on-chip network utilization, task responsetimes, among other information that is used for design-spaceexploration. UML2 is used for performance model specifi-cation. The application workload modeling is carried outusing UML2 activity diagrams. Platform is described withstructural UML2 diagrams and model elements annotatedwith performance values.

Our focus is on modeling streaming data applications.It is characteristic to streaming applications that a longsequence of data items flows through a stable set of compu-tation steps (tasks) with only occasional control messagingand branching. Each task waits for the data items, processesthem, and outputs the results to the next task. The adoptedapplication metamodel has been formulated based on thisassumption and it resembles Kahn Process Network (KPN)[9] model.

A proprietary UML2 profile for capturing performancecharacteristics of an application and platform is defined.The profile definition is based on a well-defined metamodeland reusing suitable modeling concepts from the profile forModeling and Analysis of Realtime and Embedded systems(MARTE) [5]. MARTE is a standard profile promotedby the Object Management Group (OMG) and it is apromising extension for general-purpose embedded systemmodeling. It has been intended to replace the UML Profile forSchedulability, Performance and Time (SPT) [10]. MARTEis methodology-independent and it offers a common set ofstandard notations and semantics for a designer to choosefrom while still allowing to add custom extensions. Thismeans that the profile defined in this article is a specializedinstance of the MARTE profile that is dedicated for ourperformance evaluation methodology.

It should be noted that the performance models definedin this work can be and have been used together with acustom UML profile for embedded systems, called TUT-Profile [7, 11]. However, this article illustrates the mod-els using the concepts of MARTE because the adoptionof standards promotes commonly known notations andsemantics between designers and interoperability betweentools.

Further, the article presents how performance valuescan be specified on UML models with expressions usingMARTE Value Specification Language (VSL). This allowseffective parameterization of system performance model

Applicationfunctions

Platformresources

Functions onplatform resources

• Workload

Mapping

• Performance analysis• Simulations

• Binding application workloads on platform elements

• Processin elements • Communication elements • Memory elements

Figure 2: Design Y-chart.

representation according to application-specific variablesand reduces the amount of time consuming and error-pronemanual work.

The presented modeling methods are utilized in atool framework targeted at simulation-based design-spaceexploration and performance evaluation. The exploration isbased on collecting performance statistics from simulationto optimize the platform and mapping according to apredefined cost-function.

An execution-monitoring tool provides visualization andmonitoring the system performance during the simulation.As a case study, a video codec system is modeled with thepresented modeling methods and performance evaluationand exploration is carried out using the tool framework.

The rest of the article is organized as follows. Section 2analyses the methods and concepts used in RTES perfor-mance evaluation. Section 3 presents the metamodel utilizedin this work for system performance characterization. UML2and MARTE for RTES modeling are discussed in Section 4.Section 5 presents the UML2 specification of the utilized per-formance metamodel. Section 6 presents our performanceevaluation tool framework. The video codec case study iscovered in Section 7. After final discussion on our proposalin Section 8, Section 9 concludes the article.

2. Analysis of Methods and Concepts Used inRTES Performance Evaluation

In this section the methods and concepts used in RTESperformance evaluation are covered. This comprises anintroduction to design Y-chart in RTES performance eval-uation, phases of a model-based RTES performance eval-uation process, discussion on modeling language and tooldevelopment, and a short introduction to RTES timinganalysis concepts. Finally, the related work on UML in RTESperformance evaluation is examined.

2.1. Design Y-Chart and RTES Modeling. Typical approachfor RTES performance evaluation follows the design Y-chart[12] presented in Figure 2 by separating the applicationdescription from underlying platform description. These twoare bound in the mapping phase. This means that commu-nication and computation of application functionalities arecommitted onto certain platform resources.

There are several possible abstraction levels for describ-ing the application and platform for performance evaluation.

Page 162: 541420

EURASIP Journal on Embedded Systems 3

One possibility is to utilize abstract specifications. Thismeans that application workload and performance of theplatform resources are represented symbolically withoutneeding detailed executable descriptions.

Application workload is a quantity which informs howmuch capacity is required from the underlying platformcomponents to execute certain functionality. In model-basedperformance evaluation the workloads can be estimatedbased on, for example, standard specifications, prior expe-rience from the application domain, or available processingcapacity. Legacy application components, on the otherhand, can be profiled and performance models of thesecomponents can be evaluated together with the models ofcomponents yet to be developed.

In addition to computational demands, communicationdemands between application parts must be considered. Inpractice, the communication is realized as data messagestransmitted between real-time operating system (RTOS)threads or between processing elements over an on-chipcommunication network. Shared buses and Network-on-Chip (NoC) links and routers perform scheduling fortransmitted data packets in an analogous way as PEs executeand schedule computational tasks. Moreover, inter-PE com-munication can be alternatively performed using a sharedmemory. The performance characteristics of memories aswell as their utilization play a major role in the overall systemperformance. The impact of computation, communication,and storage activities should all be considered in system-level analysis to enable successful performance evaluation ofa modern SoC.

2.2. Model-Based RTES Performance Evaluation Process.RTES performance evaluation process must follow disci-plined steps to be effective. From SoC designer’s perspective,a generic performance evaluation process consists of thefollowing steps. Some of the concepts of this and the nextsubsection have been reused and modified from the work in[13]:

(1) selection of the evaluation techniques and tools,

(2) measuring, profiling, and estimating workload char-acteristics of application and determining platformperformance characteristics by benchmarking, esti-mation, and so forth,

(3) constructing system performance model,

(4) measuring, executing, or simulating system perfor-mance models,

(5) interpreting, validating, monitoring, and back-annotating data received from previous step.

The selection of the evaluation techniques and tools isthe first and foremost step in the performance evaluationprocess. This phase includes considering the requirementsof the performance analysis and availability of tools. Itdetermines the modeling methods used and the effortrequired to perform the evaluation. It also determines theabstraction level and accuracy used. All further steps in theprocess are dependent on this step.

The second step is performed if the system performancemodel requires initial data about application task workloadsor platform performance. This is based on profiling, specifi-cations, or estimation. The application as well as platformmay be alternatively described using executable behavioralmodels. In that case, such additional information may notbe needed as all performance data can be determined duringsystem model execution.

The actual system model is constructed in the third stepby a system architect according to defined metamodel andmodel representation methods. Gathered initial performancedata is annotated to the system model. The annotation ofthe profiling results can also be accelerated by combining theprofiling and back-annotation with automation tools such as[14].

After system modeling, the actual analysis of the modelis carried out. This may involve several model transforma-tions, for example, from UML to SystemC. The analysismethods can be classified into dynamic and static methods[8]. Dynamic methods are based on executing the systemmodel with simulations. Simulations can be categorized intocycle-accurate and system-level simulations. Cycle-accuratesimulation means that the timing of system behavior isdefined by the precision of a single clock cycle. Cycle-accuracy guarantees that at any given clock cycle, the stateof the simulated system model is identical with the stateof the real system. System-level simulation uses higherabstraction level. The system is represented at IP-block levelconsisting coarse grained models of processing, memory,and communication elements. Moreover, the applicationfunctionality is presented by coarse-grained models such asinteracting tasks.

Static (or analytic) methods are typically used in earlydesign-space exploration to find different corner cases.Analytical models cannot take into consideration sporadiceffects in the system behavior, such as aperiodic interruptsor other aperiodic external events. Static models are suitedfor performance evaluation when deterministic behavior ofthe system is accurate enough for the analysis.

Static methods are faster and provide significantly largercoverage of the design-space than dynamic methods. How-ever, static methods are less accurate as they cannot take intoaccount dynamic performance aspects of a multiprocessorsystem. Furthermore, dynamic methods are better suitedfor spotting delayed task response times due to blocking ofshared resources.

Analysing, measuring, and executing the system per-formance models produces usually a massive amount ofdata from the modeled system. The final step in the flowis to select, interpret, and exploit the relevant data. Theselection and interpretation of the relevant data dependson the purpose of the analysis. The purpose can be earlydesign-space exploration, for example. In that case, the flowis usually iterative so that the results are used to optimize thesystem models after which the analysis is performed again forthe modified models. In dynamic methods, an effective wayof analysing the system behavior is to visualize the resultsof simulation in form of graphs. This helps the designer toefficiently spot changes in system behavior over time.

Page 163: 541420

4 EURASIP Journal on Embedded Systems

2.3. Modeling Language and Tool Development. SoC design-ers typically utilize predefined modeling languages and toolsto carry out the performance evaluation process. On theother hand, language and tool developers have their ownsteps to provide suitable evaluation techniques and tools forSoC designers. In general they are as follows:

(1) formulation of metamodel,

(2) developing methods for model representation andcapturing,

(3) developing analysis tools according to selected mod-eling methods.

The formulation of the metamodel requires very similarkind of consideration on the objectives of the performanceanalysis as the selection of the techniques and tools bySoC designers. The created metamodel determines the effortrequired to perform the evaluation as well as the abstractionlevel and accuracy used. In particular, it defines whether thesystem performance model can be executed, simulated, orstatically analysed.

The second step is to define how the model is captured bya designer. This phase includes the selection or definition ofthe modeling language (such as UML, SystemC or a customdomain-specific language). The selection of notations alsorequires transformation rules defined between the elementsof the metamodel and the elements of the selected descrip-tion language. In case of UML2, the metamodel concepts aremapped to UML2 metaclasses, stereotyped model elements,and diagrams.

We want to emphasize the importance of performingthese first two steps exactly in this order. The definition ofthe metamodel should be performed independently fromthe utilized modeling language and with full concentrationon the primary objectives of the analysis. The selection ofthe modeling language should not alter the metamodel norbias the definition of it. Instead, the modeling language andnotations should be tailored for the selected metamodel, forinstance, by utilizing extension mechanisms of the UML2or defining completely new domain-specific language. Thereason for this is that model notations contribute only topresentational features. Model semantics truly determinewhether the model is usable for the analysis. Nevertheless,presentational features determine the feasibility of the modelfor a human designer.

The final step is the development of the tools. To provideefficient evaluation techniques, the implementation of thetools should follow the created metamodel and its originalobjectives. This means that the original metamodel becomesthe foundation of the internal metamodel of the tools. Thesystem modeling language and tools are linked together withmodel transformations. These transformations are used toconvert the notations of the system modeling language to theformat understood by the tools, while the semantics of themodel is maintained.

2.4. RTES Timing Analysis Concepts. A typical SoC con-tains heterogeneous processing elements executing complexapplication tasks in parallel. The timing analysis of such a

system requires abstraction and parameterization of the keyconcerns related to resulting performance.

Hansson et al. define concepts for RTES timing analysis[15]. In the following, a short introduction to these conceptsis given.

Task execution time te is the time in which (in clock cyclesor absolute time) a set of sequential operations are executedundisturbed on a processing element. It should be notedthat the term task is here considered more generally as asequence of operations or actions related to single-threadedexecution, communication, or data storing. The term threadis used to denote typical schedulable object in an RTOS.profiling the execution time does not consider backgroundactivities in the system, such as RTOS thread pre-emptions,interrupts, or delays for waiting a blocked shared resource.The purpose of execution time is to determine how muchcomputing resources is required to execute the task. Taskresponse time tr , on the other hand, is the actual time ittakes from beginning to the end of the task in the system.It accounts all interference from other system parts andbackground activities.

Execution time and response time can be further classi-fied into worst case (wc), best case (bc), and average case (ac)times. Worst case execution time twce is the worst possibletime the task can take when not interfered by other systemactivities. On the other hand, worst case response time twcr isthe worst possible time the task may take when consideringthe worst case scenario in which other system parts andactivities interfere its execution. In multimedia applicationsthat require streaming data processing, the worst case andaverage case response times are usually the ones needed tobe analysed. However, in some hard real-time systems, suchas a car air bag controller, also the best case response time(tbcr) may be as important as the twcr. Average case responsetime is usually not so significant. Jitter is a measure for timevariability. For a single task, jitter in execution time can becalculated as Δte = twce − tbce. Respectively, jitter in responsetime can be calculated as Δtr = twce − tbcr.

It is assumed that the execution time is constant for agiven task-PE pair. It should be noted that in practice theexecution time of a function may vary depending on theprocessed data, for example. For these kinds of functionsthe constant task execution time assumption is not valid.Instead, different execution times of such functions shouldbe modeled by selecting a suitable value to characterize it(e.g., worst or average case) or by defining separate tasksfor different execution scenarios. As opposed to executiontime, response time varies dynamically depending on thetask’s surrounding system it is executed on. The responsetime analysis must be repeated if

(1) mapping of application tasks is changed,

(2) new functionalities (tasks) are added to the applica-tion,

(3) underlying execution platform is modified,

(4) environment (stimuli from outside) changes.

In contrast, a single task execution time does not haveto be profiled again if the implementation of the task is not

Page 164: 541420

EURASIP Journal on Embedded Systems 5

changed (e.g., due to optimization) assuming that the PEon which the profiling was carried out is not changed. Ifthe PE executing is changed and the profiling uses absolutetime units, then a reprofiling is needed. However, thiscan be avoided by utilizing PE-neutral parameters, such asnumber of operation, to characterize the execution loadof the task. Other possibility is to represent processingelement performances using a relative speed factor as in[16].

In multiprocessor SoC performance evaluation, simulat-ing the profiled or estimated execution times (or number ofoperations) of tasks on abstract HW resource models is aneffective way of observing combined effects of task executiontimes, mapping, scheduling, and HW platform parameterson resulting task response times, response time jitters, andprocessing element utilizations.

Timing requirements of SoC functions are comparedagainst estimated, simulated, or measured response times.It is typical that timing requirements are given as combinedresponse times of several individual tasks. This is naturallycompletely dependent on the granularity used in identifyingindividual tasks. For instance, a single WLAN data trans-mission task could be decomposed into data processing,scheduling, and medium access tasks. Then examining ifthe timing requirement of a single data transmission is metrequires examining the response times of the composite tasksin an additive manner.

2.5. On UML in Simulation-Based RTES Performance Evalua-tion. Related work has several static and dynamic methodsfor performance evaluation of parallel computer systems.A comprehensive survey on methods and tools used fordesign-space exploration is presented in [8]. Our focus is ondynamic methods and some of the closest related research toour work are examined in the following.

Erbas et al. [17] present a system-level modeling and sim-ulation environment called Sesame, which aims at efficientdesign space exploration of embedded multimedia systemarchitectures. For application, it uses KPN for modelingthe application performance with a high-level programminglanguage. The code of each Kahn process is instrumentedwith annotations describing the application’s computationalactions, which allows to capture the computational behaviorof an application. The communication behavior of a processis represented by reading from and writing to FIFO channels.The architecture model simulates the performance conse-quences of the computation and communication eventsgenerated by an application model. The timing of applicationevents are simulated by parameterizing each architecturemodel component with a table of operation latencies. Thesimulation provides performance estimates of the systemunder study together with statistical information such asutilization of architecture model components. Their per-formance metamodel and approach has several similaritieswith ours. The biggest differences are in the abstraction levelof HW communication modeling and visualization of thesystem models and performance results.

Balsamo and Marzolla [18] present how UML use case,activity and deployment diagrams can be used to derive

performance models based on multichain and multiclassQueuing Networks. The UML models are annotated accord-ing to the UML Profile for Schedulability, Performance andTime Specification [10]. This approach has been developedfor SW architectures rather than for embedded systems. Nospecific tool framework is presented.

Kreku et al. [19] propose a method for simulation-based RTES performance evaluation. The method is basedon capturing application workloads using UML2 state-machine descriptions. The platform model is constructedfrom SystemC component models that are instantiatedfrom a library. Simulation is enabled with automatic C++code generation from UML2 description, which makes theapplication and platform models executable in a SystemCsimulator. Platform description provides dedicated abstractservices for application to project its computational andcommunicational loads on HW resources. These functionsare invoked from actions of the state-machines. The utiliza-tion of UML2 state-machine enables efficiently capturing thecontrol structures of the application. This is a clear benefit incomparison to plain data flow graphs. The platform servicescan be used to represent data processing and memoryaccesses. Their method is well suited for control-intensiveapplications as UML state-machines are used as the basisof modeling. Our method targets at modeling embeddedstreaming data applications with less effort required inmodeling using UML activity diagrams.

Madl et al. [20] present how distributed real-timeembedded systems can be represented as discrete eventsystems and propose an automated method for verificationof dense time properties of such systems. The model ofcomputation (MoC) is based on tasks connected withchannels. Tasks are mapped onto machines that representcomputational resources of embedded HW.

Our performance evaluation method is based on exe-cutable streaming data application workload model specifiedas UML activity diagrams and abstract platform perfor-mance model specified in composite structure diagrams. Incomparison to related work, this is the first proposal thatdefines transformation between UML activity diagrams andstreaming data application workload models and successfullyadopts it for embedded RTES performance evaluation.

3. Performance Metamodel forStreaming Data Embedded Systems

The foundations of the performance metamodel defined inthis work is based on the earlier work on Model of Compu-tation (MoC) for architecture exploration described in [21].We introduce storage tasks, storage elements, and timingconstraints as new features. The metamodel definition isgiven using mathematical equations and set theory. Anotheralternative would be to utilize Meta Object Facility (MOF)[22]. MOF is often used to define the metamodels fromwhich UML profiles are derived as the model elements andnotations of MOF are a subset of UML model elements.Next, detailed formulation of the performance metamodel iscarried out.

Page 165: 541420

6 EURASIP Journal on Embedded Systems

3.1. Application Performance Metamodel. Application A isdefined as a tuple

A = (T ,Δ,E, TC), (1)

where T is a set of tasks, Δ is a set of channels, E is a setof external events (or timers), and TC is a set of timingconstraints. Tasks are further categorized to sets of executiontasks Te and storage tasks Ts, so that

T = {Te ∪ Ts}. (2)

Channels combine tasks and carry tokens between them. Asingle channel δ ∈ Δ is defined as

δ = (τsrc, τend,Ebuf), (3)

where τsrc ∈ T is task that emits tokens to the channel, τend ∈T task that consumes tokens, and Ebuf is the set of bufferedtokens in the channel. Tokens in channels represent the flowof control as well as flow of data in the application. A tokencarries certain amount of data from task to another. This hastwo impacts. First, the load on the communication mediumfor the time of the transfer. Second, the execution loadwhen the next task is triggered after reception. Latter enablesdata amount-dependent dynamic variations in executionof application tasks. Similar to traditional KPN model,channels between tasks (or processes) are uni-directional,unbounded FIFO buffers and tasks use a blocking read as asynchronization mechanism.

A task τ ∈ T is defined as

τ = (S, ec,F,Δ!,Δ?), (4)

where S ∈ {Run, Ready, Wait, Free} is the state of the task,ec ∈ {N+ ∪ {0}} is the execution counter that is incrementedby one each time the task is fired, and F is a set firing rules ofwhich definition depends on the type of the task. However Δ!

is the set of incoming channels to the task and Δ? is the set ofoutgoing channels. Incoming channels of task τ are defined as

Δτ! = {δ ∈ Δ | τend = τ}, (5)

whereas outgoing channels have definition

Δτ? = {δ ∈ Δ | τsrc = τ}. (6)

Firing rule fc ∈ Fc for a computational task is a tuple

fc = (tc,Oint,Ofloat,Omem,Δout), (7)

where tc is a task trigger condition. Oint, Ofloat, and Omem

represent the computational complexity of the task interms of amounts of integer, floating point, and memoryoperations required to be computed. Subset Δout ⊂ Δ?

determine the set of outgoing channels where tokens aretransmitted when the task is fired. Firing rule fs ∈ Fs for astorage task is a tuple

fs = (tc,Ord,Owr,Δout), (8)

where Ord and Owr are the amounts of read and write oper-ations associated to a single storage task. Correspondingly toexecution task, tc is task trigger condition and Δout ⊂ Δ?

is the set of outgoing channels. A task trigger condition isdefined as

tc = (Δin, depend,Tec,φec), (9)

where Δin ⊂ Δτ! is the set of required incoming transitions totrigger the task τ and depend ∈ {Or, And} determines thedependency type from incoming transitions. Tec is executioncount modulo period and φec is execution count modulophase. They can be used to restrict the firing of the task tocertain execution count values, so that the task is fired if

ec mod φec = 0 when ec < Tec,

ec mod(Tec + φec

) = 0 when ec ≥ Tec.(10)

3.2. External Events and Constraints. External events modelthe environment of the application feeding input data to thetask graph, such as packet reception from WLAN radio orimage reception from an embedded camera. External evente ∈ E is a tuple

e =(

type, tper, δout

), (11)

where type ∈ {Oneshot, Periodic} determines whether theevent is fired once or periodically. tper is the absolute time orperiod when the event is triggered, and δout is the channelwhere events are fed.

A path p is a finite sequence of consecutive tasks. Thus, ifn ∈ {N+∪{0}} is the total number of tasks in the path, thenp is defined as n-tuple

p = (x1, x2, x3, . . . , xn), ∀x : x ∈ {T ∪ Δ}. (12)

A timing constrain tc ∈ TC is defined

tc =(p, t

reqwcr, t

reqbcr

), (13)

in which p is a consecutive path of tasks and channels andt

reqwcr and t

reqbcr are the required worst-case response time and

best case response time for the p to be completed after thefirst element of p has been triggered.

3.3. Platform Performance Metamodel. The HW platform isa tuple

PHW = (C,L), (14)

in which C is a set of platform components and L is a set ofcommunication links connecting components. Componentsare further divided into sets of processing elements PE,storage elements SE, and to a single communication elementce in such a manner that

C = (PE∪ SE∪ ce). (15)

Links L connect processing and storage elements to thecommunication element ce. The ce carries out the requireddata exchange between PEs and SEs.

Page 166: 541420

EURASIP Journal on Embedded Systems 7

Application

pe2

m5m4m3m2m1m0

e0

e1

se0pe1pe0

Com

putation

Com

mu

-n

ication

HW

platform

ce

δ0 δ2

δ3δ5

δ4

δ1

τe4 τs0

τe

τe3τe1

τe0

Figure 3: Example performance model.

A processing element pe ∈ PE is defined as

pe =(fop,Pint,Pfloat,Pmem

), (16)

in which fop is the operating frequency, Pint, Pfloat, Pmem

describe the performance indices of the PE in terms ofexecuting integer, floating, and memory operations, respec-tively. If a task has operational complexity O (of some of thethree types) and the PE it is mapped on has correspondingperformance index P and frequency fop then task executiontime can be calculated with

te = O

P · fop. (17)

Storage element se ∈ SE is defined as

se =(fop,Prd,Pwr

), (18)

in which Prd and Pwr are performance indices for readingand writing from and to storage element. The time which ittakes to read or write to the storage is calculated in the samemanner as in (17).

The communication element ce has definition

ce =(fop,Ptx

), (19)

where Ptx is the performance index for transmitting data. If atoken carries n bits of data using the communication elementthen the time of the transfer can be calculated as

ttx = n

Ptx · fop. (20)

3.4. Metamodel for Functionality Mapping. The mapping Mbinds application load characteristics (tasks and channels) toplatform resources. It is defined as

M = {Me ∪Ms}, (21)

where Me = (me1,me2,me3, . . . ,men) is a set of map-pings of execution tasks to processing elements, Ms =(ms1,ms2,ms3, . . . ,msn) mappings of storage tasks to storageelements. In general, a mapping m ∈ M is defined as 2-tuple (task, platform element). For instance, execution taskmapping is defined as

m = (τe, pe), τe ∈ Te ∧ pe ∈ PE. (22)

Each task is mapped only onto one platform elementand several tasks can be mapped onto a single platformelement. Events are not mapped to any platform element.The mapping of channels onto communication element isnot explicitly modeled. Instead, they are implicitly mappedonto the single communication element that interconnectsprocessing and storage elements.

3.5. Example Model. Figure 3 visualizes the primary conceptsof our metamodel with a simple example. There are fiveexecution tasks τe0–τe4 and a single storage task τs0 combinedtogether with six channels δ0–δ5. Two external events e0 ande1 are feeding the task graph with tokens. Computation tasksare mapped (m0–m3) onto three PEs and the single storagetask is mapped (m4) onto the single storage element. Allchannels are implicitly mapped onto the single communica-tion element and all inter-PE transfers are conducted by it.

4. UML2 and the MARTE Profile

UML has been traditionally used for specifying software-intensive systems but currently it is seen as a promisinglanguage for developing embedded systems as well. NativelyUML2 lacks some of the key concepts that are crucialfor embedded systems such as quantifiable notion of time,nonfunctional properties, embedded execution platform,and mapping of functionality. However, the language hasextension mechanisms that can be used for tailoring thelanguage for desired domains. One of such mechanismsis to use profiles that add custom semantics to be usedwith the set of model elements offered by the languageitself. Profiles are defined with stereotype extensions, tagdefinitions, and constraints. Stereotypes give new semanticsto existing UML2 metaclasses. Tagged values are attributes ofa stereotype that are used to further specify the stereotypedmodel element. Constraints limit the meta -model bydefining how model elements can be used.

One model element can have multiple stereotypes.Consequently it gets all the properties, tagged values, andconstraints of those stereotypes. For example, a PE mayhave different stereotypes for defining its performancecharacteristics and its power consumption characteristics.The separation of concerns (one stereotype for one purpose)when defining profiles is recommended to keep the set ofmodel elements concise for a designer.

4.1. Utilized MARTE Architecture. In this work, a subset ofthe MARTE profile is used as the foundation for creatingour domain-specific modeling language for performance

Page 167: 541420

8 EURASIP Journal on Embedded Systems

Design model

HRM

Foundations

NFPs Alloc

Annexes

MARTE_modellibrary

VSL

Analysis model

Platform performance(custom extension)

Application workload(custom extension)

Figure 4: Utilized subprofiles of the MARTE profile and extensions for performance evaluation.

modeling. The concepts of the created performance eval-uation metamodel are mapped to the stereotypes definedby MARTE. Thereafter, custom stereotypes with associatedtag definitions for the rest of the metamodel concepts aredefined.

Figure 4 presents the subprofiles of MARTE that areutilized in this work together with additional subprofiles forour performance evaluation concepts. The complete profilearchitecture of MARTE can be found in [5]. From MARTEfoundations, stereotypes of the profile for nonfunctionalproperties (NFP) and allocation (Alloc) are used directly.The NFP profile is used for defining different measurementtypes for the custom stereotype extensions. Allocation sub-profile contains suitable concepts for task mapping.

From MARTE design model, the HW resource modeling(HRM) profile is adopted to identify and give semantics todifferent types of HW elements. It should be noted that HRMprofile has dependencies in other profiles in the foundations,such as general resource modeling (GRM) profile, but it is notincluded to the figure, since the stereotypes from there arenot directly adopted.

The MARTE analysis model contains pre-defined pack-ages that are dedicated for generic quantitative analysismodeling (GQAM), schedulability analysis modeling (SAM),and performance analysis modeling (PAM). MARTE profilespecification defines that this analysis model can be extendedfor other domains as well, such as for power consumption.We do not utilize the pre-defined analysis concepts but defineown extensions that implement the metamodel defined inSection 3. This is because the MARTE analysis packageshave been defined according to their own metamodel thatdiffers from ours. Although there are some similaritiesin the modeling concepts, we define dedicated stereotypeextensions to allow as straightforward way of capturing theperformance models as possible.

5. Performance Model Specification in UML2

The extension of modeling capabilities for our performancemetamodel is specified by refining the elements of UML andMARTE with additional stereotypes. These stereotypes spec-ify the performance characteristics of particular elements

to which they are applied to. The additional stereotypesare designed so that they can be used with other profilessimilar to MARTE. The requirements for such profile isthat it supports embedded HW modeling and a function-ality mapping mechanism. As mentioned, the additionalstereotypes have been successfully used also with the TUT-Profile. The defined stereotypes are, however, dependent onthe nonfunctional property data types and measurementunits defined by MARTE nonfunctional property and modellibrary packages. These data types are used in tag definitions.

5.1. Application Workload Model Presentation. UML2 activ-ity diagrams have been selected as the view for applicationworkload models. The reasons for this are

(i) activity diagrams are a natural view for presentingcontrol and data flow between functional elements ofthe application,

(ii) activity diagrams have enough expression power topresent the application task network of the workloadmodel,

(iii) reuse of activity diagrams created for describing task-level behaviour becomes possible.

In the workload model, the basic activities are used as thelevel of detail in activity diagrams. UML2 basic activity ispresented as a graph of actions and edges connecting them.Here, actions correspond to tasks T and edges to channels Δ.Basic activities allow modeling of control and data flow, butexplicit forks and joins of control, as well as decisions andmerges, are not supported [23]. Still, the expression power isadequate for our workload model.

Figure 5 presents the stereotype extensions for theapplication performance model. Workload of tasks T arepresented as action nodes. In practice, these actions refer tocertain UML2 behaviour, such as state-machine, activity, orfunction that are mapped onto HW platform elements.

Stereotypes ExecutionWorkload and StorageWorkload areapplied to actions that represent execution tasks Te and stor-age tasks Ts. The tag definitions for these stereotypes defineother properties of the represented tasks, including triggerconditions, computational workload indices, and sent data

Page 168: 541420

EURASIP Journal on Embedded Systems 9

<<metaclass>>

Action

+tc : TriggerCondition [0..∗]+intOps: Integer [0..∗]+floatOps: Integer [0..∗]+memOps: Integer [0..∗]+outChannels: String [0..∗]+sendAmount: NFP_DataSize [0..∗]+sendPropability: Real [0..∗]

<<stereotype>>

ExecutionWorkload

[Action]

+time: NFP_Duration+sendAmount: NFP_DataSize+sendPropability: Real+eventKind: EventKind

<<stereotype>>

WorkloadEvent

[Action]

+tc: TriggerCondition [0..∗]+rdOps: Integer [0..∗]+wrOps: Integer [0..∗]+outPorts: String [0..∗]+sendAmount:

+sendPropability: Real [0..∗]

<<stereotype>>

StorageWorkload

[Action]

<<metaclass>>

Activity

+inChannels: String [0..∗]

+depend: DependKind

+ecModPhase: Integer

+ecModPeriod: Integer

<<dataType>>

TriggerCondition

+WCRT: NFP_Duration+BCRT: NFP_Duration

<<stereotype>>

ResponseTiming

[Action, Activity]

<<stereotype>>

WorkloadModel

[Activity]

ANDOR

<<enumeration>>

DependKind

oneshotperiodic

<<enumeration>>

EventKind

<<metaclass>>

Action

NFP_DataSize [0..∗]

Figure 5: Stereotype extensions for application workload model.

tokens. The index of tagged value lists represent an individualtrigger condition and its related actions (operations to becalculated, data to be sent to the next tasks) when the triggercondition is satisfied.

Action nodes are connected together using activity edges.This notation is used in our model presentation to representa channel δ ∈ Δ between two tasks. The direction of thedata flow in the channel is the same as the direction ofthe activity edge. The names of the channels are directlyreferenced as strings in trigger condition as well as in taggedvalues indicating outgoing channels.

An external event is presented as an action node stereo-typed as WorkloadEvent. Such action has always a singleoutgoing channel that carries tokens to the task network. Thetop-level activity which defines a single complete workloadmodel of the system is stereotyped as WorkloadModel.

Timing constraints are defined by applying the stereotypeResponseTiming for a single action or a complete activity anddefining the response timing requirements in terms of worstand best case response times. The timing requirement for anactivity is defined as the time it takes to execute the activityfrom its initial state to its exit state.

Figure 6 shows an example application workload model—our case study—in an activity diagram. There are tenexecution tasks that are connected with edges that representchannels between the tasks. Actions on the left column(excluding the workload event) are tasks of the encoder,whereas actions on the right column are tasks of the

decoder. Tagged values indicating integer operations andsend amounts are shown for each task. Other tagged valueshave been left out from the figure for simplicity. Thetrigger conditions for PreProcessing and VLCDecoding aredefined so that they execute the operations in a loop.For example, PreProcessing task fires output tokens Xres ∗Yres/MBPixelSize times to the channels c2 and c11 whendata arrives from the incoming channel c1. This amountcorresponds to the number of macroblocks in a singleframe. Consecutive processing of this task is triggered bythe incoming data token from the loop channel c11. Thenumber of loop iterations for a single frame is thus thesame as the number of macroblocks in one frame (Xres ∗Yres/MBPixelSize). The trigger conditions for other tasksare defined so that they process the operations and senddata to next process when a data token is arrived to theirincoming channel. Send probability for all tasks and triggerconditions is 1.0. In this case sent data amounts are defined asexpressions depending on the macroblock size, bits per pixel(BPP) value, and image resolution. The operation countsare set as constant values fixed for the utilized macroblocksize. There is also a single periodically triggered workloadevent, that feeds the application workload network. Globalparameters used in expressions are defined in upper rightcorner of the figure.

5.2. Platform Performance Model Presentation. The plat-form is modeled with stereotyped UML2 classes and class

Page 169: 541420

10 EURASIP Journal on Embedded Systems

//quantization parameter (1-32)$qp = 16// frame rate (frames/s)$fr = 35// image size$Xres = 352$Yres = 240// bits per pixel$BPP = 12$MBPixelSize = 256

<<ExecutionWorkload>>PreProcessing

(Encoder::)

{intOps = 56764,sendAmount = “MBPixelSize∗BPP/8”}

c1

c11

<<ExecutionWorkload>>MBtoFrame

(Decoder::)

{intOps = 5440,sendAmount = “MBPixelSize∗BPP/8”}

<<ExecutionWorkload>>Rescaling

(Decoder::)

{intOps = 4938,sendAmount = “MBPixelSize∗BPP/8”}

c8

<<ExecutionWorkload>>MotionCompensation

(Decoder::)

{intOps = 4222,sendAmount = “MBPixelSize∗BPP/8”}

c10

<<ExecutionWorkload>>IDCT

(Decoder::)

{intOps = 15184,sendAmount = “MBPixelSize∗BPP/8”}

c9

<<ExecutionWorkload>>VLC

{intOps = 11889,sendAmount = “(Xres∗Yres∗BPP/8)/(qp∗3)”}

(Encoder::)

<<ExecutionWorkload>>DCT

(Encoder::)

{intOps = 13571,sendAmount = “MBPixelSize∗BPP/8”}

c4

<<ExecutionWorkload>>VLDecoding

(Decoder::)

{intOps = 61576,sendAmount = “MBPixelSize∗BPP/8”}

c6

c7

c12

<<ExecutionWorkload>>MotionEstimation

(Encoder::)

{intOps = 29231,sendAmount = “MBPixelSize∗BPP/8”}

c2

c3

<<ExecutionWorkload>>Quantization

(Encoder::)

{intOps = 9694,sendAmount = “MBPixelSize∗BPP/8”}

c5

<<WorkloadEvent>>VideoInput

{eventKind = periodic,sendAmount = “1”,sendPropability = “1.0”,time = “1.0/fr”}

Figure 6: Example workload model in an activity diagram.

instances. Other alternative would be to use stereotypedUML nodes and node instances. Nodes and devices indeployment diagrams are the native way in UML to modelcoarse grained HW architecture that serves as the targetto SW artifacts. Memory and communication resourcemodeling are not natively supported by UML2. Therefore,MARTE hardware resource modeling (HRM) package isutilized to classify different types of HW elements.

MARTE hardware resource modeling package offersseveral stereotypes for modeling embedded HW platform.The complete hardware resource model is divided into

logical and physical views. Logical view defines HW resourcesaccording to their functional properties whereas physicalview defines their physical properties, such as area and power.The performance modeling does not require consideringphysical properties, and thus, only stereotypes related to thelogical view are enough for our needs. Next, the stereotypesutilized from MARTE HRM to categorize different HWelements are discussed in detail.

HW ComputingResource is a generic MARTE stereotypethat is used to represent elements in the HW platform whichcan execute application functionality. It can be specialized

Page 170: 541420

EURASIP Journal on Embedded Systems 11

<<metaclass>>

+intOpsPerCycle: Real+floatOpsPerCycle: Real+memOpsPerCycle: Real+opFreq: NFP_Frequency

<<stereotype>>PePerformance

[Element]

+txOpsPerCycle: Real+opFreq: NFP_Frequency

<<stereotype>>CommPerformance

[Element]

+rdOpsPerCycle: Real+wrOpsPerCycle: Real+opFreq: NFP_Frequency

<<stereotype>>MemPerformance

[Element]

Element

Figure 7: Stereotype extensions for HW platform performance.

<<hwBus>>

bus: Hibi_segmenthibi_p1

hibi_p2

hibi_p3

<<PePerformance>><<ep_allocated>>

cpu1: ARM9{opFreq = “150 MHz”}

hibi_p

<<PePerformance>><<ep_allocated>>

cpu2: ARM9{opFreq = “120 MHz”}

hibi_p

<<PePerformance>><<ep_allocated>>

cpu3: ARM9{opFreq = “120 MHz”}

hibi_p

<<hwProcessor>>

<<hwProcessor>>

<<hwProcessor>>

Figure 8: Execution platform performance model.

to, for example, HW Processor to indicate its properties as aprogrammable computing resource. This stereotype or anyof its inherited stereotypes is used to represent processingelement pe ∈ PE.

HW Memory is a generic MARTE stereotype for re-sources that are capable of storing data. This stereotypeand its inherited stereotypes, such as HW RAM, are used torepresent storage element se ∈ SE.

Finally, generic MARTE stereotype HW Communica-tionResource and its inherited stereotypes, such as HW Bus,are used to represent communication element ce.

The performance related characteristics are given withthree additional stereotypes presented in Figure 7. ThePePerformance is applied for a processing resource, MemPer-formance for a memory resource, and CommPerformance fora communication resource, respectively. The performancecharacteristics are given for the elements with tagged valuesof the stereotypes that define the performance indices andoperating frequency of the particular elements.

Figure 8 presents an example platform model in a UMLcomposite structure diagram with performance characteris-tics. In the figure, there are three instances of HW processors(UML parts) connected to a single bus segment with UML

ports and connectors. The shown tagged values indicate theoperating frequency of the processors.

5.3. Mapping Model Presentation. MARTE allocation pack-age is used to model the mapping of application tasks ontoplatform resources. MARTE allocation mechanism allowshybrid allocation in which application behavioral elementsare associated with structural platform resources. The hybridallocation is performed with two stereotypes Application-AllocationEnd and ExecutionPlatformAllocationEnd. In UMLdiagrams they are written as app allocated and ep allocatedfor conciseness. Application allocation end has a taggedvalue that describes the platform resources to which theparticular application element is mapped. Execution plat-form allocation end identifies the platform resources ontowhich application elements can be mapped. A dependencystereotyped Allocated is used to bind application behaviourelements onto platform elements.

An example mapping with the MARTE allocation mech-anism is shown in Figure 9. In the figure, the tasks definedin the workload model of Figure 6 are mapped onto HWprocessors defined in the HW platform model of Figure 8.

Page 171: 541420

12 EURASIP Journal on Embedded Systems

<<ep_allocated>>

cpu1: ARM9

<<ep_allocated>>

cpu2: ARM9<<ep_allocated>>

cpu3: ARM9

<<app_allocated>>

MotionCompensation

<<app_allocated>>

MotionEstimation<<app_allocated>>

DCT

<<app_allocated>>

IDCT<<app_allocated>>

PreProcessing <<app_allocated>>

Rescaling<<app_allocated>>

VLC

<<app_allocated>>

Quantization

<<app_allocated>>

VLDecoding

<<app_allocated>>

MBtoFrame

<<Allocated>><<Allocated>>

<<Allocated>><<Allocated>><<Allocated>>

<<Allocated>>

<<Allocated>>

<<Allocated>> <<Allocated>>

<<Allocated>>

Figure 9: Mapping with MARTE allocation mechanism.

5.4. Parameterizing Models with MARTE VSL Expressions.The MARTE value specification language (VSL) has beendeveloped to specify the values of constraints, propertiesand stereotype attributes particularly for nonfunctionalproperties. It is an extension to the Value specification andDataType concepts provided by UML. It can be used in anyUML-based specification for extending the base expressioninfrastructure provided by UML. The VSL addresses how tospecify variables, constants, and expressions in textual form.It also deals with time values and assertions as well as howto specify composite values such as collection, interval, andtuples in UML models.

In our approach the syntax of VSL is utilized to defineexpressions on application workload models and platformperformance models. It is an efficient way for parameterizingthe workload models according to application-related values.Top-right corner of Figure 6 shows an example of usingVSL syntax to parameterize application workload modelsaccording to video quality metrics that are dependent on theapplication. In the example, frame rate (fr) is set to 35 framesper second and this constant variable is utilized to determinethe time period for the VideoInput workload event whena single image is fed to the process network. Further, themacroblock size in pixels (MBPixelSize) and image size (Xresand Yres) are used to determine the data amounts transferredbetween tasks.

6. Tool Framework for Model-Driven SoCPerformance Evaluation and Exploration

The presented performance evaluation models are used forearly analysis of data intensive embedded systems. Figure 10presents the tool framework in which the models are applied.

6.1. Performance Model Capture and System-Level Simulation.The flow begins from capturing the system performancemodeling in UML2 using the presented model elements andprofiles. This is followed by the model parsing phase in whichthe models are transformed into XML system model (XSM)[24, 25]. This is the corresponding XML presentation of theUML2 performance models. The XSM is a common formatbetween tools to exchange information on the designedsystem. The XSM can be modified by tools after its creationduring the design-space exploration iterations.

UML2 performance model

Performance results

SystemC simulation withtransaction generator

XML system model

Design-spaceexploration tool

Executionmonitor

Model parser Back-annotator

Figure 10: Tool framework for performance evaluation andexploration.

After model creation the XSM file is fed to the simulator.The simulator is divided into two parts: computation andcommunication. The computation part is in practice realizedwith a configurable transaction generator (TG) [21]. Thecomputation part simulates the execution and schedulingof tasks on processing and memory elements. It alsofeeds the underlying communication part with data tokenstransmitted between tasks which are mapped onto differentplatform elements. The abstraction level of the computationpart is the same with the metamodel defined in Section 3.

Due to high abstraction level of the computation part,the executed tasks do not contain any specific functionality,but they only reserve the processing or memory element andblock it from other tasks for certain amount of time. Forexample, for execution tasks this time is derived with (17).

The computation part (TG) is configured automaticallybased on the abstract task, processing and storage resourcemodels defined in UML. The configuration is based ongenerating corresponding SystemC code containing the sametasks, processing and memory elements. This is done byinstantiating generic task and HW element SystemC com-ponents with parameters (operation counts, performanceindices, etc.) defined in UML the models.

The computation and communication parts are inter-faced with Open Core Protocol (OCP) [26] TL2 compatible

Page 172: 541420

EURASIP Journal on Embedded Systems 13

Table 1: Summary of collected and monitored performancestatistics.

Category Values

Applicationspecific

For example, frame rate, radiothroughput

ApplicationTaskcommunication

Signals in/out, avg./tot.communication cycles,communication % of executiontime, intra/inter-PEcommunication bytes and cycles,communication cycles/byte

Task general

Execution count, avg./tot.execution cycles, execution % ofthread/service total, signal queue,execution latency, response time

Mapping Task to thread/PE

Platform PEUtilization, inter-PEcommunication bytes, avg./tot.execution cycles

Network Utilization, efficiency

interfaces. This means that the communication part canbe changed to any SystemC-based network model thatimplements OCP TL2 compatible interfaces for intercon-nected elements. This allows simulation of low abstractionlevel models of communication (such as NoCs) with highabstraction level models of computation. Currently, theearlier presented simple performance model for commu-nication element is not used in our framework. Instead,a more accurate SystemC defined TLM model for thecommunication part is used in simulations.

6.2. Execution Monitoring. After simulation the simulatortool produces a performance result file. It is a detaileddescription of events of particular interest during simulation.This file can be used as an input to Execution Monitor [27]program that can be used to visualize the simulation ina repeatable manner. The collected and monitored perfor-mance statistics are summarized in Table 1. The monitoringof simulation is efficient in spotting trends, correlations, andanomalities in system performance over time. In addition, itis efficient in understanding dynamic effects such as varyingdelays (jitter) and race conditions due to contention andscheduling.

Performance bottlenecks can be detected by observingthe amount of tokens in signal queues and the utilizationof PEs. If the number of tokens in the incoming channelof a task is increasing it is usually an indication of that taskbeing the bottleneck in a chain of several tasks. On the otherhand, a bottleneck can be located when a single processorhas a considerably higher utilization than other collaboratingprocessors.

In practice, the modeled response time requirements arevalidated by observing the maximum response time of atask in different execution scenarios. Meeting throughputrequirements can be also observed in a similar manner.

Figure 11 presents the control view of the executionmonitor tool. In the figure, the control view shows a systemconsisting of ten tasks mapped onto three processors. Eachprocessor column consists of the current task mapping ontop and an optional graph on the bottom. The graph canpresent, for example, processor utilization as in the figure.

6.3. Design-Space Exploration. After simulation and perfor-mance monitoring, the performance simulation results andXSM are fed to the design-space exploration tool whichtries to optimize the platform parameters and task mappingso that user-defined cost function is minimized. The costfunction can contain several nonfunctional properties suchas power, frequency, area, or response time of an individualtask. The design space exploration tool has several mappingheuristics supported: simulated annealing, group migration,hybrid of the previous two [28], optimal subset mapping[29], genetic algorithm, and random. The design-spaceexploration cycle continues by performing the simulationafter each remapping or modification in the executionplatform.

After the design-space exploration cycle ends, the opti-mized system description is again written to the XSM file.The back-annotator tool is used to change the UML2 modelsaccording to the results of the design-space exploration(updated platform and mapping).

6.4. Governing the Tool Flow Execution. The execution of thedesign flow is governed by a customizable Java-based toolfor configuring and executing SoC design flows. This toolis called Koski Graphical User Interface. The idea of this toolis that a user selects tools to the flow to be executed froma library of tools. New tools can be imported to the libraryin a plug-and-play fashion. Each tool includes a section ofXML which specifies the input and output tokens (files andparameters) of that particular tool. Parameters of individualtools can be set via the GUI. For example, the platformconstraints such as maximum and minimum number of PEsand the cost function of the design-space exploration toolare these kind of parameters. Due to its flexibility, this toolhas shown to be very effective in researching and evaluatingdifferent methodologies and tool flow configurations.

7. Case Study: Performance Evaluationand Exploration of a Video Codec onMultiprocessor SoC

This section presents a case study that illustrates the appli-cability of the modeling methods and tool framework inpractice. The application is a video codec on a multiprocessorplatform. We used an approach in which new functionalityrepresenting web client was modeled and added to anexisting video codec system in Figure 6 and the systemwas simulated and optimized based on the monitoredinformation.

7.1. Profiling and Modeling. All the functions were modeledby their workload and simulated in SystemC using TG. The

Page 173: 541420

14 EURASIP Journal on Embedded Systems

0

25

50

75

100

100 110 120 130 140 150 160 170 180 1900

25

50

75

100

100 110 120 130 140 150 160 170 180 1900

25

50

75

100

100 110 120 130 140 150 160 170 180 190

Processor utilization Processor utilization Processor utilization

Figure 11: Control view in execution monitor.

workload model of the video codec was originally profiledfrom real FPGA execution trace whereas the model of theweb client was only a single task which had an early estimateof its behavior.

The performance requirement of the video codec wasset to 35 frames per second (FPS). Thus, an externalevent representing the camera triggered at 35 Hz frequency.The HW platform consisted of three processors connectedthrough a shared bus. The operating frequencies of theprocessors were set to 150 MHz, 120 MHz, and 120 MHz.The frequency of the bus was set to 100 MHz.

7.2. Simulating and Monitoring. When the original systemwas simulated, it was observed that it met the FPS require-ment. Next, functionality for the web client was added torun in parallel with the video codec. The web client wasmapped to cpu1 (see Figure 11) because it was observed thatthe utilization of cpu1 was the lowest in the original system.Simulations indicated that the performance of the videocodec was decreased to 14 FPS. In addition, cpu1 becamefully utilized at all times whereas the utilizations of the othertwo processors decreased. This indicated a clear bottleneckon cpu1 as it was not able to forward processed data fastenough to other processors. This could also be observedfrom the signal queues of the tasks mapped onto cpu1. Theenvironment produced raw frames so fast that they startedaccumulating at the cpu1.

Thereafter, a remapping of the application tasks wasperformed since the workload of the processors was clearlyimbalanced. The mapping was done manually so that all theencoder tasks were mapped to cpu1, the decoder tasks tocpu2, and the web client functionality was isolated to cpu3.During the simulation it was observed that this improved theFPS to 22.

Because the manual mapping did not result in therequired performance, the next phase was automatic

exploration of the task mapping. The result mapping wasnonobvious because the tasks of the encoder and decoderwere distributed among all the processors. Hence, it isunlikely that we had ended to it with manual mapping.

The system became more balanced and the video codecperformance increased to 30 FPS, but it did still not meet therequired 35 FPS. Cpu1 was still the bottleneck and the signalqueues of the tasks mapped to it kept increasing. However,they were not increasing as fast as with the unoptimizedmapping, as presented in Figure 12. Figure 12(a) illustratesthe queue before the mapping exploration and Figure 12(b)after the exploration. The signal queues are shown for thetime frame of 50 to 100 ms, and the scale of the y-axis is 0–150 signals.

Finally, automated exploration was performed for theoperating frequencies of the processors. The result of theexploration was that the frequency of cpu1 was increased40 MHz to 190 MHz, and the frequencies of the othertwo processors were increased 20 MHz to 140 MHz. Thesimulation on this system model showed that the FPSrequirement should be met, and the tasks could process allthe signals which they received.

8. Discussion

In early performance evaluation, the key issue is the tradeoffbetween accuracy and development time of the model. Thebest accuracy is achieved from cycle-accurate simulationsor from actual implementation. However, constructing thecycle-accurate model or integrating the system is very timeconsuming in comparison to using system-level modelsand simulations. Thus, utilization of abstract system-levelmodels allow the designer to explore the design spacemore efficiently. The actual simulation time is also fasterin system-level simulations in comparison to cycle-accuratesimulations.

Page 174: 541420

EURASIP Journal on Embedded Systems 15

0

25

50

50 55 60 65 70 75 80 85 90 95

75

100

125

150VLC: Signal queue

(a) Before mapping exploration

0

25

50

50 55 60 65 70 75 80 85 90 95 100

75

100

125

150VLC: Signal queue

(b) After mapping exploration

Figure 12: Signal queues for task VLC before and after mapping exploration.

In this work we concentrate on reducing the effortin specifying and managing the performance models forsystem-level simulations. This has been done by utilizinggraphical UML2 models. As a result, the degree of readabilityof the models is improved in comparison to textual presen-tation. The case study showed that the system model is easyto construct, interpret, and modify with the presented UMLmodel elements. The case study models were constructedin few hours. Profiling and estimating operation counts forworkload tasks can be considered time-consuming and hard.In our case, it was done by profiling similar applicationexecuting on FPGA.

MARTE VSL was found useful for defining expressions. Itsignificantly simplified modifying the models with differentapplication-specific parameters in comparison to usingconstant values.

In earlier study [30] the average error in frame-rate was4.3%. This article uses the same metamodel. Hence, it canbe concluded that our method offers designer-friendly, rapidyet rather accurate performance evaluation for RTES.

9. Conclusions and Future Work

This article presented an efficient method to model andevaluate streaming data embedded system performance withUML2 and system-level simulations. The modeling methodswere successfully utilized in a tool framework for earlyperformance evaluation and design-space exploration. Thecase study showed that UML2, the presented modelingmethods, and the utilized performance evaluation toolsform a designer-friendly, rapid yet rather accurate way ofmodeling and evaluating RTES performance before actualimplementation. Future work consists of taking accountthe impact of SW platform in the RTES performancemetamodel. This includes the workload of SW platformservices (such as file access and memory allocation) as wellas scheduling of tasks with different policies.

References

[1] Object Management Group (OMG), “Unified Modeling Lan-guage (UML) Superstructure,” V2.1.2, November 2007.

[2] G. Martin and W. Mueller, Eds., UML for SOC Design,Springer, 2005.

[3] K. Berkenkotter, “Using UML 2.0 in real-time developmenta critical review,” in International Workshop on SVERTS:Specification and Validation of UML Models for Real Time andEmbedded Systems, October 2003.

[4] R. B. France, S. Ghosh, T. Dinh-Trong, and A. Solberg,“Model-driven development using UML 2.0: promises andpitfalls,” IEEE Computer, vol. 39, no. 2, pp. 59–66, 2006.

[5] Object Management Group (OMG), “A UML profile forMARTE, beta 1 specification,” August 2007.

[6] Object Management Group (OMG), “OMG systems modelinglanguage (SysML) specification,” September 2007.

[7] P. Kukkala, J. Riihimaki, M. Hannikainen, T. D. Hamalainen,and K. Kronlof, “UML 2.0 profile for embedded systemdesign,” in Proceedings of the Conference on Design, Automationand Test in Europe (DATE ’05), vol. 2, pp. 710–715, March2005.

[8] M. Gries, “Methods for evaluating and covering the designspace during early design development,” Integration, the VLSIJournal, vol. 38, no. 2, pp. 131–183, 2004.

[9] G. Kahn, “The semantics of a simple language for parallel pro-gramming,” in Proceedings of the IFIP Congress on InformationProcessing, August 1974.

[10] Object Management Group (OMG), “UML profile for schedu-lability, performance, and time specification (Version 1.1),”January 2005.

[11] T. Arpinen, M. Setala, P. Kukkala, et al., “Modeling embeddedSsoftware platforms with a UML profile,” in Proceedings ofthe Forum on Specification & Design Languages (FDL ’07),Barcelona, Spain, April 2007.

[12] K. Keutzer, S. Malik, R. Newton, et al., “System-level design:orthogonalization of concerns and platform-based design,”IEEE Transactions on Computer-Aided Design, vol. 19, no. 12,pp. 1523–1543, 2000.

[13] G. Kotsis, Workload modeling for parallel processing systems,Ph.D. thesis, University of Vienna, Vienna, Austria, 1995.

Page 175: 541420

16 EURASIP Journal on Embedded Systems

[14] P. Kukkala, M. Hannikainen, and T. D. Hamalainen, “Per-formance modeling and reporting for the UML 2.0 designof embedded systems,” in Proceedings of the InternationalSymposium on System-on-Chip, pp. 50–53, November 2005.

[15] H. Hansson, M. Nolin, and T. Nolte, “Real-time in embeddedsystems,” in Embedded Systems Handbook, chapter 2, CRCPress Taylor & Francis, 2004.

[16] F. Boutekkouk, S. Bilavarn, M. Auguin, and M. Benmo-hammed, “UML profile for estimating application worstcase execution time on system-on-chip,” in Proceedings ofthe International Symposium on System-on-Chip, pp. 1–6,November 2008.

[17] C. Erbas, A. D. Pimentel, M. Thompson, and S. Polstra,“A framework for system-level modeling and simulationof embedded systems architectures,” EURASIP Journal ofEmbedded Systems, vol. 2007, Article ID 82123, 11 pages, 2007.

[18] S. Balsamo and M. Marzolla, “Performance evaluation ofUML software architectures with multiclass queueing networkmodels,” in Proceedings of the 5th International Workshop onSoftware and Performance, (WOSP ’05), pp. 37–42, July 2005.

[19] J. Kreku, M. Hoppari, T. Kestila, et al., “Combining UML2application and SystemC platform modelling for performanceevaluation of real-time embedded systems,” EURASIP Journalon Embedded Systems, 2008.

[20] G. Madl, N. Dutt, and S. Abdelwahed, “Performance estima-tion of distributed real-time embedded systems by discreteevent simulations,” in Proceedings of the 7th ACM & IEEEInternational Conference on Embedded Software (EMSOFT’07), pp. 183–192, 2007.

[21] T. Kangas, Methods and implementations for automated systemon chip architecture exploration, Ph.D. thesis, Tampere Univer-sity of Technology, 2006.

[22] Object Management Group (OMG), “Meta object facility(MOF) specification (version 1.4),” April 2002.

[23] Object Management Group (OMG), “Unified modeling lan-guage (UML) superstructure specification,” V2.1.2, November2007.

[24] T. Kangas, J. Salminen, E. Kuusilinna, et al., “UML-basedmulti-processor SoC design framework,” ACM TECS, vol. 5,no. 2, pp. 281–320, 2006.

[25] E. Salminen, C. Grecu, T. D. Hamalainen, and A. Ivanov,“Networkon- chip benchmarking specifications part I: appli-cation modeling and hardware description,” v1.0, OCP-IP,April 2008.

[26] “Open core protocol international partnership (OCP-IP),”OCP specification 2.2., May 2008, http://www.ocpip.org.

[27] K. Holma, T. Arpinen, E. Salminen, M. Hannikainen, and T.D. Hamalainen, “Real-time execution monitoring on multi-processor system-on-chip,” in Proceedings of the InternationalSymposium on System-on-Chip (SOC ’08), pp. 1–6, November2008.

[28] H. Orsila, T. Kangas, M. Hannikainen, and T. D. Hamalainen,“Hybrid algorithm for mapping static task graphs on multi-processor SoCs,” in Proceedings of the International Symposiumon System-on-Chip, pp. 146–150, November 2005.

[29] H. Orsila, E. Salminen, M. Hannikainen, and T. D.Hamalainen, “Optimal subset mapping and convergenceevaluation of mapping algorithms for distributing task graphson multiprocessor SoC,” in Proceedings of the InternationalSymposium on System-on-Chip, November 2007.

[30] K. Holma, M. Setala, E. Salminen, M. Hannikainen, and T.D. Hamalainen, “Evaluating the model accuracy in automateddesign space exploration,” Microprosessors and Microsystems,vol. 32, no. 5-6, pp. 321–329, 2008.

Page 176: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 235032, 12 pagesdoi:10.1155/2009/235032

Research Article

Cascade Boosting-Based Object Detection from High-LevelDescription to Hardware Implementation

K. Khattab, J. Dubois, and J. Miteran

Le2i UMR CNRS 5158, Aile des Sciences de l’Ingenieur, Universite de Bourgogne, BP 47870, 21078 Dijon Cedex, France

Correspondence should be addressed to J. Dubois, [email protected]

Received 28 February 2009; Accepted 30 June 2009

Recommended by Bertrand Granado

Object detection forms the first step of a larger setup for a wide variety of computer vision applications. The focus of this paperis the implementation of a real-time embedded object detection system while relying on high-level description language suchas SystemC. Boosting-based object detection algorithms are considered as the fastest accurate object detection algorithms today.However, the implementation of a real time solution for such algorithms is still a challenge. A new parallel implementation, whichexploits the parallelism and the pipelining in these algorithms, is proposed. We show that using a SystemC description modelpaired with a mainstream automatic synthesis tool can lead to an efficient embedded implementation. We also display some of thetradeoffs and considerations, for this implementation to be effective. This implementation proves capable of achieving 42 fps for320× 240 images as well as bringing regularity in time consuming.

Copyright © 2009 K. Khattab et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Object detection is the task of locating an object in an imagedespite considerable variations in lighting, background, andobject appearance. The ability of object detecting in a sceneis critical in our everyday life activities, and lately it hasgathered an increasingly amount of attention.

Motivated by a very active area of vision research, mostof object detection methods focus on detecting Frontal Faces(Figure 1). Face detection is considered as an importantsubtask in many computer vision application areas suchas security, surveillance, and content-based image retrieval.Boosting-based method has led to the state-of-the-art detec-tion systems. It was first introduced by Viola and Jones asa successful application of Adaboost [1] for face detection.Then Li et al. extended this work for multiview faces, usingimproved variant boosting algorithms [2, 3]. However, thesemethods are used to detect a plethora of objects, suchas vehicles, bikes, and pedestrians. Overall these methodsproved to be time accurate and efficient.

Moreover this family of detectors relies upon severalclassifiers trained by a boosting algorithm [4–8]. Thesealgorithms help achieving a linear combination of weak

classifiers (often a single threshold), capable of real-time facedetection with high detection rates. Such a technique canbe divided into two phases: training and detection (throughthe cascade). While the training phase can be done offlineand might take several days of processing, the final cascadedetector should enable real-time processing. The goal is torun through a given image in order to find all the facesregardless of their scales and locations. Therefore, the imagecan be seen as a set of subwindows that have to be evaluatedby the detector which selects those containing faces.

Most of the solutions deployed today are general purposeprocessors software. Furthermore, with the development offaster camera sensors which allows higher image resolutionat higher frame-rates, these software solutions are not alwaysworking in real time. Accelerating the boosting detection canbe considered as a key issue in pattern recognition, as muchas motion estimation is considered for MPEG-4.

Seeking some improvement over the software, severalattempts were made trying to implement object/face detec-tion on multi-FPGA boards and multiprocessor platformsusing programmable hardware [9–14], just to fell short inframe rate and/or high accuracy.

Page 177: 541420

2 EURASIP Journal on Embedded Systems

The first contribution of this paper is a new structurethat exploits intrinsic parallelism of a boosting-based objectdetection algorithm.

As for a second contribution, this paper shows thata hardware implementation is possible using high-levelSystemC description models. SystemC enables PC simulationthat allows simple and fast testing and leaves our structureopen to any kind of hardware or software implementationsince SystemC is independent from all platforms. Main-stream Synthesis tools, such as SystemCrafter [15], arecapable of generating automatic RTL VHDL out of SystemCmodels, though there is a list of restrictions and constraints.The simulation of the SystemC models has highlightedthe critical parts of the structure. Multiple refinementswere made to have a precise, compile-ready description.Therefore, multiple synthesis results are shown. Note thatour fastest implementation was capable of achieving 42frames per second for 320× 240 images running at 123 MHzfrequency.

The paper is structured as follows. In Section 2 theboosted-based object detectors are reviewed while focusingon accelerating the detection phase only. In Section 3 asequential implementation of the detector is given whileshowing its real time estimation and drawbacks. A new paral-lel structure is proposed in Section 4; its benefits in maskingthe irregularity of the detector and in speeding the detectionare also discussed. In Section 5 a SystemC modelling for theproposed architecture is shown using various abstractionlevels. And finally, the firmware implementation details aswell as the experimental results are presented in Section 6.

2. Review of Boosting-Based Object Detectors

Object detection is defined as the identification and thelocalization of all image regions that contain a specific objectregardless of the object’s position and size, in an uncontrolledbackground and lightning. It is more difficult than objectlocalization where the number of objects and their size arealready known. The object can be anything from a vehicle,human face (Figure 1), human hand, pedestrian, and soforth. The majority of the boosting-based object detectorswork-to-date have primarily focused on developing novelface detection since it is very useful for a large array ofapplications. Moreover, this task is much trickier than otherobject detection tasks, due to the typical variations of hairstyle, facial hair, glasses, and other adornments. However,a lot of previous works have proved that the same familyof detector can be used for different type of object, suchas hand detection, pedestrian [4, 10], and vehicles. Most ofthese works achieved high detection accuracies; of course alearning phase was essential for each case.

2.1. Theory of Boosting-Based Object Detectors

2.1.1. Cascade Detection. The structure of the cascade detec-tor (introduced in face detection by Viola and Jones [1])is that of a degenerated decision tree. It is constituted ofsuccessively more complex stages of classifiers (Figure 2).

Figure 1: Example of face detection.

No object

Object21 3 4

Rejected subwindows

All subwindows Further processing

Figure 2: Cascade detector.

A B C D

+1+1

+1−1

−1−1−1

−1 +1+1

+1

Figure 3: Rectangle features.

The objective is to increase the speed of the detector byfocusing on the promising zones of the image. The firststage of the cascade will look over for these promising zonesand indicates which subwindows should be evaluated bythe next stage. If a subwindow is labeled at the currentclassifier as nonface, then it will be rejected and the decisionupon it is terminated. Otherwise it has to be evaluatedby the next classifier. When a sub-window survives all thestages of the cascade, it will be labeled as a face. Thereforethe complexity increases dramatically with each stage, butthe number of sub-windows to be evaluated will decreasemore tremendously. Over the cascade the overall detectionrate should remain high while the false positive rate shoulddecrease aggressively.

2.1.2. Features. To achieve a fast and robust implementation,Boosting based faces detection algorithms use some rectangleHaar-like features (shown in Figure 3) introduced by [16]:two-rectangle features (A and B), three-rectangle features(C), and four-rectangle features (D). They operate ongrayscale images and their decisions depend on the thresholddifference between the sum of the luminance of the whiteregion(s) and the sum of the luminance of the gray region(s).

Using a particular representation of the image so-calledthe Integral Image (II), it is possible to compute very rapidly

Page 178: 541420

EURASIP Journal on Embedded Systems 3

P1A B

C D

P2

P3 P4

Figure 4: The sum of pixels within Rectangle D can be calculated byusing 4 array references; SD= II [P4] – (II [P3] + II [P2] – II [P1]).

the features. The II is constructed of the initial image bysimply taking the sum of luminance value above and to theleft of each pixel in the image:

ii(x, y

) =∑

x′<x,y′<yi(x′, y′

)(1)

where ii(x, y) is the integral image, and i(x, y) is the originalimage pixel’s value. Using the Integral Image, any sum ofluminance within a rectangle can be calculated from II usingfour array references (Figure 4). After the II computation, theevaluation of each feature requires 6, 8, or 9 array referencesdepending on its type.

However, assuming a 24× 24 pixels sub-window size, theover-complete feature set of all possible features computed inthis window is 45 396 [1]: it is clear that a feature selectionis necessary in order to keep real-time computation timecompatibility. This is one of the roles of the Boosting trainingstep.

2.1.3. Weak Classifiers and Boosting Training. A weak classi-fier hj(x) consists of a feature f j , a threshold θj , and a paritypj indicating the direction of the inequality sign:

hj(x) =⎧⎨⎩

1, if pj f j(x) < pjθj ,

0, otherwise.(2)

Boosting algorithms (Adaboost and variants) are able toconstruct a strong classifier as a linear combination of weakclassifiers (here a single threshold) chosen from a given, finiteor infinite, set, as shown in (3):

h(x) =

⎧⎪⎪⎨⎪⎪⎩

1,T∑

t=1

αtht(x) > θ,

0, otherwise,

(3)

where θ is the stage threshold, αt is the weak classifier’sweight, and T is the total number of weak classifiers(features).

This linear combination is trained in cascade in order tohave better results.

There, a variant of Adaboost is used for learning objectdetection; it performs two important tasks: feature selectionfrom the features defined above and constructing classifiersusing selected features.

The result of the training step is a set of parameters (arrayreferences for features, constant coefficients of the linearcombination of classifiers, and thresholds values selected by

Adaboost). This set of features parameters can be storedeasily in a small local memory.

2.2. Previous Implementations. The state-of-the-art initialprototype of this method, also known as Viola-Jones algo-rithm, was a software implementation based on trainedclassifiers using Adaboost. The first implementation showssome good potential by achieving good results in terms ofspeed and accuracy; the prototype can achieve 15 frames persecond on a desktop computer for 320 × 240 images. Suchan implementation on general purpose processors offers agreat deal of flexibility, and it can be optimized with littletime and cost, thanks for the wide variety of the well-established design tools for software development. However,such implementation can occupy all CPU computationalpower for this task alone; nevertheless, face/object detectionis considered as prerequisite step for some of the mainapplication such as biometric, content-base image retrievalsystems, surveillance, and autonavigation. Therefore, there ismore and more interest in exploring an implementation ofaccurate and efficient object detection on low-cost embeddedtechnologies. The most common target technologies areembedded microprocessors such as DSPs, pure hardwaresystems such as ASIC, and configurable hardware such asFPGAs.

Lot of tradeoffs can be mentioned when trying tocompare these technologies. For instance, the use of embed-ded processor can increase the level of parallelism of theapplication, but it costs high power consumption, all whilelimiting the solution to run under a dedicated processor.

Using ASIC can result better frequency performancecoupled with high level of parallelism and low powerconsumption. Yet, in addition to the loss of flexibility, usingthis technology requires a large amount of development,optimization, and implementation time, which elevates thecost and risk of the implementation.

FPGAs can have a slightly better performance/cost trade-offs than previous two, since it permits high level ofparallelism coupled with some design flexibility. Howeversome restriction in design space, costly rams connections aswell as lower frequency comparing to ASIC, can rule-out itsuse for some memory heavy applications.

For our knowledge, few attempts were made tryingto implement Boosting-based face detection on embeddedplatforms; even so, fewer attempted such an implementationfor other object type, for example, whole body detection[10].

Nevertheless, these proposed architectures were config-urable hardware-based implementations, and most of themcould not achieve high detection frame rate speed whilekeeping the detection rate close of that of the originalimplementation. For instance, in order to achieve 15 framesper second for 120×120 images, Wei et al. [11] choose to skipthe enlargement scale factor from 1.25 to 2. However such amaneuver would lower the detection rate dramatically.

Theocharides et al. [12] have proposed a parallel archi-tecture taking advantage of a grid array processor. This arrayprocessor is used as memory to store the computation data

Page 179: 541420

4 EURASIP Journal on Embedded Systems

and as data transfer unit, to aid in accessing the integralimage in parallel. This implementation can achieve 52 framesper second at a 500 MHz frequency. However, details aboutthe image resolution were not mentioned.

Another complex control scheme to meet hard real-timedeadlines is proposed in [13]. It introduces a new hardwarepipeline design for Haar-like feature calculation and a systemdesign exploiting several levels of parallelism. But it sacrificesthe detection rate and it is better fitted for portrait pictures.

And more recently, an implementation with NoC(Network-on-Chip) architecture is proposed in [14] usingsome of the same element as [12]; this implementationachieves 40 frames per second for 320 × 240 images.However detection rate of 70% was well below the softwareimplementation (82% to 92%), due to the use of only 44features (instead of about 4600 in [1]).

3. Sequential Implementation

In software implementation the strategy used consists ofprocessing each sub-window at a time. The processing onthe next sub-window will not trigger until a final decisionis taken upon the previous one, that is, going through a set offeatures as a programmable list of coordinate rectangles.

In attending to implement such a cascade algorithm,each stage is investigated alone. For instance, the first stageclassifier should be separated from the rest since it requiresprocessing all the possible subwindows in an image, whileeach of the other relies on the result of previous stage andevaluates only the subwindows that passed through.

3.1. First Classification Stage. As mentioned earlier thisclassifier must run all over the image and reject the subwin-dows that do not fit the criteria (no face in the window).The detector is scanned across locations and scales, andsubsequent locations are obtained by shifting the windowsome number of pixels k. Only positive results trigger in thenext classifier.

The addresses of the positive sub-windows are stored in amemory, so that next classifier could evaluate them and onlythem in the next stage. Figure 5 shows the structure of suchclassifier. The processing time of this first stage is stable andindependent from the image content; the algorithm here isregular. The classifier complexity on this stage is usually verylow (only one or two features are considered; the decision ismade of one to comparisons, two multiplications, and oneaddition).

3.2. Remaining Classification Stages. Next classificationstages, shown in Figure 6, do not need to evaluate thewhole image. Each classifier should examine only the positiveresults, given by the previous stage, by reading their addressesin the memory, and then takes a decision upon each one(reject or pass to the next classifier stage).

Each remaining classifier is expected to reject the major-ity of sub-windows and keep the rest to be evaluated later inthe cascade. As a result, the processing time depends largelyon the number of positive sub-windows resulted from the

Yes

No

Featuresparameters

Load IIvalues

Integralimage II

Decision

EndShift & scale

Positives sub-windowsaddresses

Processnext

image

Figure 5: First cascade stage.

Featuresparameters

Load IIvalues

Integral image (II)

Decision Positives sub-

windowsaddresses n

Positives sub-windowsaddresses n−1

Figure 6: nth Stage classifier.

Yes

Change ofclassifier

Load IIvalues

Decision

EndShift/positiveaddresses

Positives sub-windowsaddresses

No

Integral image

Figure 7: Sequential implementation.

previous stage. Moreover the classifier complexity (numberof comparisons, multiplications, and additions) increassewith the stage level.

3.3. Full Sequential Implementation. The full sequentialimplementation of this cascade is proposed in Figure 7.

For a 320 × 240 image, scanned on 11 scales with ascaling factor of 1.25 and a step of 1.5, the number of totalsub-windows to be investigated is 105 963. Based on testsdone in [1], an average of 10 features is evaluated per sub-window. As a result, the estimated number of decision madeover the cascade, for a 320 × 240 image, is 1.3 million as anaverage. Thereafter around 10 millions memory access (sinceeach decision needs 6, 8, or 9 array references to calculatethe feature in play). Note that the computation time of thedecision (linear combination of constants) as well as the timeneeded to build the integral image is negligible comparing tothe overall memory access time.

Page 180: 541420

EURASIP Journal on Embedded Systems 5

Considering the speed of the memory to be 10 nanosec-ond per access (100 MHz), the time needed to processa full image is around 100 millisecond (about 10 imagesper second). However, this rate can vary with the image’scontent. Nevertheless, this work has been performed severaltimes [1–3] using standard PC, and the obtained processingrate is around 10 images/s; the implementation is still notwell suited for embedded applications and does not useparallelism.

4. Possible Parallelism

As shown in Section 3, Boosting-based face detector gota few drawbacks: first, the implementation still needs tobe accelerated in order to achieve real time detection, andsecond the processing time for an image depends on itscontent.

4.1. Algorithm Analysis and Parallel Model. A degradedcascade of 10 stages is presented in [1]. It contains less than400 features and achieves a detection rate between 75% and85%. Another highly used and more complex cascade can befound in OpenCV [17] (discussed later in Section 5.2). Thiscascade includes more than 2000 features spread on 22 stagesand achieves higher detection rates than the degraded version(between 80% and 92%) with less false positive detections.Analyzing those 2 cascade, one could notice that about 35%of the memory access takes place on each of the first twoclassifier while 30% on all remaining stages, which leads usto suggest a new structure (shown in Figure 8) of 3 parallelblocks that work simultaneously: in the first two blocks weintend to implement, respectively, the first and second stageclassifiers, and then a final block assigned to run over allremaining stages sequentially.

Unlike the state-of-the-art software implementation, theproposed structure tends to run each stage as a standaloneblock. Nevertheless, some intermediate memories betweenthe stages must be added in order to stock the positive-labelwindows addresses.

The new structure proposed above can upsurge the speedof the detector in one condition: since that the computationcomplexity is relatively small and the time processingdepends heavily on the memory access, an integral imagememory should be available for each block in order togain benefit of three simultaneous memory accesses. Figure 8shows the proposed parallel structure. At the end of everyfull image processing cycle, the positive results from Block1trigger the evaluation of Block2. The positive results fromBlock2 trigger the evaluation of Block3. And the positiveresults from Block3 are labeled as faces. It should be notedthat blocks cannot process simultaneously on the sameimage, that is, if at a given moment Block2 is working onthe current image (I1), then Block1 should be working onthe next image (I2) and Block3 should be working on theprevious image (I0). As mentioned in Section 3, the firstclassifier stage is slightly different from the others sinceit should evaluate the whole image. Hence, a “shift-and-scaling” model is needed. The positive results are stored in

a memory (mem.1) and copied in another memory (mem.2)in order to be used on the second stage. The positive resultsare stored in a memory (mem.3, duplicated in mem.4) inorder to be used in the final block.

The final block is similar to the second, but it is designedto implement all the remaining stages. Once the processingon mem.4 is finished, block 3 works the same way as inthe sequential implementation: the block run back and forththrough all remaining stages, to finally give the addresses ofthe detected faces.

This can be translated into the model shown in Figure 9.A copy of the integral image is available to each block as wellas three pairs of logical memory are working in ping pong toaccelerate the processing.

The given parallel model ought to run at the samespeed rate as its slower block. As mentioned earlier, thefirst stage of the cascade requires more access memory andtherefore more time processing than the second stage aloneor all the remaining stages together. In the first classifierstage, all 105 963 sub-windows should be inspected usingfour features with eight array references each. Therefore,it requires about 3.4 million of memory access per image.Using the same type of memory as in Section 3.3, an imageneeds roughly 34 millisecond (29 images per second) of timeprocessing.

4.2. Parallel Model Discussion. Normally the proposed struc-ture should stay the same, even if the cascade structurechanges, since most of the boosting cascade structures havethe same properties as long as the first two cascade stages.

One of the major issues surrounding boosting-baseddetection algorithms (specially when applied on to objectdetection in a non constraint scene) is the inconsistencyand the unpredictable processing time; for example, a whiteimage will always takes a little processing time since no sub-window should be cable of passing the first stage of thecascade. As opposite, an image of thumbnails gallery will takemuch more time.

Though this structure not only gives a gain in speed, thisfirst stage happens to be the only regular one in the cascade,with fixed time processing per image. This means that we canmask the irregular part of the algorithm by fixing the detectoroverall time processing.

As a result, the whole system will not work at 3 timesthe speed of the average sequential implementation, but alittle bit less. However, theoretically both models should berunning at the same speed if encountering a homogenousimage (e.g., white or black image). Further work in Section 5will show that the embedded implementation can benefitfrom some system teaks (pipelining and parallelism) withinthe computation that will make the architecture even faster.

Due to the masking phenomena in the parallel imple-mentation, decreasing the number of weak classifiers canaccelerate the implementation, but only if the first stage ofthe cascade is accelerated.

For this structure to be implemented effectively, itsconstraints must be taken into consideration. The memory,for instance, can be the most greedy and critical part;

Page 181: 541420

6 EURASIP Journal on Embedded Systems

BLOCK 1:

BLOCK 2:

BLOCK 3:

Positives sub-windowsaddresses

IIiImage

Shift & scale

Mem.1

Mem.2

Features

Features

Decision

Decision

Features Decision

Mem.4

Mem.3

IIi − 2

IIi − 1

- 100% sub- windows

- 35% total memory access

- 4 to 8 features

- <50% sub- windows

- 35% total memory access

- 8 to 20 features

- <15% sub- windows

- 30% total memory access

- Up to 2000 features

Figure 8: Parallel structure.

BLOCK 1

BLOCK 2

BLOCK 3

IIi − 2

IIi − 3

IIi − 1

Sub-windowsaddresses

Sub-windowsaddresses

Sub-windowsaddresses

Sub-windowsaddresses

Decision

Figure 9: Data Flow.

the model requires multiple memory accesses to be donesimultaneously.

It is obvious that a generic architecture (a processor, aglobal memory and cache) will not be enough to manage

up to seven simultaneous memory accesses on top of theprocessing, without crashing it performances.

5. Architecture Definition:Modelling Using SystemC

Flexibility and target architecture are two major criteria forany implementation. First, a decision has been taken uponbuilding our implementation using a high level descriptionmodel/language. Modelling at a high-level of descriptionwould lead to quicker simulation, and better bandwidthestimation, better functional validation, and for most it canhelp delaying the system orientation and thereafter delayingthe hardware target.

5.1. SystemC Description. C++ implements Object-Orientation on the C language. Many Hardware Engineersmay consider that the principles of Object-Orientation arefairly remote from the creation of Hardware components.Nevertheless, Object-Orientation was created from designtechniques used in Hardware designs. Data abstraction is thecentral aspect of Object-Orientation which can be found ineveryday hardware designs with the use of publicly visible“ports” and private “internal signals”. Moreover, componentinstantiation found in hardware designs is almost identicalto the principle of “composition” used in C++ for creatinghierarchical design. Hardware components can be modelledin C++, and to some extent, the mechanisms used aresimilar to those used in HDLs. Additionally C++ provides

Page 182: 541420

EURASIP Journal on Embedded Systems 7

IIi − 1

Sub-windowsaddresses

Sub-windowsaddresses

BLOCK 1

IIi − 2

Sub-windowsaddresses

Sub-windowsaddresses

BLOCK 2

IIi − 3BLOCK 3

Integralimage trans-

formationmodule

Decision

Figure 10: SystemC architecture implementation.

inheritance as a way to complement the compositionmechanism and promotes design reuse.

Nonetheless, C++ does not support concurrency whichis an essential aspect of systems modelling. Furthermore,timing and propagation delays cannot easily expressed inC++.

SystemC [18] is a relatively new modeling language basedon C++ for system level design. It has been developed asstandardized modeling language for system containing bothhardware and software components.

SystemC class library provides necessary constructs tomodel system architecture from reactive behaviour, schedul-ing policy, and hardware-like timing. All of which are notavailable using C/C++ standalone languages.

There is multiple advantages of using SystemC, over aclassic hardware description languages, such as VHDL andVerilog: flexibility, simplicity, simulation time velocity, andfor most the portability, to name a few.

5.2. SystemC Implementation for Functional Validation andVerification. The SystemC approach consists of a progres-sive refinement of specifications. Therefore, a first initialimplementation was done using an abstract high-level timedfunctional representation.

In this implementation, we used the proposed parallelstructure discussed in Section 4.

This modeling consists of high-level SystemC modules(TLM) communicating with each other using channels,signals, or even memory-blocks modules written in SystemC(Figure 10). Scheduling and timing were used but have notbeen explored for hardware-like purposes. Data types, usedin this modelling, are strictly C++ data types.

As for the cascade/classifiers, we chose to use the databasefound on Open Computer Vision Library [17] (OpenCV).OpenCV provides the most used trained cascade/classifiersdatasets and face-detection software (Haar-Detector) today,for the standard prototype of Viola-Jones algorithm. Theparticular classifiers, used on this library, are those trained

SystemC model

Simulation

Validation

Figure 11: SystemC functional validation flow.

for a base detection window of 24×24 pixels, using Adaboost.These classifiers are created and trained, by Lienhart et al.[19], for the detection of upright front face detection. Thedetection rate of these classifiers is between 80% and 92%,depending on the images Database.

The output of our implementation is the addresses ofthe sub-windows which contain, according to the detector,an object of particular type (a face in our case). Functionalvalidation is done by simulation (Figure 11). Then, multipletests were done, including visual comparisons on a datasetof images, visual simulation signals, and other tests thatconsist of comparing the response of each classifier withits correspondent implemented on OpenCV’s Haar-Detectorsoftware. All of these tests indicate that we were able toachieve the same result in detection rate as using the softwareprovided by OpenCV. The images, used in these tests, weretaken from the CMU+MIT face databases [20].

The choice of working with faces, instead of other objecttypes, can help the comparison with other recent works.However, using this structure for other object-type detectionis very feasible, on the condition of having a trained datasetof classifiers for the specific object. This can be considereda simple task, since OpenCV also provides the trainingsoftware for the cascade detector. Even more, classifiersfrom other variant of boosting can be implemented easily,since the structure is written in a high-level language. As aresult, changing the boosting variant is considered a minormodification since the architecture of the cascade detectorshould stay intact.

5.3. Modelling for Embedded Implementation. While theprevious SystemC modelling is very useful for functionalvalidation, more optimization should be carried out in orderto achieve a hardware implementation. Indeed, SystemCstandard is a system-level modelling environment whichallows the design of various abstraction levels of systems.The design cycle starts with an abstract high-level untimed ortimed functional representation that is refined to a bus-cycleaccurate and then an RTL (Register Transfer Level) hardwaremodel. SystemC provides several data types, in addition tothose of C++. However these data types are mostly adaptedfor hardware specification.

Page 183: 541420

8 EURASIP Journal on Embedded Systems

BLOCK1

Decision

Shif_scale

Mem_Ctrl

Switch M

UX

BLOCK2/BLOCK3

Decision

Mem_Ctrl

Images

SRAM SRAM

SDRAM SDRAM

SR

AM

Integralimage

BLOCK 3

BLOCK 2

BLOCK 1

SDR

AM

Figure 12: The global architecture in SystemC modules.

Besides, SystemC hardware model can be synthesizablefor various target technologies. Numerous behavioural syn-thesis tools are available on the market for SystemC (e.g.,Synopsys Cocentric compiler, Mentor Catapult, System-Crafter, and AutoESL). It should be noted that for, all thoseavailable tools, it is necessary to refine the initial simulatableSystemC description in order to synthesize into hardware.The reason behind is the fact that SystemC language is asuperset of the C++ designed for simulation.

Therefore, a new improved and foremost a more refined“cycle accurate RTL model” version of the design implemen-tation was created.

Our design is split into compilation units, each of whichcan be compiled separately. Alternatively, it is possible touse several tools for different parts of your design, or evenusing the partition in order to explore most of the possibleparallelism and pipelining for more efficient hardwareimplementation. Eventually, the main block modules of thedesign were split into a group of small modules that workin parallel and/or in pipelining. For instance, the moduleBLOCK1 contains tree compilation units (modules): a“Decision” Module which contains the first stage’s classifiers.This module is used for computation and decision on eachsub-window. The second module is “Shift-and-Scale” usedfor shifting and scaling the window in order to obtainall subsequent locations. Finally, a “Memory-Ctrl” modulemanages the intermediate memory access.

As result, a SystemC model composed of 11 modules(Figure 12): tree for BLOCK1, two for BLOCK2, two forBLOCK3, one for the Integral image transformation, two forthe SRAM simulation, and one for the SDRAM intermediatememory (discussed later in this chapter).

Other major refinements were done: divisions weresimplified in order to be power of two divisions, dataflowmodel was further refined to a SystemC/C++ of combined

finite state-machines and data paths, loops were exploited,and timing and scheduling were taken into consideration.Note that, in most cases, parallelism and pipelining wereforced manually. On the other hand, not all the moduleswere heavily refined; for example, the two module of SRAMwere used in order to simulate a physical memory, which willnever be synthesized no matter what the target platform is.

5.4. Intermediate Memory. One of the drawbacks of theproposed parallel structure (given in Section 4) is the useof additional intermediate memories (unnecessary in thesoftware implementation). Logically, an interblocks memoryunit is formed out of two memories working in ping-pong.

A stored address should hold the position of a particularsub-window and its scale; there is no need for two-dimensional positioning, since the Integral Image is createdas a monodimensional table for a better RAM storage.

For a 320×240 image and an initial mask’s size of 24×24,a word of 20 bits would be enough to store the concatenationof the position and the scale of each sub-window.

As for the capacity of the memories, a worse case scenariooccurs when half of the possible sub-windows manage to passthrough first block. That leads to around 50 000 (50% of thesub-windows) addresses to store. Using the same logic on thenext block, the total number of addresses to store should notexceed the 75 000. Eventually, a combined memory capacityof less than 192 Kbytes is needed.

Even more, the simulation of our SystemC model showsthat even when facing a case of consecutive positive decisionsfor a series of sub-windows, access onto those memorieswill not occur more than once every each 28 cycles (case ofmem.1 and mem.2 ), or once each 64 cycles (case of mem.3and mem.4).

Due to these facts, we propose a timesharing system(shown in Figure 13) using four memory banks, workingas a FIFO block, with only one physical memory. Typicalhardware implementation of a 192 Kbytes SDRAM orDDRAM memory, running on a frequency of at least 4 timesthe frequency of the FIFO banks, is necessary to replace thefour logical memories.

SystemC simulation shows that 4 Kbits is enough for eachmemory bank. The FIFOs are easily added using SystemCown predefined sc fifo module.

6. Hardware Implementation andExperimental Performances

6.1. Hardware Implementation. SystemC hardware modelcan be synthesizable for various target technologies. How-ever, no synthesizer is capable of producing efficient hard-ware from a SystemC program written for simulation. Auto-matic synthesis tool can produce fast and efficient hardwareonly if the entry code accommodates certain difficult require-ments such as using hardware-like development methods.Therefore, the results of the synthesis design implementationand the tool itself and the different levels of refinements donedepend heavily on the entry code. Figure 14 shows the twodifferent kinds of refinements needed to achieve a successful

Page 184: 541420

EURASIP Journal on Embedded Systems 9

I/O

MUX MUX

FIFO FIFO FIFO FIFO

BLOCK1

BLOCK2

BLOCK2

BLOCK3

SDRAM

Figure 13: Intermediate Memories structure.

SystemC model

C to RTLsynthesis tool

HDL to hardwaresynthesis tool

HDL

Implementation

Refinement

Figure 14: SystemC to hardware implementation developmentflow.

fast implementation, using a high-level description language.The first type of refinements is the one set by the tool itself.Without it, the tool is not capable of compiling the SystemCcode to RTL level. Even so, those refinements do not leaddirectly to a good proven implementation. Another typeof refinements should take place in order to optimize thesize, the speed and sometimes (depending on the used tool)power consumption.

For our design, several refinement versions have beendone on different modules depending on their initial speedand usability.

The SystemC scheduler uses the same behavior forsoftware simulation as for hardware simulation. This worksto our advantage since it gives the possibility of choosingwhich of the modules to be synthesized, while the rest worksas SystemC test bench for the design.

Our synthesis phase was performed using an automatictool, named SystemCrafter, which is a SystemC synthesis toolthat targets Xilinx FPGAs.

Table 1: The synthesis results of the components implementations.

Logic utilization Used Available Utilization

IntegralImage

Number of occupied Slices 913 10752 8%

Number of Slice Flip Flops 300 21504 1%

Number of 4 input LUTs 1761 21504 8%

Number of DSPs 2 48 4%

Maximum frequency 129 MHz

BLOCK 1

Number of occupied Slices 1281 10752 12%

Number of Slice Flip Flops 626 21504 3%

Number of 4 input LUTs 2360 21504 11%

Number of DSPs 1 48 2%

Maximum frequency 47 MHz

BLOCK 2

Number of occupied Slices 3624 10752 34%

Number of Slice Flip Flops 801 21504 4%

Number of 4 input LUTs 7042 21504 33%

Number of DSPs 3 48 6%

Maximum frequency 42 MHz

It should be noted that the used SystemC entry code canbe described as VHDL-like synchronous and pipelined C-code (bit accurate): most parallelism and pipelining withinthe design were made manually using different processes,threads, and state-machines. SystemC data types were usedin order to minimize the implementation size. Loops wereexploited, and timing as well as variables lengths was alwaysa big factor.

Using the SystemCrafter, multiple VHDL componentsare generated and can be easily added or merged into/withother VHDL components (notably the FIFO’s modules).As for the testbench set, the description was kept inhigh-level abstraction SystemC for faster prototyping andsimulation.

Basically, our implementation brings together threemajor components: the integral image module, the first stagedecision module, and the second stage decision module(block 3 of the structure is yet to be implemented). Othercomponents such as memory controllers and FIFO’s mod-ules are also implemented but are trifling when compared tothe other big three.

Each of these components was implemented separately inorder to analyze their performances. In each case, multiplegraphic simulations were carried out to verify that theoutputs of both descriptions (SystemC’s and VHDL’s) areidentical.

6.2. Performances. The Xilinx Virtex-4 XC4VL25 wasselected as a target FPGA. The VHDL model was backannotated using the Xilinx ISE. The synthesis results of thedesign implementation for each of the components are givenin Table 1.

The synthesis results of the design implementation forthe whole design (BLOCK1, BLOCK 2 and integral imagecombined) are given in Table 2.

Page 185: 541420

10 EURASIP Journal on Embedded Systems

Table 2: The synthesis results of the entire design implementation.

Logic utilization Used Available Utilization

Number of occupied Slices 5941 10752 55%

Number of Slice Flip Flops 1738 21504 8%

Number of 4 input LUTs 11418 21504 53%

Number of DSPs 6 48 13%

Maximum frequency 42 MHz

Table 3: The synthesis results of the decision modules implemen-tation.

Logic utilization Used Available Utilization

BLOCK 1Decision

Number of occupied Slices 1281 10752 12%

Number of Slice Flip Flops 626 21504 3%

Number of 4 input LUTs 2360 21504 11%

Number of DSPs 1 48 2%

Maximum frequency 47 MHz

BLOCK 2Decision

Number of occupied Slices 3624 10752 34%

Number of Slice Flip Flops 801 21504 4%

Number of 4 input LUTs 7042 21504 33%

Number of DSPs 3 48 6%

Maximum frequency 42 MHz

The clock rate of the design did not exceed the rate ofits slowest component (BLOCK2). The design is capable ofrunning with a frequency of 42 MHz. In the first block, adecision is taken on a sub-window each 28 clock cycles.Hence, this system is capable of achieving only up to 15frames per second or 320× 240 images.

Accelerating BLOCK1 and BLOCK2 is essential in orderto achieve higher detection speed. BLOCK1 includes threeimportant modules: “Decision” module, “Shift-and-Scale”module, and “Memory ctrl” module. As for BLOCK2 itincludes only “Decision” module and “Memory ctrl” mod-ule. The decision modules however use some division andmultiplication operators, which are costly in clock cyclefrequency. Therefore, each “Decision” module of these twocomponents is synthesized alone, and their synthesis resultsare shown in Table 3.

As expected the “Decision” Modules in both BLOCK1and BLOCK2 are holding the implementation onto a lowfrequency.

Analyzing the automatic generated VHDL code showsthat despite all the refinement already done, the System-Crafter synthesis tool still produces a much complex RTLcode than essentially needed. Particularly, when using arraysin loops, the tool creates a register for each value, andthen wired it into all possible outputs. Things get worsewhen trying to update all the array elements within oneclock cycle. A scenario which occurs regularly in ourdesign, for example, updating classifiers parameters after aShifting or a Scaling. Simulation tests proved that these lastmanipulations can widely slowdown the design frequency.

Table 4: The synthesis results for the new improved decisionmodules.

Logic utilization Used Available Utilization

BLOCK 1Decision

Number of occupied Slices 713 10752 7%

Number of Slice Flip Flops 293 21504 1%

Number of 4 input LUTs 1091 21504 5%

Number of DSPs 1 48 2%

Maximum frequency 127 MHz

BLOCK 2Decision

Number of occupied Slices 2582 10752 24%

Number of Slice Flip Flops 411 21504 2%

Number of 4 input LUTs 5082 21504 24%

Number of DSPs 3 48 6%

Maximum frequency 123 MHz

Table 5: The synthesis results of the refined implementation for theentire design.

Logic utilization Used Available Utilization

Number of occupied Slices 4611 10752 43%

Number of Slice Flip Flops 1069 21504 5%

Number of 4 input LUTs 8527 21504 40%

Number of DSPs 6 48 13%

Maximum frequency 123 MHz

Therefore more refinement has been made for the “Decision”SystemC modules. For instance, the arrays updating weresplit between the clock cycles, in a way that no additionalclock cycles are lost while updating a single array element percycle.

The synthesis results for new improve and more refineddecision modules are shown in Table 4. The refinementsmade allow faster, lighter, and more efficient implementationfor the two modules. A new full system implementation ismade by inserting the new “Decision” modules, its resultsand performances are shown in Table 5. The FPGA canoperate at a clock speed of 123 MHz. Using the same logicas before, a decision is taken on a sub-window each 28 clockcycles; therefore the new design can achieve up to 42 framesper second on 320× 240 images.

The simulation tests, used in Section 5.2 for the func-tional validation of the SystemC code, were carried out onthe VHDL code mixed with a high-level test bench (the sameSystemC test bench used for the SystemC validation model).The outputs of the VHDL code were compared to the outputsof the OpenCV’s implementation after the first two classifi-cation stages. These tests prove that we were able to achievethe same detection results as in using the software providedby OpenCV. The design can run on even faster pace, if morerefinements and hardware considerations are taken. How-ever, it should be noted that using different SystemC synthe-sis tools can yield different results. After all, the amount andeffectiveness of the refinements depend largely on the toolitself.

Page 186: 541420

EURASIP Journal on Embedded Systems 11

Other optimizations can be done by replacing some ofthe autogenerated VHDL codes from the crafter by manuallyoptimized ones.

7. Conclusion

In this paper, we proposed a new architecture for anembedded real-time object and face detector based on a fastand robust family of methods, initiated by Viola and Jones[1].

First we have built a sequential structure model whichreveals to be irregular in time processing. As estimation, thesequential implementation of a degraded cascade detectorcan achieve on an average of 10 frames per second.

Then a new parallel structure model is introduced. Thisstructure proves to be at least 2.9 times faster than thesequential and provides regularity in time processing.

The design was validated using SystemC. Simulation andhardware synthesis were done, showing that such an algo-rithm can be fitted easily into an FPGA chip, while having theability to achieve the state-of-the-art performances in bothframe rate and accuracy.

The hardware target, used for the validation, is an FPGAbased board, connected to the PC using an USB 2.0 Port. Theuse of SystemC description enables the design to be easilyretargeted for different technologies. The implementationof our SystemC model onto a Xilinx Virtex-4 can achievetheoretical 42 frames per second detection rate for 320× 240images.

We proved that SystemC description is not only inter-esting to explore and validate a complex architecture. It canalso be very useful to detect bottlenecks in the dataflowand to accelerate the architecture by exploiting parallelismand pipelining. Then eventually, it can lead to an embeddedimplementation that achieves state-of-the-art performances,thanks to some synthesis tools. More importantly, it helpsdeveloping a flexible design that can be migrated to a widevariety of technologies.

However, experiments have shown that refinementsmade to the entry SystemC code add up to substantialreductions in size and total execution time. Even though,the extent and effectiveness of these optimizations is largelyattributed to the SystemC synthesis tool itself and designer’shardware knowledge and experience. Therefore, one veryintriguing perspective is the exploration of this design usingother tools for comparison purposes.

Accelerating the first stage can lead directly to a wholesystem acceleration. In the future, our description couldbe used as a part of a more complex process integratedin a SoC. We are currently exploring the possibility of ahardware/software solution, by prototyping a platform basedon a Wildcard [21]. Recently, we had successful experiences,implementing a similar type of solutions in order to accel-erate a “Fourier Descriptors for Object Recognition usingSVM” [22] and motion estimation for MPEG-4 coding [23].For example, the Integral Image block as well as the first andsecond stages can be executed in hardware on the wildcard,while the rest can be implemented in software on a Dual coreprocessor.

References

[1] P. Viola and M. Jones, “Rapid object detection using aboosted cascade of simple features,” in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR ’01), vol. 1, pp. 511–518, 2001.

[2] S. Li, L. Zhu, Z. Q. Zhang, A. Blake, H. J. Zhang, and H.Shum, “Statistical learning of multi-view face detection,” inProceedings of the 7th European Conference on Computer Vision,Copenhagen, Denmark, May 2002.

[3] J. Sochman and J. Matas, “AdaBoost with totally correctiveupdates for fast face detection,” in Proceedings of the 6thIEEE International Conference on Automatic Face and GestureRecognition, pp. 445–450, 2004.

[4] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestriansusing patterns of motion and appearance,” in Proceedings ofthe IEEE International Conference on Computer Vision (ICCV’03), vol. 2, pp. 734–741, October 2003.

[5] Y. Freund and R. E. Schapire, “A decision-theoretic general-ization of online learning and an application to boosting,” inProceedings of the ECCLT, pp. 23–37, 1995.

[6] J. Kivinen and M. K. Warmuth, “Boosting as entropy pro-jection,” in Proceedings of the 12th Annual Conference onComputational Learning Theory (COLT ’99), pp. 134–144,ACM, Santa Cruz, Calif, USA, July 1999.

[7] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methodsin feature selection,” Pattern Recognition Letters, vol. 15, no. 11,pp. 1119–1125, 1994.

[8] J. Sochman and J. Matas, “WaldBoost: learning for timeconstrained sequential detection,” in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR ’05), vol. 2, San Diego, Calif, USA, June2005.

[9] M. Reuvers, Face detection on the INCA+ system, M.S. thesis,University of Amsterdam, 2004.

[10] V. Nair, P. O. Laprise, and J. J. Clark, “An FPGA-basedpeople detection system,” EURASIP Journal on Applied SignalProcessing, no. 7, pp. 1047–1061, 2007.

[11] Y. Wei, X. Bing, and C. Chareonsak, “FPGA implementationof AdaBoost algorithm for detection of face biometrics,” inProceedings of the IEEE International Workshop on BiomedicalCircuits and Systems, 2004.

[12] T. Theocharides, N. Vijaykrishnan, and M. J. Irwin, “A parallelarchitecture for hardware face detection,” in Proceedings ofthe IEEE Computer Society Annual Symposium on EmergingTechnologies and Architectures (VLSI ’06), 2006.

[13] M. Yang, Y. Wu, J. Crenshaw, B. Augustine, and R. Mareachen,“Face detection for automatic exposure control in handheldcamera,” in Proceedings of the 4th IEEE International Confer-ence on Computer Vision Systems (ICVS ’06), 2006.

[14] H.-C. Lai, R. Marculescu, M. Savvides, and T. Chen,“Communication-aware face detection using noc architec-ture,” in Proceedings of the 6th International Conference onComputer Vision (ICVS ’08), vol. 5008 of Lecture Notes inComputer Science, pp. 181–189, 2008.

[15] http://www.systemcrafter.com/.[16] C. Papageorgiou, M. Oren, and T. Poggio, “A general frame-

work for object detection,” in Proceedings of the InternationalConference on Computer Vision, 1998.

[17] Open Source Computer Vision Library, February 2009,http://sourceforge.net/projects/opencvlibrary/.

[18] S. Swan, An Introduction to System Level Modeling in SystemC2.0, Cadence Design Systems, Inc., 2001.

Page 187: 541420

12 EURASIP Journal on Embedded Systems

[19] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysisof detection cascades of boosted classifiers for rapid objectdetection,” in Proceedings of the 25th Pattern RecognitionSymposium (DAGM ’03), pp. 297–304, 2003.

[20] H. Rowley, S. Baluja, and T. Kanade, “Neural network-basedface detection,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 20, no. 1, pp. 22–38, 1998.

[21] Annapolis Microsystems Inc, Annapolis WILDCARD Sys-tem Reference Manual, Revision 2.6, 2003, http://www.annapmicro.com/.

[22] F. Smach, J. Miteran, M. Atri, J. Dubois, M. Abid, and J.-P.Gauthier, “An FPGA-based accelerator for Fourier Descriptorscomputing for color object recognition using SVM,” Journal ofReal-Time Image Processing, vol. 2, no. 4, pp. 249–258, 2007.

[23] J. Dubois, M. Mattavelli, L. Pierrefeu, and J. Miteran, “Con-figurable motion-estimation hardware accelerator module forthe MPEG-4 reference hardware description platform,” inProceedings of the International Conference on Image Processing(ICIP ’05), vol. 3, Genova, Italy, 2005.

Page 188: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 479281, 16 pagesdoi:10.1155/2009/479281

Research Article

Very Low-Memory Wavelet Compression ArchitectureUsing Strip-Based Processing for Implementation inWireless Sensor Networks

Li Wern Chew, Wai Chong Chia, Li-minn Ang, and Kah Phooi Seng

Department of Electrical and Electronic Engineering, The University of Nottingham, 43500 Selangor, Malaysia

Correspondence should be addressed to Li Wern Chew, [email protected]

Received 4 March 2009; Accepted 9 September 2009

Recommended by Bertrand Granado

This paper presents a very low-memory wavelet compression architecture for implementation in severely constrained hardwareenvironments such as wireless sensor networks (WSNs). The approach employs a strip-based processing technique where an imageis partitioned into strips and each strip is encoded separately. To further reduce the memory requirements, the wavelet compressionuses a modified set-partitioning in hierarchical trees (SPIHT) algorithm based on a degree-0 zerotree coding scheme to give highcompression performance without the need for adaptive arithmetic coding which would require additional storage for multiplecoding tables. A new one-dimension (1D) addressing method is proposed to store the wavelet coefficients into the strip buffer forease of coding. A softcore microprocessor-based hardware implementation on a field programmable gate array (FPGA) is presentedfor verifying the strip-based wavelet compression architecture and software simulations are presented to verify the performance ofthe degree-0 zerotree coding scheme.

Copyright © 2009 Li Wern Chew et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The capability of having multiple sensing devices to commu-nicate over a wireless channel and to perform data processingand computation at the sensor nodes has brought wirelesssensor networks (WSNs) into a wide range of applicationssuch as environmental monitoring, habitat studies, objecttracking, video surveillance, satellite imaging, as well as formilitary applications [1–4]. For applications such as objecttracking and video surveillance, it is desirable to compressimage data captured by the sensor nodes before transmissionbecause of limitations in power supply, memory storage,and transmission bandwidth in the WSN [1, 2]. For imagecompression in WSNs, it is desirable to maintain a high-compression ratio while at the same time, providing a lowmemory and low-complexity implementation of the imagecoder.

Among the many image compression algorithms, wave-let-based image compression based on set-partitioning inhierarchical trees (SPIHT) [5] is a powerful, efficient, andyet computationally simple image compression algorithm. It

provides a better performance than the embedded zerotreeswavelet (EZW) algorithm [6]. Although the embedded blockcoding with optimized truncation (EBCOT) algorithm [7]which was adopted in the Joint Photographic Experts Group2000 (JPEG 2000) standard provides a higher compressionefficiency as compared to SPIHT, its multilayer codingprocedures are very complex and computationally intensive.Also, the need for multiple coding tables for adaptivearithmetic coding requires extra memory allocation whichmakes the hardware implementation of the coder morecomplex and expensive [7–9]. Thus, from the viewpoint ofhardware implementation, SPIHT is preferred over EBCOTcoding.

In the traditional SPIHT coding, a full wavelet-trans-formed image has to be stored because all the zerotreesare scanned in each pass for different magnitude intervalsduring the set-partitioning operation [5, 10]. The memoryneeded to store these wavelet coefficients increases as theimage resolution increases. This in turn increases the cost ofhardware image coders as a large internal or external memorybank is needed. This issue is also faced in the implementation

Page 189: 541420

2 EURASIP Journal on Embedded Systems

of on-board satellite image coders where the availablememory space is limited due to power constraints [11].

By adopting the modifications in the implementationof the discrete wavelet transform (DWT) where the wavelettransformation can be carried out without the need for fullimage transformation [12], SPIHT image coding can alsobe performed on a portion of the wavelet subbands. Thestrip-based image coding technique proposed in [10] whichsequentially performs SPIHT coding on a portion of animage has contributed to significant improvements in low-memory implementations for image compression. In strip-based coding, a few image lines that are acquired in rasterscan format are first wavelet transformed. The computedwavelet coefficients are then stored in a strip-buffer forSPIHT coding. At the end of the image coding, the strip-buffer is released and is ready for the next set of data lines.

In this paper, a hardware architecture for strip-basedimage compression using the SPIHT algorithm is presented.The lifting-based 5/3 DWT which supports a losslesstransformation is used in our proposed work. The waveletcoefficients output from the DWT module is stored ina strip-buffer in a predefined location using a new one-dimension (1D) addressing method for SPIHT coding. Inaddition, a proposed modification on the traditional SPIHTalgorithm is also presented. In order to improve the codingperformance, a degree-0 zerotree coding methodology isapplied during the implementation of SPIHT coding. Tofacilitate the hardware implementation, the proposed SPIHTcoding eliminates the use of lists in its set-partitioningapproach and is implemented in two passes. The proposedmodification reduces both the memory requirement andcomplexity of the hardware coder. Besides this, the fastzerotree identifying technique proposed by [13] is alsoincorporated in our proposed work. Since the recursion ofdescendant information checking is no longer needed here,the processing time of SPIHT coding significantly reduced.

The remaining sections of this paper are organized as fol-lows. Section 2 presents an overview of the strip-based cod-ing followed by a discussion on the lifting-based 5/3 DWTand its hardware implementation in Section 3. Section 4presents the proposed 1D addressing method for DWTcoefficients in a strip-buffer and new spatial orientation treestructures which is incorporated into our proposed strip-based coding. Next, a proposal on modifications to the tra-ditional SPIHT coding to improve its compression efficiencyand the hardware implementation of the proposed algorithmin strip-based coding are presented in Section 5. The pro-posed work is implemented using our designed microproces-sor without interlocked pipeline stages (MIPS) processor ona Xilinx Spartan III field programmable gate array (FPGA)device and the results of software simulations are discussedin Section 6. Finally, Section 7 concludes this paper.

2. Strip-Based Image Coding

Traditional wavelet-based image compression techniquesfirst apply a DWT on a full resolution image. The computedN-scale decomposition wavelet coefficients that provide a

Wavelet coefficients in a strip buffer

Wavelet transform of a full image

SPIHT coding

Figure 1: SPIHT coding is carried out on part of the waveletcoefficients that are stored in the strip buffer.

compact multiresolution representation of the image arethen obtained and stored in a memory bank. Entropy codingis subsequently carried out to achieve compression. Thistype of coding technique that requires the whole image tobe stored in a memory is not suitable for processing largeimages especially in a hardware constrained environmentwhere limited amount of memory is available.

A low-memory wavelet transform that computes thewavelet subband coefficients on a line-based basis hasbeen proposed in [12]. This method reduces the amountof memory required for the wavelet transform process.While the wavelet transformation of the image data canbe processed in a line-based manner, the computed fullresolution wavelet coefficients still have to be stored forSPIHT coding because all the zerotrees are scanned in eachpass for different magnitude intervals [5, 10].

The strip-based coding technique proposed in [10] whichadopts the line-based wavelet-transform [12] has resultedin great improvements in low-memory implementations forSPIHT compression. In strip-based coding, SPIHT coding issequentially performed on a few lines of wavelet coefficientsthat are stored in a strip buffer as shown in Figure 1. Oncethe coding is done for a strip buffer, it is released and isthen ready for the next set of data lines. Since only a portionof the full wavelet decomposition subband is encoded at atime, there is no need to wait for the full transformationof the image. Coding can be performed once a strip is fullybuffered. This enables the coding to be carried out rapidlyand also significantly reduces the memory storage needed forthe SPIHT coding.

Figure 2 shows the block diagram of the strip-basedimage compression that is presented in this paper. A fewlines of image data are first loaded into the DWT module(DWT MODULE) for wavelet transformation. The waveletcoefficients are computed and then stored in a strip-buffer(STRIP BUFFER) for SPIHT encoding (SPIHT ENCODE).At the end of encoding, the bit-stream generated is trans-mitted as the output. In the next few sections, the detailedfunction and the hardware architecture of each of theseblocks will be presented.

3. Discrete Wavelet Transform

The DWT is the mathematical core of the wavelet-basedimage compression scheme. Traditional two-dimension (2D)

Page 190: 541420

EURASIP Journal on Embedded Systems 3

DWT MODULE(Original image)

Image strip

512

512

16

16×

512

STR

IPB

UFF

ER

SPIHT ENCODE Bit-stream

Figure 2: Block diagram of proposed strip-based image compression.

HHLH

LL HL

LL3 HL3

LH3 HH3HL2

LH2 HH2

HL1

LH1 HH1

Two-scale decomposition is carriedout on LL subband

One-scale DWT decomposition

(a)

Three-scale DWT decomposition

(b)

Figure 3: 2D wavelet decomposition.

DWT first performs row filtering on an image followed bycolumn filtering. This gives rise to four wavelet subbanddecompositions as shown in Figure 3(a). The Low-Low (LL)subband contains the low-frequency content of an image inboth the horizontal and the vertical dimension. The High-Low (HL) subband contains the high-frequency content ofan image in the horizontal and low-frequency content ofan image in the vertical dimension. The Low-High (LH)subband contains the low-frequency content of an imagein the horizontal and high-frequency content of an imagein the vertical dimension, and finally, the High-High (HH)subband contains the high-frequency content of an image inboth the horizontal and the vertical dimension.

Each of the wavelet coefficients in the LL, HL, LH,and HH subbands represents a spatial area correspondingto approximately a 2 × 2 area of the original image [6].For an N-scale DWT decomposition, the coarsest subbandLL is further decomposed. Figure 3(b) shows the subbandsobtained for a three-scale wavelet decomposition. As a result,each coefficient in the coarser scale represents a larger spatialarea of the image but a narrower band of frequency [6].

The two approaches that are used to perform the DWTare the convolutional based filter bank method and the liftingbased filtering method. Between the two methods, the liftingbased DWT is preferred over the convolutional based DWTfor hardware implementation due to its simple and fast

lifting process. Besides this, it also requires a less complicatedinverse wavelet transform [14–18].

3.1. Lifting Based 5/3 DWT. The reversible Le Gall 5/3 filteris selected in our proposed work since it provides a losslesstransformation. In the implementation of lifting-based 5/3DWT [18–20], three computation operations—addition,subtraction, and shift—are needed. As shown in Figure 4,the lifting process is built based on the split, prediction, andupdating steps. The input sequenceX[n] is first split into oddand even components for the horizontal filtering process. Inthe prediction phase, a high-pass filtering is applied to theinput signal which results in the generation of the detailedcoefficient H[n]. In the updating phase, a low-pass filteringis applied to the input signal which leads to the generation ofthe approximation coefficient L[n]. Likewise, for the verticalfiltering, the split, prediction, and updating processes arerepeated for both the H[n] and L[n] coefficients. Equations(1) and (2) give the lifting implementation of the 5/3 DWTfilter used in JPEG 2000 [19]:

H[2n + 1] = X[2n + 1]−⌊X[2n] + X[2n + 2]

2

⌋, (1)

L[2n] = X[2n] +⌊H[2n− 1] +H[2n + 1] + 2

4

⌋. (2)

Page 191: 541420

4 EURASIP Journal on Embedded Systems

+

+

Split

Odd

Even

Split

Odd

Even

Split Predict Update

Odd

Even

Row filtering Column filtering

Predict

Predict

Update

HH

LH

HL

LL

Update+

H

L

X[n]

X[2n + 1]

X[2n]

X[2n + 2]

Subtractor−Adder

+Shifter>> 1

H[2n + 1]

X[2n]

H[2n− 1]

H[2n + 1]Shifter>> 2

Adder+

Adder+2

Adder+

L[2n]

Figure 4: Implementation of the lifting-based 5/3 DWT filter: Prediction, highpass filtering, and Updating, lowpass filtering.

TEMP_BUFFER

DWT_MODULE

HPF (ODD)

Row filtering

LPF (EVEN)

HPF (ODD)

LPF (EVEN)

Col. filtering

Image pixel

DWT coefficients

New address calculating unitOriginal pixel

locationPixel location in STRIP_BUFFER

STRIP_BUFFER

HH2

HH1

HL1

LH1

HL2

LH2

LLn...

Figure 5: Architecture for DWT MODULE.

3.2. Architecture for DWT MODULE. In our proposed work,a four-scale DWT decomposition is applied on an image sizeof 512 × 512 pixels. For a four-scale DWT decomposition,the number of lines that is generated at the first scalewhen one line is generated in the fourth scale is equal toeight. This means that the lowest-memory requirement thatwe can achieve at each subband with a four-scale waveletdecomposition is eight lines. Since each wavelet coefficientrepresents a spatial area corresponding to approximately a 2× 2 area of the original image, the number of image rows thatneeds to be fed into the DWT MODULE is equal to 16 lines.Equation (3) shows the relationship between the number ofimage rows that are needed for strip-based coding, Rimage andthe level of DWT decomposition to be performed, N :

Rimage = 2N . (3)

Figure 5 shows our proposed architecture for theDWT MODULE. In the initial stage (where N = 1),image data are read into the DWT MODULE in a row-by-row order from an external memory where the image dataare kept. Row filtering is then performed on the image rowand the filtered coefficients are stored in a temporary buffer(TEMP BUFFER). As soon as four lines of row-filteredcoefficients are available, column filtering is then carriedout. The size of TEMP BUFFER is four lines multiplied bythe width of the image. The filtered DWT coefficients HH,HL, LH, and LL are then stored in the STRIP BUFFER.For an N-scale DWT decomposition where N > 1, LLcoefficients that are generated in stage (N − 1) are loadedfrom the STRIP BUFFER back into the TEMP BUFFER. AN-scale DWT decomposition is then performed on theseLL coefficients. Similarly, the DWT coefficients generated in

Page 192: 541420

EURASIP Journal on Embedded Systems 5

Full

imag

e

H[15] = X[15]− [(X[14] + X[16])/2]Image line

X[0]

X[14]

X[15]

X[16]

(a)

Stri

pim

age

1

H[15] = X[15]− [(X[14] + X[14])/2]Image line

X[0]

X[14]

X[15]

X[14]

Symmetricextension

using reflection

(b)

Figure 6: Symmetric extension in strip-based coding.

the N-level are then stored back into the STRIP BUFFER.The location of each wavelet coefficient to be stored in theSTRIP BUFFER is provided by the new address calculationunit and will be discussed in Section 4.

3.3. Symmetric Extension in Strip-Based Coding. From (1)and (2), it can be seen that to calculate the waveletcoefficient (2n + 1), wavelet coefficients (2n) and (2n + 2)are also needed. For example, to perform column filteringat image row 15, image rows 14 and 16 are needed asshown in Figure 6(a). However, in our proposed strip-basedcoding, only a strip image of size 16 rows is available at atime. Thus, during the implementation of the strip-based5/3 transformation, a symmetric extension using reflectionmethod is applied at the edges of the strip image data asshown in Figure 6(b). Compared to the traditional 2D DWTwhich performs wavelet transformation on a full image,the wavelet coefficient output from the DWT MODULEis expected to be slightly different due to the symmetricextension carried out. Analysis from our study shows that thepercentage error in wavelet coefficient value is not significantsince only an average of 0.81% differences is observed.

It should be noted that the strip-based filtering can alsosupport the traditional full DWT if the number of image linesfor strip-based filtering is increased from 16 lines to 24 lines.This is because each wavelet coefficient at N-scale wouldrequire one extra line of wavelet coefficients at N − 1 scalefor the 5/3 DWT. Thus, for a four-scale DWT, a total of eightadditional image lines are required. This approach is appliedin the strip-based coding proposed in [10] which uses theline-based DWT implementation proposed in [12]. However,in order to achieve the low memory implementation of imagecoder, our proposed work described in this paper applies thereflection method for its symmetric extension in the DWTimplementation.

4. Architecture for STRIP BUFFER

The wavelet coefficients generated from the DWT MODULEare stored in the STRIP BUFFER for SPIHT coding. The

number of memory lines needed in STRIP BUFFER is equalto two times the lowest-memory requirement that we canachieve at each subband. Therefore, the size of the strip-buffer is equal to the number of image rows needed for strip-based coding multiplied by the number of pixels in each row.Equation (4) gives the size of a strip-buffer:

Size of strip buffer = Rimage ×Width of image. (4)

4.1. Memory Allocation of DWT Coefficients inSTRIP BUFFER. To facilitate the SPIHT coding, theDWT coefficients obtained from the DWT MODULE arestored in a strip buffer in a predetermined location. Figure 7shows a memory allocation example of the DWT coefficientsin the STRIP BUFFER. The parent-children relationshipof SPIHT spatial orientation tree (SOT) structure using anexample of an 8 × 8 three-scale DWT decomposed imageis shown in Figure 7(a). For hardware implementation,the 2D data are necessary to be stored in a 1D format asshown in Figure 7(b). Researchers in [9] have introducedan addressing method to rearrange the DWT coefficients ina 1D array format for practical implementation. However,their proposed method works only on the pyramidalstructure of DWT decomposition as shown in Figure 7(a).

In our proposed work, the initial collection of DWTcoefficients is a mixture of LL, HL, LH, and HH componentsas shown in Figures 7(c)–7(e). In addition, to simplify theproposed modified SPIHT coding which will be explainedin the next section, it is preferred that the DWT coefficientsin the strip-buffer are stored in a predetermined location asshown in Figure 7(b). For these two reasons, a new addresscalculating unit is needed in the DWT MODULE.

Table 1 records the predefined rules to calculate the newaddresses of DWT coefficients in the STRIP BUFFER. TheDWT coefficients in the STRIP BUFFER are arranged in sucha manner that each parent node will have its four directoffsprings in a consecutive order. Besides this, it can beseen from Table 1 that the proposed new address calculationcircuit only requires address rewiring and therefore does notcause an increase in hardware complexity in the implemen-tation of our proposed work.

Page 193: 541420

6 EURASIP Journal on Embedded Systems

LH6

LL6

LL7

LL8

LH8

LL9

LH9

HH9

HL9

HH8

LH12

LL13

LH13

HL8

LH7

HH7

LL12HL12

HH12

HL13

HH13

HL7

HL6

HH6

LL10HL10

LL15

LH15

LH14

HL15

HH15

HH14

LL14HL14

HH10

LH10

LL11HL11

HH11

LH11

LH17

LL17

LL16

LH16

HH17

HL17

HL16

HH16

LL19

LH19

LH18

HL19

HH19

HH18

LL18HL18

LH21

LL21

LL20

LH20

HH21

HL21

HL20

HH20

(c2) 1-D DWT arrangement (scale 1)(c1) 2-D DWT arrangement (scale 1)

(d1) 2-D DWT arrangement (scale 2) (d2) 1-D DWT arrangement (scale 2)

(e1) 2-D DWT arrangement (scale 3) (e2) 1-D DWT arrangement (scale 3)

(a) 2-D DWT arrangement (scale 3)

(b) 1-D DWT arrangement in STRIP_BUFFER (final)

LH1

LL1

LH2

LH6

LH7

LH8

LH9

LH13

LH12

LH11LH15

LH16

LH17

LH10

LH3

LH5

LH14LH18

LH19

LH20

LH21

LH4

HL1

HH1

HL2

HL4

HL8

HL9

HL7

HL12

HL13

HL11

HL6

HL10

HL16

HL17

HL15

HL14

HL20

HL21

HL19

HL18

HL5

HL3

HH2

HH4

HH5

HH3

HH9

HH8

HH6

HH7

LH13

HH12

HH10

HH11

HH17

HH16

HH14

HH15

HH21

HH20

HH18

HH19

LL7

LL6

LL10

LL8

LL9

LL12

LL13

LL11

LL3

LL2

LL1

LL1 LH

1

LH1 HL

1HL1

HH1HH

1

LL2

LH2

LL3

LH3

LL4

LH4

LL5

LH5

HL2

HH2

HL3

HH3

HL4

HH4

HL5

HH5

LL4

LH2

LH3

LH4

LH5

LL5

LL15

LL14

LL18

LL16

LL17

LL20

LL21

LL19

LH7

LH6

LH10

LH8

LH9

LH12

LH13

LH11

LH15

LH14

LH18

LH16

LH17

LH20

LH21

LH19

HL7

HL6

HL10

HL8

HL9

HL12

HL13

HL11

HL15

HL14

HL2

HL3

HL4

HL5

HL18

HL16

HL17

HL20

HL21

HL19

HH7

HH6

HH10

HH8

HH9

HH12

HH13

HH11

HH15

HH14

HH2

HH3

HH4

HH5

HH18

HH16

HH17

HH20

HH21

HH19

LH1

LL1

HL1

LH2

LH3

LH4

LH5

HH1

HL3

HL2

HL4

HH2

HH3

HH4

HH5

HL5

LH7

LH6

LH10

LH8

LH9

LH12

LH13

LH11

LH15

LH14

LH18

LH16

LH17

LH20

LH21

LH19

HL7

HL6

HL10

HL8

HL9

HL12

HL13

HL11

HL15

HL14

HL18

HL16

HL17

HL20

HL21

HL19

HH7

HH6

HH10

HH8

HH9

HH12

HH13

HH11

HH15

HH14

HH18

HH16

HH17

HH20

HH21

HH19

Figure 7: Memory allocation of DWT coefficients in STRIP BUFFER.

4.2. New Spatial Orientation Tree Structure. In our proposedstrip-based coding, a four-scale DWT decomposition is per-formed on a strip image of size equal to 16 rows. Thus, atthe highest LL, HL, LH, and HH subbands, a single line of 32DWT coefficients is available at each subband.

Since each node in the original SPIHT SOT has 2 × 2adjacent pixels of the same spatial orientation as its descen-dants, the traditional SPIHT SOT is not suitable for applica-tion in our proposed work. The strip-based SPIHT algorithmproposed by [10] is implemented with zerotree roots startingfrom the HL, LH, and HH subbands. Although this methodcan be used in our proposed work, a lower performance ofthe strip-based SPIHT coding is expected. This is becausewhen the number of SOT is increased, many encoding bitswill be wasted especially at low bit-rates where most of thecoefficients have significant numbers of zeros [10, 21].

In [21], new SOT structures which take the next four pix-els of the same row as its children for certain subbands wereintroduced. The proposed new tree structures are named

SOT-B, SOT-C, SOT-D, and so on depending on the numberof scales the parent-children relationship has changed. In theproposed work, the virtual SPIHT (VSPIHT) algorithm [22]is applied in conjunction with new SOTs. In VSPIHT coding,a real LL subband is replaced with zero value coefficients andthe LL subband is further virtually decomposed by V-levels.The LL coefficients are then scalar quantized.

In our work presented in this paper, SOT-C proposedin [21] is applied together with a two-level of virtualdecomposition on the LL subband. However, instead ofreplacing the LL coefficients with zero value, our proposedwork treats these coefficients as the virtual HL, LH, and HHcoefficients as shown in Figure 8. The total number of rootnodes during the initialization stage is equal to eight, thatis, two roots without the descendant and two roots for eachof the HL, LH, and HH subbands. With the modified SOT, alonger tree structure is obtained. This means that the numberof zerotrees that needs to be coded at every pass is fewer.As a result, the number of bits that are generated during

Page 194: 541420

EURASIP Journal on Embedded Systems 7

Table 1: Predefined rules to calculate the new addresses of DWT coefficients in STRIP BUFFER of size 16 × 512 pixels.

N-scale DWT N = 1 N = 2 N = 3 N = 4

decomposition (MSB) (LSB) (MSB) (LSB) (MSB) (LSB) (MSB) (LSB)

Initial address of A12A11A10A9A8A7A6A5 — — —

image pixel A4A3A2A1A0

Initial address — A10A9A8A7A6A5A4 A8A7A6A5A4A3A2 A6A5A4A3 A2A1A0

of LL pixel A3A2A1A0 A1A0

New address ofDWT

A9A0A12A11A8A7A6 A8A0A10A7A6A5A4 A7A0A6A5A4A3A2 A6A0A5A4 A3A2A1

coefficients inSTRIP BUFFER

A5A4A3A2A10A1 A3A2A9A1 A8A1

( A12∗1024 ) + ( A11∗512 ) + ( A10∗2 ) ( A10∗256 ) + ( A9∗2 ) ( A8∗2 ) + ( A7∗256 ) ( A6∗64 ) + ( A5∗16 )

Equivalent + ( A9∗4096 ) + ( A8∗256 ) + ( A8∗1024 ) + ( A7∗128 ) + ( A6∗64 ) +( A5∗32 ) + ( A4∗8 ) + ( A3∗4 ) +

mathematical + ( A7∗128 ) + ( A6∗64 ) + ( A6∗64 ) + ( A5∗32 ) + ( A4∗16 ) ( A2∗2 ) + ( A1∗1 ) +

equation for new + ( A5∗32 ) + ( A4∗16 ) + ( A4∗16 ) + ( A3∗8 ) + ( A3∗8 ) + ( A2∗4 ) ( A0∗32 )

address + ( A3∗8 ) + ( A2∗4 ) + ( A2∗4 ) + ( A1∗1 ) + ( A1∗1 ) + ( A0∗128 )

+ ( A1∗1 ) + ( A0∗2048 ) + ( A0∗512 )

Roots without descendant

0000Virtual LLsubbandLL subband

Two-scale virtual decompositionon LL subband

LH, HL and HH subbands for

four-scale DWT decomposition

Virtual LHsubband

Virtual HLsubband

Virtual HHsubband

Virtual LHsubband

Virtual HLsubband

Virtual HHsubband

0000

00020032

8191

0004

0006

0008

0016

0024

0031

Parent node in this subband

1× 4 directoffsprings

in this subband

Figure 8: Proposed new spatial orientation tree structures.

the early stage of the sorting pass is significantly reduced[21, 22].

5. Set-Partitioning in Hierarchical Trees

In SPIHT coding, three sets of coordinates are encoded [5]:Type H set which holds the set of coordinates of all SOTroots, Type A set which holds the set of coordinates of alldescendants of node (i, j), and Type B set which holds the setof coordinates of all grand descendants of node (i, j). Theorder of subsets which are tested for significance is storedin three ordered lists: (i) List of significant pixels (LSPs), (ii)List of insignificant pixels (LIPs), and (iii) List of insignificantsets (LISs). LSP and LIP contain the coordinates of individualpixels whereas LIS contains either the Type A or Type B set.

SPIHT encoding starts with an initial threshold T0 whichis normally equal to K power of two where K is the numberof bits needed to represent the largest coefficient found inthe wavelet-transformed image. The LSP is set as an emptylist and all the nodes in the highest subband are put into theLIP. The root nodes with descendants are put into the LIS.A coefficient/set is encoded as significant if its value is largerthan or equal to the threshold T , or as insignificant if its valueis smaller than T . Two encoding passes which are the sortingpass and the refinement pass are performed in the SPIHTcoder.

During the sorting pass, a significance test is performedon the coefficients based on the order in which they arestored in the LIP. Elements in LIP that are found to besignificant with respect to the threshold are moved to the

Page 195: 541420

8 EURASIP Journal on Embedded Systems

1

0 1 0 0

1 0 11 0 0

DESC(i, j) = 1

and GDESC(i, j) = 1

Test on SIG(k, l)and DESC(k, l)

(i, j)

(k, l)ε(i, j)

· · ·

(a)

1

0 0 0 1

0 0 00 0 0

DESC(i, j) = 1

and GDESC(i, j) = 0

Test on SIG(k, l)only

(i, j)

(k, l)ε(i, j)

· · ·

(b)

Figure 9: Two Combinations in modified SPIHT algorithm. (a) Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1, (b) Combination 2:DESC(i, j) = 1 and GDESC(i, j) = 0.

LSP list. A significance test is then performed on the setsin the LIS. Here, if a set in LIS is found to be significant,the set is removed from the list and is partitioned into foursingle elements and a new subset. This new subset is addedback to LIS and the four elements are then tested and movedto LSP or LIP depending on whether they are significant orinsignificant with respect to the threshold.

Refinement is then carried out on every coefficient that isadded to the LSP except for those that are just added duringthe sorting pass. Each of the coefficients in the list is refinedto an additional bit of precision. Finally, the threshold ishalved and SPIHT coding is repeated until all the waveletcoefficients are coded or until the target rate is met. Thiscoding methodology which is carried out under a sequenceof thresholds T0,T1,T2,T3 . . . TK−1 where Ti = (Ti−1/2) isreferred to as bit-plane encoding.

From the study of SPIHT coding, it can be seen thatbesides the individual tree nodes, SPIHT also performssignificance tests on both the degree-1 zerotree and degree-2 zerotree. Despite improving the coding performance byproviding more levels of descendant information for eachcoefficient tested as compared to the EZW which onlyperforms significance test on the individual tree nodes andthe degree-0 zerotree, the development of SPIHT codingneglects the coding of the degree-0 zerotree.

Analysis from our study involving degree-0 to degree-2zerotree coding found that the coding of degree-0 zerotreewhich has been removed during the development of SPIHTcoding is important and can lead to a significant improve-ment in zerotree coding efficiency. Thus, in the next sub-section, a proposed modification of SPIHT algorithm whichreintroduces the degree-0 zerotree coding methodology willbe presented. It should be noted that in our proposedmodified SPIHT coding, significance tests performed onindividual tree nodes, Type A, and Type B sets are referredto as SIG, DESC, and GDESC, respectively.

5.1. Proposed SPIHT-ZTR Coding. In the traditional SPIHTcoding on the sets in the LIS, significance test is firstperformed on the Type A set. If Type A set is found tobe significant, that is, DESC(i, j) = 1, its 2 × 2 offsprings(k, l) ∈ O(i, j) are tested for significance and are movedto LSP or LIP, depending on whether they are significant,

that is, SIG(k, l) = 1 or insignificant, that is, SIG(k, l) =0, with respect to the threshold. Node (i, j) is then addedback to LIS as the Type B set. Subsequently, if Type Bis found to be significant, that is, GDESC(i, j) = 1, theset is removed from the list and is partitioned into fournew Type A subsets and these subsets are added back toLIS. Here, we are proposing a modification in the order in

which the DESC and GDESC bits are sent. In the modifiedSPIHT algorithm, the GDESC(i, j) bit is sent immediatelywhen the DESC(i, j) is found to be significant. As shown inFigure 9, when DESC(i, j) = 1, four SIG(k, l) bits need to besent. However, whether the DESC(k, l) bits need to be sentdepends on the result of GDESC(i, j). Thus, there are twopossible combinations here: Combination 1: DESC(i, j) = 1and GDESC(i, j) = 1; Combination 2: DESC(i, j) = 1 andGDESC(i, j) = 0.

Combination: DESC(i, j) = 1 and GDESC(i, j) = 1. Whenthe significance test result of GDESC(i, j)equals 1, it indicatesthat there must be at least one grand descendant node under(i, j) that is significant with respect to the current thresholdT . Thus, in order to locate the significant node or nodes,four DESC(k, l) bits need to be sent in addition to the fourSIG(k, l) bits where (k, l) ∈ O(i, j). Table 2 shows the resultsof an analysis carried out on six standard test images onthe percentage of occurrence of possible outcomes of theSIG(k, l) and DESC(k, l) bits.

As shown in Table 2, the percentage of occurrence ofthe outcome SIG = 0 and DESC = 0 is much higher thanthe other remaining three outcomes. Thus, in our proposedmodified SPIHT coding, Huffman coding concept is appliedto code all these four possible outcomes of SIG and DESCbits. By allocating fewer bits to the most likely outcomeof SIG = 0 and DESC = 0, an improvement in the codinggain of SPIHT is expected. It should be noted that thisoutcome where SIG = 0 and DESC = 0 is also equivalentto the significance test of zerotree root (ZTR) in the EZWalgorithm. Therefore, by encoding the root node anddescendant of an SOT using a single symbol, the degree-0zerotree coding methodology has been reintroduced into ourproposed modified SPIHT coding which for convenience istermed the SPIHT-ZTR coding scheme.

Page 196: 541420

EURASIP Journal on Embedded Systems 9

Table 2: The percentage (%) of occurrence of possible outcomes of the SIG(k, l) and DESC(k, l) bits for various standard gray-scale testimages of size 512 × 512 pixels under Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1. Node (i, j) is the root node and (k, l) is theoffspring of (i, j).

Test ImageSIG(k, l) = 0 and SIG(k, l) = 0 and SIG(k, l) = 1 and SIG(k, l) = 1 and

DESC(k, l) = 0 DESC(k, l) = 1 DESC(k, l) = 0 DESC(k, l) = 1

Lenna 42.60 32.67 11.49 13.24

Barbara 42.14 35.47 10.70 11.69

Goldhill 44.76 28.13 14.07 13.04

Peppers 44.39 34.49 9.41 11.71

Airplane 44.01 25.22 16.51 14.26

Baboon 42.71 28.30 14.97 14.02

Equivalent symbol in EZW ZTR IZ POS/NEG POS/NEG

Bits assignment in the proposed work “0” “10” “110” “111”

Table 3: The percentage (%) of occurrence of possible outcomes of the ABCD for various standard grayscale test images of size 512 × 512pixels under Combination 2: DESC(i, j) = 1 and GDESC(i, j) = 0. ABCD refers to the significance of the four offsprings of node (i, j).

Possible outcome of ABCDTest Image Bits assignment in the proposed work

Lenna Barbara Goldhill Peppers Airplane Baboon

0001 15.40 14.66 15.25 15.15 15.27 14.70 “00”

0010 14.87 14.21 14.41 14.76 15.84 14.67 “1” + “01”

0100 14.79 13.66 15.72 15.23 15.96 14.78 “10”

1000 15.21 13.96 14.83 15.02 15.70 15.26 “11”

0011 4.81 5.93 5.21 5.20 5.34 5.48 “0011”

0101 5.48 5.51 5.38 4.98 4.92 4.95 “0101”

0110 4.60 4.41 4.25 4.24 3.96 4.54 “0110”

1001 4.34 4.38 4.15 4.39 3.96 4.56 “1001”

1010 5.33 5.58 5.12 5.06 5.21 4.86 “1010”

1100 4.84 5.24 5.32 5.37 5.26 5.25 “0” + “1100”

0111 2.27 2.69 2.34 2.31 1.86 2.36 “0111”

1011 2.26 2.51 2.12 2.37 1.85 2.31 “1011”

1101 2.16 2.56 2.21 2.20 1.95 2.47 “1101”

1110 2.28 2.43 2.37 2.32 1.84 2.40 “1110”

1111 1.36 2.27 1.32 1.40 1.08 1.41 “1111”

Combination: DESC(i, j) = 1 and GDESC(i, j) = 0. WhenDESC(i, j) = 1 and GDESC(i, j) = 0, it indicates that theSOT is a degree-2 zerotree where all the grand descendantnodes under (i, j) are insignificant. It also indicates that atleast one of the four offsprings of node (i, j) is significant. Inthis situation, four SIG(k, l) bits where (k, l) ∈ O(i, j) needto be sent. Let the significance of the four offsprings of node(i, j) be referred to as “ABCD.” Here, a total of 15 possiblecombinations of ABCD can be obtained as shown in Table 3.The percentage of occurrence of possible outcomes of ABCDis determined for various standard test images and the resultsare recorded in Table 3.

From Table 3, it can been seen that the first four ABCDoutcomes of “0001,” “0010,” “0100,” an “1000” occur morefrequently as compared to the other remaining 11 possibleoutcomes. Like in Combination 1, Huffman coding conceptis applied to encode all the outcomes of ABCD. The outputbits assignment for each of the 15 possible outcomes ofABCD is shown in Table 3. Since fewer bits are needed to

encode the most likely outcomes of ABCD, that is, “0001,”“0010,” “0100,” and “1000”, an improved performance of theSPIHT coding is anticipated.

It should be noted that in both Combinations 1 and 2, allthe wavelet coefficients that are found to be insignificant areadded to the LIP and those that are found to be significantare added to the LSP. The sign bit for those significantcoefficients are also output to the decoder.

5.2. Listless SPIHT-ZTR for Strip-Based Implementation.Although the proposed SPIHT-ZTR coding is expected toprovide an efficient compression performance, its imple-mentation in a hardware constrained environment is diffi-cult. One of the major difficulties encountered is the useof three lists to store the coordinates of the individualcoefficients and subset trees during the set-partitioningoperation. The use of these lists will increase the complexityand implementation cost of the coder since memory man-agement is required and a large amount of storage is needed

Page 197: 541420

10 EURASIP Journal on Embedded Systems

B

Yes

Has children

Has children

Has grandchildren Combination #1

Combination #2

Yes

Yes

Yes

Yes

Yes

Yes

Yes Yes

Yes

Yes

No

No

No

No

No

No

No

No

No

No

No

A

C

Threshold =threshold/2

Coefficient (i, j)

Is (i, j) a root node

DESC PREV(parent of (i, j)) = 1

SIG PREV(i, j) = 1

DESC PREV(i, j) = 1

GDESC PREV(parent of (i, j)) = 1

Output SIG(i, j). Set

SIG PREV(i, j) =SIG(i, j)/output

refinement bit

DESC PREV(i, j) = 1

Output DESC(i, j).

Set DESC PREV(i, j)

= DESC(i, j)

DESC PREV(i, j) = 1

GDESC PREV(i, j) = 1

Output GDESC(i, j).

Set GDESC PREV(i, j)

= GDESC(i, j)

(a) Sorting pass and refinement pass.

Output “0” ifSIG(i, j) = 0 & DESC(i, j) = 0

Set SIG PREV(i, j) = SIG(i, j)

and DESC PREV(i, j)

= DESC(i, j)

Output “110” ifSIG(i, j) = 1 & DESC(i, j) = 0

Set SIG PREV(i, j) = SIG(i, j)

and DESC PREV(i, j)

= DESC(i, j)

Output “10” ifSIG(i, j) = 0 & DESC(i, j) = 1

Set SIG PREV(i, j) = SIG(i, j)

and DESC PREV(i, j)

= DESC(i, j)

Output “111” ifSIG(i, j) = 1 & DESC(i, j) = 1

Set SIG PREV(i, j) = SIG(i, j)

and DESC PREV(i, j)

= DESC(i, j)

Combination #1:

A B

(b) Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1

Figure 10: Continued.

Page 198: 541420

EURASIP Journal on Embedded Systems 11

Yes

Yes

No

No

Is (i, j) firstdirect offspring

Output bit assignment(refer to Table 3). Set

SIG PREV(x, y) = SIG(x, y)

where (x, y) is the four

direct offsprings

Output SIG(i, j).

Set SIG PREV(i, j)

= SIG(i, j)

Is SIG PREV(next 3 offsprings) = 0

Combination #2:

C

Skip coding for next threecoefficients

(c) Combination 2: DESC(i, j) = 1 and GDESC(i, j) = 0.

Figure 10: Listless SPIHT-ZTR for strip-based implementation.

to maintain these lists [23, 24]. In this subsection, a listlessSPIHT-ZTR for strip-based implementation is proposed.The proposed algorithm not only has all the advantages thata listless coder has but is also developed for the low-memorystrip-based implementation of SPIHT coding. The flow chartof the proposed algorithm is shown in Figure 10.

In our proposed listless SPIHT-ZTR algorithm, threesignificance maps known as SIG PREV, DESC PREV, andGDESC PREV are used to store the significance of thecoefficient, the significance of the descendant, and the signif-icance of the grand descendant, respectively. The SIG PREVinformation is stored in a one-bit array which has a sizeequal to the size of the strip-buffer. In comparison, the arraysize of DESC PREV is only a quarter that of SIG PREVsince the leaf nodes have no descendant and the array sizeof GDESC PREV is only one-sixteenth that of SIG PREVsince the nodes in the lowest two scales have no granddescendant.

In listless SPIHT-ZTR coding, the memory needed tostore the significance information during the entropy codingis very small when compared to SPIHT and listless zerotreecoder (LZC) [24]. In SPIHT, three lists are used and in LZC,the significance flags FC and FD are equal to the image size,and a quarter of the image size, respectively. In our proposedcoding scheme, the significance maps storage is cleared andreleased for coding of the next image strip after the coding isdone for each image strip.

It should be noted that the peak signal-to-noise ratio(PSNR) performance of our proposed listless SPIHT-ZTRcoding is similar to that obtained using the original SPIHTalgorithm at the end of every bit-plane. The number ofsignificant pixels of both algorithms after every bit-plane is

SPIHT_ENCODE

DE

SC_

BU

FFE

RSI

G_P

RE

V

GD

ESC

_PR

EV

DE

SC_P

RE

VG

DE

SC_

BU

FFE

RSignificance data collection

(upward scanning)

SPIHT-ZTR(downward scanning)

Threshold = threshold/2

Figure 11: Architecture for SPIHT ENCODE.

exactly the same except that the sequence in which the bitsare produced is different.

Similar to the other listless coders, the sorting andrefinement passes in the traditional SPIHT algorithm aremerged into one single pass in the proposed listless SPIHT-ZTR algorithm. This makes the control flow of our proposedcoding simple and easy to implement in hardware [23, 24].

5.3. Architecture for SPIHT ENCODE. Figure 11 shows ourproposed SPIHT ENCODE architecture. Since the wavelet

Page 199: 541420

12 EURASIP Journal on Embedded Systems

0

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4

1

MSB

GDESC(GDESC_BUFFER)

DESC(DESC_BUFFER)

SIG(STRIP_BUFFER)

MSB

OR

OR

OR

OR

OR

OR

MSBU

pwar

d sc

ann

ing

1 0

0 0 0

0 0 0

0 1 0

0 1 0

1 0 0

0 0 0

0 0 0

1 0 0

0 0 0

0 1 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

1 0 0

0 0 0

1 0 0

0 0 0

11

0

1 0 0

0 1 0

0 0 0

1 0 0

0

1

2

3

4

1 1 00

10

11

12

13

14

15

16

17

18

19

20

0

1

2

3

4

5

6

7

8

9

Figure 12: Significance information for each coefficient at each bit plane is determined and is stored in buffers when the SOT is scannedfrom the bottom to the top.

coefficients in the STRIP BUFFER are arranged in a pyra-midal structure where the parent nodes are always on topof their descendant nodes, the proposed listless SPIHT-ZTRcoding is implemented using a one-pass upward scanningand a one/multipass downward scanning methodology asexplained below.

One-Pass Upward Scanning—Significance Data Collection.This scanning method starts from the leaf nodes up to theroots, that is, from the bottom to the top of STRIP BUFFER.While the SOT is being scanned, the DESC and GDESC sig-nificance information for each coefficient at each bit-plane isdetermined and stored in temporary buffers DESC BUFFERand GDESC BUFFER.

This significance data collection process is carried outin parallel for all bit-planes as shown in Figure 12. The SIG

information is obtained directly from the STRIP BUFFERwhereas the DESC and GDESC information for a coefficientis obtained by OR-ing the SIG and DESC results of its fouroffsprings, respectively. It should be noted that the proposedsignificance data collection process is analogous to the fastzerotree identifying technique proposed in [13]. With allthe significance information precomputed and stored, thisresults in a fast encoding process since the significanceinformation can be readily obtained from the buffers duringthe SPIHT-ZTR coding.

One/Multi-Pass downward Scanning—Listless SPIHT-ZTRCoding. SPIHT-ZTR coding as described in Figure 10is performed on the DWT coefficients stored in theSTRIP BUFFER. Similar to the traditional SPIHT coding, abit-plane coding methodology can be applied here. Although

Page 200: 541420

EURASIP Journal on Embedded Systems 13

WB

WB

ME

ME

XC

MP

FWD

0Mux

1

0Mux

1

0Mux

1

0Mux

1

0Mux

1

EX/MEM

ALUOp

Inst. [3-0] ALUcontrol

ALUZeroALUSrc

Branch

ID/EXMemtoReg

Branch

ControlInst. [28-26]

Inst. [27]

Inst. [25-21]

Inst. [15-0]

Inst. [20-16]

Inst. [15-11]

Inst. [20-16]

ANDAdd

Address

Instruction memory

1

PC

PC

Src

IF/ID

Registers

Sign extend

3216

MemRead

RegDst

ALUOp

ALUSrc

MemWrite

RegWrite

Reg

Wri

te

XOR

Comparator

ALU result

RegDst

MemWrite

MemRead

AddressOutput

data

Output data 1

Read register 1

Read register 2

Output data 2

Write address

Write data

Writedata

Datamemory

Mem

ToR

eg

2

Figure 13: Architecture of our modified MIPS processor.

a fully embedded bit-stream cannot be obtained becauseonly a portion of the image is encoded at a time, theproposed strip-based image compression scheme has apartially embedded property. Each SOT in the strip-buffer isencoded in the order of importance, that is, those coefficientswith a higher magnitude are encoded first. This allowsregion-of-interest (ROI) coding since a higher number ofencoding pass can be set for a strip that contains the targetedpart of the image.

On the other hand, a non-embedded SPIHT-ZTR codingcan be performed using the one-pass downward scanningmethodology. Here, instead of scanning the SOT for differentmagnitude intervals, each coefficient in the tree can bescanned starting from its most-significant-bit (MSB) to theleast-significant-bit (LSB). Since all the significance informa-tion needed for all bit-planes is stored during the upwardscanning process, a full bit-plane encoding can be carried outon one coefficient followed by the next coefficient.

Not only does the proposed listless SPIHT-ZTR codingrequire less memory and reduce the complexity in the imple-mentation of the coder by eliminating the use of listsbut alsothe upward-downward scanning methodology simplifies theencoding process and allows a faster coding speed.

6. Microprocessor-Based Implementation andSimulation Results

The proposed strip-based SPIHT-ZTR architecture wasimplemented using a softcore microprocessor-based appr-

oach on a Xilinx Spartan III FPGA device. A customizedimplementation of the MIPS processor architecture [25] wasadopted. Figure 13 shows the architecture of our proposedMIPS processor which is a modified version of the MIPSarchitecture presented in [25] in order to simplify theprocessor architecture and to facilitate the implementationof strip-based image compression.

First, a simplified forwarding unit is incorporated intoour MIPS architecture. This unit allows the output of thearithmetic logic unit (ALU) to be fedback to the ALU itselffor computation. The data forwarding operation is con-trolled by the result derived from the AND operation whichis stored in the register FWD. Instead of having to detect datahazard like in the traditional MIPS architecture, a specificregister number (register 31) is used to inform the processorto use the data directly from the previous ALU operation.Next, the MIPS architecture is reduced from its original five-stage pipeline implementation to a four-stage pipeline imple-mentation. This is achieved by shifting the data memory unitand the branch instruction unit one stage forward.

In the traditional MIPS index addressing method, an off-set value is added to a pointer address to form a new memoryaddress. For example, the instruction “lw $t2, 4($t0)” willload the word at memory address ($t0+4) into register $t2.The value “4” gives an offset from the address stored in regis-ter $t0. In our MIPS implementation, the addressing methodis simplified by removing the offset calculation because mostof the time, the offset is equal to zero. For example, to

Page 201: 541420

14 EURASIP Journal on Embedded Systems

Table 4: MIPS machine language.

Category Instruction Format Example Meaning

ArithmeticAdd R add $s1, $s2, $s3 $s3 = $s1 + $s2

Subtract R sub $s1, $s2, $s3 $s3 = $s2 −$s1

Add Immediate I addi $s1, $s2, 100 $s2 = $s1 + 100

Data TransferLoad Word I lw $s1, $s2, X $s2 = Memory[$s1]

Store Word I sw $s1, $s2, X Memory[$s1] = $s2

Logical

And R and $s1, $s2, $s3 $s3 = $s1 & $s2

Or R or $s1, $s2, $s3 $s3 = $s1| $s2

Shift Left Logical R sll $s1, X, $s3 $s3 = $s1� 1

Shift Right Logical R srl $s1, X, $s3 $s3 = $s1� 1

Conditional Branch

Branch on Equal I beq $s1, $s2, B If ($s1 = $s2) Go to B

Branch on Not Equal I bne $s1, $s2, B If ($s1 /= $s2) Go to B

Set on Less Than R slt $s1, $s2, $s3If ($s2 > $s1) $s3 = 1;

Else $s3 = 0;

DWT

Add Shift R as1 $s1, $s2, $s3 $s3 = ($s1 + $s2) / 2

Add Shift Shift R as2 $s1, $s2, $s3 $s3 = ($s1 + $s2 + 2) / 4

DWT-1 R dwt1 $s1, X, $s3 $s3 = NewAddressCalculation ($s1)

DWT-2 R dwt2 $s1, X, $s3 $s3 = NewAddressCalculation ($s1)

DWT-3 R dwt3 $s1, X, $s3 $s3 = NewAddressCalculation ($s1)

DWT-4 R dwt4 $s1, X, $s3 $s3 = NewAddressCalculation ($s1)

$s1, $s2, $s3 – Registers, X–Not in used.

access the data stored in location ($t0+4), the address is firstobtained by adding “4” to the content of register $t0. Then,an indirect addressing method “lw $t2, $t0” is used to loadthe word at the memory address contained in $t0 into $t2.The register $t0 contains the new address ($t0+4) and isavailable directly from the output of the ALU or from theID/EX pipeline register. Hence, the data memory unit can beshifted one stage forward in the proposed MIPS architecture.This allows the data forwarding hardware to be simplified.

The branch instruction unit is also shifted one stageforward in our modified MIPS processor in order to reducethe number of stall instructions that are required after abranch instruction. In addition, our MIPS architecture alsosupports both the “branch not equal” and “branch equal”instructions. By incorporating a comparator followed by aXOR operation, the “branch not equal” and “branch equal”are selected based on the result stored in register CMP.

Table 4 shows the MIPS instruction set used in our strip-based image processing implementation. As can be seen, afew instructions are added for the DWT implementationbesides the standard instructions given in [25]. The as1 andas2 instructions are used to speed up the processing of theDWT whereas the dwt1 to dwt4 instructions are used tocalculate the new memory address of the wavelet coefficientsin the strip-buffer. Table 5 shows the device utilization sum-mary for the proposed strip-based coding implementation.The implementation uses 2366 slices which is approximately17% of the Xilinx Spartan III FPGA. The number ofMIPS instructions needed for the DWT MODULE andSPIHT MODULE is 261 and 626, respectively.

Table 5: Device utilization summary for the strip-based SPIHT-ZTR architecture implementation.

Device utilization summary

Selected device Xilinx Spartan III 3S1500L-4 FPGA

Number of occupied slices 2366 out of 13312 (17%)

Number of slice flip flops 1272 out of 26624 (4%)

Number of 4 input LUTs 3416 out of 26624 (12%)

Software simulations using MATLAB were carried out toevaluate the performance of our proposed strip-based imagecoding using SPIHT-ZTR algorithm. The simulations wereconducted using the 5/3 DWT filter. All standard grey-scaletest images used are of size 512× 512 pixels. In our proposedwork, a four-scale DWT decomposition and a five-scale SOTdecomposition were performed using the proposed SPIHT-ZTR coding with an SOT-C structure. The performanceof the proposed coding scheme was compared with thetraditional SPIHT coding. Both the binary-uncoded (SPIHT-BU) and arithmetic-coded (SPIHT-AC) SPIHT coding werealso implemented with a four-scale DWT and a five-scale SOT decomposition using the traditional 2 × 2 SOTstructure.

Table 6 shows the PSNR at various bit-rates (bpp) fortest images Lenna, Barbara, Goldhill, Peppers, Airplane,and Baboon. Figure 14 shows the performance comparisonplot for SPIHT-AC, SPIHT-BU, and SPIHT-ZTR in termsof average PSNR versus the average number of bits sent.From the simulation results shown in Table 6, it can be seen

Page 202: 541420

EURASIP Journal on Embedded Systems 15

Table 6: Performance of the proposed strip-based image coder using SPIHT-ZTR coding and SOT-C structure compared to the traditionalbinary uncoded (SPIHT-BU) and arithmetic encoded (SPIHT-AC) SPIHT coding in terms of peak signal-to-noise ratio (dB) versus bit-rate(bpp) for various grey-scale test images of size 512 × 512 pixels.

Peak Signal-to-Noise Ratio, PSNR (dB)

Bit-rates (Bpp) SPIHT-AC SPIHT-ZTR SPIHT-BU Bit-rates (Bpp) SPIHT-AC SPIHT-ZTR SPIHT-BU

Lenna Barbara

0.25 33.35 32.98 32.91 0.25 26.50 26.16 26.14

0.50 36.56 36.17 36.07 0.50 30.01 29.65 29.60

0.80 38.74 38.46 38.34 0.80 33.35 32.95 32.86

1.00 39.75 39.49 39.31 1.00 34.99 34.46 34.29

Goldhill Peppers

0.25 30.30 29.84 29.91 0.25 34.42 34.04 33.99

0.50 32.82 32.33 32.33 0.50 36.87 36.50 36.48

0.80 34.90 34.62 34.41 0.80 38.35 38.12 37.95

1.00 36.29 35.77 35.66 1.00 39.12 38.85 38.71

Airplane Baboon

0.25 33.35 32.93 32.78 0.25 24.20 24.03 23.88

0.50 37.31 36.81 36.68 0.50 26.49 25.89 25.95

0.80 40.45 40.01 39.83 0.80 28.56 28.25 28.07

1.00 42.01 41.40 41.26 1.00 30.02 29.48 29.38

Table 7: Memory requirements for the strip-based implementation of the traditional SPIHT coding using the original 2 × 2 SOT structureand our proposed SPIHT-ZTR using SOT-C.

Coding SchemeDWT SOT Minimum Memory Lines Needed at Type of Spatial Orientation Tree

Scale Scale each subband (DWT / SOT) (SOT) Structure

SPIHT-BU /4 5 8/32

Original 2 × 2 structure with

SPIHT- AC [5] roots at LL subbands

Strip-based4 5 8/8

Roots start from highest

SPIHT [10] LH, HL and HH subbands.

Our proposed Strip-4 4 8/8 SOT-C with roots at LL subbands

based SPIHT-ZTR

SPIHT-ACSPIHT-ZTRSPIHT-BU

0 100 200 300 400 500 600 700

Number of bits sent (Kbits)

14

18

22

26

30

34

38

42

46

PSN

R(d

B)

Figure 14: Performance comparison of SPIHT-AC, SPIHT-BU andSPIHT-ZTR in terms of peak signal-to-noise ratio (PSNR) versusthe number of bits sent (Kbits). (The comparison plots are in termsof average PSNR values and average number of bits sent for all sixtest images.)

that our proposed SPIHT-ZTR performs better than theSPIHT-BU. An average PSNR improvement of 0.14 dB isobtained at 1.00 bpp using the proposed coding scheme. Thisis because the number of bits required to encode the imageat each bit-plane is fewer in SPIHT-ZTR when comparedto SPIHT-BU. In comparison with SPIHT-AC, althoughSPIHT-ZTR gives a slightly lower PSNR performance, itsimplementation is much less complex since there is noarithmetic coding in SPIHT-ZTR.

Table 7 shows the comparison in memory requirementsneeded for the strip-based implementation of our proposedSPIHT-ZTR and that of those needed in [5] and [10]. Itshould be noted that in the traditional SPIHT [5] coding,a six-scale DWT decomposition and a seven-scale SOTdecomposition were originally applied on an image of size512 × 512 pixels. However, for our comparison to be mean-ingful, the memory requirements recorded here all involvea four-scale DWT and a five-scale SOT-decomposition.From Table 7, it can be seen that our proposed strip-based

Page 203: 541420

16 EURASIP Journal on Embedded Systems

SPIHT-ZTR using SOT-C reduces the memory requirementby 75% as compared to the traditional SPIHT using the orig-inal 2× 2 SOT structure. Even though the strip-based SPIHTcoder proposed in [10] requires the same number of memorylines as our proposed work, there is a significant degradationin its performance since the number of zerotrees to be codedis increased. This hypothesis has been shown in [10, 21].

Lastly, we have also verified that the result output fromour proposed hardware strip-based coder is similar to thesoftware simulation results.

7. Conclusion

The proposed architecture for strip-based image codingusing SPIHT-ZTR algorithm is able to reduce the complexityof its hardware implementation considerably and it requiresa very much lower amount of memory for processingand buffering compared to the traditional SPIHT codingmaking it suitable for implementation in severely con-strained hardware environments such as WSNs. Using theproposed new 1D addressing method, wavelet coefficientsgenerated from the DWT module are organized into thestrip-buffer in a predetermined location. This simplifies theimplementation of SPIHT-ZTR coding since the coding cannow be performed in two passes. Besides this, the proposedmodification on the SPIHT algorithm by reintroducing thedegree-0 zerotree coding results in a significant improvementin compression efficiency. The proposed architecture is suc-cessfully implemented using our designed MIPS processorand the results have been verified through simulations usingMATLAB.

References

[1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “Wirelessmultimedia sensor networks: a survey,” IEEE Wireless Commu-nications, vol. 14, no. 6, pp. 32–39, 2007.

[2] A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler, and J.Anderson, “Wireless sensor networks for habitat monitoring,”in Proceedings of the ACM International Workshop on WirelessSensor Networks and Applications (WSNA ’02), pp. 88–97,Atlanta, Ga, USA, September 2002.

[3] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci,“A survey on sensor networks,” IEEE Communications Maga-zine, vol. 40, no. 8, pp. 102–105, 2002.

[4] E. Magli, M. Mancin, and L. Merello, “Low-complexity videocompression for wireless sensor networks,” in Proceedings ofthe IEEE International Conference on Multimedia and Expo(ICME ’03), vol. 3, pp. 585–588, Baltimore, Md, USA, July2003.

[5] A. Said and W. A. Pearlman, “A new, fast, and efficient imagecodec based on set partitioning in hierarchical trees,” IEEETransactions on Circuits and Systems for Video Technology, vol.6, no. 3, pp. 243–250, 1996.

[6] J. M. Shapiro, “Embedded image coding using zerotrees ofwavelet coefficients,” IEEE Transactions on Signal Processing,vol. 41, no. 12, pp. 3445–3462, 1993.

[7] D. Taubman, “High performance scalable image compressionwith EBCOT,” IEEE Transactions on Image Processing, vol. 9,no. 7, pp. 1158–1170, 2000.

[8] W.-B. Huang, W. Y. Su, and Y.-H. Kuo, “VLSI implementationof a modified efficient SPIHT encoder,” IEICE Transactions onFundamentals of Electronics, Communications and ComputerSciences, vol. 89, no. 12, pp. 3613–3622, 2006.

[9] J. Jyotheswar and S. Mahapatra, “Efficient FPGA imple-mentation of DWT and modified SPIHT for lossless imagecompression,” Journal of Systems Architecture, vol. 53, no. 7,pp. 369–378, 2007.

[10] R. K. Bhattar, K. R. Ramakrishnan, and K. S. Dasgupta,“Strip based coding for large images using wavelets,” SignalProcessing, vol. 17, no. 6, pp. 441–456, 2002.

[11] C. Parisot, M. Antonini, M. Barlaud, C. Lambert-Nebout,C. Latry, and G. Moury, “On board strip-based waveletimage coding for future space remote sensing missions,” inProceedings of the International Geoscience and Remote SensingSymposium (IGARSS ’00), vol. 6, pp. 2651–2653, Honolulu,HI, USA, July 2000.

[12] C. Chrysafis and A. Ortega, “Line-based, reduced memory,wavelet image compression,” IEEE Transactions on ImageProcessing, vol. 9, no. 3, pp. 378–389, 2000.

[13] J. M. Shapiro, “A fast technique for identifying zerotrees inthe EZW algorithm,” in Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP’96), vol. 3, pp. 1455–1458, 1996.

[14] A. Jensen and l. Cour-Harbo, Ripples in Mathematics: TheDiscrete Wavelet Transform, Springer, Berlin, Germany, 2000.

[15] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge, Wellesley, Mass, USA, 2nd edition, 1996.

[16] M. Weeks, Digital Signal Processing Using Matlab and Wavelets,Infinity Science Press LLC, Sudbury, Mass, USA, 2007.

[17] W. Sweldens, “The lifting scheme: a custom-design construc-tion of biorthogonal wavelets,” Applied and ComputationalHarmonic Analysis, vol. 3, no. 2, pp. 186–200, 1996.

[18] K.-C. B. Tan and T. Arslan, “Shift-accumulator ALU centricJPEG2000 5/3 lifting based discrete wavelet transform archi-tecture,” in Proceedings of the IEEE International Symposium onCircuits and Systems (ISCAS ’03), vol. 5, pp. V161–V164, 2003.

[19] T. Archarya and P.-S. Tsai, JPEG2000 Standard for ImageCompression: Concepts, Algorithms and VLSI Architectures,Wiley-Interscience, New York, NY, USA, 2004.

[20] M. E. Angelopoulou, K. Masselos, P. Y. K. Cheung, and Y.Andreopoulos, “Implementation and comparison of the 5/3lifting 2D discrete wavelet transform computation scheduleson FPGAs,” Journal of Signal Processing Systems, vol. 51, no. 1,pp. 3–21, 2008.

[21] L. W. Chew, L.-M. Ang, and K. P. Seng, “New virtual SPIHTtree structures for very low memory strip-based imagecompression,” IEEE Signal Processing Letters, vol. 15, pp.389–392, 2008.

[22] E. Khan and M. Ghanbari, “Very low bit rate video codingusing virtual SPIHT,” Electronics Letters, vol. 37, no. 1, pp.40–42, 2001.

[23] F. W. Wheeler and W. A. Pearlman, “SPIHT imagecompression without lists,” in Proceedings of the IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP ’00), vol. 4, pp. 2047–2050, Istanbul,Turkey, June 2000.

[24] W.-K. Lin and N. Burgess, “Listless zerotree coding for colorimages,” in Proceedings of the 32nd Asilomar Conference onSignals, Systems and Computers, vol. 1, pp. 231–235, Monterey,Calif, USA, November 1998.

[25] D. A. Patterson and J. L. Hennessy, Computer Organization andDesign: The Hardware/Software Interface, Morgan Kaufmann,San Francisco, Calif, USA, 2nd edition, 1998.

Page 204: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 725438, 7 pagesdoi:10.1155/2009/725438

Research Article

Data Cache-Energy and Throughput Models: Design Explorationfor Embedded Processors

Muhammad Yasir Qadri and Klaus D. McDonald-Maier

School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK

Correspondence should be addressed to Muhammad Yasir Qadri, [email protected]

Received 25 March 2009; Revised 19 June 2009; Accepted 15 October 2009

Recommended by Bertrand Granado

Most modern 16-bit and 32-bit embedded processors contain cache memories to further increase instruction throughput of thedevice. Embedded processors that contain cache memories open an opportunity for the low-power research community to modelthe impact of cache energy consumption and throughput gains. For optimal cache memory configuration mathematical modelshave been proposed in the past. Most of these models are complex enough to be adapted for modern applications like run-timecache reconfiguration. This paper improves and validates previously proposed energy and throughput models for a data cache,which could be used for overhead analysis for various cache types with relatively small amount of inputs. These models analyzethe energy and throughput of a data cache on an application basis, thus providing the hardware and software designer with thefeedback vital to tune the cache or application for a given energy budget. The models are suitable for use at design time in thecache optimization process for embedded processors considering time and energy overhead or could be employed at runtime forreconfigurable architectures.

Copyright © 2009 M. Y. Qadri and K. D. McDonald-Maier. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

The popularity of embedded processors could be judged bythe fact that more than 10 billion embedded processors wereshipped in 2008, and this is expected to reach 10.76 billionunits in 2009 [1]. In the embedded market the number of32-bit processors shipped has surpassed significantly that of8-bit processors [2]. Modern 16-bit and 32-bit embeddedprocessors increasingly contain cache memories to furtherinstruction throughput and performance of the device. Therecent drive towards low-power processing has challengedthe designers and researchers to optimize every componentof the processor. However optimization for energy usuallycomes with some sacrifice on throughput, and which mayresult in overall minor gain.

Figure 1 shows the operation of a typical battery poweredembedded system. Normally, in such devices, the processoris placed in active mode only when required; otherwise itremains in a sleep mode. An overall power saving (increasedthroughput to energy ratio) could be achieved by increasingthe throughput (i.e., lowering the duty cycle), decreasing

the peak energy consumption, or by lowering the sleepmode energy consumption. This phenomenon clearly showsthe interdependence of energy and throughput for overallpower saving. Keeping this in mind, a simplified approachis proposed that is based on energy and throughput modelsto analyze the impact of a cache structure in an embeddedprocessor per application basis which exemplifies the useof the models for design space exploration and softwareoptimization.

The remainder of this paper is divided into five sections.In the following two sections related work is discussed andthe energy and throughput models are introduced. In thefourth section experimental environment and results arediscussed, the fifth section describes an example applicationfor the mathematical models, and the final section forms theconclusion.

2. Related Work

The cache energy consumption and throughput modelshave been the focus of research for some time. Shiue and

Page 205: 541420

2 EURASIP Journal on Embedded Systems

Power consumption

Time

Active mode power

Average power

Sleep mode power

Figure 1: Power consumption of a typical battery poweredprocessor (adapted from [3]).

Chakrabarti [4] present an algorithm to find optimumcache configuration based on cache size, the number ofprocessor cycles, and the energy consumption. Their workis an extension of the work of Panda et al. [5, 6] on datacache sizing and memory exploration. The energy model byShiue and Chakrabarti , though highly accurate, requires awide range of inputs like number of bit switches on addressbus per instruction, number of bit switches on data bus perinstruction, number of memory cells in a word line and ina bit line, and so forth. which may not be known to themodel user in advance. Another example of a detailed cacheenergy model was presented by Kamble and Ghose [7]. Theseanalytical models for conventional caches were found to beaccurate to within 2% error. However, they over-predict thepower dissipations of low-power caches by as much as 30%.The low-power cache designs used by Kamble and Ghoseincorporated block buffering, data RAM subbanking, andbus invert coding for evaluating the models. The relativeerror in the models increased greatly when the sub-bankingand block buffering were simultaneously applied. The majordifference between the approach used by Kamble and Ghose[7] and the one discussed in this paper is that the formerone incorporated bit level models to evaluate the energyconsumption, which are in some cases inaccurate as the errorin output address power was found (by the Kamble andGhose ) in the order of 200%, due to the fact that dataand instruction access addresses exhibit strong locality. Theapproach presented here uses a standard cache modellingtool, CACTI [8], for measuring bit level power consumptionin cache structures and provides a holistic approach forenergy and throughput for an application basis. In fact theaccuracy of these models is independent of any particularcache configuration as standard cache energy and timingtools are used to provide cache specific data. This approachis discussed in detail in Section 4.

Simunic et al. [9] presented mathematical models forenergy estimation in embedded systems. The per cycle energymodel presented in their work comprises energy componentsof processor, memory, interconnects and pins, DC-to-DCconverters, and level two (L2) cache. The model wasvalidated using an ARM simulator [10] and the SmartBadge[11] prototype based on ARM-1100 processor. This wasfound to be within 5% of the hardware measurements for

the same operating frequency. The models presented intheir work holistically analyze the embedded system powerand do not estimate energy consumption for individualcomponents of a processor that is, level one (L1) cache, on-chip memory, pipeline, and so forth. In work by Li andHenkel [12] a full system detailed energy model comprisingcache, main memory, and software energy components waspresented. Their work includes description of a framework toassess and optimize energy dissipation of embedded systems.Tiwari et al. [13] presented an instruction level energy modelestimating energy consumed in individual pipeline stages.The same methodology was applied in [14] by the authorsto observe the effects of cache enabling and disabling.

Wada et al. [15] presented comprehensive circuit levelaccess time model for on-chip cache memory. On comparingwith SPICE results the model gives 20% error for an 8nanoseconds access time cache memory. Taha and Wills [16]presented an instruction throughput model for Superscalarprocessors. The main parameters of the model are super-scalar width of the processor, pipeline depth, instructionfetch method, branch predictor, cache size and latency, andso forth. The model results in errors up to 5.5% as comparedto the SimpleScalar out-of-order simulator [17]. CACTI(cache access and cycle time model) [8] is an open-sourcemodelling tool based on such detailed models to providethorough, near accurate memory access time and energyestimates. However it is not a trace driven simulator, and soenergy consumption resulting in number of hits or misses isnot accounted for a particular application.

Apart from the mathematical models, substantial workhas been done for cache miss rate prediction and minimiza-tion. Ding and Zhong in [18] have presented a framework fordata locality prediction, which can be used to profile a codeto reduce miss rate. The framework is based on approximateanalysis of reuse distance, pattern recognition, and distance-based sampling. Their results show an average of 94% accu-racy when tested on a number of integer and floating pointprograms from SPEC and other benchmark suites. Extendingtheir work Zhong et al. in [19] introduce an interactivevisualization tool that uses a three-dimensional plot to showmiss rate changes across program data sizes and cache sizes.Another very useful tool named RDVIS as a further extensionof the work previously stated was presented by Beyls et al.in [20, 21]. Based on cluster analysis of basic block vectors,the tool gives hints on particular code segments for furtheroptimization. This in effect provides valuable feedback tothe programmer to improve temporal locality of the data toincrease hit rate for a cache configuration.

The following section presents the proposed cache energyand throughput models, which can be used to identify anearly cache overhead estimate based on a limited set of inputdata. These models are an extension of the models previouslyproposed by Qadri and Maier in [22, 23].

3. The D-Cache Energy and Throughput Models

The cache energy and throughput models given below striveto provide a complete application-based analysis. As a resultthey could facilitate the tuning of a cache and an application

Page 206: 541420

EURASIP Journal on Embedded Systems 3

Table 1: Simulation platform parameters.

Parameter Value

Processor PowerPC440GP

Execution mode Turbo

Clock frequency (Hz) 1.00E+08

Time 1.00E−08

CPI 1

Technology 0.18 um

Vdc (V) 1.8

Logic Supply (V) 3.3

DDR SDRAM (V) 2.5

VDD (1.8 V) activeoperating currentIDD (A)

9.15E−01

OVDD (3.3 V) activeoperating currentIODD (A)

1.25E−01

Energy per Cycle (J) 1.65E−08

Idle mode Energy (J) 4.12E−09

Table 2: Cache simulator data.

CACTI Data

Cache Size 32 Kbytes

Block Size 256 bytes

R/W Ports 0

Read ports 1

Write ports 1

Access Time (s) 1.44E−09

Cycle Time (s) 7.38E−10

Read Energy (J) 2.24E−10

Write Energy (J) 3.89E−11

Leakage Read Power (W) 2.96E−04

Leakage Write Power (W) 2.82E−04

according to a given power budget. The models presentedin this paper are an improved extension of energy andthroughput models for a data cache, previously presentedby the authors in [22, 23]. The major improvements in themodel are as follows: (1) The leakage energy (Eleak) is nowindicated for the entire processor rather than simply thecache on its own. The energy model covers the per cycleenergy consumption of the processor. The leakage energystatistics of the processor in the data sheet covers the cacheand all peripherals of the chip. (2) The miss rate in Eread andEwrite has been changed to readmr (read miss rate) and writemr

(write miss rate) as compared to total miss rate (rmiss) thatwas employed previously. This was done as the read energyand write energy components correspond to the respectivemiss rate contribution of the cache. (3) In the throughputmodel stated in [23] a term tmem (time saved from memoryoperations) was subtracted from the total throughput of thesystem, which was later found to be inaccurate. The overall

time taken to execute an instruction denoted as Ttotal is themeasure of the total time taken by the processor for runningan application using cache. The time saved from memoryonly operations is already accounted in Ttotal. However a newterm tins was introduced to incorporate the time taken for theexecution of cache access instructions.

3.1. Energy Model. If Eread and Ewrite are the energy con-sumed by cache read and write accesses, Eleak the leakageenergy of the processor, Ec→m the energy consumed bycache to memory accesses, Emp the energy miss penalty,and Emisc is the Energy consumed by the instructions whichdo not require data memory access, then the total energyconsumption of the code Etotal in Joules (J) could be definedas

Etotal = Eread + Ewrite + Ec→m + Emp + Eleak + Emisc. (1)

Further defining the individual components,

Eread = nread · Edyn.read ·[

1 +readmr

100

],

Ewrite = nwrite · Edyn.write ·[

1 +writemr

100

],

Ec→m = Em · (nread + nwrite) ·[

1 +totalmr

100

],

Emp = Eidle · (nread + nwrite) ·[Pmiss · totalmr

100

],

(2)

where nread is the number of read accesses, nwrite the numberof write accesses, Edyn.read the total dynamic read energy forall banks, Edyn.write the total dynamic write energy for allbanks, Em the energy consumed per memory access, Eidle theper cycle idle mode energy consumption of the processor,readmr, writemr, and totalmr are the read, write, and totalmiss ratio (in percentage), and Pmiss is the miss penalty (innumber of stall cycles).

The idle mode leakage energy of the processor Eleak couldbe calculated as

Eleak = Pleak · tidle, (3)

where tidle (s) is the total time in seconds for which processorwas idle.

3.2. Throughput Model. Due to the concurrent nature ofcache to memory access time and cache access time, theiroverlapping can be assumed. If tcache is the time taken forcache operations, tins the time taken in execution of cacheaccess instructions (s), tmp the time miss penalty, and tmisc isthe time taken while executing other instructions which donot require data memory access, then the total time taken byan application with a data cache could be estimated as

Ttotal = tcache + tins + tmp + tmisc. (4)

Page 207: 541420

4 EURASIP Journal on Embedded Systems

05

1015202530

En

ergy

(J)

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

1 2 4 8 16

Associativity

Rep

lace

men

tpo

licy

Epredicted Esimulated

Figure 2: Energy consumption for write-through cache.

05

1015202530

En

ergy

(J)

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

1 2 4 8 16

Associativity

Rep

lace

men

tpo

licy

Epredicted Esimulated

Figure 3: Energy consumption for write-back cache.

Furthermore,

tcache = tc · (nread + nwrite) ·[

1 +totalmr

100

],

tins =(tcycle − tc

)· (nread + nwrite),

tmp = tcycle · (nread + nwrite) ·[Pmiss · totalmr

100

],

(5)

where tc is the time taken per cache access and tcycle is theprocessor cycle time in seconds (s).

4. The Experimental Environment and Results

To analyze and validate the aforementioned models, SIMICS[25], a full system simulator was used. An IBM/AMCCPPC440GP [26] evaluation board model was used as thetarget platform and Montavista Linux 2.1 kernel was usedas target application to evaluate the models. A generic 32-bitdata cache was included in the processor model, and resultswere analyzed by varying associativity, write policy, andreplacement policy. The cache read and write miss penaltywas fixed at 5 cycles. The processor input parameters aredefined in Table 1.

As SIMICS could only provide timing information of themodel, processor power consumption data like idle modeenergy (Eidle) and leakage power (Pleak) was taken from

0

4

8

12

16

Tim

e(s

)

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

1 2 4 8 16

Associativity

Rep

lace

men

tpo

licy

Tpredicted Tsimulated

Figure 4: Throughput for write-through cache.

0

4

8

12

16

Tim

e(s

)

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

Ran

dom

LRU

Cyc

lic

1 2 4 8 16

Associativity

Rep

lace

men

tpo

licy

Tpredicted Tsimulated

Figure 5: Throughput for write-back cache.

0

5

10

15

20

25

En

ergy

(J)

1 3 5 7 9 11 13 15 17 19 21 23

Iterations

Basic math Esimulated

Qsort Epredicted

Basic math Epredicted

CRC 32 Esimulated

Qsort Esimulated CRC 32 Epredicted

Figure 6: Simulated and Predicted Energy Consumption, varyingCache Size and Block Size (see Table 3).

PPC440GP datasheet [26], and cache energy and timingparameters such as dynamic read and write energy per cacheaccess (Edyn.read, Edyn.write) and cache access time (tc) weretaken from CACTI [8] cache simulator (see Table 2). Forother parameters such as number of memory reads/writesand read/write/total miss rate (nread, nwrite, readmr, writemr,totalmr), SIMICS cache profilers statistics were used. Thecache to memory access energy (Em) was assumed to be halfthat of per cycle energy consumption of the processor. The

Page 208: 541420

EURASIP Journal on Embedded Systems 5

Table 3: Iteration definition for varying Block Size and Cache Size.

Block Size/CacheSize

1KBytes

2KBytes

4KBytes

8KBytes

16Kbytes

32KBytes

64 Bytes 1 2 3 4 5 6

128 Bytes 7 8 9 10 11 12

256 Bytes 13 14 15 16 17 18

512 Bytes 19 20 21 22 23 24

Table 4: Cache Simulator Data for various Iterations.

CACTI Data

Iteration AssociativityBlock Size

(bytes)Number of

LinesCache Size

(bytes)Access

Time (ns)Cycle

Time (ns)

ReadEnergy

(nJ)

WriteEnergy

(nJ)

1 0 64 16 1024 2.15 0.7782 0.160524 0.0918

2 0 128 8 1024 2.47 1.182 0.126 0.0695

3 0 256 4 1024 3.639 2.394 0.135 0.063

4 0 512 2 1024 8.185 6.955 0.171 0.068

5 0 64 32 2048 2.368 0.818 0.265 0.142

6 0 128 16 2048 2.58 1.206 0.186 0.095

7 0 256 8 2048 3.706 2.42 0.183 0.0755

8 0 512 4 2048 8.23 6.975 0.213 0.075

9 0 64 64 4096 2.2055 0.778 0.593 0.404

10 0 128 32 4096 2.802 1.25 0.307 0.145

11 0 256 16 4096 3.84 2.46 0.28 0.1

12 0 512 8 4096 8.316 7.016 0.298 0.087

13 0 64 128 8192 2.422 0.8175 0.96 0.5988

14 0 128 64 8192 2.633 1.206 0.619 0.407

15 0 256 32 8192 4.085 2.529 0.474 0.151

16 0 512 16 8192 8.48 7.09176 0.468 0.1125

17 0 64 256 16384 2.85 0.88 1.7 0.988

18 0 128 128 16384 2.8559 1.251 1.0049 0.602

19 0 256 64 16384 3.888 2.4557 0.834 0.413

20 0 512 32 16384 8.533 7.092 0.77 0.254

21 0 64 512 32768 3.783 0.985 3.177 1.7661

22 0 128 256 32768 3.3 1.33 1.776 0.991

23 0 256 128 32768 4.14 2.53 1.413 0.608

24 0 512 64 32768 8.534 7.092 1.263 0.4247

simulated energy consumption was obtained by multiplyingper cycle energy consumption as per datasheet specification,by the number of cycles executed in the target application.

The results for energy and timing models are presentedin Figures 2, 3, 4, and 5. From the graphs, it could be inferredthat the average error of the energy model for the givenparameters is approximately 5% and that of timing modelis approximately 4.8%. This is also reinforced by the specificresults for the benchmark applications; that is, BasicMath,QuickSort, and CRC 32 from the MiBench benchmarksuite [27], while varying cache size and block size using a

direct-mapped cache, are shown in Figures 6 and 7. Thedefinition of each iteration for various cache and block sizeis given in Table 3, and the cache simulator data are given inTable 4.

5. Design Space Exploration

The validation of the models opens an opportunity toemploy these in a variety of applications. One such appli-cation could be a design exploration to find optimal cache

Page 209: 541420

6 EURASIP Journal on Embedded Systems

02468

101214

Tim

e(s

)

1 3 5 7 9 11 13 15 17 19 21 23

Iterations

Basic math Tsimulated

Qsort Tpredicted

Basic math Tpredicted

CRC 32 Tsimulated

Qsort Tsimulated CRC 32 Tpredicted

Figure 7: Simulated and Predicted Throughput, varying Cache Sizeand Block Size (see Table 3).

Start

C code

Compiler

Cache miss rateanalysis

Code optimizedfor minimum miss rate?

NoCache

parameters

CachemodellerCode profiler

Yes

Energy andthroughput model

Requirementsfulfilled?

Yes

Stop

Energy andthroughput

requirements

No

Figure 8: Proposed design cycle for optimization of cache andapplication code.

configuration for a set amount of energy budget or timingrequirement. A typical approach for design exploration inorder to identify the optimal cache configuration and codeprofile is shown in Figure 8. At first the miss rate predictionis carried out on the compiled code and preliminary cacheparameters. Then several iterations may be performed tofine tune the software to reduce miss rates. Subsequently,the tuned software goes through the profiling step. Theinformation from the cache modeller and the code profiler is

then fed to the energy and throughput models. If the givenenergy budget along with the throughput requirements isnot satisfied, then the cache parameters are to be changedand the same procedure is repeated. This strategy can beadopted at design time to optimize the cache configurationand decrease the miss rate of a particular applicationcode.

6. Conclusion

In this paper straightforward mathematical models werepresented with a typical accuracy of 5% when compared toSIMICS timing results and per cycle energy consumptionof the PPC440GP processor. Therefore, the model-basedapproach presented here is a valid tool to predict the pro-cessors performance with sufficient accuracy, which wouldclearly facilitate executing these models in a system in orderto adapt its own configuration during the actual operationof the processor. Furthermore, an example application fordesign exploration was discussed that could facilitate theidentification of an optimal cache configuration and codeprofile for a target application. In future work the presentedmodels are to be analyzed for multicore processors andto be further extended to incorporate multilevel cachesystems.

Acknowledgment

The authors like to thank the anonymous reviewers fortheir very insightful feedback on earlier versions of thismanuscript.

References

[1] “Embedded processors top 10 billion units in 2008,” VDCResearch, 2009.

[2] “MIPS charges into 32bit MCU fray,” EETimes Asia, 2007.

[3] A. M. Holberg and A. Saetre, “Innovative techniques forextremely low power consumption with 8-bit microcon-trollers,” White Paper 7903A-AVR-2006/02, Atmel Corpora-tion, San Jose, Calif, USA, 2006.

[4] W.-T. Shiue and C. Chakrabarti, “Memory exploration for lowpower, embedded systems,” in Proceedings of the 36th AnnualACM/IEEE Conference on Design Automation, pp. 140–145,New Orleans, La, USA, 1999.

[5] P. R. Panda, N. D. Dutt, and A. Nicolau, “Architecturalexploration and optimization of local memory in embeddedsystems,” in Proceedings of the 10th International Symposiumon System Synthesis, pp. 90–97, Antwerp, Belgium, 1997.

[6] P. R. Panda, N. D. Dutt, and A. Nicolau, “Data cache sizingfor embedded processor applications,” in Proceedings of theConference on Design, Automation and Test in Europe, pp. 925–926, Le Palais des Congres de Paris, France, 1998.

[7] M. B. Kamble and K. Ghose, “Analytical energy dissipationmodels for low power caches,” in Proceedings of the Interna-tional Symposium on Low Power Electronics and Design, pp.143–148, Monterey, Calif, USA, August 1997.

[8] D. Tarjan, S. Thoziyoor, and P. N. Jouppi, “CACTI 4.0,” Tech.Rep., HP Laboratories, Palo Alto, Calif, USA, 2006.

Page 210: 541420

EURASIP Journal on Embedded Systems 7

[9] T. Simunic, L. Benini, and G. De Micheli, “Cycle-accuratesimulation of energy consumption in embedded systems,” inProceedings of the 36th Annual ACM/IEEE Design AutomationConference, pp. 867–872, New Orleans, La, USA, 1999.

[10] ARM Software Development Toolkit Version 2.11, AdvancedRISC Machines ltd (ARM), 1996.

[11] G. Q. Maguire, M. T. Smith, and H. W. P. Beadle, “Smart-Badges: a wearable computer and communication system,”in Proceedings of the 6th International Workshop on Hard-ware/Software Codesign (CODES/CASHE ’98), Seattle, Wash,USA, March 1998.

[12] Y. Li and J. Henkel, “A framework for estimation and min-imizing energy dissipation of embedded HW/SW systems,”in Proceedings of the 35th Annual Conference on DesignAutomation, pp. 188–193, San Francisco, Calif, USA, 1998.

[13] V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embeddedsoftware: a first step towards software power minimization,”IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 2, no. 4, pp. 437–445, 1994.

[14] V. Tiwari and M. T.-C. Lee, “Power analysis of a 32-bitembedded microcontroller,” in Proceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC ’95),pp. 141–148, Chiba, Japan, August-September 1995.

[15] T. Wada, S. Rajan, and S. A. Przybylski, “An analytical accesstime model for on-chip cache memories,” IEEE Journal ofSolid-State Circuits, vol. 27, no. 8, pp. 1147–1156, 1992.

[16] T. M. Taha and D. S. Wills, “An instruction throughput modelof superscalar processors,” IEEE Transactions on Computers,vol. 57, no. 3, pp. 389–403, 2008.

[17] T. Austin, E. Larson, and D. Ernest, “SimpleScalar: aninfrastructure for computer system modeling,” Computer, vol.35, no. 2, pp. 59–67, 2002.

[18] C. Ding and Y. Zhong, “Predicting whole-program localitythrough reuse distance analysis,” ACM SIGPLAN Notices, vol.38, no. 5, pp. 245–257, 2003.

[19] Y. Zhong, S. G. Dropsho, X. Shen, A. Studer, and C. Ding,“Miss rate prediction across program inputs and cacheconfigurations,” IEEE Transactions on Computers, vol. 56, no.3, pp. 328–343, 2007.

[20] K. Beyls and E. H. D’Hollander, “Platform-independent cacheoptimization by pinpointing low-locality reuse,” in Proceed-ings of the 4th International Conference on ComputationalScience (ICCS ’04), vol. 3038 of Lecture Notes in ComputerScience, pp. 448–455, Springer, May 2004.

[21] K. Beyls, E. H. D’Hollander, and F. Vandeputte, “RDVIS:a tool that visualizes the causes of low locality and hintsprogram optimizations,” in Proceedings of the 5th InternationalConference on Computational Science (ICCS ’05), vol. 3515of Lecture Notes in Computer Science, pp. 166–173, Springer,Atlanta, Ga, USA, May 2005.

[22] M. Y. Qadri and K. D. M. Maier, “Towards increased powerefficiency in low end embedded processors: can cache help?”in Proceedings of the 4th UK Embedded Forum, Southampton,UK, 2008.

[23] M. Y. Qadri and K. D. M. Maier, “Data cache-energyand throughput models: a design exploration for overheadanalysis,” in Proceedings of the Conference on Design andArchitectures for Signal and Image Processing (DASIP ’08),Brussels, Belgium, 2008.

[24] M. Y. Qadri, H. S. Gujarathi, and K. D. M. Maier, “Lowpower processor architectures and contemporary techniquesfor power optimization—a review,” Journal of Computers, vol.4, no. 10, pp. 927–942, 2009.

[25] P. S. Magnusson, M. Christensson, J. Eskilson, et al., “Simics: afull system simulation platform,” Computer, vol. 35, no. 2, pp.50–58, 2002.

[26] “PowerPC440GP datasheet,” AMCC 2009.[27] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T.

Mudge, and R. B. Brown, “MiBench: a free, commerciallyrepresentative embedded benchmark suite,” in Proceedings ofthe IEEE International Workshop on Workload Characterization(WWC ’01), pp. 3–14, IEEE Computer Society, Austin, Tex,USA, December 2001.

Page 211: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 737689, 15 pagesdoi:10.1155/2009/737689

Research Article

Hardware Architecture for Pattern Recognition inGamma-Ray Experiment

Sonia Khatchadourian,1 Jean-Christophe Prevotet,2 and Lounis Kessal1

1 ETIS, CNRS UMR 8051, ENSEA, University of Cergy-Pontoise, 6, Avenue du Ponceau, 95014 Cergy-Pontoise, France2 IETR, CNRS UMR 6164, INSA de Rennes, 20, Avenue des buttes de Coesmes, 35043 Rennes, France

Correspondence should be addressed to Sonia Khatchadourian, [email protected]

Received 19 March 2009; Accepted 21 July 2009

Recommended by Ahmet T. Erdogan

The HESS project has been running successfully for seven years. In order to take into account the sensitivity increase of the entireproject in its second phase, a new trigger scheme is proposed. This trigger is based on a neural system that extracts the interestingfeatures of the incoming images and rejects the background more efficiently than classical solutions. In this article, we present thebasic principles of the algorithms as well as their hardware implementation in FPGAs (Field Programmable Gate Arrays).

Copyright © 2009 Sonia Khatchadourian et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

For many years, the study of gamma photons have led scien-tists to understand more deeply the complex processes thatoccur in the Universe, for example, remnants of supernovaexplosions, cosmic rays interactions with interstellar gas,and so forth. In the 1960s, it has finally been possible todevelop efficient measuring instruments to detect gamma-ray emissions, thus enabling to validate the theoreticalconcepts. Most of these instruments were built in orderto identify the direction of gammas rays. Since gammaphotons are not deflected by interstellar magnetic fields, itbecomes possible to determine the position of the sourceaccurately. In this context, Imaging Atmospheric CherenkovTelescopes constitute the most sensitive technique for theobservation of high-energy gamma-rays. Such telescopesprovide a large effective collection area and achieve excellentangular and energy resolution for detailed studies of cosmicobjects. The technique relies upon Cherenkov light producedby the secondary particles once the gamma-ray interactswith the atmosphere at about 10 km of altitude. It resultsa shower of secondary particles, that also may interactwith the atmosphere producing other particles accordingto well-known physical rules. By detecting shower particles(electrons, muons, protons), it is then possible to reconstruct

the initial event and determine the precise location of asource within the Universe.

In order to determine the nature of the shower, it isimportant to analyze its composition, that is, determinethe types of particles that have been produced duringthe interaction with the atmosphere. This is performedby studying the different images that are collected by thetelescopes and that are generally representative of the particletype. For example, gamma-ray showers usually have thin,high-density structures. On the other hand, protons are quitebroad with low density.

The major problem in these experiments is that thenumber of images to be collected is generally huge and thecomplete storage of all events is impossible. This is mainlydue to the fact that data-storage capacity is limited and thatit is impossible to keep track of all incoming images for off-line analysis.

In order to circumvent this issue, a trigger system isoften used to select the events that are interesting (from aphysicist’s point of view). This processing must be performedin real time and is very tightly constrained in terms of latencysince it is compatible with the data acquisition rate of thecameras. The role of such triggering system is to rapidlydecide whether an event is to be recorded for further studiesor rejected by the system.

Page 212: 541420

2 EURASIP Journal on Embedded Systems

The organization of this paper is given as follows: thecontext of our work is presented in Section 2. Section 3describes the algorithms that are envisaged in order to build anew trigger system. Considerations on hardware implemen-tations are then provided in Section 4 and Section 5 describesthe results in terms of timing and resource usage.

2. The HESS Project

The High-Energy Stereoscopic System (HESS) is a systemof imaging Cherenkov telescopes that strive to investigatecosmic gamma rays in the 100 GeV to 100 TeV energy range[1]. It is located in Namibia at an altitude of 1800 m wherethe optical quality is excellent. The Phase-I of this projectwent into operation in Summer 2002 and consists of fourLarge Cherenkov Telescopes (LCT), each of 107 m2 mirrorarea in order to provide a good stereoscopic viewing of theair showers. The telescopes are arranged on a square of 120 msides, enabling thus to optimize the collection area.

The cameras of the four telescopes serve to capture andrecord the Cherenkov images of air showers. They haveexcellent resolution since the pixel size is very small: eachcamera is equipped of 960 photomultiplier tubes (PMTs) thatare assimilated to pixels.

An efficient trigger scheme has also been designed inorder to reject background such as the light of the nightsky that interferes with measurements. Next sections describeboth phases of the project in terms of triggering issues.

2.1. Phase-I. The trigger system of the HESS Phase-I projectis devised in order to make use of the stereoscopic approach:simultaneous observation of interesting images must berequired in order to store a specific event [2]. This coinci-dence requirement reduces the rate of background events,that is, events that may be assimilated to night sky noise. It iscomposed of two separate levels (L1 and the central trigger).

At the first level, a basic threshold is applied on signalscollected by the camera. A trigger occurs if the signals inM pixels within a 64-pixel sector of the camera exceed avalue of N photoelectrons. This enables to get rid of isolatedpixels and thus to eliminate the noise. The pixel signals aresampled using 1 GHZ Analogue Ring Samplers (ARSs) [3]with a ring buffer depth of 128 cells. Following a cameratrigger, the ring buffer is stopped and its content is digitized,summed and written in an FPGA buffer. After read-out, thecamera is ready for the next event, and further processingmay be performed including the transmission of data viaoptical cable to the PC processor farm located in the controlbuilding.

The Central Trigger System (CTS) consists in implement-ing the coincidence between the four telescopes. It identifiesthe status of the telescopes and writes this information as wellas an absolute time (measured by a GPS) into an FIFO (First-in First-out) memory for each system coincidence. Once thedata have been written in this FIFO, the CTS is ready toprocess new incoming events, about 330 nanoseconds afterthe coincidence occurred. The FIFO memory has a depth of

SAMSAM

L1

L1

L1

L1

LCT4LCT1

LCT2 LCT3SAM SAM

CAN

CANCAN

CAN

Centraltriggersystem

Figure 1: Schematic of the HESS Trigger System.

L1accept~100 kHz

L2accept/L2reject~3.5 kHz

L1

L2PreL2

VLCT

Analogimage

SAM CAN FIFO

1

1

Classifieddata

12

Figure 2: Schematic of the VLCT Trigger System.

16000 events and is read out asynchronously. A schematicillustration of the HESS trigger system is depicted in Figure 1.

2.2. Phase-II. Since its inception in 2002, the HESS projectkeeps on delivering very significant results. In this verypromising context, researchers of the collaboration havedecided to improve the initial project by adding a new VeryLarge Central Telescope (VLCT) in the middle of the fourexisting ones. This new telescope should permit to increasethe sensitivity of the global system as well as improvingresolution for high-energy particles. It is composed of 2048pixels which represent the energy of the incident event.

Considering the new approach, the quantity of data to becollected would drastically increase, and it becomes necessaryto build a new trigger system in order to be compatible withthe new requirements of the project.

One of the most challenging objectives of the HESSproject is to detect particles which energy are below 50 GeV.In this energy range, it is not conceivable to use all telescopes(since the smallest ones cannot trigger), and only the fifthtelescope may be used in a monoscopic mode.

The structure of the new triggering system is depicted inFigure 2. Data coming from the VLCT camera consist of 2048pixels values which are first stored in a Serial Analog Memory(SAM). In parallel, data are also sent to a level 1 trigger (L1)whose structure is described in Section 2.1. The L1 triggerapplies a basic analog threshold on the pixel’s values andgenerates a binary signal indicating whether an event has tobe kept (L1accept) or rejected (L1reject). In the case where anevent is accepted, the entire image is converted into digitalpatterns. These data are stored in FIFO memories until aL2accept/L2reject signal coming from a second level trigger(L2) is generated.

Page 213: 541420

EURASIP Journal on Embedded Systems 3

(1) (2)

(3) (4)

(5) (6)

(7) (8)

Figure 3: Gamma (1–4), muon (5-6), and proton (7-8) images ofdifferent energies.

In parallel, data are sent to the PreL2 stage whichthresholds the incoming pixels according to 3 energy levels.Each pixel value is coded into 2 bits corresponding to 3 statesof energies. These images are then sent to the L2 Trigger.L1 and L2 trigger decisions are expected at average rates of100 KHz and 3.5 KHz, respectively. Examples of simulatedimages are depicted in Figure 3.

3. The HESS2 L2 Triggering System

In order to cope with the new performances of the HESSPhase-II system, an efficient L2 trigger scheme is currentlybeing built. Like all triggers, it aims to provide a decision

W

L

αD

Source

Figure 4: Overview of Hillas moments.

regarding the interest of a particular event. In this context,two parallel studies have been led in order to identify the bestalgorithms to implement at that level. The first study reliedon the Hillas parameters which are seen as a classical solutionin astrophysics pattern recognition. The second study thathas been envisaged is to use pattern recognition tools such asneural networks associated with an intelligent preprocessing.Both approaches are described in the next sections.

3.1. The First Approach

3.1.1. Hillas Parameters. Hillas parametrization has beenintroduced in [4]. The retained method consists in isolatingimage descriptors which are based on image shape param-eters such as length (L) and width (W) as well as an angle(α). The α angle represents the angle of the image withthe direction of the emitting source location (see Figure 4).This approach globally considers that gamma’s signatures aremainly elliptical in shape whereas other particle’s signaturesare most irregular. This assumption is often the case inpractice. Nevertheless, signatures strongly depend on thedistance between the impact point of the ray shower andthe telescope. This may lead to various types of images forthe same event nature and constitutes a real challenge foridentification (see Figure 3).

3.1.2. The Classifier. In this first approach, the classifierconsists in applying thresholds on the hillas parameters(or a combination of these parameters) computed on theincoming images in order to distinguish gamma signaturesbetween all collected images. One of the best parameters thathave been identified as a good discriminator is the Centerof Gravity (CoG). This parameter represents the center ofgravity of all illuminated pixels within the ellipse.

In this case, the recognition of particles is performedaccording the following rule:

(a) if CoG < t, then the event is recognized as a gammaparticle;

(b) if CoG ≥ t and α < 20 deg, then the event isrecognized as a gamma particle;

(c) otherwise, the event is rejected.

t is a parameter which is adjusted according to the dataset.

Page 214: 541420

4 EURASIP Journal on Embedded Systems

The major drawback of such approach is that theconsidered thresholds consist of constant values. Thus, alack of flexibility is to be deplored. For example, it does notallow to take into consideration the various conditions of theexperiment that may have a significant impact on the shapeof signatures.

3.2. Intelligent Preprocessing. The second studied approachaims to make use of algorithms that already broughtsignificant results in terms of pattern recognition. Neuralnetworks are good candidates because they are a powerfulcomputational model. On the other hand, their inherentparallelism makes them suitable for a hardware imple-mentation. Although used in different fields of physics,these algorithms based on neural networks have successfullybeen implemented and have already proved their efficiency[5, 6]. Typical applications include particle recognitionin tracking systems, event classification problems, off-linereconstruction of event, and online trigger in High-EnergyPhysics.

From the assumption that neural networks may beuseful in such experiments, we have proposed a newLevel 2 (L2) trigger system enabling to implement rathercomplex processing on the incoming images. The major issuewith neural networks resides in the learning phase whichstrives to identify optimal parameters (weights) in orderto solve the given problem. This is true when consideringunsupervised learning in which representative patterns haveto be iteratively presented to the network in a first learningphase until the global error has reached a predefinedvalue.

One of the most important drawbacks of this type ofalgorithms is that the number of weights strongly depends onthe dimensionality of the problem which is often unknownin practice. This implies to find the optimal structure of thenetwork (number of neurons, number of layers) in order tosolve the problem.

Moreover, the curse of dimensionality [7] constitutesanother challenge when dealing with neural networks. Thisproblem expresses a correlation between the size of thenetwork and the number of examples to furnish. Thisrelation is exponential, that is, if the network’s size becomessignificant, the number of training examples may becomerelatively huge. This cannot be considered in practice.

In order to reduce the size of the network, it is possibleto simplify its, task that is, reduce the dimensionality of theproblem. In this case, a preprocessing step aims at findingcorrelations on data and at applying basic transformationsin order to ease the resolution. In this study, we advise to usean “intelligent” preprocessing based on the extraction of theintrinsic features of the incoming images.

The structure of the proposed L2 trigger is depicted inFigure 5. It is divided into three stages. A rejection step aimsto eliminate isolated pixels and small images that cannot beprocessed by the system. A second step consists in applyinga preprocessing on incoming data. Finally, the classifier takesthe decision according to the nature of the event to identify.These different steps are described in the following sections.

L2accept/reject

L2reject

Neuralnetwork

Rejection

Denoise > 4PreL2data

Preprocessing

Zernike

Figure 5: Schematic of the HESS Phase-II Trigger System.

3.2.1. The Rejection Step. The rejection step has two signif-icant roles. First, it aims to remove isolated pixels that aretypically due to background. These pixels are eliminated byapplying a filtering mask on the entire image in order tokeep the only relevant information, that is, clusters of pixels.This consists in testing the neighborhood of each pixel of theimage. As the image has an hexagonal mesh grid, a hexagonalneighborhood is used. The direct neighborhood of each pixelof the image is tested. If none of the 6 neighbors are activated,the corresponding central pixel is considered as isolated anddeactivated. Second, the rejection step permits to eliminateparticles that cannot be distinguished by the classifier. Verysmall images (<4 pixels) are discarded since they containpoor information that cannot be deciphered.

3.2.2. The Preprocessing Step. The envisaged system is basedon a preprocessing step whose role consists in applying basictransformations on incoming images in order to isolate themain characteristics of a given image. The most importantrole of the preprocessing is to guarantee invariance in ori-entation (rotation and translation) of the incoming images.Since signatures of particles within the image depend on theimpact point of an incident particle, the image may result ina series of pixels located wherever on the telescope. Withoutusing a preprocessing stage based on orientation invariance,the 2048 inputs of the classifier would completely differ froman image to another although the basic shape of the particlewould remain the same.

The retained preprocessing is based on the use of Zernikemoments. These moments are mainly considered in shapereconstruction [8] and can be easily made invariant tochanges in objects orientation. They are defined as a setof orthogonal functions based on complex polynomialsoriginally introduced in [9]. Zernike polynomials can beexpressed as

Vpq(r, θ) = Rpq(r)eiqθ , (1)

where i = √(−1), p is a nonnegative integer, q is apositive

integer such as p-q is even and p ≤ q, r: length of a vectorfrom the origin to a point (x, y) such as r ≤ 1, that is,

r =√x2 + y2/rmax where rmax = max

√x2 + y2, θ: angle

between the x axis and the vector extending from the originto a point(x, y).

Rpq(r): Zernike polynomial defined as:

Rpq(r) =p∑

k=q,|p−k|even

Bpqkrk (2)

Page 215: 541420

EURASIP Journal on Embedded Systems 5

with Bpqk = (−1)(p−k)/2(((p + k)/2)!/((p − k)/2)!((k +|q|)/2)!((k − |q|)/2)!).

Zernike moments Zpq are expressed according to

Zpq = p + 1π

∑x

∑y

I(x, y

)V∗pq(r, θ), (3)

where I(x, y) refers to the pixels’ value of coordinates(x,y).The rotation invariance property of Zernike moments isdue to the intrinsic nature of such moments. In order toguarantee translation invariance as well, it is necessary toalign the center of the object to the center of the unit circle.This may be performed by changing the coordinates x and yof each processing point by the coordinates x− x0 and y− y0

where x0 and y0 refer to the center of the signature and maybe obtained by

x0 =∑

X

∑Y xI

(x, y

)∑

X

∑Y I(x, y

) , y0 =∑

X

∑Y yI

(x, y

)∑

X

∑Y I(x, y

) . (4)

In this case, r is expressed as follows:

r =√

(x − x0)2 + (y − y0)2

rmax

, (5)

where rmax = max√

(x − x0)2 + (y − y0)2.In the context of our application, it has been found that

considering the first 8-order Zernike moments was sufficientto obtain the best performances. This implies to compute 25polynomials for each of the pixels within an image. Then,it is necessary to accumulate these values in order to obtain25 real values corresponding to all the Zernike moments ofthe image. The module of these moments is then computedand a normalization step is fulfilled in order to scale theobtained values within the −1 to 1 range. These values arethen provided to the neural network.

3.2.3. The Neural Classifier. The envisaged classifier is afeed-forward neural network named Multilayer Perceptron(MLP) with one hidden layer and three outputs. The generalstructure of such a network is depicted in Figure 6:

According to the nature of the network, the value of theoutput may be computed according to (6).

y = g

⎛⎝

M∑

j=0

w(2)k j g

⎛⎝

d∑

i=0

w(1)ji xi

⎞⎠⎞⎠. (6)

In (6), y represents the output;w(2)k j andw(1)

ji , respectively,represent the weights connecting the output layer and thehidden layer and the weights connecting the hidden layer andthe input nodes. M is the number of neurons in the hiddenlayer and d is the number of inputs. xi denotes the value of aninput node and g is an activation function. In our case, thenonlinear function g has been used:

g(x) = tanh(x) = ex − e−xex + e−x

. (7)

x0

w11 wMd

x1

y1 y2

x2 x3 xd

Figure 6: Structure of a 2-layer perceptron.

In the considered application, the output layer is com-posed of a three neurons (which value ranges between −1and 1) corresponding to the type of particle to identify.Each output refers to a gamma, proton, and muon particle,respectively. If the value of an output neuron is positive, itmay be assumed that the corresponding particle has beenidentified by the network. In the case that more than oneoutput neuron is activated, the maximum value is taken intoaccount.

The learning phase has been performed off-line on aset of 4500 patterns computed on simulated images as theHESS2 telescope is not yet installed. The simulated imagesare generated thanks to series of Monte Carlo simulations.These patterns covered all ranges of energies and types ofparticles. 1500 patterns were considered for each class ofparticles. A previous study had determined the reliabilityof the patterns in order to consider the most representativepatterns that may be collected by the telescope.

A classical backpropagation algorithm has been pro-grammed off-line in order to get the optimal value ofweights. The training have been performed simultaneouslyon two sets of patterns (learning and testing set). Once theerror on the testing phase was minimum, the training wasstopped ensuring that the weights had an optimal value.

The size of the input layer was determined according tothe type of preprocessing that was envisaged. In the case of aZernike preprocessing, this number has been set to 25 sinceit corresponds to the number of outputs furnished by thepreprocessing step.

The number of hidden nodes (in the hidden layer) hasbeen evaluated regarding the results obtained on a specificvalidation set of patterns. This precaution has been handledin order to ensure that the neural network was able togeneralize on new data (i.e., it has not learnt explicitly).

3.3. Simulated Performances. The best performances thathave been obtained are summarized in Table 1. It corre-sponds to a trigger with a preprocessing based on the first25 Zernike moments. Other results concerning differentpreprocessings have also been described in [10].

Page 216: 541420

6 EURASIP Journal on Embedded Systems

Table 1: Performances according to both approaches.

Gamma Muon Proton

Hillas approach 60% 56% 37%

Neural approach 95% 58% 41%

According to Table 1, it may be seen that the neuralsolution provides significant improvement compared toclassical methods in terms of classification. This improve-ment resides in the fact that a largest dimensionality ofthe problem has been taken into account. Whereas Hillasprocessing takes only five parameters into consideration, thenumber of inputs in the case of a neural preprocessing isset to 25. Moreover, as the Hillas approach only consists inapplying strong “cuts” on predefined parameters, the neuralapproach is more flexible and guarantees nonlinear decisionboundaries. It may be assumed that the considered neuralnetwork is capable of extracting the relevant informationand discriminate between all images, efficiently. The majordrawback of the neural approach is its relative complexityin terms of computation and hardware implementation.Although Hillas algorithms may be implemented in software,it is impossible to implement both the neural networkand the preprocessing step in the same manner. In thiscontext, dedicated circuits have to be designed in order tobe compliant with the strong timing constraints imposed bythe entire system. In our case, an L2-decision has to be takenat a rate of 3.5 KHz which corresponds to a timing constraintof 285 microseconds.

4. Hardware Implementation

The complete L2 trigger system is currently being built,making intensive use of the reconfigurable technology. Com-ponents such as FPGAs constitute an attractive alternativeto classical circuits such as ASICs (Application SpecificIntegrated Circuits). This type of reconfigurable circuitstends to be more and more efficient in terms of speed andlogic resources and is more and more envisaged in deeplyconstrained applications.

4.1. Hardware Implementation of Zernike Moments.Although very efficient, Zernike moments are knownfor their computation complexity. Many solutions havebeen proposed for the fast implementation of the Zernikemoments. Some algorithms are based on recursivity [11],reuse of previous parts of the computation [12] or momentgenerators [13].

Since using a moment generator allows a reductionof the number of operations, we have decided to followthis approach, that is, to compute Zernike moments fromaccumulation moments.

4.1.1. Zernike Moments via Accumulation Moments. Themechanism of a moment generator [14] can be summarizedby the expression of the geometric moments with respect

px

py

Nx

0, 0

x

y

Ny

Figure 7: Image topology in the L2-trigger of the HESS Phase-IIproject.

to the point of coordinates (Nx,Ny) from the accumulationmoments:

mNx ,Nyp,q =

Nx∑

x=0

Ny∑

y=0

(Nx − x)p(Ny − y)qI(x, y

)

=p∑

e=0

q∑

f=0

S(p, e)S(q, f

)ψe, f

(8)

with ψe, f being the accumulation moments of order (e, f ),and I(x, y) being the pixels’ values in the image and

S =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 0 0 0 · · ·−1 1 0 0 0 0

1 −3 2 0 0 0

−1 7 −12 6 0 0

1 −15 50 −60 24 0

−1 31 −180 390 −360 120

.... . .

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

. (9)

According to (8), it is important to note that geometricmoments may be expressed as a function of accumulationmoments. In the context of our application, (10) to (23)demonstrate how to calculate the Zernike moments from thegeometric moments and thus from accumulation moments.

Note that, in the particular case of HESS Phase-II,one of the issues is the image topology which consistsof a hexagonal grid with empty corners (see Figure 7).Since Zernike moments are continuous, they are particularlysuitable for this type of images. The following equations

Page 217: 541420

EURASIP Journal on Embedded Systems 7

aim to express the Zernike moments from the accumulationmoments in the particular context of HESS Phase-II.

We have seen that Zernike moments may be expressed asfollows:

Zpq = p + 1π

∑x

∑y

I(x, y

) p∑

k=q,p−k even

Bpqkrkeiqθ. (10)

The expressions of r and eiqθ can be rewritten as

r =√

(x − x0)2 + (y − y0)2

rmax, (11)

where (x0, y0) are the coordinates of the image centercomputed as explained in (4):

eiqθ = ((x − x0) + i(y − y0))q

((x − x0)2 +(y − y0

)2)q/2 . (12)

In order to simplify the equations, we note X = x − x0

and Y = y − y0. In this case,

Zpq = p + 1π

∑x

∑y

I(x, y

) p∑

k=q,p−k even

Bpqkrkmax

× (X2 + Y 2)(k−q)/2

(X + iY)q.

(13)

According to the binomial theorem, the development ofa given polynom can be expressed as follows:

(a + b)n =n∑

m=0

Cmn an−mbm. (14)

It is then possible to modify the following expressions:

(X2 + Y 2)(k−q)/2 =

(k−q)/2∑

ξ=0

Cξ(k−q)/2Xk−q−2ξY 2ξ ,

(X + iY)q =q∑

ζ=0

CζqiζXq−ζYζ .

(15)

Thus, (13) can be reformulated as follows:

Zpq = p + 1π

p∑

k=q,p−k even

Bpqkrkmax

q∑

ζ=0

(k−q)/2∑

ξ=0

iζCζqCξ(k−q)/2

×∑x

∑y

Xk−ζ−2ξY 2ξ+ζ I(x, y

)

= p + 1π

p∑

k=q,p−k even

(−1)kBpqkrkmax

q∑

ζ=0

(k−q)/2∑

ξ=0

iζCζqCξ(k−q)/2

×∑x

∑y

(x0 − x)k−ζ−2ξ(y0 − y)2ξ+ζ

I(x, y

).

(16)

The next step consists in considering the last point(Nx,Ny) in the equation of Zernike moments with respectto the center of the image:

Zpq = p + 1π

p∑

k=q,p−k even

(−1)kBpqkrkmax

q∑

ζ=0

(k−q)/2∑

ξ=0

iζCζqCξ(k−q)/2

×∑x

∑y

(x0 −Nx +Nx − x)k−ζ−2ξ

× (y0 +Ny −Ny − y)2ξ+ζ I(x, y

)

= p + 1π

p∑

k=q,p−k even

(−1)kBpqkrkmax

q∑

ζ=0

(k−q)/2∑

ξ=0

iζCζqCξ(k−q)/2

×k−ζ−2ξ∑

a=0

Cak−ζ−2ξXk−ζ−2ξ−ac

2ξ+ζ∑

b=0

Cb2ξ+ζY2ξ+ζ−bc

×∑x

∑y

(Nx − x)a(−Ny − y

)bI(x, y

),

(17)

where Xc = x0 −Nx and Yc = y0 +Ny.Since the coordinates of the pixels in the image are

expressed as real numbers, we need to express these coor-dinates with integers in order to formulate the Zernikemoments in function of the geometric moments. As we cansee in Figure 7, the even rows have to be distinguished fromthe odd rows. Therefore, the x coordinate is expressed in twodifferent ways according to the type of row (even or odd row).

x and y may be expressed as

x =⎧⎨⎩

(xd − 0.5)px + offsetx, if yd%2 = 1,

(xd − 1)px + offsetx, if yd%2 = 0,

y = (1− yd)py + offsety ,

(18)

where xd and yd are positive integers such as xd = 1 · · ·Xdand yd = 1 · · ·Yd with Xd = 48 corresponding to thenumber of columns, and Yd = 52 corresponding to thenumber of rows. px (resp., py) is the distance betweentwo adjacent columns (resp., rows) and offsetx and offsetycorrespond to the new position of the origin of the image inthe upper left corner.

Page 218: 541420

8 EURASIP Journal on Embedded Systems

In the following equations, since the first part of theexpressions does not change, the second part is just devel-oped:

∑x

∑y

(Nx − x)a(−Ny − y)bI(x, y

)

=Xd∑

xd=1

Yd∑

yd=1,yd%2=1

(Nx −

((xd − 0.5)px + offsetx

))a

×(−Ny −

((1− yd

)py + offsety

))bf(xd, yd

)

+Xd∑

xd=1

Yd∑

yd=1,yd%2=0

(Nx −

((xd − 1)px + offsetx

))a

×(−Ny −

((1− yd

)py + offsety

))bf(xd, yd

).

(19)

Notice that Nx = pxXd and Ny = pyYd :

∑x

∑y

(Nx − x)a(−Ny − y)bI(x, y

)

=Xd∑

xd=1

Yd∑

yd=1,yd%2=1

((Xd − xd)px + 0.5px − offsetx

)a

×((−Yd + yd

)py − py − offsety

)bI(xd, yd

)

+Xd∑

xd=1

Yd∑

yd=1,yd%2=0

((Xd − xd)px + px − offsetx

)a

×((−Yd + yd

)py − py − offsety

)bI(xd, yd

).

(20)

In this case, we propose to figure out the even and oddparts of the image, let yd = 2yed if yd%2 = 0 and yd = 2yod−1 if yd%2 = 1. In this case, the sums on yd will be boundedby Yd/2 and

∑x

∑y

(Nx − x)a(−Ny − y)bI(x, y

)

=Xd∑

xd=1

Yd/2∑

yod=1

((Xd − xd)px + 0.5px − offsetx

)a

×(−2py

(Yd2− yod

)− 2py − offsety

)bI(xd, yod

)(21)

+Xd∑

xd=1

Yd/2∑

yed=1

((Xd − xd)px + px − offsetx

)a

×(−2py

(Yd2− yed

)− py − offsety

)bI(xd, yed

)

=Xd∑

xd=1

Yd/2∑

yod=1

I(xd, yod

)

×a∑

c=0

Cca(0.5px − offsetx)a−c pcx(Xd − xd)c

×b∑

d=0

Cdb(−2py − offsety

)b−d(−2py)d(Yd

2− yod

)d

+Xd∑

xd=1

Yd/2∑

yed=1

I(xd, yed

)

×a∑

c=0

Cca(px − offsetx

)a−cpcx(Xd − xd)c

×b∑

d=0

Cdb(−py − offsety

)b−d(−2py)d(Yd

2− yed

)d

(22)

=a∑

c=0

b∑

d=0

CcaMa−coddN

cCdbOb−dodd P

d

×Xd∑

xd=1

Yd/2∑

yod=1

(Xd − xd)c(Yd2− yod

)dI(xd, yod

)

+a∑

c=0

b∑

d=0

CcaMa−cevenN

cCdbOb−devenP

d

×Xd∑

xd=1

Yd/2∑

yed=1

(Xd − xd)c(Yd2− yed

)dI(xd, yed

),

(23)

where Modd = 0.5px − offsetx, Meven = px − offsetx, N = px,Oodd = −2py−offsety ,Oeven = −py−offsety , and P = −2py .

Equation (23) shows that the Zernike moments can becomputed from the geometric moments. If we consider twoaccumulation grids, the first computes the accumulationmoments on the odd lines of the image and the secondon the even lines. Since the computation is divided intotwo different parts, the image should be arranged in twocomponents: the odd component of the image and theeven one. Therefore, according to (8), the analogy givesan expression of the Zernike moments which is functionof ψodd (accumulation moments computed from the oddcomponent of the image) and ψeven (accumulation momentscomputed from the even component of the image) by settingNx = Xd and Ny = Yd/2:

Xd∑

xd=1

Yd/2∑

yod=1

(Xd − xd)c(Yd2− yod

)dI(xd, yod

)

=c∑

e=0

d∑

f=0

S(c, e)S(d, f

)ψodd,e, f

Page 219: 541420

EURASIP Journal on Embedded Systems 9

Xd∑

xd=1

Yd/2∑

yed=1

(Xd − xd)c(Yd2− yed

)dI(xd, yed

)

=c∑

e=0

d∑

f=0

S(c, e)S(d, f

)ψeven,e, f .

(24)

By reinjecting (24) in the (23), Zernike moments arereformulated as follows:

Zpq = p + 1π

p∑

k=q,p−k even

q∑

ζ=0

(k−q)/2∑

ξ=0

iζ(−1)kCζqCξ(k−q)/2

Bpqkrkmax

×k−ζ−2ξ∑

a=0

Cak−ζ−2ξXk−ζ−2ξ−ac

2ξ+ζ∑

b=0

Cb2ξ+ζY2ξ+ζ−bc

×⎡⎣

a∑

c=0

b∑

d=0

CcaMa−coddN

cCdbOb−dodd P

d

×c∑

e=0

d∑

f=0

S(c, e)S(d, f

)ψodd,e, f

+a∑

c=0

b∑

d=0

CcaMa−cevenN

cCdbOb−devenP

d

×d∑

f=0

S(c, e)S(d, f

)ψeven,e, f

⎤⎦.

(25)

In an analogous way, the coordinates of the center ofthe image (x0, y0) can be computed from the accumulationmoments:

x0 = m01

m00,

y0 = m10

m00,

(26)

where

m00 = ψodd,0,0 + ψeven,0,0

m01 = (−1)×1∑

b=0

Cb1N1−by

×⎡⎣

b∑

d=0

CdbOb−dodd P

dd∑

f=0

S(d, f

)ψodd,0, f

+b∑

d=0

CdbOb−devenP

dd∑

f=0

S(d, f

)ψeven,0, f

⎤⎦

= (−1)×[Ny(ψodd,0,0 + ψeven,0,0

)

+(Ooddψodd,0,0 +Oevenψeven,0,0

+ P × S(1, 0)(ψodd,0,0 + ψeven,0,0

)

+P × S(1, 1)(ψodd,0,1 + ψeven,0,1

))],

m10 = (−1)×1∑

a=0

Ca1(−Nx)1−a

×⎡⎣

a∑

c=0

CcaOa−coddP

cc∑

e=0

S(c, e)ψodd,e,0

+a∑

c=0

CcaOa−cevenP

cc∑

e=0

S(c, e)ψeven,e,0

⎤⎦

= (−1)× [(−Nx)(ψodd,0,0 + ψeven,0,0

)

+(Ooddψodd,0,0 +Oevenψeven,0,0

+ P × S(1, 0)(ψodd,0,0 + ψeven,0,0

)

+P × S(1, 1)(ψodd,1,0 + ψeven,1,0

))].

(27)

It comes

x0 = −Ny − P × S(1, 0)− Ooddψodd,0,0 +Oevenψeven,0,0

ψodd,0,0 + ψeven,0,0

− P × S(1, 1)(ψodd,0,1 + ψeven,0,1

)

ψodd,0,0 + ψeven,0,0,

y0 = Nx − P × S(1, 0)− Ooddψodd,0,0 +Oevenψeven,0,0

ψodd,0,0 + ψeven,0,0

− P × S(1, 1)(ψodd,1,0 + ψeven,1,0

)

ψodd,0,0 + ψeven,0,0.

(28)

We have developed here an algorithm enabling thecomputation of Zernike moments based on the momentgenerator using the accumulation moments. This algorithmhas the advantage to be used on images which have particulartopologies since their mesh grid is regular or semiregular bythe use of a second accumulation grid. The second advantageof this algorithm is its simplicity to be implemented onFPGA, for instance. The base of this algorithm relies on theaccumulation moments and is easily computed thanks to asimple accumulation grid.

4.1.2. Architecture Description. To make the exploitation of(25) easier, we need to reorder the terms to get an expressionof the Zernike moments such as

Zpq =∑e

f

Γp,qodd,e, f ψodd,e, f + Γ

p,qeven,e, f ψeven,e, f , (29)

Page 220: 541420

10 EURASIP Journal on Embedded Systems

where

Γp,qodd,e, f =

(p + 1

)

π

p∑

k=qtk,etk, f

q∑

ζ=0

(k−q)/2∑

ξ=0

αke,ζ ,ξβkf ,ζ , j i

ζ

× (−1)kBpqkrkmax

CζqCξ

(k−q)/2

×k−ζ−2 j∑

a=0

2ξ+ζ∑

b=0

ta,etb, f Cak−ζ−2ξC

b2ξ+ζ

× Xk−ζ−2ξ−ac Y 2ξ+ζ−b

c

×a∑

c=0

b∑

d=0

CcaCdbM

a−coddN

cOb−dodd P

dS(c, e)S(d, f

),

Γp,qeven,e, f =

(p + 1

)

π

p∑

k=qtk,etk, f

q∑

ζ=0

(k−q)/2∑

ξ=0

αke,ζ ,ξβkf ,ζ ,ξ

× iζ(−1)kBpqkrkmax

CζqCξ

(k−q)/2

×k−ζ−2ξ∑

a=0

2ξ+ζ∑

b=0

ta,etb, f Cak−ζ−2ξC

b2ξ+ζ

× Xk−ζ−2ξ−ac Y 2ξ+ζ−b

c

×a∑

c=0

b∑

d=0

CcaCdbM

a−cevenN

cOb−devenP

dS(c, e)S(d, f

)

(30)

with

tg,h =⎧⎨⎩

1, if h ≤ g,

0, else,

αke,ζ ,ξ =⎧⎨⎩

1, if e ≤ k − ζ − 2ξ,

0, else,

βkf ,ζ ,ξ =⎧⎨⎩

1, if f ≤ 2ξ + ζ ,

0, else.

(31)

The general scheme of the architecture of Zernikemoments (see Figure 8) can be described as follows. (i) Theimage is first divided into two parts: the odd componentwhich only contains the odd rows of the images (resp., even).(ii) The accumulation moments are computed in parallelaccording to two accumulation grids. (iii) On the one handthe accumulation moments of order (0, 0), (0, 1) and, (1, 0)reach the block which computes Xk−ζ−2ξ−a

c , Y 2ξ+ζ−bc , and

r−kmax. On the other hand the accumulation moments aredelivered to the Zernike computation block, waiting for the

completion of the computations. (iv) As soon as Xk−ζ−2ξ−ac ,

Y 2ξ+ζ−bc , and r−kmax are computed, the coefficients Γ

p,qodd,e, f and

Γp,qeven,e, f can be computed. (v) The coefficients are transmitted

to the final computation block in order to evaluate the

Accumulationgrids

Zernikecomputation

Coefficientscomputation

Image

(Xc ,Yc) and rmax

computation

Zpq

Figure 8: Zernike architecture general scheme.

Zernike moments according to (29). Their module is thencomputed.

The scheme of the accumulation grid of width 4 isgiven in Figure 9, and we can notice that it consists of asimple series of accumulators. They are arranged in a waythat the accumulation is first computed on each row via anaccumulation row (row of ym) and then the accumulationis performed on the columns (set of ymn). As soon as arow ends in a given accumulator ym, the result of thisaccumulator is furnished to its corresponding first columnaccumulator, and ym0 and ym are cleared. At the same timeall the corresponding column accumulators transmit theiraccumulation to the next one.

The registers used between the column accumulators aresynchronized at the end of each row; so their clock enabledepends on the image topology. In our case, corners havebeen filled with zeros before dividing the image. Therefore,the size of each image’s component is Xd ×Yd/2. In this case,the accumulation moment ψe, f is computed on Xd × (Yd/2 +f ) + e clock cycles from the moment when the first pixelarrives into the accumulation grid.

One major point of the Zernike moments implemen-tation is the computation of the coefficients. The mainissue of this computation relies in the trade-off between thenumber of coefficients stored in the chip and the numberof operations that are useful to compute these coefficients.Table 2 shows the number of operations that are necessaryfor the computation of the coefficients for Zernike momentsuntil order 8. Configuration 1 corresponds to the case whereBpqk (55 values), Ckp (45 values), the matrix S (45 values)

and the Mpodd, the M

peven, the Np, the O

podd, the O

peven, and

the Pp (9 × 6 = 54 values), that is, 199 values stored. Thesecond configuration corresponds to a storage of the results

dealing with the operations: (−1)k(Bpqk/rkmax)CζqC(k−q)/2,Cak−ζ−2ξC

b2ξ+ζ , C

caM

a−coddN

c, CcaMa−cevenN

c, CdbOb−dodd P

d, CdbOb−devenP

d,and S(c, e)S(d, f ). If there is no optimisation of the storageof these values, the occupied memory will be huge (4564values), but by using the redundancy in each group andthe centralized storage of the 1 value, the number of storedvalues may be reduced to 1203. Note that even if thenumber of values to store has hardly increased, the numberof multiplications is divided by two compare to the firstconfiguration.

Figure 10 shows the envisaged computation of thezernike coefficients taking into account the second con-figuration. The control block deals with the bounds ofthe sum. T1, T2, T3, T4, T5, T6, and T7 look-uptables correspond, respectively, to CcaM

a−coddN

c, CdbOb−dodd P

d,

Page 221: 541420

EURASIP Journal on Embedded Systems 11

RE

G2

RE

G2

RE

G2

RE

G2

RE

G2

REGREGREGREGREG

REGREGREGREG

REGREGREG

REGREG

REG

42

42

42

42

42

34

34

34

34

26

26

26

18

1810

7 13 18 24 292ψ0 ψ1 ψ2 ψ3 ψ4

ψ40ψ30ψ20ψ10ψ00

ψ01

ψ02

ψ03 ψ13

ψ12

ψ11 ψ21

ψ22

ψ04

ψ31

Figure 9: Example of accumulation grid of width 4.

Level 1:0 ≤ c ≤ a and 0 ≤ d ≤ b

Level 2:0 ≤ a ≤ k−ζ−2ξ0 ≤ b ≤ 2ξ + ζ

if e ≤ a and f ≤ b

Level 4:q ≤ k ≤ p, |p−k|%2if e ≤ p and f ≤ p

Level 3:0 ≤ ζ ≤ q and 0 ≤ ξ ≤ (k−q)/2if e ≤ k−ζ−2ξ and f ≤ 2ξ + ζ

××

×

× × ×××

×

×

××

18

18T6

T7T5

T3-4

T1-2

18

18

18

18

18

18

18

1818

RE

GR

EG

RE

GR

EG

RE

GR

EG

RE

GR

EGR

EG

RE

G

RE

GR

EG

+

+

+

+

+

+

+

+

++

+ +

Control

r−kmax

Xk−i−2 j−ac Y

2 j+i−bc

�(Γp,qodd,e, f )

�(Γp,qodd,e, f )

�(Γp,qeven,e, f )

�(Γp,qeven,e, f )

(−1)ζ/2

Figure 10: Computation of Zernike coefficients.

Page 222: 541420

12 EURASIP Journal on Embedded Systems

Table 2: Number of operations executed to compute the Γp,qe,f

coefficients.

pNb. accumulations Nb. Multiplications

Configure 1 configure 2

0 2 25 14

1 16 183 102

2 101 1225 672

3 349 5543 3000

4 1311 22987 12266

5 4267 77637 41010

6 13642 241767 126592

7 38860 660481 343560

8 104663 1692910 875720

CcaMa−cevenN

c, CdbOb−devenP

d, S(c, e)S(d, f ), Cak−ζ−2ξCb2ξ+ζ , and

(−1)k(Bpqk/rkmax)CζqC(k−q)/2. T1-2 (resp., T3-4) means that T1and T2 (resp., T3 and T4) are first read and then the productsbetween the read values are computed.

A Zernike computation block aims to compute the mod-ule of Zernike moments from the accumulation momentsthat are provided by the grids and from the module thatfurnishes the coefficients (see Figure 8). This block consistsin summing the different coefficients and in computing themodule of each moment. In order to reduce the amount oflogical resources to provide, the computation of the squareroot is simplified according to the following approximation:

√x2 + y2 ≈ max

(|x|,∣∣y∣∣,

34

(|x| +∣∣y∣∣)). (32)

This approximation is often utilized in image processingand does not impact significantly the final results.

4.1.3. FPGA Implementation of Zernike Moment’s Computa-tion. In order to compute the Zernike moment from theaccumulation, we proposed an original architecture whichis presented in Figure 10. This architecture is very regularand simplifies the implementation on an FPGA target.Furthermore, we can notice that the hardware required issimple to design for both the moments’ accumulation andmoments’ computation. In fact, the computations are basedon a multiplier and an adder. These constitute the MAC (forMultiply-ACcumulate) operator and are widely available incurrent FPGA devices. In order to improve performances,MAC operators are integrated in some FPGA devices as ahardwired component like DSP48 in Xilinx Virtex4.

Two implementation approaches are possible in whicheither hardware or time optimization is considered.

Hardware Optimization. This approach allows to reproducepartially the temporal model of processors. The computa-tions are performed iteratively and coefficients are read fromthe tables sequentially. The results can be temporarily storedin paged memory rather than registers. In this approach,the total number of iterations is directly proportional tothe order of the desired moment and it remains relatively

Multiplier Adder

D Q D Q× +

Figure 11: Using MAC operator in data flow architecture.

Multiplier Adder

Out

SEL

ENA

Level i

Level i+1

ENA

Clock

Control

D Q

D Q

i1

i2

Figure 12: Reduction of the calculation resources by reusing hard-wired operators.

small (some thousands only). Figure 11 depicts one ofthe two variants of realization: with or without pipelinedcomputation. The pipelined organization allows to increasethe calculation frequency of the iterations.

Time Optimization. In that case, we consider that theamount of the computation hardware is sufficient. Therefore,the architecture includes all necessary pipelined operatorsas it is suggested in Figure 10. The intermediate resultsare stored in registers. This solution offers the possibilityof reducing the number of operators by reusing the samehardware resources as shown in Figure 12.

Figure 13 describes the hardware implementation of theZernike computation block. Its main objective is to generatethe different Zernike moments from the accumulationmoments calculated with the accumulation grids and fromthe coefficients computation module. It mainly consists ofMAC blocks and of a module destined to compute the squareroot of the module according to (32). Only 75 slices arerequired to implement the entire block.

4.2. Hardware Implementation of the Neural Network. Theparallel nature of neural networks makes them very suitablefor hardware implementation. Several studies have beenperformed so far allowing complex configurations to beimplemented in reconfigurable circuits [15, 16].

The proposed architecture strives to reduce the amountof logic to be utilized. This is mainly due to the fact that theneural network has to be implemented with its associated

Page 223: 541420

EURASIP Journal on Embedded Systems 13

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

D Q

(1)

18

18

RE

GR

EG

RE

GR

EG

18

18

18

18

48

48

18

(2)(3)(4)(5)

(1)(2)(3)(4)(5)

(1)(2)(3)(4)(5)

(1)(2)(3)(4)(5)

×

×

×

×

+

+

Control

+

+

+

+

|Z0,0||Z1,1||Z2,0||Z2,2||Z3,1||Z3,3||Z4,0||Z4,2||Z4,4||Z5,1||Z5,3||Z5,5||Z6,0||Z6,2||Z6,4||Z6,6||Z7,1||Z7,3||Z7,5||Z7,7||Z8,0||Z8,2||Z8,4||Z8,6||Z8,8|

�(Γp,qodd,e, f )

ψodd,e, f

ψeven,e, f

�(Γp,qodd,e, f )

�(Γp,qeven,e, f )

�(Γp,qeven,e, f )

√(�2 + �2)

Figure 13: Computation of Zernike moments from the accumulation moments.

complex preprocessing that may require a lot of resources.An example of such architecture is presented in Figure 14.

In this example, the neural architecture is implementinga 5-input MLP with 7 hidden nodes and 3 outputs. Theseparameters are easily modifiable since the proposed circuit isscalable.

Input data are accepted sequentially and applied to theseries of multipliers. Aj corresponds to the jth input of thepresent state whereas Bj corresponds to the jth of the nextset. Data arrive at each clock cycle.

At each clock cycle, at any particular level of adder, apartfrom the addition operation between the multiplier outputand the sum from the previous level, the multiplicationoperation of the next set of inputs at the adjacent multiplieris also simultaneously performed. The sum, thus, ripples andaccumulates through the central adders (48 bits) until it isfed to a barrel shifter that aims to translate the data into a16-bit address. The obtained sum addresses a sigmoid blockmemory (SIGMOID0) containing 65536 values of 18 bits.

This block feeds the outputs of the hidden layer sequen-tially to three MAC units for the output layer calculation.

Finally a multiplexer distributes serially the results of the out-put layer to another sigmoid block memory (SIGMOID1).After a study on data representation, it has been decided tocode the incoming data in 18 bits. Weights are stored in ROM(Read-only Memories) containing 256 words of 18 bits. Thecontrol of the entire circuit is performed by a simple statemachine that aims to organize the sequence of computationsand memory management.

The number of multipliers required for the network isI+O, where I is the number of inputs andO is the number ofoutputs. Considering that the number of hidden nodes maybe large compared to the number of inputs and the numberof outputs, the adopted solution does not affect the numberof multipliers which is a great relief. In this context, it isalso important to note that the design is very easily scalableto accommodate more hidden, input or output nodes. Forexample, adding a hidden node does not impact the numberof resources but requires an additional cycle of computation.Adding an input may be accommodated by the additionof another ROM, multiplier, and adder set to the series ofadders at the centre (part HL of the figure). Moreover, the

Page 224: 541420

14 EURASIP Journal on Embedded Systems

A5 A5 A5 A5 A5 A5 A5

B4 A4 A4 A4 A4 A4 A4 A4

B3 B3 A3 A3 A3 A3 A3 A3 A3

B2 B2 B2 A2 A2 A2 A2 A2 A2 A2

B1 B1 B1 B1 A1 A1 A1 A1 A1 A1A1

ROM

ROM

ROM

ROM

ROM

ROM

ROM

ROM

18 1818

18Sigmoid1

Sigmoid0

MAC BS

BS

BS

BS

MAC

MAC

MU

X

18

18

16

16

16

16

48

48

48

48181818181818181818

1

0

Control

+

+

+

+

+

2

3

1

2

3

4

5

HL

OL

S

×

×

×

×

×

Figure 14: Example of the hardware implementation of a basic neural network.

Table 3: Summary of occupied resources.

Module Number of used logic slices/ Used DSP blocks/ Number of used memory bits/

available available available (Kb)

Accumulation moments 1786/49152 0/96 0 /4320

Zernike moments 75/49152 60/96 21/4320

Neural Network 477/49152 28/96 2808/4320

Total 2338/49152 (4%) 88/96 (92%) 2829 /4320 (65.5%)

addition of an output node can be fulfilled by adding anotherROM, MAC unit, and sigmoid block to the part OL of thefigure.

Another advantage of the architecture is that a singleactivation function (sigmoid block) is required to computethe complete hidden layer. This block consists of a Look-upTable (LUT) that stores 65536 values of the function.

In general, the time required to obtain the outputs afterthe arrival of the first input is fixed to I + H + 6, where I isthe number of inputs, and H is the number of hidden units.In every cycle, I +O number of multiplications is performed(O is the number of output units).

5. Performances

The complete architecture (preprocessing + neural network)has been implemented in a Xilinx Virtex4 (xc4lx100) FPGAwhich is the part that has been retained for the triggerimplementation. This type of reconfigurable circuit exhibitsa lot of dedicated resources such as memory blocks or DSPblocks that allow to compute a MAC very efficiently.

5.1. Resources Requirements. The resources that are requiredto implement the global L2 trigger are given in Table 3.The accumulation grid is essentially realized with logicalresources. No DSP block is utilized at this level. The compu-tation of Zernike moments from the accumulation momentsmakes intensive use of parallelism. Five computation stagesenable to compute 25 Zernike moments very rapidly andmake use of 60 DSP blocks.

Concerning the hardware implementation of the neuralnetwork, it is important to notice that, independently ofthe configuration, the amount of used resources is verylow. Nevertheless, one may deplore an important usage ofmemory blocks destined to store the values of the sigmoidfunctions. This issue may be circumvented in case wherehardware resources constitute an issue. A modified activationfunction sgn(x)× (1− 2−absx) could be used [17]. This has ashape quite similar to the sigmoid function and is very easyto implement on hardware with just a number of shifts andadds. This function can be executed with an error less than4.3%

According to Table 3, it is clear that the entire systemfits in an FPGA without consuming to much logic (4%).Moreover, the complete architecture has been devised inorder to take full benefit of the intrinsic dedicated resourcesof the FPGA, that is, DSP and memory blocks.

5.2. Timing. The computation time of the complete triggeris summarized in Table 4. According to this table, it isimportant to notice that the timing constraints imposed bythe HESS system have been met since the mean frequency totake a decision is fixed to 3.5 KHz, that is, 285 microseconds.The global latency time of the proposed L2-trigger is115.3 microseconds which makes it possible to envisage otherimprovements.

It is important to note that most of the computation timeis monopolized by the computation of Zernike momentsfrom the accumulation moments. This is mainly due tothe fact that the number of accumulations to perform ishuge (104663 accumulations for an order-8) and that these

Page 225: 541420

EURASIP Journal on Embedded Systems 15

Table 4: Timing performances.

Module Processing time in μs

Accumulation moments 13.5

Zernike moments 101.4

Neural Network 0.4

Total 115.3

computations are performed iteratively. Even if we havedecided to parallelize the architecture in five stages, thenumber of iterations remains high (≈30 000). A currentwork is performed to optimize the computations in thisblock for further improvements.

The maximum clock frequency has been estimated at120 MHz and 366 MHz for the DSP blocks.

6. Conclusion

In this article, we have presented an original solution thatmay be seen as an intelligent way of triggering data in theHESS Phase-II experiment. The system relies on the utiliza-tion of image processing algorithms in order to increase thetrigger efficiency. The hardware implementation has repre-sented a challenge because of the relatively strong timingconstraints 285 microseconds to process all algorithms. Thisproblem has been circumvented by taking advantage of thenature of the algorithms. All these concepts are implementedmaking intensive use of FPGA circuits which are interestingfor several reasons. First, the current advances in recon-figurable technology make FPGAs an attractive alternativecompare to very powerful circuits such as ASICs. Moreover,their relatively small cost permits to rapidly implement aprototype design without major developmental constraints.The reconfigurability also constitutes a major point. It allowsto configure the whole system according to the applicationneeds, enabling flexibility and adaptivity. For example, inthe context of the HESS project, it may be conceivable toreconfigure the chip according to the surrounding noise orto deal with specific experimental conditions.

References

[1] J. A. Hinton, “The status of the hess project,” New AstronomyReviews, vol. 48, no. 5-6, pp. 331–337, 2004.

[2] S. Funk, G. Hermann, J. Hinton, et al., “The trigger system ofthe Hess telescope array,” Astroparticle Physics, vol. 22, no. 3-4,pp. 285–296, 2004.

[3] E. Delagnes, Y. Degerli, P. Goret, P. Nayman, F. Toussenel, andP. Vincent, “Sam: a new ghz sampling asic for the Hess-ii front-end electronics,” Nuclear Instruments and Methods in PhysicsResearch, vol. 567, no. 1, pp. 21–26, 2006.

[4] A. M. Hillas, “Cerenkov Light Images of EAS Produced byPrimary Gamma Rays and by Nuclei,” in Proceedings of the19th International Cosmic Ray Conference (ICRC ’85), SanDiego, Calif, USA, August 1985.

[5] B. Denby, “Neural networks in high energy physics: a ten yearperspective,” Computer Physics Communications, vol. 119, no.2, pp. 219–231, 1999.

[6] C. Kiesling, B. Denby, J. Fent, et al., “The h1 neural networktrigger project,” in Advanced Computing and Analysis Tech-niques in Physics Research, vol. 583, pp. 36–44, 2001.

[7] C. M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press, Oxford, UK, 1995.

[8] M. R. Teague, “Image analisis via the general theory ofmoments,” Journal of the Optical Society of America, vol. 70,pp. 920–930, 1979.

[9] F. Zernike, “Beugungstheorie des schneidenverfahrens undseiner verbesserten form, der phasenkontrastmethode,” Phys-ica, vol. 1, pp. 689–704, 1934.

[10] S. Khatchadourian, J.-C. Prevotet, and L. Kessal, “A neuralsolution for the level 2 trigger in gamma ray astronomy,” inProceedings of the 11th International Workshop on AdvancedComputing and Analysis Techniques in Physics Research (ACAT’07), Proceedings of Science, Nikhef, Amsterdam, The Nether-lands, April 2007.

[11] S. O. Belkasim, M. Ahmadi, and M. Shridhar, “Efficientalgorithm for fast computation of zernike moments,” Journalof the Franklin Institute, vol. 333, pp. 577–581, 1996.

[12] E. C. Kintner, “On the mathematical properties of the zernikepolynomials,” Journal of Modern Optics, vol. 23, pp. 679–680,1976.

[13] L. Kotoulas and I. Andreadis, “Real-time computation ofzernike moments,” IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 15, no. 6, pp. 801–809, 2005.

[14] M. Hatamian, “A real-time two-diensional moment gener-ating algorithm and its single chip implementation,” IEEETransactions on Acoustics, Speech, and Signal Processing, vol. 34,pp. 546–553, 1986.

[15] J.-C. Prevotet, B. Denby, P. Garda, B. Granado, and C.Kiesling, “Moving nn triggers to level-1 at lhc rates,” NuclearInstruments and Methods in Physics Research A, vol. 502, no.2-3, pp. 511–512, 2003.

[16] A. R. Omondi and J. C. Rajapakse, Fpga Implementations ofNeural Networks, Springer, 2006.

[17] M. Skrbek, “Fast neural network implementation,” NeuralNetwork World, vol. 9, pp. 375–391, 1999.

Page 226: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 542035, 15 pagesdoi:10.1155/2009/542035

Research Article

Evaluation and Design Space Exploration of a Time-DivisionMultiplexed NoC on FPGA for Image Analysis Applications

Linlin Zhang,1, 2, 3 Virginie Fresse,1, 2, 3 Mohammed Khalid,4 Dominique Houzet,5

and Anne-Claire Legrand1, 2, 3

1 Universite de Lyon, 42023 Saint-Etienne, France2 CNRS, UMR 5516, Laboratoire Hubert Curien, 42000 Saint-Etienne, France3 Universite de Saint-Etienne, Jean-Monnet, 42000 Saint-Etienne, France4 RCIM, Department of Electrical & Computer Engineering, University of Windsor, Windsor, ON, Canada N9B 3P45 GIPSA-Lab, Grenoble, France

Correspondence should be addressed to Linlin Zhang, [email protected]

Received 1 March 2009; Revised 17 July 2009; Accepted 17 November 2009

Recommended by Ahmet T. Erdogan

The aim of this paper is to present an adaptable Fat Tree NoC architecture for Field Programmable Gate Array (FPGA) designed forimage analysis applications. Traditional Network on Chip (NoC) is not optimal for dataflow applications with large amount of data.On the opposite, point-to-point communications are designed from the algorithm requirements but they are expensives in termsof resource and wire. We propose a dedicated communication architecture for image analysis algorithms. This communicationmechanism is a generic NoC infrastructure dedicated to dataflow image processing applications, mixing circuit-switching andpacket-switching communications. The complete architecture integrates two dedicated communication architectures and reusableIP blocks. Communications are based on the NoC concept to support the high bandwidth required for a large number and typeof data. For data communication inside the architecture, an efficient time-division multiplexed (TDM) architecture is proposed.This NoC uses a Fat Tree (FT) topology with Virtual Channels (VCs) and flit packet-switching with fixed routes. Two versionsof the NoC are presented in this paper. The results of their implementations and their Design Space Exploration (DSE) onAltera Stratix II are analyzed and compared with a point-to-point communication and illustrated with a multispectral imageapplication. Results show that a point-to-point communication scheme is not efficient for large amount of multispectral imagedata communications. An NoC architecture uses only 10% of the memory blocks required for a point-to-point architecture butseven times more logic elements. This resource allocation is more adapted to image analysis algorithms as memory elements are acritical point in embedded architectures. An FT NoC-based communication scheme for data transfers provides a more appropriatesolution for resource allocation.

Copyright © 2009 Linlin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Image analysis applications consist of extracting some rele-vant parameters from one or several images or data. Embed-ded systems for real-time image analysis allow computers totake appropriate actions for processing images under real-time hard constraints and often in harsh environments.Current image analysis algorithms are resource intensive sothe traditional PC- or DSP-based systems are unsuitable asthey cannot achieve the required high performance.

An increases in chip density following Moore’s law allowsthe implementation of ever larger systems on a single chip.

Known as systems on chip (SoC), these systems usuallycontain several CPUs, memories, and custom hardwaremodules. Such SoC can also be implemented on FPGA.For embedded real-time image processing algorithms, theFPGA devices are widely used because they can achieve high-speed performances in a relatively small footprint with lowpower compared to GPU architectures 0 [1]. Modern FPGAsintegrate many heterogeneous resources on one single chip.The resources on an FPGA continue to increase at a rate thatonly one FPGA is capable to handle all processing operations,including the acquisition part. That means that incomingdata from the sensor or any other acquisition devices are

Page 227: 541420

2 EURASIP Journal on Embedded Systems

directly processed by the FPGA. No other external resourcesare required for many applications (some algorithms mightuse more than one FPGA). Today, many designers of suchsystems choose to build their designs on Intellectual Property(IP) cores connected to traditional buses. Most IP coresare already predesigned and pretested and they can beimmediately reused [2–4]. Without reinventing the wheel,the existing IPs and buses are directly used and mapped tobuild the dedicated architecture. Although the benefits ofusing existing IPs are substantial, buses are now replacedby NoC communication architectures for a more systematic,predictive and reliable architecture design. Network onChip architectures is classified according to its switchingtechnique and to its topology. Few NoC architectures forFPGA are proposed in the literature. Packet switching withwormhole is used by Hermes [5], IMEC [6], SoCIN [7],and Extended Mesh [8] NoCs. PNoC [9] and RMBoC [10]use only the circuit switching whereas the NoC of Lee[11] uses the packet switching. For the topology, Hermesuses a 2D mesh, the NoC from IMEC uses a 2D torus,SoCIN/RASoC can use a 2D mesh or a torus. RMBoCfrom [12] has a 1D or 2D mesh topology. An extendedmesh is used for the Extended Mesh NoC. HIBI uses ahierarchical bus. PNoC and the NoC from Lee have a customtopology.

Existing NoC architectures for FPGA are not adaptedto image analysis algorithms as the number of inputdata is high compared to the results and commands. Adedicated and optimized communication architecture isrequired and is most of the time designed from the algorithmrequirements.

The Design Space Exploration (DSE) of an adaptablearchitecture for image analysis applications on FPGA withIP designs remains a difficult task. It is hard to predict thenumber and the type of the required IPs and buses from aset of existing IPs from a library.

In this paper we present an adaptable communicationarchitecture dedicated to image analysis applications. Thearchitecture is based on a set of locally synchronous modules.The communication architecture is a double NoC archi-tecture, one NoC structure dedicated to commands andresults, the other one dedicated to internal data transfers.The data communication structure is designed to be adaptedto the application requirements (number of tasks, requiredconnections, size of transmitted data). Proposing an NoCparadigm helps the dimensioning and exploration of thecommunication between IPs as well as their integration inthe final system.

The paper is organised into 5 further sections. Section 2presents the global image analysis architecture and focuseson the data flow. Special communication units are set upto satisfy the application constraints. Section 3 presentstwo versions of NoC for the data flow which are built onthese basic communication units. The NoC architectures aretotally parameterized. Section 4 presents one image analysisapplication: a multispectral image authentication. DSEmethod is used to find out the best parameters for the NoCarchitecture according to the application. Section 5 gives theconclusion and perspectives.

2. Architecture Dedicated toImage Analysis Algorithms

This architecture is designed for most of image analysisapplications. Characteristics from such applications are usedto propose a parameterized and adaptable architecture forFPGA.

2.1. Characteristics of Image Analysis Algorithms. Imageanalysis consists of extracting some relevant parametersfrom one or several images. Image analysis examples areobject segmentation, feature extraction, motion detectionobject tracking, and so forth [13, 14]. Any image analysisapplication requires four types of operations:

(i) acquisition operations,

(ii) storage operations,

(iii) processing operations,

(iv) control operations.

A characteristic of image analysis applications is the unbal-anced data flow between the input and the output. Theinput data flow corresponds to a high number of pixels(images) whereas the output data flow represents little datainformation (selective results). From these unbalanced flows,two different communication topologies can be defined, witheach one being adapted to the speed and flow of data.

2.2. An Adaptable Structure for Image Analysis Algorithms.The architecture presented here is designed from thecharacteristics of image analysis applications. The struc-ture of the architecture contains four types of modules;each one corresponds to the four types of operations. Allthese modules are designed as several VHDL IntellectualProperty (IP) nodes. They are presented in details in[13].

(i) The Acquisition Module produces data that are pro-cessed by the system. The number of acquisitionmodules depends on the applications and the num-ber of types of required external interfaces.

(ii) The Storage Module stores incoming images orany other data inside the architecture. Writing andreading cycles are supervised by the control mod-ule. Whenever possible, memory banks are FPGA-embedded memories.

(iii) The Processing Module contains the logic that isrequired to execute one task of the algorithm. Thenumber of processing modules depends on thenumber of tasks of the application. Moreover, morethan one identical processing module can be used inparallel to improve timing performances.

The number of these modules is only limited bythe size of the target FPGA. The control of thesystem is not distributed in all modules but it is fullycentralized in a single control module.

Page 228: 541420

EURASIP Journal on Embedded Systems 3

AN

Wrapper

SN

Wrapper CCN

AU NoC

CN

Wrapper PN n◦

Wrapper

PN 1◦

Wrapper

Multispectralcamera

Data flowResult and command flow

Data flow

Figure 1: The proposed adaptable architecture dedicated to imageanalysis applications.

(iv) The Control Module performs decisions and schedul-ing of operations and sends commands to the othermodules. All the commands are sent from thismodule to the other modules. In the same way,this module receives result data from the processingmodules.

The immediate reuse of all modules is possible asall modules are designed with an identical structure andinterface given in Figure 1.

To run each node at its best frequency, Globally Asyn-chronous Locally Synchronous (GALS) concept is used inthis architecture. The frequencies for each type of nodesin the architecture depend on the system requirements andtasks of the application.

2.3. Structure of Modules/NoC for Command and Results. Themodular principle of the architecture can be shown at differ-ent levels: one type of operation is implemented by meansof a module (acquisition, storage, processing, etc.). Eachmodule includes units that carry out a function (decoding,control, correlation, data interface, etc.), and these units areshaped into basic blocks (memory, comparator, etc.). Someunits can be found inside different modules. Figure 2 depictsall levels inside a module.

Each module is designed in a synchronous way havingits own frequency. Communications between modules areasynchronous via a wrapper and use a single-rail data path4-phase handshake. Two serial flip-flops are used betweenindependent clock domains to reduce the metastability [15,16]. The wrapper includes two independent units. Onereceives frames from the previous module and the other onesends frames to the following module at the same time.

An NoC is characterized by its topology, routing pro-tocol, and flow control. The communication architecture

is an NoC for command and results and another NoC forinternal data. Topology, flow control, and type of packetsdiffer according to the targeted NoC.

2.4. NoC for Command and Results. Because the commandflow and the final results are significantly fewer comparedto the incoming data, they use an NoC architecture whichis linked to the IP wrappers. The topology for this commu-nication is a ring using a circuit switching technique with8-bit flits. Through the communication ring, the controlmodule sends 4 packets. Packets have one header flit and3 other flits containing command flits and empty flits. Thecontrol module sends packets to any other modules; packetsare command packets or empty packets. Command packetssent by the control module to any other module containinstructions to execute. Empty packets are used by any othermodule to send data to the control module. Empty packetscan be used by any module to send results or any informationback to the control module.

2.5. Communication Architecture for Data Transfers. TheNoC dedicated to data uses a Fat Tree topology which canbe customized according to the required communication ofthe application. Here we use flit packet-switching/wormholerouting with fixed routes and virtual channels. Flow controldeals with the allocation of channel and buffer resources toa packet/data. For image analysis applications, the specifica-tions for the design of our NoC dedicated to data are thefollowing.

(i) Several types of data with different lengths at theinputs. The size of the data must be parameterizedto support any algorithm characteristic.

(ii) Several output nodes, this number is defined accord-ing to the application requirements.

(iii) Frequencies nodes/modules are different.

According to the algorithms implemented, several data fromany input module can be sent to any output module at anytime.

In the following sections, we assume that the architec-ture contains four input modules (the memory modules)connected to four output modules (the processing modules).This configuration will be used for the multispectral imageapplication illustrating the design space exploration in thefollowing sections.

2.5.1. The Topology. The topology chosen is a Fat Tree (FT)topology as depicted in Figure 3 as it can be adapted tothe algorithm requirements. Custom routers are used tointerconnect modules in this topology.

2.5.2. Virtual Channel (VC) Flow Control. VC flow controlis a well-known technique. A VC consists of a buffer thatcan hold one or more flits of a packet and associatedstate information. Several virtual channels may share thebandwidth of a single physical channel [17]. It allowsminimization of the size of the router’s buffers—a significant

Page 229: 541420

4 EURASIP Journal on Embedded Systems

Module

Synchronous module

Special unit

Codingunit

Memoryunit

Controlunit

Decodingunit

Receive Send

Asynchronous communication

SynchronousFPGA module

Asynchronouswrapper

Command and resultData

Figure 2: The generic structure of modules with the asynchronous wrapper for result and command.

Coefficient

Original

Compared

Result

Router 0

PN n◦0

PN n◦1

PN n◦2

PN n◦3

...

Figure 3: FT topology for the TDM NoC.

source of area and energy overhead [18, 19], while providingflexibility and good channel use.

During the operation of the router a VC can be in one ofthe states: idle, busy, empty, or ready.

Virtual channels are implemented using bi-synchro-nousd FIFOs.

2.5.3. Packet/Flit Structure. Table 1 shows the structure of thepacket/flit used for the data transfers. The packet uses an8-bit header flit, a 8-bit tail flit and several 8-bit flits forthe data. For the header flit, Id is the IDentified numberof data. P is the output Port number corresponding to thenumber of PN. Int l that signifies INTeger Length representsthe position of the fixed point in data.

The tail flit is a constant “FF.” One packet can beseparated in several Flow control units (flit). The datastructure is dynamic in order to adapt to different types ofdata. The length of packet and data, number and size of flits,and the depth of VC are all parameterized. The size of flitscan be 8, 16, 32, or 64 bits, but we keep a header and tail of8 bits, extended to the flit size.

Packet switching with wormhole and fixed routingpaths is used, each packet containing the target addressinformation as well as data with Best Effort (BE) traffic.

2.5.4. The Switch Structure. This NoC is based on routersbuilt with three blocks. One block called Central Coordina-tion Node (CCN) performs the coordination of the system.

FIFO

FIFO

FIFO

...

TDM-NA

AU(arbitration unit)

CCN(central

coordination node)

Figure 4: Switch structure.

The second block is the Arbitration Unit (AU) which detectsthe states of data paths. The last one is a mux (TDM-NA)with formatting of data. The switch structure is shown inFigure 4.

The CCN manages the resources of the system and mapsall new incoming messages to the target channel. The switchis based on a mux (crossbar) from several inputs to severaloutputs. All the inputs are multiplexed using the TDM TimeDivision Multiplexing.

For a high throughput, more than one switch can beimplemented in the communication architecture.

The AU is a Round Robin Arbiter (RRA) [20, 21] whichdetects the states of all the VC at the outputs. It determineson a cycle-by-cycle basis which VC may advance. When AU

Page 230: 541420

EURASIP Journal on Embedded Systems 5

Table 1: Data structures for the packets.

Header 8 bits Data N bits Tail 8 bits

id p reserve 1st flit of data. . . Nth flit of data Constant

F F

— — — — — — — — — — — — — — — — — — — — — 10 9 8 7 4 3 0

Table 2: The 24 bit packet data structure for version 1.

Header 8 bits Data 16 bits

id p No used 1st flit 2nd flit

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

receives the destination information of the flit (P enc), itdetects the available paths’ states connected to the targetoutput. This routing condition information will be sent backto CCN in order to let CCN perform the mapping of thecommunication.

2.5.5. The Structure of TDM-NA. The TDM-NA is a set ofa MUX and a Network Adapter (NA). One specific NA isproposed in Figure 5. The Network Adapter adapts any databefore being sent to the communication architecture. TheNetwork Adapter contains 5 blocks.

(i) Adaptor type verifies the type of the data and definesthe required number of flits (and the number of clockcycles used to construct the flits).

(ii) Adaptor tmd performs the time division multiplex-ing for the process of cutting one packet to severalflits.

(iii) Adaptor pack adds header and Tail flits to the initialdata.

(iv) Fifo NA stores the completed packet.

(v) Adaptor flit cuts the whole packet into several 8-bitflits.

Adaptor flit runs with a higher clock frequency in this NAarchitecture because it needs time to cut one packet intoseveral flits. For different lengths of data, Adaptor tdm willgenerate different frequencies which depend on the numberof flits going out for one completed packet.

3. Two Versions ofthe TDM Parameterized NoCs

Two versions are proposed and presented in this paper. Dataare transferred in packets in version 1 with a packet switchingtechnique and with a fixed size of links. Data are transferredwith flits with a wormhole technique and a reduced sizeof the links in version 2. The first version uses one mainswitch and 2 Virtual Channels on the outputs. The secondversion contains 2 main switchs in parallel with 2 Virtual

Channels on the inputs and on the outputs. All versions havefour memory modules as input modules and four processingmodules as output modules. All versions are designed inVHDL.

3.1. Version 1 with ONE Main Switch. Version 1 is a TDM FTNoC containing one main switch and 2 VCs, 2 channels foreach output as shown in Figure 6.

The data are sent as 24-bit packets. The width of VCs inversion 1 is 24 bits. The simplified data structure of version 1is shown in Table 2 .

3.2. Version 2 with TWO Main Switches. Another switchis added to the architecture to increase the throughput.Structure of switch is identical to the switch presented in theprevious section. These two main switches operate in parallelas depicted in Figure 7.

The width of all the VCs in this version depends on thealgorithm characteristics.

3.3. NoC Parameters for DSE. The proposed NoC is flexibleand adaptable to the requirements of image analysis applica-tions. Parameters from the communication architecture arespecified for the Design Space Exploration (DSE).

The parameters are the following.

(i) Number of switches: one main switch for version 1and two main switches for version 2.

(ii) Size of VCs: it corresponds to the different sizes of thedifferent types of data transferred.

(iii) Depth of the FIFOs in VCs: limited by the maximumstorage resources of the FPGA.

Several synthesis tools are used for the architectureimplementation and DSE as these synthesis tools givedifferent resource allocations on FPGA.

4. Experiments and Results

The size of data, FIFOs, and virtual channels are extractedfrom the algorithm implemented. A multispectral image

Page 231: 541420

6 EURASIP Journal on Embedded Systems

Clk

rst

id2

P3

3Int length

Datan

Adaptor pack

Clk

rst

id2

Adaptor type

Nb flitinteger

Adaptor tdm

Vad in

Adaptor flit Flit8

Data pack Tableau 8 bits × nb flit

Clk

rst

Clkrst

FIFO NA

Data pack

16 + n

Figure 5: Data structure for the defined types of data.

Storage

Reference

Original

Compared

Result

Storagemodule

FIFO

FIFO

FIFO

FIFO

Switch

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

Switch

Switch

Switch

Switch

Processingmodule 0

Processingmodule 1

Processingmodule 2

Processingmodule 3

Processingmodules

Virtual channel/packet-switching4 inputs/4 destinations/2 channels per output

Figure 6: The structure of Version 1.

algorithm for image authentication is used here to validatethe communication architecture.

4.1. Multispectral Image Authentication. Multispectral imageFigure 8 analysis (Figure 8) has been used in the space-basedimage identifications since 1970s [22–26]. This technologycan capture light from a wide range of frequencies. This canallow extraction of additional information that the humaneye fails to capture with its receptors of red, green, and blue.Art authentication is one common application widely used inmuseums. In this field, an embedded authentication systemis required.

The multispectral images are optically acquired in morethan one spectral or wavelength interval. Each individualimage has usually the same physical area and scale but adifferent spectral band. Other applications are presented in[27, 28].

The aim of the multispectral image correlation is tocompare two spectral images.

(i) Original image (OI): its spectrum is saved in theStorage Module as the reference data.

(ii) Compared images (CIs): its spectrum is acquired bya multispectral camera.

For the art authentication process, OI is the informationof the true picture, and the CIs are the others “similar”candidates. With the comparison process of the authenti-cation (Figure 9), the true picture can be found among thefalse ones by calculating the distance of the multispectralimage data. For this process, certain algorithms require highprecision operations which imply large amount of differenttypes of data (e.g., floating-point, fixed-point, integer, BCDencoding, etc.) and complex functions (e.g., square rootor other nonlinear functions). Several spectral projectionsand distance algorithms can be used in the multispectralauthentication.

Page 232: 541420

EURASIP Journal on Embedded Systems 7

We can detail the process.

(i) First of all, the original data received from themultispectral camera are the spectral values for everypixel on the image. The whole image will be separatedas several significant image regions. These regions’values need to be transformed as average color valuesby using different windows’ sizes (e.g., 8 × 8 pixelas the smallest window, 64 × 64 pixel as the biggestwindow).

(ii) After this process, certain “color projection” (e.g.,RGB, L∗a∗b∗, XYZ, etc.) will transform the averagecolor values to color space values. An example of RGBcolor projection is shown in:

Ri =780∑

λ=380

S(λ)× Rc(λ),

Gi =780∑

λ=380

S(λ)×Gc(λ),

Bi =780∑

λ=380

S(λ)× Bc(λ), (1)

where Rc, Gc, and Bc are the coefficients of the red,green, blue color space. S(λ) represents the spectralvalue of the image corresponding to each scannedwavelength λ. The multispectral camera used canscan one picture from 380 nm to 780 nm with 0.5 nmas precision unit. So the number of spectral values Ncan vary from 3 to 800. Ri, Gi, and Bi are the RGBvalues of the processed image.

(iii) These color image data go through the comparisonprocess of the authentication. Color distance is justthe basic neutral geometry distance. For example, forthe RGB color space, the calculated distance is shownin:

ΔERGB =√

(R1 − R2)2 + (G1 −G2)2 + (B1 − B2)2. (2)

If the true picture can be found among the falseones by calculating the color distance, the process isfinished otherwise goes to the next step.

(iv) Several multispectral algorithms (e.g., GFC, Mv) areused to calculate the multispectral distance with theoriginal multispectral image data. Certain algorithmsrequire high precision operations which imply largeamount of floating-point data and complex functions(e.g., square root or other nonlinear functions) in thisprocess.

(v) After comparing all the significant regions on theimage, a ratio (Rs/d ) of similitude will be calculatedas shown in (3). N◦

s represents the number of similarregions and N◦

d represents the number of dissimilarregions

Rs/d = N◦s

N◦d

. (3)

Different thresholds will be defined to give the final authenti-cation result for the different required precisions, finding thetrue image which is most alike the original one. One of thesealgorithms is presented in [29].

The calculations are based on the spatial and spectraldata which make the memory accesses a bottleneck in thecommunication. From the algorithm given in Figure 9, thecharacteristics are

(i) number of regions for every wavelength = 2000,

(ii) number of wavelength = 992,

(iii) size of the window for the average processing = 2× 2,4× 4, 8× 8, 16× 16, 32× 32,

(iv) number of tasks: 4—color projection, color distance,multispectral projection, multispectral distance. Themultispectral authentication task is executed by thecontrol module. In this example, there is no taskparallelism. Sizes of data are 72 bits, 64 bits, 64 bitsand 24 bits as shown in Table 3 .

(v) number of modules: 4 processing modules, 4 storagemodules, 1 acquisition module, 1 control module.

(vi) bandwidth of multispectral camera: 300 MB/s. TheNoC architecture is dimensioned to process andexchange data at least at the same rate in order toachieve real time.

For the NoC architecture, four types of data are defined byanalyzing multispectral image algorithms. Each data has anidentical number id).

(i) Coef: Coefficient data which means the normalizedvalues of difference color space vector (56-bit, id“00”).

(ii) Org: Original image data which are stored in the SN(48-bit, id “01”).

(iii) Com: Compared image data which are acquired bythe multispectral camera and received from the NA(48-bit, id “10”).

(iv) Res: Result of the authentication process (8-bit, id“11”).

4.2. Resources of Modules in the Architecture. This parame-terized TDM architecture was designed in VHDL. Table 4shows the resources of the modules in the architecture.The FPGA is the Altera Stratix II EP2S15F484C3 which has6240 ALMs/logic cells. The number of resources dedicated toall the modules represents around 14% of the total logic cells.Whatever the communication architecture, all these modulesremain unchanged with the same number of resources.

4.3. The Point-to-Point Communication Architecture Dedi-cated to Multispectral Image Algorithms. A classical point-to-point communication architecture is designed for thealgorithm requirements presented previously and is shownin Figure 10. This traditional structure is used to comparesome significant results obtained by the proposed NoC. In

Page 233: 541420

8 EURASIP Journal on Embedded Systems

Ref

Adaptor

Adaptor

Adaptor

Fifo_out00Fifo_in0

Fifo_in1

Fifo_in0

Fifo_in1

Fifo_in0

Fifo_in0

Fifo_in1

Fifo_in1

Fifo_out01

Fifo_out02

Fifo_out03

Fifo_out05

Fifo_out06

Fifo_out07

Fifo_out10

Fifo_out11

Fifo_out12

Fifo_out13

Fifo_out14

Fifo_out15

Fifo_out16

Fifo_out17

Fifo_out04

Adaptor

Adaptor

Adaptor

Adaptor

Adaptor

Org

Com

Switch

Switch

SwitchProcessingmodule 0

Processingmodule 1

Processingmodule 2

Processingmodule 3

Switch

Switch

Switch

Res

Figure 7: The structure of version 2 with 2 main switches in parallel.

Table 3: Type and size of data for the multispectral algorithm.

(a) COEF: Header 8 bits + coefficient data 56 bits + tail 8 bits = 72 bits

Header 8 bits Data 56 bits Tail

id p No used 1st flit of data 2nd flit · · · 7th flit 8 bits

Flit0 Flit1 · · · Flit6 oxFF

· · ·71 70 69 68 67 66 65 64 63 — 60 59 — 56 55 — 48 · · · 11 — 8 7 — 0

(b) ORG/COM: Header 8 bits + original/compared data 48 bits + tail 8 bits = 64 bits

Header 8 bits Data 56 bits 48 bits Tail

id p No used 1st flit of data 2nd flit · · · 6th flit 8 bits

Flit0 Flit1 · · · Flit5 oxFF

· · ·63 62 61 60 59 58 57 56 55 — 52 51 — 48 47 — 40 · · · 11 — 8 7 — 0

(c) RES: Header 8 bits + result 8 bits + tail 8 bits = 24 bits

Header 8 bits Data 8 bits Tail 8 bits

id p No used 0∼255 oxFF constant

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Page 234: 541420

EURASIP Journal on Embedded Systems 9

−100 −50 0 50 100 150 200

RGB

250

200

150

100

50

(a)

400 500 600 700

RealArtificial

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(b)

−100 −50 0 50 100 150 200 250

Wavelength 500 nm

250

200

150

100

50

(c)

−100 −50 0 50 100 150 200

Wavelength 775 nm

250

200

150

100

50

(d)

Figure 8: Multispectral images. Multispectral images add additional information compared to color image. In this example, the artificialleaf can be extracted to the real ones for 775 nm.

Table 4: Resources for the nodes in the GALS architecture.

Node Frequency (MHz) Resources on Stratix II 2S60

Logic cells Registers Memory bits

Control 150 278 265 32

Acquisition 76.923 315 226 2

Storage 100 280 424 320000

Processing 50 Depending on the algorithms

the global communication architecture, any input data canbe transmitted to any processing module. 72-bit muxes areinserted here between FIFO and processing modules. Thispoint-to-point communication uses input FIFOs having thesize of data used. Their bandwidth is thus not tuned to fit

the bandwidth of the input streams. For the three versionsstudied here, the input FIFO bandwidth is higher than thespecifications of multispectral cameras. If it was not thecase, the input FIFO size could be increased to respect theconstraint.

Page 235: 541420

10 EURASIP Journal on Embedded Systems

Original regionof OI

Spectral averagecalculation

Spectral averagevalue of OI

Spectral projection

StorageSpectral

image of OI

Spectral distance

Multispectralauthentication

R ≤ P

Precision Pevaluation P∗ ≤ P

Similarregion

Comparedregion of CI

Spectral averagecalculation

Spectral averagevalue of CI

Spectral projection

StorageSpectral

image of CI

Dissimilarregion

Figure 9: General comparison process of the authentication. R: Result of each step of calculation. P: Precision of each multispectral distance.

Coef

Org

Com

Res

Fifo_64bit/32

Fifo_72bit/32

Fifo_24bit/32

Fifo_64bit/32

Fifo_64bit/32

Fifo_72bit/32

Fifo_24bit/32

Fifo_64bit/32

Fifo_64bit/32

Fifo_72bit/32

Fifo_24bit/32

Fifo_64bit/32

Fifo_64bit/32

Fifo_72bit/32

Fifo_24bit/32

Fifo_64bit/32

PN0

PN1

PN2

PN3

72

72

72

72

Figure 10: The point-to-point communication dedicated to the multispectral authentication algorithm.

Page 236: 541420

EURASIP Journal on Embedded Systems 11

4.4. Implementation of the Communication Architecture forData Transfers. The point-to-point architecture and bothversions of the NoC are designed in VHDL language. TheFPGA used for the implementation is an Altera Stratix IIEP2S15F484C3 EP2S180F1508C3 FPGA. Two sizes of dataare used for version 1 of the NoC, 48 bits and 56 bits.These sizes are similar to the size of data for the point-to-point communication. Implementation results are given inTable 5.

Concerning latency, version 2 uses 8-bit flits as thetransmission unit, thus the NA needs 8 cycles to cut a 64-bit data packet as flits plus 1 cycle for the header. Also, thelatency of the NoC is 1 cycle for the storing in the first FIFO, 1cycle for the main switch crossing, and 1 cycle for the storingin the second FIFO, that is, 3 cycles of latency due to the NoC.Compared to the point-to-point communication, we pay thepacket serialization latency to have much better flexibility.

Concerning area resources, as depicted in Table 5 , thepoint-to-point communication needs less ALUTs but over7 times more memory blocks. The switch requires 4 timesmore logic (ALUTs) than the point-to-point architecture(the other ALUTs are for the FIFOs of versions 1 and 2that use more registers than memory blocks to implementFIFOs). One reason is the structure of the switch whichis more complex than muxes used in the point-to-pointarchitecture. If we compare just the switch size with a simpleclassical NoC like Hermes, we obtain similar sizes for aswitch based on a 4 × 4 crossbar, but a full NoC linking4 memory nodes to 4 processing nodes would require atleast 8 switches, that is almost 8 times more area and from2 times to 5 times more latency to cross the switches whenthere is no contention and even more with contentions.The advantage of a classical NoC approach is to allow anycommunication. This is of no use here as our four memoriesare not communicating together. We have here an orienteddataflow application with specific communications. Ourdataflow NoC has the advantages of NoC, that is systematicdesign, predictability, and so on, and the advantages of point-to-point communications, that is low latency and optimizedwell sized links to obtain the best performance/cost trade offand to use less memory blocks which is important for imagealgorithms using huge quantity of data to be stored inside thechip.

Also the number of pins for the point-to-point com-munication is significantly higher compared to both NoCversions, even with a simple communication pattern. Itindicates that the point-to-point communication requiresmuch more wires inside the chip when the complete systemis implemented. This can be a limitation for complexMultiprocessor SoC. Furthermore, the frequency of point-to-point communication is a bit slower than NoC versions.

Resource allocations show the benefits of using NoCarchitecture for image analysis algorithms compared to tra-ditional point-to-point architectures. The following imple-mentations focus on the NoC architectures. Both versions areanalyzed and explored when implemented on FPGA.

The number of resources for the version 1 is less than forversion 2, but with less bandwidth (a single switch comparedto 2 switches for version 2). The choice of one version of the

NoC is made from the tradeoff on timing and resources. Theoptimization of the number of resources leads to the choiceof version 1 whereas version 2 is adapted to higher bandwithrequirements.

4.5. DSE of the Proposed NoC. The knowledge of the designspace of architectures is important for the designer to maketradeoffs related to the design. The architecture most adaptedto the algorithm requirements requires a Design SpaceExploration (DSE of the NoC architectures). The explo-ration presents the relationships between input parametervalues and performance results in order to make tradeoffsin the implemented architecture. In DSE the parametervalues guide the designer to make choices and tradeoffswhile satisfying the required performances without requiringimplementation processes.

(i) Input parameters:

(a) number of switches,

(b) number and width of VCs,

(c) depth of FIFOs/VCs.

(ii) Performances:

(a) logic device,

(b) ALUTs,

(c) Registers,

(d) latency,

(e) frequency.

The input parameters are explored to see their effect on theperformances. Performances are focused on the resourcesfirst. The purpose of DSE is to find the most appropri-ate parameter values to obtain the most suitable design.Hence, it needs to find the inverse transformation from theperformance back to the parameters as represented by the“light bulb” in Figure 11. The Y-chart environment is realizedby means of Retargetable Simulation. It links parametervalues used to define an architecture instance to performancenumbers [30–32].

The depth of the FIFOs is 32 for both versions. The widthfor Version 1 is 24-bits, and 8 bits for Version 2. Note that inVersion 2; FIFOs do not only exist in the VCs, but also in theNA at the input of NoC (shown in Figure 5). The FPGA is anAltera EP2S1F484C3 implemented with Quartus II 7.2 withDK Design Tool used for the synthesis tool.

4.5.1. Results of Version 1 (Parameter: Depth of the FIFOs/VCs,Performance: Device Utilization). Figure 12 presents the DSEof the proposed communication architecture. The width ofVersion 1 FIFOs is 24 bits. The depth of FIFO is the numberof packets stored in the FIFO. This version corresponds to thecase where all the data have the same lengths..

Figure 12 shows that with the increasing of the depthsof VCs, the device utilization increases almost linearly.With the maximum depth 89, Logic Array Blocks (LABs)are not enough for the architecture implementation. The

Page 237: 541420

12 EURASIP Journal on Embedded Systems

Table 5: Comparison of the resources: Point-to-point versus NoC Version 1.

Ressources Point-to-point Version 1 48-bit Version 1 56-bit Version 2

Logic utilization % 1% 2% 3% 3%

Combinational ALUTs 305 1842 2118 2521

Dedicated logic registers 1425 2347 2739 4217

Total pins 512 344 408 230

Total block memory bit 29568 3384 3960 8652

Frequency for F (MHz) 165.73 MHz 264.34 MHz 282.41 MHz 292.31 MHz

Architecturemodel (VHDL)

Applicationsmodel

Mapping

Retargetablesimulation

Performancenumber

Parameters Performance

Figure 11: The inverse transformation from the performance back to the parameters.

10 20 30 40 50 60 70 80

Depth FIFOs/VCs

0

20

40

60

80

100

Dev

ice

uti

lizat

ion

sum

mar

y(%

)

Logic utilization (%)Combinational ALUTs (%)Dedicated logic registers (%)Block memory bit (%)

Figure 12: The Device Utilization Summary of Version 1 on AlteraStratix II.

memory blocks’ augmentation is two times bigger than theaugmentation of the total registers.

When DSE reached the maximum depth of FIFOs/VCs,the utilization of ALUTs is 80%, but for the block memory,it has been only used at 5%, that is the synthesis tool doesnot use memory blocks here as target for implementation ofFIFOs.

4.5.2. Results of Version 1 (Parameter: Width of FIFOs/VCs,Performance: Device Utilization). In Figure 13, the DSE usestwo depths for the FIFOs: an 8-data depth (depicted as solidline) and a 32-data depth (presented as dotted line). As data

can be parameterized in flits in the NoC, the size of packets isfrom 20 bits to 44 bits. For the data width, we take from 8-bitdata (minimal size for the defined data in the multispectralimage analysis architecture) corresponding to 20-bit packets(add 12 bit header/tail) to 32-bit data corresponding to 44-bit packets. The X-axes in Figure 13 represent the width ofFIFOs/VCs which is the length of packets in the transmission(we consider here size of packet = width of FIFOs).

Results show that the number of resources depends onthe width of the FIFOs. The limiting parameters in the size ofFIFOs are the number of logic and registers. With a depth of32, 20% of the registers are used and 40% of the logic is used.The use of logic grows more significantly with a depth of 32.All required resources can be found from a linear equationextracted from the figure and resource predictions can bemade without requiring any implementation. We have thesame comment on memory blocks.

4.5.3. Results of Version 2 (Parameter: Depth of the FIFOs/VCs,Performance: Device Utilization). To solve the problem offixed width of Version 1, Version 2 uses flits method. Version2 has different lengths of the transmitted data which presentdifferent widths for each input of the NoC communication.The total data bit transmitted per data is 224 bits (72 bits +64 bits + 64 bits + 24 bits).

Figure 14 shows the resource utilization summary onStratix II for Version 2 which has the similar characteristicsas the one of Version 1. In the data transmission of Version

Page 238: 541420

EURASIP Journal on Embedded Systems 13

20 24 28 32 36 40 44

Combinational ALUTs

800

1689

3455

(a)

20 24 28 32 36 40 44

Dedicated logic registers

975

21512755

6231

(b)

20 24 28 32 36 40 44

Total memory (bit)

1368

3096

5016

11352

(c)

20 24 28 32 36 40 44

Width of FIFOs/VCs

0

10

20

30

40

50

60

Dev

ice

uti

lizat

ion

sum

mar

y(%

)

Logic (depth 8)(%)ALUTs (depth 8) (%)Registers (depth 8) (%)Memory (depth 8) (%)

Logic (depth 32)(%)ALUTs (depth 32) (%)Registers (depth 32) (%)Memory (depth 32) (%)

(d)

Figure 13: The device utilization summary with fixed depth of FIFOs/VC but different width on Altera Stratix II for Version 1.

10 20 30 40 50 60 70 80 90 100 110 120

Depth of FIFOs/VCs

0

20

40

60

80

100

Dev

ice

uti

lizat

ion

sum

mar

y(%

)

Logic utilization (%)Combinational ALUTs (%)Dedicated logic registers (%)Block memory bit (%)

Figure 14: The Device Utilization Summary on Altera Stratix II forVersion 2.

2, all the data are divided with 8-bit flits by the NA, whichreduce the general width of FIFOs in VCs.

Comparing these 2 versions, Version 1 has fixed widthswhich is suitable for data having the same size. The structureof Version 1 is simpler, requires fewer resources, and hasbetter latency. But for large different lengths/sizes of data

transmission, Version 2 is better than Version 1 because it canadapt precisely to the data sizes to obtain an optimal solution.

4.5.4. Results of Version 2 (Parameter: Synthesis Tool, Per-formance: Device Utilization). Two different synthesis toolshave been chosen: DK design Suite V. 5.0 SP5 from MentorGraphics [33] and Synopsys Design Compiler [34] to analyzethe impact of synthesis tools on the DSE. From a singleVHDL description corresponding to version 2, these twotools gave quite different synthesis results with the sameAltera Stratix II EP2S15F484C3 (6240 ALMs) as depicted inFigure 15.

These two tools have been chosen as the default synthesistools on QuartusII 7.1 and 7.2. All the lines with markers(“o”, “x”, etc.) present the synthesized results producedby DK design suite. All the lines without the marker arethe results synthesized by Design Compiler. The maximumdepths synthesized by these two versions are 63 by DesignCompiler and 120 by DK design suite. For the memory blockutilization, these two tools behave in the same way. But forthe other implementation factors, results are quite different.At certain depths of the FIFOs/VCs, the resource utilizationsare similar (e.g., 14, 30, 62). However, in most of otherdepths situations, synthesized results of DK design suite are

Page 239: 541420

14 EURASIP Journal on Embedded Systems

7 14 20 30 40 50 60 70 80 90 100 110 120

Depth of FIFOs

0102030405060708090

100

Dev

ice

uti

lizat

ion

sum

mar

yon

Alt

era

Star

aix

II

DK-logic utilizationDK-ALUTsDK-registersDK-memory bits

DC-logic utilizationDC-ALUTsDC-registersDC-memory bits

Figure 15: The Device Utilization Summary on Altera Stratix II forVersion 2 with different synthesis tools: design compiler and DKdesign suite.

better than the one of Design Compiler which means ittakes fewer resources on the FPGA. The resource utilizationincrease is quite different as well. With the Design Compilersynthesis tool, device utilization increases as several “steps”but with DK design suite, and it increases more smoothlyand linearly. In this case, the synthesis with DK design suitemakes resource utilization prediction much easier than withDesign Compiler.

5. Conclusion and Perspectives

The presented architecture is a parameterized architecturededicated to image analysis applications on FPGA. All flowsand data are analyzed to propose two generic NoC archi-tectures, a ring for results and command and a dedicatedFT NoC for data. Using both communication architectures,the designer inserts several modules; the number andtype of modules depend on the algorithm requirements.The proposed NoC for data transfer is more precisely aparameterized TDM architecture which is fast, flexible, andadaptable to the type and size of data used by the given imageanalysis application. This NoC uses a Fat Tree topology withVC packet-switching and parameterized flits.

According to the implementation constraints, area, andspeed, the designer chooses one version and can adaptthe communication to optimize the area/bandwidth/latencytradeoff. Adaptation consists in adding several switches inparallel or in serial and to size data (and flit), FIFOs,and Virtual channels for each switch. Without any imple-mentation the designer can predict the resources used andrequired. This Fat Tree generic topology allows us to generateand explore systematically a communication infrastructurein order to design efficiently any dataflow image analysisapplication.

Future work will focus on automating the exploration ofthe complete architecture and the analysing of the algorithmarchitecture matching according to the different required

data. From the evaluation of the NoC exploration an auto-mated tool can predict the most appropriate communicationarchitecture for data transfer and the required resources.The power analysis will be analyzed to complete the DesignSpace Exploration of the NoC architecture. Power, area,and latency/bandwidth are the values which will guide theexploration process.

References

[1] P. Taylor, “Nvidia opens mobile GPU kimono: slideware showshigher performance, lower TDPs,” The Inquirer, June 2009.

[2] Z. Yuhong, H. Lenian, X. Zhihan, Y. Xiaolang, and W. Leyu,“A system verification environment for mixed-signal SOCdesign based on IP bus,” in Proceedings of the 5th InternationalConference on ASIC, vol. 1, pp. 278–281, 2003.

[3] U. Farooq, M. Saleem, and H. Jamal, “Parameterized FIRfiltering IP cores for reusable SoC design,” in Proceedings ofthe 3rd International Conference on Information Technology:New Generations (ITNG ’06), pp. 554–559, 2006.

[4] S. H. Chang and S. D. Kim, “Reuse-based methodology indeveloping system-on-chip (SoC),” in Proceedings of the 4thInternational Conference on Software Engineering Research,Management and Applications (SERA ’06), pp. 125–131,Seattle, Wash, USA, August 2006.

[5] F. Moraes, N. Calazans, A. Mello, L. Moller, and L. Ost,“HERMES: an infrastructure for low area overhead packet-switching networks on chip,” Integration, the VLSI Journal,vol. 38, no. 1, pp. 69–93, 2004.

[6] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R.Lauwereins, “Interconnection networks enable fine-graindynamic multi-tasking on FPGAs,” in Proceedings of the 12thInternational Conference on Field-Programmable Logic andApplications (FPL ’02), pp. 795–805, 2002.

[7] C. A. Zeferino and A. A. Susin, “SoCIN: a parametricand scalable network-on-chip,” in Proceedings of the 16thSymposium on Integrated Circuits and Systems Design(SBCCI ’03), pp. 169–174, 2003.

[8] E. Salminen, A. Kulmala, and T. D. Hamalainen, “HIBI-based multiprocessor soc on FPGA,” in Proceedings of IEEEInternational Symposium on Circuits and Systems (ISCAS ’05),vol. 4, pp. 3351–3354, 2005.

[9] C. Hilton and B. Nelson, “PNoC: a flexible circuit-switchedNoC for FPGA-based systems,” IEE Proceedings: Computersand Digital Techniques, vol. 153, no. 3, pp. 181–188, 2006.

[10] C. Bobda and A. Ahmadinia, “Dynamic interconnection ofreconfigurable modules on reconfigurable devices,” IEEEDesign & Test of Computers, vol. 22, no. 5, pp. 443–451, 2005.

[11] H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang, “Designspace exploration and prototyping for on-chip multimediaapplications,” in Proceedings of the 43rd Design AutomationConference, pp. 137–142, 2006.

[12] A. Ahmadinia, C. Bobda, J. Ding, et al., “A practical approachfor circuit routing on dynamic reconfigurable devices,” inProceedings of the 16th International Workshop on RapidSystem Prototyping (RSP ’05), pp. 84–90, 2005.

[13] V. Fresse, A. Aubert, and N. Bochard, “A predictive NoCarchitecture for vision systems dedicated to image analysis,”EURASIP Journal on Embedded Systems, vol. 2007, Article ID97929, 13 pages, 2007.

[14] G. Schelle and D. Grunwald, “Exploring FPGA network onchip implementations across various application and network

Page 240: 541420

EURASIP Journal on Embedded Systems 15

loads,” in Proceedings of the International Conference on FieldProgrammable Logic and Applications (FPL ’08), pp. 41–46,September 2008.

[15] P. Wagener, “Metastability—a designer’s viewpoint,” inProceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit,pp. 14/7.1–14/7.5, 1990.

[16] E. Brunvand, “Implementing self-timed systems with FPGAs,”in FPGAs, W. Moore and W. Luk, Eds., pp. 312–323, AbingdonEE&CS Books, Abingdon, UK, 1991.

[17] W. J. Dally, “Virtual-channel flow control,” IEEE Transactionson Parallel and Distributed Systems, vol. 3, no. 2, pp. 194–205,1992.

[18] E. Rijpkema, K. Goossens, A. Radulescu, et al., “Trade-offs inthe design of a router with both guaranteed and best-effortservices for networks on chip,” IEE Proceedings: Computersand Digital Techniques, vol. 150, no. 5, pp. 294–302, 2003.

[19] H.-S. Wang, L.-S. Peh, and S. Malik, “A power model forrouters: modeling alpha 21364 and InfiniBand routers,” inProceedings of the 10th High Performance Interconnects, pp.21–27, 2002.

[20] P. Gupta and N. McKeown, “Designing and implementing afast crossbar scheduler,” IEEE Micro, vol. 19, no. 1, pp. 20–28,1999.

[21] W. J. Dally and B. Towles, “Route packets, not wires: on-chipinterconnection networks,” in Proceedings of the DesignAutomation Conference (DAC ’01), pp. 684–689, 2001.

[22] F. Koning and W. Praefcke, “Multispectral image encoding,” inProceeding of the International Conference on Image Processing(ICIP ’99), vol. 3, pp. 45–49, October 1999.

[23] A. Kaarna, P. Zemcik, H. Kalviainen, and J. Parkkinen,“Multispectral image compression,” in Proceeding of the 14thInternational Conference on Pattern Recognition, vol. 2, pp.1264–1267, August 1998.

[24] D. Tretter and C. A. Bouman, “Optimal transforms formultispectral and multilayer image coding,” IEEE Transactionson Image Processing, vol. 4, no. 3, pp. 296–308, 1995.

[25] P. Zemcik, M. Frydrych, H. Kalviainen, P. Toivanen, and J.Voracek, “Multispectral image colour encoding,” in Proceedingof the 15th International Conference on Pattern Recognition,vol. 3, pp. 605–608, September 2000.

[26] A. Manduca, “Multispectral image visualization withnonlinear projections,” IEEE Transactions on Image Processing,vol. 5, no. 10, pp. 1486–1490, 1996.

[27] D. Tzeng, Spectral-based color separation algorithmdevelopment for multiple-ink color reproduction, Ph.D.thesis, R.I.T., Rochester, NY, USA, 1999.

[28] E. A. Day, The effects of multi-channel spectrum imaging onperceived spatial image quality and color reproduction accuracy,M.S. thesis, R.I.T., Rochester, NY, USA, 2003.

[29] L. Zhang, A.-C. Legrand, V. Fresse, and V. Fischer, “AdaptiveFPGA NoC-based architecture for multispectral imagecorrelation,” in Proceedings of the 4th European Conferenceon Colour in Graphics, Imaging, and Vision and the 10thInternational Symposium on Multispectral Colour Science(CGIV/MCS ’08), pp. 451–456, Barcelona, Spain, June 2008.

[30] A. C. J. Kienhuis and Ir. E. F. Deprettere, Design spaceexploration of stream-based dataflow architectures: methodsand tools, Ph.D. thesis, Toegevoegd Promotor, TechnischeUniversitat Braunschweig, 1999.

[31] H. P. Peixoto and M. F. Jacome, “Algorithm and architecture-level design space exploration using hierarchical data flows,”in Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors, pp. 272–282, July1997.

[32] V. Krishnan and S. Katkoori, “A genetic algorithm for thedesign space exploration of datapaths during high-levelsynthesis,” IEEE Transactions on Evolutionary Computation,vol. 10, no. 3, pp. 213–229, 2006.

[33] Mentor graphics, “DK Design Suite Tool,” http://www.agilityds.com/products/c based products/dk design suite/.

[34] “RTL-to-Gates Synthesis using Synopsys Design Compiler,”http://csg.csail.mit.edu/6.375/6 375 2008 www/handouts/tutorials/tut4-dc.pdf.

Page 241: 541420

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 318654, 19 pagesdoi:10.1155/2009/318654

Research Article

Efficient Processing of a Rainfall Simulation Watershed onan FPGA-Based Architecture with Fast Access toNeighbourhood Pixels

Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang, and Kah Phooi Seng

School of Electrical and Electronics Engineering, The University of Nottingham, 43500 Selangor, Malaysia

Correspondence should be addressed to Lee Seng Yeong, [email protected]

Received 15 March 2009; Accepted 9 August 2009

Recommended by Ahmet T. Erdogan

This paper describes a hardware architecture to implement the watershed algorithm using rainfall simulation. The speed of thearchitecture is increased by utilizing a multiple memory bank approach to allow parallel access to the neighbourhood pixel values.In a single read cycle, the architecture is able to obtain all five values of the centre and four neighbours for a 4-connectivitywatershed transform. The storage requirement of the multiple bank implementation is the same as a single bank implementationby using a graph-based memory bank addressing scheme. The proposed rainfall watershed architecture consists of two parts. Thefirst part performs the arrowing operation and the second part assigns each pixel to its associated catchment basin. The paperdescribes the architecture datapath and control logic in detail and concludes with an implementation on a Xilinx Spartan-3 FPGA.

Copyright © 2009 Lee Seng Yeong et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Image segmentation is often used as one of the main stagesin object-based image processing. For example, it is oftenused as a preceding stage in object classification [1–3]and object-based image compression [4–6]. In both theseexamples, image segmentation precedes the classification orcompression stage and is used to obtain object boundaries.This leads to an important reason for using the watershedtransform for segmentation as it results in the detectionof closed boundary regions. In contrast, boundary-basedmethods such as edge detection detect places where there isa difference in intensity. The disadvantage of this method isthat there may be gaps in the boundary where the gradientintensity is weak. By using a gradient image as input into thewatershed transform, qualities of both the region-based andboundary-based methods can be obtained.

This paper describes a watershed transform implementedon an FPGA for image segmentation. The watershed algo-rithm chosen for implementation is based on the rainfallsimulation method described in [7–9]. There is an imple-mentation of a rainfall-based watershed algorithm on hard-ware proposed in [10], using a combination of a DSP and an

FPGA. Unfortunately, the authors do not give much detailson the hardware part and their architecture. Other sourceshave implemented a watershed transform on reconfigurablehardware based on the immersion watershed techniques[11, 12]. There are two advantages of using a rainfall-basedwatershed algorithm over the immersion-based techniques.The first advantage is that the watershed lines are formedin-between the pixels (zero-width watershed). The secondadvantage is that every pixel would belong to a segmentedregion. In immersion-based watershed techniques, the pixelsthemselves form the watershed lines. A common problemthat arises from this is that these watershed lines may havea width greater than one pixel (i.e., the minimum resolutionin an image). Also, pixels that form part of the watershed linedo not belong to a region. Other than leading to inaccuraciesin the image segmentation, this also slows down the regionmerging process that usually follows the calculation of thewatershed transform. Other researchers have proposed usinga hill-climbing technique for their watershed architecture[13]. This technique is similar to that of rainfall simulationexcept that it starts from the minima and climbs by thesteepest slope. With suitable modifications, the techniques

Page 242: 541420

2 EURASIP Journal on Embedded Systems

proposed in this paper can also be applied for implementinga hill-climbing watershed transform.

This paper describes a hardware architecture to imple-ment the watershed algorithm using rainfall simulation.The speed of the architecture is increased by utilizing amultiple memory bank approach to allow parallel accessto the neighbourhood pixel values. This approach has theadvantage of allowing the centre and neighbouring pixelvalues to be obtained in a single clock cycle without the needfor storing multiple copies of the pixel values. Comparedto the memory architecture proposed in [14], our proposedarchitecture is able to obtain all five values required forthe watershed transform in a single read cycle. The methoddescribed in [14] requires two read cycles, one read cycle forthe centre pixel value using the Centre Access Module (CAM)and another read cycle for the neighbouring pixels using theNeighbourhood Access Module (NAM).

The paper is structured as follows. Section 2 willdescribe the implemented watershed algorithm. Section 3will describe a multiple bank memory storage method basedon graph analysis. This is used in the watershed architectureto increase processing speed by allowing multiple values (i.e.,the centre and neighbouring values) to be read in a singleclock cycle. This multiple bank storage method has the samememory requirement as methods which store the pixel valuesin a single bank. The watershed architecture is described intwo parts, each with their respective examples. The parts aresplit up based on their functions in the watershed transformas shown in Figure 1. Section 4 describes the first part ofthe architecture, called “Architecture-Arrowing” which is fol-lowed by an example of its operation in Section 5. Similarly,Section 6 describes the second part of the architecture, called“Architecture-Labelling” which is followed by an example ofits operation in Section 7. Section 8 describes the synthesisand implementation on a Xilinx Spartan-3 FPGA. Section 9summarizes this paper.

2. The Watershed Algorithm Based onRainfall Simulation

The watershed transformation is based on visualizing animage in three dimensions: two spatial coordinates versusgrey levels. The watershed transform used is based on therainfall simulation method proposed in [7]. This methodsimulates how falling rain water flows from higher levelregions called peaks to lower level regions called valleys. Therain drops that fall over a point will flow along the path ofthe steepest descent until reaching a minimum point.

The general processes involved in calculating the water-shed transform is shown in Figure 1. Generally, a gradientimage is used as input to the watershed algorithm. By usinga gradient image the catchment basins should correspondto the homogeneous grey level regions of the image. Acommon problem to the watershed transform is that it tendsto oversegment the image due to noise or local irregularitiesin the gradient image. This can be corrected using a regionmerging algorithm or by preprocessing the image prior tothe application of the watershed transform.

Watershed (region detect)

Gradient image (edge detect)

Region merging

Arrowing Labelling

Label all pixels to their respective catchment

basins

Find steepest descending path for each pixel and

label accordingly

Figure 1: General preprocessing and postprocessing steps involvedwhen using the watershed. Also it shows the two main steps involvedin the watershed transform. Firstly find the direction of steepestdescending path and label the pixels to point in that direction. Usingthe direction labels, the pixels will be relabelled to match the labelof their corresponding catchment basin.

2

1

4

3

(a)

−1 −3

−2

−4

(b)

Figure 2: The steepest descending path direction priority andnaming convention used to label the direction of the steepestdescending path. (a) shows the criterion used when determiningorder of steepest descendent path when there is more than onepossible path; that is, the pixel has two or more lower neighbourswith equivalent values. Paths are numbered in increasing priorityfrom the left moving in a clockwise direction towards the right andto the bottom. Shown here is the path with the highest prioritylabelled as 1 to the lowest priority, labelled as 4. (b) shows labelsused to indicate direction of the steepest descent path. The labelsshown correspond with the direction of the arrows.

The watershed transform starts by labelling each inputpixel to indicate the direction of the steepest descent. In otherwords, each pixel points to its neighbour with the smallestvalue. There are two neighbour connectivity approachesthat can be used. The first approach called 8-connectivityconsiders all eight neighbours surrounding the pixel andthe second approach called 4-connectivity only considers theneighbours to its immediate north, south, east, and west. Inthis paper, we use the 4-connectivity approach. The directionlabels are chosen to be negative values from −1 → −4 sothat it will not overlap with the catchment basin labellingwhich will start from 1. These direction labels are shownin Figure 2. There are four different possible direction labelsfor each pixel for neighbours in the vertical and horizontaldirections. This process of finding the steepest descendingpath is repeated for all pixels so that every pixel will point

Page 243: 541420

EURASIP Journal on Embedded Systems 3

Pixel has at least one lower neighbour

All pixels have at least one lower

neighbour

All pixels are of lesser values than their neighbour

Pixel has no lower neighbour

Label to the lowest neighbour

Label as minima

Normal

Minima

Edge + inner

Inner

Edge

Plateau

Label all pixels to point to their

respective lowest neighbour

Iteratively classify as edge or inner

Label all pixels as minima

Pixels have similar valued neighbours

A plateau is a group of connected pixels

with the same value

Nonplateau

Pixel has no similar valued neighbours

Pixel type/class

Group have lower, similar, and/or higher-valued neighbours

Plateau-edgePlateau-inner

Nonplateau-minima

Plateau-(Edge + inner)Edge: dark greyInner: light grey

Example of the different types of pixels encountered during labelling of

the steepest descending path

10 9 8 35 20

20 20 40 45

20 38

37 26

3920

20

2020

20

20 20 20

20202020202020

20 24 59

10 12 10 7

6 1 11 8

14114

10 14 22

20

60 49 45 27 19 17 14 10

216202429475562

Figure 3: Various arrowing conditions that occur.

to the direction of steepest descent. If a pixel or a group ofsimilar valued pixels which are connected has no neighbourswith a lower value, it becomes a regional minima. Followingthe steepest descending paths for each pixel will lead to aminimum (or regional minima). All pixels along the steepestdescending path will be assigned the label of that minimumto form a catchment basin. Catchment basins are formed bythe minimum and all pixels leading to it. Using this method,the region boundary lines are formed by the edges of thepixels that separate the different catchment basins.

The earlier description assumed that there will always beonly one lower-valued neighbour or none at all. However,this is often not the case. There are two other conditionsthat can occur during the pixel labelling operation: (1) whenthere is more than one steepest descending paths becausetwo or more lowest-valued neighbours have the same value,and (2) when the current pixel value is the same as anyof its neighbours. The second condition is called a plateaucondition and increases the complexity in determining thesteepest descending path.

These two conditions are handled as follows.

(1) If a pixel has more than one steepest descending path,the steepest descending path is simply selected basedon a predefined priority criterion. In the proposedalgorithm, the highest priority is given to those goingup from the left and decreases as we move to the rightand down. The order of priority is shown in Figure 2.

(2) If the image has regions where the pixels have thesame value and are not a regional minimum, they arecalled nonminima plateaus. The nonminima plateausare a group of pixels which can be divided into twogroups.

(i) Descending edge pixels of the plateau. This groupconsists of every pixel in the plateau whichhas a neighbour with a lower value. Thesepixels simply labelled with the direction to theirlower-valued neighbour.

(ii) Inner pixels. This group consists of every pixelwhose neighbours have equal or higher valuesthan its own value.

Figure 3 shows a summary of the various arrowingconditions that may occur. Normally, the geodesic distancesfrom the inner points to the descending edge are determinedto obtain the shortest path. In our watershed transform thisstep has been simplified by eliminating the need to explicitlycalculate and store the geodesic distance. The method usedcan be thought of as a shrinking plateau. Once the edges ofa plateau has been labelled with the direction of the steepestdescent, the inner pixels neighbouring these edge pixels willpoint to those edges. These edges will be “stripped” andthe neighbouring inners will become the new edges. Thisis performed until all the pixels in the plateau have beenlabelled with the path of steepest descent (see Section 4.7 formore information).

Page 244: 541420

4 EURASIP Journal on Embedded Systems

358910

7101210

8116

14114

20221410

20202020

27454960

29475562 2

20

20

20

20

20

20

19

24

592420

454020

393820

263720

202020

202020

101417

1620

0

0

1

1 765432

2

3

4

5

6

7

10 9 8 35

7101210

6 811

4 1411

59242020

45402020

39382020

26372020

10 202214

20 202020

60 274549

62 294755

20202020

20202020

10141719

2162024

2111

2333

3333

3333

2222

4222

4333

4333

3333

3333

4333

4444

4333

4333

4444

4444

(a) Original input values. Values are typically those from a gradient image

Row numbering convention

Columnnumberingconvention

(b) Identification of catchment basins which are formed by the local minima (indicated in circles) and plateaus (shaded). Direction of the steepest path is indicated by the arrows

(d) Region labelling. All pixels which“flow” to a particular catchment basinwill assume that catchment basin’s label. The catchment basins have been circled and the pixels that are associated with it are labelled and shaded correspondingly

(c) Labelling of the pixels based on the direction of the path of steepest descent.The earlier circled catchment basins are given a catchment basin label indicated by the bold lettering in the circles. All paths to that catchment basin will assume that catchment basin’s label

Labelling convention for thevarious paths are also indicatedby the negative values at the end

of the direction arrows.

The steepest descending paths arelabelled from the left moving in

a clockwise direction withincreasing priority. This prioritydefinition is used to determinewhat is the steepest descendingpath to choose when there are

two or more lowest-valuedneighbours with the same value.

The steepest descending directionpriority

and the steepest descending path

labelling convention.

−1 −3

−2

−4

1 −4

2

−3 −1

−3 −1

4

−3 −3

3 3

3 3

−4 −4 −4

−4 −1 −1 −1

−1 −1 −1 −4

−1 −1 −1 −4

−1 −1 −1 −4

−2 −2 −2 −2

−2 −2 −1 −2

−2 −2 −2 −3

−3 −3 −3 −3

−1 −4 −4 −4

−4 −4 −4 −4

−3 −3 −3 −4

−2 −3 −3

Figure 4: Example of four-connectivity watershed performed on an 8 × 8 sample data. (a) shows the original gradient image values. (b)shows the direction of the steepest descending path for each pixel. Minima are highlighted with circles. (c) shows pixels where the steepestdescending paths and minima have been labelled. The labels used for the direction of the steepest descending path are shown on the rightside of the figure. (d) shows the 8 × 8 data fully labelled. The pixels have been assigned to the label of their respective minima forming acatchment basin.

The final step once all the pixels have been labelledwith the direction of steepest descent is to assign themlabels that correspond to the label of their respectiveminimum/minima. This is done by scanning each pixel andto follow the path indicated by each pixel to the next pixel.This is performed repeatedly until a minimum/minima isreached. All the pixel in the path are then assigned to the labelof that minimum/minima. An example of all the algorithmsteps is shown in Figure 4. The operational flowchart of thewatershed algorithm is shown in Figure 5.

3. Graph-Based Memory Implementation

Before going into the details of our architecture, we willdiscuss a multiple bank memory storage scheme based ongraph analysis. This is used to speed up operations by allow-ing all five pixel values required for the watershed transformto be read in a single clock cycle with the same memorystorage requirement as a single bank implementation. Asimilar method has been proposed in [14]. However, theirmethod requires twice the number of read cycles compared

Page 245: 541420

EURASIP Journal on Embedded Systems 5

Find pixelneighbourlocation

Get pixelvalue

Anyneighbour with

the same value as thecurrent pixel?

Find all connectedpixels with the

same value

Label asminima

Store all pixellocations

Read and classifyeach pixel

Pixelvalue smallestcompared toneighbours?

Anysimilar-valued

neighbours (2 lowerdecending

paths)?

Get neighbourvalues

Label based ondirectionpriority

Label tosmallest-valued

neighbour

Current/next pixel location

No

Yes

Yes

No

No

Yes

Figure 5: Watershed algorithm flowchart.

to our proposed method. Their proposed method requirestwo read cycles, one to obtain the centre value and anotherto obtain the neighbourhood values. This effectively doublesthe number of clock cycles required for reading the pixelvalues.

To understand why this is important, recall that one ofthe main procedures of the watershed transform was to findthe path of the steepest descent. This required the valuesof the current and neighbouring pixels. Traditionally, thesevalues can be obtained using

(1) sequential reads: a single memory bank the size of theimage is read five times, requiring five clock cycles,

(2) parallel read: it reads five replicated memory bankseach size of the image. This requires five times morememory required to store a single image but allrequired values can be obtained in a single clockcycle.

Using this multiple bank method, we can obtain thespeed advantage of the parallel read with the nonreplicatingstorage required by the sequential reading method. Theadvantages of using this multiple bank method are to

(1) reduce the memory space required for storing theimage by up to five times,

(2) obtain all values for the current pixel and its neigh-bours in a single read cycle, eliminating the need fora five clock cycle read.

This multiple bank memory storage stores the imagein separate memory banks. This is not a straightforwarddivision of the image pixels by the number of memory banks,but a special arrangement is required that will not overlapand that will support the access to five banks simultaneouslyto obtain the five pixel values (Centre, East, North, South,West). The problem now is to

(1) determine the number of banks required to store theimage,

(2) fill the banks with the image data,

(3) access the data in these banks.

All of these steps shall be addressed in the following sectionsin the order listed above.

Page 246: 541420

6 EURASIP Journal on Embedded Systems

Two distinctive subgraphs with

4-neighbourhood connectivity

Each number represents a different bank

(a) Shows neighbourhood graph for 4-neighbour connectivity. Each pixel can be represented by a vertex (node); two distinct subgraphs arise from this and have been highlighted. All vertices within each subgraph is fully connected (via edges) to all its neighbours

Notice that each vertex is not connected to any of its four neighbours, that is, the grey dots are not connected to the blackones

(b) Combined subgraph with nonoverlapping labels.The nonoverlapping nature allows the concurrent access of the centre pixel value and its associated neighbours

Each number has been color coded and corresponds to a singlebank. The complete image is stored in eight different banks

Separate into two subgraphs

Recombine and show colouration of

different banks

2020

0202

2020

0202

1313

3131

1313

6240

1537

4062

3715

6240

1537

4062

3715

6240

1537

4062

3715

6240

1537

4062

3715

6464

4646

6464

4646

5757

7575

5757

Figure 6:N4 connectivity graph. Two sub-graphs combined to produce an 8-bank structure allowing five values to be obtained concurrently.

3.1. Determining How Many Banks Are Needed. This sectionwill describe how the number of banks needed to allowsimultaneous access is determined. This depends on (1) thenumber of neighbour connectivity and (2) the number ofvalues to be obtained in one read cycle. Here, graph theoryis used to determine the minimum number of databanksrequired to satisfy the following:

(1) any of the values that we want cannot be from thesame bank;

(2) none of the image pixels are stored twice (i.e., noredundancy).

Satisfying these criteria results in the minimum numberof banks required with no additional memory neededcompared to a standard single bank storage scheme.

Imagine every pixel in an image as a region and a vertex(node) will be added to each pixel. For 4-neighbour con-nectivity (N4), the connectivity graph is shown in Figure 6.To determine the number of banks for parallel access canbe viewed as a graph colouration problem, whereby any ofthe parallel values cannot be from the same bank. We ensurethat each of the nodes will have a neighbour of a differentcolour, or in our case number. Each of these colours (ornumbers) corresponds to a different bank. The same methodcan be applied for different connectivity schemes such as 8-neighbour connectivity.

In our implementation of 4-neighbourhood connectivityand five concurrent memory access (for five concurrentvalues), we require eight banks. In the discussion andexamples to follow, we will use these implementationcriteria.

Page 247: 541420

EURASIP Journal on Embedded Systems 7

(b) Any filling order is possible. For any filling order, the bank and address within the bank is determined by the same logic in the address bar (see Figure 8) Using a traditional raster scan pattern as an example. The order of bank_select is

(a) Using cardinal directions, CWNES are the centre, west, north, east, and south values, respectively. These correspond to the current pixel, left, top, right, and bottom neighbour values

Scan from top left to bottom right one pixel at atime

ban

k_se

lect

10

20

1

37

20

20

47

16

10

40

4

20

20

20

62

24

35

59

1

20

20

20

49

17

9

20

8

39

14

20

27

10

12

20

14

26

20

20

29

2

01

1 7 37 3 5 1 2 6 0 4 … 5

4 2 6 0 4 2 6 7 3 5

7

45

1

20

20

20

55

20

8

24

6

20

22

20

60

19

10

20

1

38

10

20

45

14

Crossbar

C W N E S

0

1

2

3

4

5

6

7

knab e

ht ni

htiw sserdd

A

0

0

Pixel location (3, 3) as used in the addressing scheme example

1

1

2

2

3

3

4

4

5

5

6

6

7

7

0 1 2 3 4 5 6 7

358910

7101210

8116

14114

59242020

45402020

39382020

26372020

20221410

20202020

27454960

29475562

20202020

20202020

10141719

2162024

Figure 7: Block diagram of graph-based memory storage andretrieval.

3.2. Filling the Banks. After determining how many banks areneeded, we will need to fill the banks. This is done by writing

the individual values one at a time into the respective banks.During the determination of the number of required banks,a pattern emerges from the connectivity graph. An exampleof this pattern is highlighted with a detached bounding boxin Figures 6 and 7.

The eight banks are filled with one value at a time.This can be done in any order. The bank number and bankaddress is calculated using some logic. The same logic isused to determine the bank and bank address during reading(See Section 3.3 for more details on this). For the ease ofexplanation, we shall adopt a raster scan type of sequence.Using this convention, the order of filling is simply the orderof the bank number as it appears from top-left to bottom-right. An example of this is shown in Figure 7.

The group of banks replicates itself every four pixels ineither direction (i.e., right and down). Hence, to determinehow many times the pattern is replicated, the image sizeis simply divided by sixteen. Alternatively, any one of itssides can be divided by four since all images are square.This is important as the addressing for filling the banks (andreading) holds true for square images whose sizes are to thepower of two (i.e. 22, 23, 24). Image sizes which are not squareare simply padded.

3.3. Accessing Data in the Banks. To access the data fromthis multiple bank scheme, we need to know (1) which bankand (2) location within that bank. The addressing schemeis a simple addressing scheme based on the pixel location.A hardware unit called the Address Processor (AP) handlesthe memory addressing. By providing the AP with the pixellocation, it will calculate the address to retrieve that pixelvalue. This address will tell us which bank and locationwithin that bank the pixel value is stored in.

To understand how the AP works, consider a pixelcoordinate which consists of a row and column value with theorigin located at the upper left corner. These two values arerepresented in their binary form and the lowest significantbits for the column and row are used to determine the bank.The number of bits required to represent the number ofbanks is dependent on the total number of banks in thismultiple bank scheme. In our case of eight banks, three bitsfrom the address are needed to determine in which bank thevalue for that particular pixel location is stored in. Thesebinary values go through some logic as shown in Figure 8 orin equation form:

B[2] = r[0]c[0]′ + c[0]r[0]′,

B[1] = r[1]′r[0]′c[1] + r[1]r[0]′c[0]′

+ r[1]′r[0]c[0]′ + r[1]r[0]c[0],

B[0] = r[0],

(1)

where B[0 → 2] represent the three bits that determine thebank number (from 0 → 7). r[0] and r[1] represent thefirst two bits of the row value in binary while c[0] and c[1]represent the first two bits of the column value in binary.

Page 248: 541420

8 EURASIP Journal on Embedded Systems

Now that we have determined which bank the value is in;the remainder of the bits is used to determine the location ofthe value within the bank. An example is given in Figure 8(a).

For an image of size y-rows and x-columns, the numberof bits required for addressing will simply be the number ofbits required to store the largest value of the row and columnin binary, that is, no o f address bits = log2(x) + log2(y).

This addressing scheme is shown in Figure 8. (Note thatthe steps described here assume an image with a minimumsize of 4× 4 and increase in powers of 2).

3.4. Sorting the Data from the Banks. After obtaining thefive values from the banks, they need to be sorted accordingto the expected neighbour location output to ensure thatvalues of a particular direction is sent to the right outputposition. This sorting is handled by another hardware unitcalled the Crossbar (CB). In addition, the CB also tags invalidvalues from invalid neighbour conditions which occur at thecorners and edges of the image. This tagging is part of theoutput multiplexer control.

The complete structure for reading from the banks isshown in Figure 9. In this figure, five pixel locations are fedinto the AP which generates five addressees, for the centreand its four neighbours. These five addresses are fed intoall eight banks. However, only the address correspondingto the correct bank is chosen by the add sel x, where x =0 → 7. The addresses fed into the banks will generate eightvalues however, only five will be chosen by the CB. Thesevalues are also sorted using the CB to ensure that the valuescorresponding to the centre pixel and a particular neighbourare output onto the correct data lines. The mux control,CB sel x, is controlled by the same logic that selects theadd sel x.

4. Arrowing Architecture

This section will provide the details on the architecture thatperforms the arrowing function of the algorithm. This partof the architecture will describe how we get from Figure 4(a)to Figure 4(c) in hardware. As mentioned in the previousdescription of the algorithm, things are simple when everypixel has a lower neighbour and gets more complicateddue to plateau conditions. Similarly, this plateau conditioncomplicates the architecture. Adding to this complexity is thefact that all neighbour values are obtained simultaneously,and instead of processing one value at a time, we have toprocess five values, the centre and its four neighbours. Thispart of the architecture that performs the arrowing is shownin Figure 10.

When a pixel location is fed into the system, it enters the“Centre and Neighbour Coordinates” block. From this, thecoordinates of the centre and its four neighbours are outputand fed into the “Multibank Memory” block to obtain all thepixel values and the pixel status (PS) from the “Pixel Status”block.

Assuming the normal state, the input pixel will have alower neighbour and no neighbours of the same value, thatis, inner = 0 and plat = 0. The pixel will just be arrowed to the

c[2]

Binary representation of value location of

within the bank

Row value (in binary) Column value (in binary)

c[0]c[1]c[2]log 2

(y)

log 2

(y)

log 2

(x)

log 2

(x)

r[0]r[1]r[2]

MSB LSB

(a) Example of location to address calculations

Determining which bank the data is in

c[0]B[2]

B[1]

B[0]

r[0]

c[1]r[0]r[1]

c[0]r[0]r[1]

c[0]r[0]r[1]

c[0]r[0]

r[0]

r[1]

(3,3)

1 1 00 1 1

c[0]c[1]c[2]r[0]

0

B[2]

1

B[1]

1

B[0]

r[2]

0 1

r[1]

0

c[2]

r[1]r[2]

Bank address logic

Address 2

of

Bank 3

This example is based on the conventionthat the first pixel location is (0,0).

The bank and location within the bankcount start from 0, that is, the first bank is 0 and the last bank is 7. Similarly, the firstaddress location is 0 and the last is 7.

Bank address logic

r[1] c[3]

In the case of 8 banks, 3bits are needed todetermine which bank thedata is located. For 4 and16 banks, 2 bits and 4 bitsare required, respectively.

These values are derivedfrom the LSB of both therow and column values.

In this 8 bank example,the bank number isrepresented by the 3 bitvalue B[0 2].

The location within thatbank is determined by theremaining 3 bits.

B[2] = r[0]c[0]′ + c[0]r[0]′

B[1] = r[1]′r[0]′c[1] + r[1]r[0]′c[0]′+r[1]′r[0]c[0]′ + r[1]r[0]c[0]

B[0] = r[0]

Figure 8: The addressing scheme for the multiple bank graph-basedmemory storage.

nearest neighbour. The Pixel Status (PS) for that pixel will bechanged from 0 → 6 (See Figure 19).

However, if the pixel has a similar valued neighbour,plat = 1 and plateau processing will start. Plateau processingstarts off by finding all the current pixel neighbours of similarvalue and writes them to Q1. Q1 is predefined to be the first

Page 249: 541420

EURASIP Journal on Embedded Systems 9

AP-W AP-NAP-C AP-E AP-S

Address processor (AP)

Crossbar (CB)

inv inv inv inv inv

CB

_sel

_0

CB

_sel

_1

CB

_sel

_2

CB

_sel

_3

CB

_sel

_4

add_

sel_

0

add_

sel_

1

add_

sel_

2

add_

sel_

3

add_

sel_

4

add_

sel_

5

add_

sel_

6

add_

sel_

7

Pixel neighbour coordinates

0 1 2 3 4

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4

C W N E S

C W N E S

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

B0

(c, r) (c−1, r) (c, r−1) (c+1, r) (c, r+1)

B1 B2 B3 B4 B5 B7B6

Figure 9: 8 Bank memory architecture.

queue to be used. After writing to the queue, the PS of thepixels is changed from 0 → 1. This is to indicate whichpixel locations have been written to queue to avoid duplicateentries in the queue. At the end of this process, all the pixellocations belonging to the plateau will have been written toQ1.

To keep track of the number of elements in Q1 WNES,two sets of memory counters are used. These two sets ofcounters consist of mc1 → mc4 in one set and mc6 →mc9 in another. When writing to Q1 WNES, both sets ofcounters are incremented in parallel but when reading fromQ1 WNES to obtain the neighbouring plateau pixels, onlymc1–4 is decremented while mc6–9 remains unchanged.This means that, at the end of the Stage 1 processing,mc1–4 = 0 and mc6–9 will contain the count of the numberof pixel locations which are contained within Q1 WNES.This is needed to handle the case of a lower completeminima (i.e., a plateau with all inner pixels). When thistype of plateau is encountered, mc1–5 = 0, and Q1 WNESwill be read once again using mc6–9, this time not toobtain the same valued neighbours but to label all the pixellocations within Q1 WNES with the current value stored inthe minima register. Otherwise, mc5 > 0 and values will

be read from Q1 C and subsequently from Q2 WNES andQ1 WNES until all the locations in the plateau have beenvisited and classified. The plateau processing steps and theassociated conditions are shown in Figure 11.

There are other parts which are not shown in the maindiagram but warrants a discussion. These are

(1) memory counters—to determine the number ofunprocessed elements in a queue,

(2) priority encoder—to determine the controls forQ1 sel and Q2 sel.

The rest of the architecture consists of a few main partsshown in Figure 10 and are

(1) centre and neighbour coordinates—to obtain thecentre and neighbour locations,

(2) multibank memory—to obtain the five required pixelvalues,

(3) smallest-valued neighbour—to determine whichneighbour has the smallest value,

Page 250: 541420

10 EURASIP Journal on Embedded Systems

Smal

lest

- va

lued

n

eigh

bou

r

Pla

t/in

ner

Arr

owin

g

Min

ima

+1

+1

a >

b

Q1

(W)

Q1_

sel

Q1

(N)

Q1

(E)

Q1

(S)

we_

t6

we_

t7

we_

t8

we_

t9

we_

t10

we_

t1

we_

t2

we_

t3

we_

t4

we_

t5

in_c

trl

we_

min

ima

plat

s_st

ate_

stat

n_s

tat

w_s

tat

c_st

at

1

Q1

(C)

2 3

0 1 2

1 0

4 5

Q2

(W)

Q2_

sel

Q2

(N)

Q2

(E)

Q2

(S)

1

Q2

(C)

2 3 4 5

Location (x,y)

Cu

rren

t pi

xel v

alu

e

Pla

t

Inn

er1

1

1 1

88

2

in_a

m

w_l

oc

w_v

alu

e

a b1:

wh

en a

> b

0: o

ther

wis

e

PS

= 2

PS

= 3

in_c

trl =

2

PS

= 1

mc6

–9 =

1

3

Cen

tre

and

nei

ghbo

ur

coor

din

ates

Pix

elco

ordi

nat

es

in_c

trl >

0

d

Arrow memory

Mu

ltib

ank

mem

ory

Pix

el s

tatu

s

Figure 10: Watershed architecture based on rainfall simulation. Shown here is the arrowing architecture. This architecture starts from pixelmemory and ends up with an arrow memory with labels to indicate the steepest descending paths.

(4) plat/inner—to determine if the current pixel is partof a plateau and whether it is an edge or inner plateaupixel,

(5) arrowing—to determine the direction of the steepestdescent. This direction is to be written to the “ArrowMemory”,

(6) pixel status—to determine the status of the pixels,that is, whether they have been read before, put intoqueue before, or have been labelled.

The next subsections will begin to describe the partslisted above in the same order.

Page 251: 541420

EURASIP Journal on Embedded Systems 11

Read from Q2_WNES, label pixels and write

similar valued neighbours to Q1_WNES

Read all similar valued neighbouring pixels

Read from Q1_WNES, label pixels and write

similar valued neighbours to Q2_WNES

Read from Q1_C, label pixels and write similar valued neighbours to

Q2_WNES

Read all from Q1_WNES using mc6–9 and label

with value from minima register

Q1_W

Q1_N

Q1_E

Q1_S

Q1_C

mc1 + mc6

mc2 + mc7

mc3 + mc8

mc4 + mc9

mc5

Plateau processing completed

Start plateau processing

if mc6–9 > 0

if mc6–9 > 0

if mc6–9 = 0if mc1–4 = 0

if mc1–4 > 0

if mc5 = 0 if mc5 > 0

Stag

e 1

Stag

e 2

Stag

e: in

ner

arr

owin

g

Notes:1. In stage 1 of the processing, mc6–9 is used as a secondary

counter for Q1_WNES and incremented as mc1–4increments but does not decrement when mc1-4 is decremented.In stage 2, if mc5 = 0 (i.e., complete lower minima), mc6–9 is used as the counter to track the number of elements in Q1_WNES. In this state, mc6-9 is decremented when Q1_WNES is read from. However, if mc5 > 0, mc6–9 is reset and resumes the role of memory counter for Q2_WNES.

2. Q1_C is only ever used once and that is during stage 2 of theprocessing.

Figure 11: Stages of Plateau Processing and their various condi-tions.

4.1. Memory Counter. The architecture is a tristate systemwhose state is determined by the condition of whether thequeues, Q1 and Q2, are empty or otherwise. This is shownin Figure 12. These states in turn determine the control ofthe main multiplexer, in ctrl, which is the control of the datainput into the system.

0

2 1

in_ctrl values = state numbers

E1� × E2

E1� × E2

E1 × E2�

E1 × E2

E1 × E2�

E1 × E2�

E1 × E2

E1� × E2

E1� × E2�E1� × E2�

E1 = 1 when Q1 is empty E2 = 1 when Q2 is empty

E1 × E2

Figure 12: State diagram of the architecture-ARROWING.

Memory counter 1

Memory counter 2

Memory counter 9

Memory counter 10

mc1

10

Q1_sel = 1we_t1

Q1_sel = 2we_t2

Q2_sel = 4we_t9

Q2_sel = 5we_t10

+1

mc2

10

+1

mc9

1

...

0

+1

mc10

10

+1

−1

−1

−1

−1

Figure 13: Memory counter for Queue C, W, N, E, and S. Thememory counter is used to determine the number of elements inthe various queues for the directions of Centre, West, North, East,and South.

To determine the initial queue states, Memory Counters(MCs) are used to keep track of how many elements arepending processing in each of the West, North, East, South,and Centre queues. There are five MCs for Q1 and anotherfive for Q2, one counter for each of the queue directions.These MCs are named mc1–5 for Q1 W, Q1 N, Q1 E, Q1 S,

Page 252: 541420

12 EURASIP Journal on Embedded Systems

Table 1: Comparison of the number of clock cycles required forreading all five required values and the memory requirements forthe three different methods.

Sequential Parallel Graph-based

Clock cycles 5 1 1

Memory Req. 1x image size 5x image size 1x image size

and Q1 C, respectively, and similarly mc6–10 for Q2 W,Q2 N, Q2 E, Q2 S, and Q2 C respectively. This is shown inFigure 13.

The MCs increase by one count each time an elementis written to the queue. Similarly, the MCs decrease by onecount every time an element is read from the queue. Thisincrement is determined by tracking the write enable we txwhere x = 1 − 10 while the decrement is determined bytracking the values of Q1 sel and Q2 sel.

A special case occurs during the stage one of plateauprocessing, whereby mc6–9 is used to count the number ofelements in Q1 W, Q1 N, Q1 E, and Q1 S, respectively. Inthis stage, mc6–9 is incremented when the queues are writtento but are only decremented when Q1 WNES is read again inthe stage two for complete lower minima labelling.

The MC primarily consists of a register and a multiplexerwhich selects between a (+1) increment or a (−1) decrementof the current register value. Selecting between these twovalues and writing these new values to the register effectivelycount up and down. The update of the MC register value iscontrolled by a write enable, which is an output of a 2-inputXOR. This XOR gate ensures that the MC register is updatedwhen only one of its inputs is active.

4.2. The Priority Encoder. The priority encoder is used todetermine the output of Q1 sel and Q2 sel by comparingthe outputs of the MC to zero. It selects the output from thequeues in the order it is stored, that is, from queue Qx Wto Qx C, x = 1 or 2. Together with the state of in ctrl, Q1 seland Q2 sel will determine the data input into the system. Thelogic to determine the control bits for Q1 sel and Q2 sel isshown in Figure 14.

4.3. Centre and Neighbour Coordinate. The centre andneighbourhood block is used to determine the coordinatesof the pixel’s neighbours and to pass through the centrecoordinate. These coordinates are used to address the variousqueues and multibank memory. It performs an additionand subtraction by one unit on both the row and columncoordinates. This is rearranged and grouped into theirrespective outputs. The outputs from the block are five pixellocations, corresponding to the centre pixel location and thefour neighbours, West (W), North (N), East (E), and South(S). This is shown in Figure 15.

4.4. The Smallest-Valued Neighbour Block. This block is todetermine the smallest-valued neighbour (SVN) and itsposition in relation to the current pixel. This is used todetermine if the current pixel has a lower minima and to findthe steepest descending path to that minima (arrowing).

mc1=

=

=

=

0

mc20

mc30

mc40

=mc5

0

a

b

c

d

e

Pri

orit

y en

code

r Q1_sel[0]

Q1_sel[1]

Q1_sel[2]

mc6=

=

=

=

0

mc70

mc80

mc90

=mc100

f

g

h

i

j

Pri

orit

y en

code

r Q2_sel[0]

Q2_sel[1]

Q2_sel[2]

(a)

Qx_sel

a/f

a/f

Q1_sel[0]/Q2_sel[0]

Q1_sel[1]/Q2_sel[1]

Q1_sel[2]/Q2_sel[2]

b/g

b/g

c/h

c/h

d/i

d/i

e/j

e/j [2] [1] [0]

000111111

Disable100xxxx0

2010xxx013110xx0114001x0111510101111

Q1_sel[0] = a� + abc� + abcde�Q1_sel[1] = ab� + abc�Q1_sel[2] = abcd� + abcde�

Q2_sel[0] = f � + fgh� + fghij�Q2_sel[1] = fg� + fgh�Q2_sel[2] = fghi� + fghij�

(b)

Figure 14: The priority encoder. (a) shows the controls for Q1 seland Q2 sel using the priority encoders. The output of memorycounters determines the multiplexer control of Q1 sel and Q2 sel.(b) shows the logic of the priority encoders used. There is a special“disable” condition for the multiplexers of Q1 and Q2. This is usedso that the Q1 sel and Q2 sel can have an initial condition and willnot interfere with the memory counters.

Page 253: 541420

EURASIP Journal on Embedded Systems 13

Row

Column

C

W

N

E

S

+1

+1

r + 1

r − 1

c + 1

c − 1−1

−1

Figure 15: Inside the Pixel Neighbour Coordinate.

To determine the smallest value pixel, the values of theneighbours are compared two at a time, and the result ofthe comparator is used to select the smaller value of thetwo. The last two values are compared once again and thevalue of the smallest value neighbour will be obtained. As forthe direction of the SVN, the outputs from the 3 stages ofcomparison are used and compared to a truth table. This isshown in Figure 16. This output is passed to the arrowingblock to determine the direction of the steepest descent(when there is a lower neighbour).

4.5. The Plateau-Inner Block. This block is to determinewhether the current pixel is part of a plateau and which typeof plateau pixel it is. The current pixel type will determinewhat is done to the pixel and its neighbours, that is, whetherthey are put back into a queue or otherwise. Essentially,together with the Pixel Status, it helps to determine if a pixelor one of its neighbours should be put back into the queuesfor further processing. When the system is in State 0 (i.e.,processing pixel locations from the PC), the block determinesif the current pixel is part of a plateau. The value of thecurrent pixel is compared to all its neighbours. If any oneof the neighbours has a similar value to the current pixel, it ispart of a plateau and plat = 1. The respective similar valuedneighbours are put into the different queue locations basedon sv W, sv N, sv E, and sv S and the value of pixel status.The logic for this is shown in Figure 17(a).

In any other state, this block is used to determine if thecurrent pixel is an inner (i.e., equal to or smaller than itsneighbours). If the current pixel is an inner, inner = 1. Thisis shown in Figure 17(b). Whether the pixel is an inner ornot will determine the arrowing part of the system. If it is aninner, it will point to the nearest edge.

4.6. The Arrowing Block. This block is to determine thesteepest descending path label for the “Arrow Memory.” Thesteepest path is calculated based on whether the pixel is aninner or otherwise. When processing non-inner pixels thearrowing block generates a direction output based on thelocation of the lowest neighbour obtained from the block“Smallest Valued Neighbour.” If the pixel is an inner, thearrow will simply point to the nearest edge. When there ismore than one possible path to the nearest edge, a priority

<

<

<

WvalueNvalue

EvalueSvalue

01

0

a

b

c

1

01

Value of smallest-valuedneighbour

(a)

x = cy = ac� + bc

xc

yb

a

a b c x y Direction

0 x 0 0 0 W1 x 0 0 1 Nx 0 1 1 0 Ex 1 1 1 1 S

(b)

Figure 16: Inside the Smallest Value Neighbour (SVN) block.(a) The smallest-valued neighbour is determined and selectedusing a set of comparators and multiplexers. (b) The location ofthe smallest-valued neighbour is determined by the selections ofeach multiplexer. This location information used to determine thesteepest descending path and is fed into the arrowing block.

sv_W = 1, when C = Wvalue

sv_N = 1, when C = Nvalue

sv_E = 1, when C = Evalue

sv_S = 1, when C = Svalue

C

C

C

C

Plat

sv_E

sv_S

sv_N

sv_WWvalue

Nvalue

Evalue

Svalue

=

=

=

=

(a)

lv_E

lv_S

lv_N

lv_WC

C

C

C

Inner

Wvalue

Nvalue

Evalue

Svalue

lv_W = 1, when C ≤ Wvalue

lv_N = 1, when C ≤ Nvalue

lv_E = 1, when C ≤ Evalue

lv_S = 1, when C ≤ Svalue

(b)

Figure 17: Inside the Plateau-Inner Block.

Page 254: 541420

14 EURASIP Journal on Embedded Systems

encoder in the block is used to select the predefined directionof the highest priority. This is shown in Figure 18 when thesystem is in State = 0, and in any other state where the pixel isnot an inner, this arrowing block uses the information fromthe SVN block and passes it through directly to its own mainmultiplexer, selecting the appropriate value to be written into“Arrow Memory.”

If the current pixel is found to be an inner, the arrowingdirection is towards the highest priority neighbour withthe same value which has been previously labelled. This ispossible because we are labelling the plateau pixels fromthe edge pixels going in, one pixel at a time, ensuring thatthe inners will always point in the direction of the shortestgeodesic distance.

4.7. Pixel Status. One of the most important parts of thissystem is the pixel status (PS) registers. Since six statesare used to flag the pixel, this register requires a 3-bitrepresentation for each pixel location of the image. Thusthe PS registers have as many registers as there are pixelsin the input image. In the system, values from the PS helpdetermine what processes a particular pixel location has gonethrough and whether it has been successfully labelled intothe “Arrow Memory.” The six states and their transitions areshown in Figure 19. The six states are as follows:

(i) 0 : unvisited—nothing has been done to the pixel,

(ii) 1 : queued : initial,

(iii) 2 : queued in Q2,

(iv) 3 : queued in Q1,

(v) 4 : completed when plat = 0,

(vi) 5 : completed when plat = 1 and reading from Q2,

(vii) 6 : completed when plat = 1 and reading from Q1.

To ease understanding of how the plateau conditionsare handled and how the PS is used, we shall introducethe concept of the “Unlabelled pixel (UP)” and “Labelledpixel (LP).” The UP is defined as the “outermost pixel whichhas yet to be labelled.” Using this definition, the arrowingprocedure for the plateau pixels are

(1) arrow to lower-valued neighbour (applicable only ifinner = 0)

(2) arrow to neighbour with PS = 5 according topredefined arrowing priority.

With reference to Figure 20, the PS is used to determinewhich neighbours to the UPs have not been put into the otherqueue, UPs of the same label and LPs.

5. Example for the Arrowing Architecture

This example will illustrate the states and various controlsof the watershed architecture for an 8 × 8 sample data. It isthe same sample data shown in Figures 6 and 7. A table withthe various controls, status, and queues for the first 14 clockcycles is shown in Table 2.

0123

01

Direction of steepest

descent

22

x and y from smallest value

neighbour block

01

01

01

01

PS_W

sv_W

5

PS_x are the values read from the pixel status registers from the center (C) and respective neighbours (W, N, E, S).

sv_x are the “same value” conditions obtained from the plat/inner block wherex are the directions W, N, E, S.

6=

PS_C2

PS_C3

ne_W

ne_N

ne_E

ne_S

dir[0]

dir[1]

Pri

orit

y en

code

r

PS_N

sv_N

56

=

PS_E

sv_E

56

=

PS_S

sv_S

56

=

=

=

in_ctrl1

=

dir[0] = a�b + a�c�dir[1] = a�b�

adir[0]

dir[1]b

c

a b c d x y mux_ctrl

1 x x x 0 0 00 1 x x 0 1 10 0 1 x 1 0 20 0 0 1 1 1 3

−1−2−3−4

Figure 18: Inside arrowing block.

The initial condition for the system is as follows. TheProgram Counter (PC) starts with the first pixel andgenerates a (0, 0) output representing the first pixel in an(x, y) format.

With both the Q1 and Q2 queues being empty, that is,mc1 → mc10 = 0, the system is in State 0. This sets in ctrl =0 that controlsmux1 to select the PC value (in this case (0, 0).This value is incremented on the next clock cycle.

The First Few Steps. This PC coordinate is then fed into thePixel Neighbour Coordinate block. The outputs of this block

Page 255: 541420

EURASIP Journal on Embedded Systems 15T

abl

e2:

Exa

mpl

eof

con

diti

ons

wit

hre

fere

nce

toth

esy

stem

cloc

kcy

cle.

Qu

eue

2h

asbe

enom

itte

dbe

cau

seby

the

14th

cloc

kcy

cle,

ith

asn

otbe

enu

sed.

clk

inQ

xse

lLo

cpl

atin

ner

svx

Pix

elst

atu

sw

etx

we

amdi

rm

inQ

ueu

e1

con

ten

ts

CW

NE

SW

NE

SC

10

0(0

,0)

10

S0→

6in

vin

v0

0→

1t4

,t5

1−3

0[1]

——

—(1

,0)[

1][1

](0

,0)[

1]

22

Q1=

4(1

,0)

10

N1→

6in

v6

00

t51

−40[

1]—

——

(1,0

)[0]

[1]

(0,0

)[1]

(1,0

)[2]

32

Q1=

5(0

,0)

10

S6

inv

inv

00

—0

—0[

1]—

——

—(0

,0)[

0]

(1,0

)[1]

42

Q1=

5(1

,0)

10

N6

inv

60

0—

0—

0[1]

——

——

(0,0

)[0]

(1,0

)[0]

50

0(0

,1)

00

—0→

66

inv

00

—1

−30[

1]—

——

——

60

0(0

,2)

00

—0→

66

inv

00

—1

11[

1→

2]—

——

——

70

0(0

,3)

00

—0→

66

inv

00

—1

−40[

2]—

——

——

80

0(0

,4)

11

E,S

06

inv

0→

10→

1t3

,t4

00

0[2]

——

(0,5

)[1]

[1]

(1,4

)[1]

[1]

92

Q1=

3(0

,5)

11

W,S

00→

1in

v0

0→

1t1

,t4

00

0[2]

(0,4

)[1]

[1]

—(0

,5)[

0][1

](1

,4)[

1][1

]—

(1,5

)[2]

[2]

102

Q1=

1(0

,4)

11

E,S

11

inv

11

—0

00[

2](0

,4)[

0][1

]—

(0,5

)[0]

[1]

(1,4

)[1]

[1]

(1,5

)[2]

[2]

112

Q1=

4(1

,4)

10

N,E

,S0→

61

inv

10→

1t4

,t5

1−1

0[2]

(0,4

)[0]

[1]

—(0

,5)[

0][1

](1

,4)[

0][1

](1

,4)[

1]

(1,5

)[1]

[2]

(2,4

)[2]

[3]

122

Q1=

4(1

,5)

11

W,N

,S1

61

00→

1t4

,t5

1−1

0[2]

(0,4

)[0]

[1]

—(0

,5)[

0][1

](1

,4)[

0][1

](1

,4)[

1]

(1,5

)[0]

[2]

(2,4

)[1]

[3]

(2,5

)[2]

[4]

132

Q1=

4(2

,4)

10

N,E

,S0→

60

61

0→

1t4

,t5

1−1

0[2]

(0,4

)[0]

[1]

—(0

,5)[

0][1

](1

,4)[

0][1

](1

,4)[

1]

(1,5

)[0]

[2]

(2,4

)[2]

(2,4

)[0]

[3]

(2,5

)[1]

[4]

(3,4

)[2]

[5]

142

Q1=

4(2

,5)

11

W,N

,S1

61

00→

1t4

0—

0[2]

(0,4

)[0]

[1]

—(0

,5)[

0][1

](1

,4)[

0][1

](1

,4)[

1]

(1,5

)[0]

[2]

(2,4

)[2]

(2,4

)[0]

[3]

(2,5

)[0]

[4]

(3,4

)[1]

[5]

(3,5

)[2]

[6]

Page 256: 541420

16 EURASIP Journal on Embedded Systems

Wri

te to

Q2

Wri

te to

Q1 W

rite to Q2

Reading from Q2_WN

ES

Reading from Q1_WN

ES

Normal

During stage 1 plat processing when it is an edge

Plat of all inner/edges (from Q

1)

Stag

e 1

plat

pro

cess

ing

The pixel status is a 3-bit register. It is used to tag the status of pixels. The various tags are as follows:

0: Never visited1: Queued-initial (all plat pixel locations into Q1)2: Queued in Q23: Queued in Q14: Completed when plat = 05: Completed when plat = 1 and reading from Q26: Completed when plat = 1 and reading from Q1

0 4

1

2

5

3

plat = 1inner = 1

Q1_sel = 5in_ctrl = 2

plat = 1inner = 1

plat = 1inner = 1

PS = 3Qx_sel < 5in_ctrl = 2

plat = 1inner = 1

PS = 2Q2_sel < 5in_ctrl = 1

plat = 1inner = 1

PS = 1Qx_sel < 5in_ctrl = 1

plat = 1inner = 1

PS = 1Qx_sel < 5in_ctrl = 2

plat = 0inner = 0

in_ctrl = 0

plat = 1PS = 1

mc5 = 0Q1_sel < 5in_ctrl = 2

plat = 1inner = 0

in_ctrl = 0

6

∗ Controls assume that Q1 isthe first queue to be used

Figure 19: The pixel status block is a set of 3-bit registers used tostore the state of the various pixels.

(the pixel locations) are (0, 0), (0, 1) → E, (1, 0) → W ,(−1, 0) → INVALID, and (0,−1) → INVALID. The validaddresses are then used to obtain the current pixel value,10(C), and neighbour values, 9(W) and 10(S). The invalidpixel locations are set to output an INVALID value throughthe CB mux. This value has been predefined to be 255.

The pixel locations are also used to determine addresslocations within the 3-bit pixel status registers. When read,the values are (0, 0) = 0, (0, 1) = 0, and (1, 0) = 0. The

20[6]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[6]

20[6]

20[6]

20[6]

20[6]

20[6]

10

10

10

10

10 10 10 10 10

20[6]

20[2]

20[1]

20[1]

[2]20[1]

20[1]

20[2]

20 20

20[6]

20[6]

20[6]

20 20 20

10

10

10

10

10 10 10 10 10

20[6]

20[5]

20[3]

20[1]

20[5]

20[3]

20[3]

20[5]

20[5]

20[5]

20[6]

20[6]

20[6]

20[6]

20[6]

20[6]

10

10

10

10

10 10 10 10 10

20[6]

20[6]

20[6]

20[6]

20[6]

20[6]

20[6]

20[3]

20[3]

20[3]

20[2]

20[2]

20[2]

20[2]

20[2]

20[2]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[1]

20[2]

20[2]

20[2]

20[2]

20[2]

20[3]

20[3]

20[3]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

20[0]

10

10

10

10

10 10 10 10 10

UP

LP

[6] [6] [6]

20

[2] [2]

[5]

2

20[5]

[3

2[3] [3]

LP

UP

Scan entire plateau by continouslyfeeding the unvisited neighbours backinto the system. Each visited pixel isflagged by changing PS = 0→1.Scanning stops when there are nomore unvisited neighbours

During the initial scan ofall the plateau pixels, allthe pixels with a lowerneighbour are arrowed,put into Q1_C and theirPS = 0→6.

Then Q1_C is read andall the neighbours tothese pixel locations(circled in blue) are putinto Q2_WNES. Whenput into Q2, PS = 1→2.

PS after the going throughthe plateau once. Shownhere before readingQ1_C. The pixels in Q1_Chave their PS = 6 evenbefore they are read backbecause they are labelledbefore they are put intoQ1_C.

PS shown here afterreading Q1_C. This isbefore reading fromQ2_WNES. When readingfrom Q2_WNES, all theinner unlabelled pixel (UP)will arrow to the labelledpixel (LP) which can beidentified because theirPS=6 (completed).

PS shown here afterreading Q1_WNES (fromthe previous cycle). This isbefore reading fromQ2_WNES. When readingfrom Q1_WNES, all the UPwill arrow to the LP. Thiscontinues until there are nomore neighbours to writeinto the other queue.

When Q1_WNES is read,all the neighbours to thispixel location (circled inblue) are put into Q2_WNES.When put into anotherqueue, PS = 1→2.

When Q2_WNES isread, all the neighboursto these pixel locations(circled in blue) are putinto Q1_WNES. Whenput into Q1, PS = 1→3.

Starting condition withpixel staus (PS) = 0(shown in square brackets)

Read fromcontents of

Q1_C

Read fromcontents ofQ2_WNES

Ignorecontents ofQ1_WNES

Write toQ2_WNES

Write toQ1_WNES

Read fromcontents ofQ1_WNES

Write toQ2_WNES

Figure 20: An example of how Pixel Status is used in the system.

neighbours with similar value are put into the queue. Inthe example used, only the sound neighbour has a similarvalue and is put into queue Q1 S. Next, the pixel status for(1, 0) is changed from 0 → 1. This tells the system thatthe coordinate (1, 0) has been put into the queue and willavoid an infinite loop once its similar valued neighbour to

Page 257: 541420

EURASIP Journal on Embedded Systems 17

Pixel coordinate

+1

Reverse arrowing

Arrrow memory Buffer

Path queue

Lab

el m

emor

y

Pixel statuswe_pc

1

0

All memories have a built-in “pixel coordinate to memory

address decoder”

mux

w_loc: memory write location r_loc: memory read location

w_info: memory write data

we_pq: write enable path queue memory we_label: write enable label memory &

pixel status memorywe_pc: write enable for pixel coordinate

incrementation we_buf: write enable for buffer. Value of

CBL is locked in buffer and read from it until “read Q” is completed.

mux: data input selection

we_label

we_label

+1

−1

PQ_counter

we_pq

10

Path queue counter 1

Memory counter for path queue

a

b

if a > 0, b = 1

w_loc

w_loc

w_loc

r_loc

r_loc

we_pq

we_buf

w_value

0>

Figure 21: The watershed architecture: Labelling.

the north (0, 0) finds (1, 0) again. The current pixel location(0, 0) on the other hand is written to Q1 C because itis a plateau pixel but not an inner (i.e., an edge) and isimmediately arrowed. The status for this location (0, 0) ischanged from 0 → 6. Q1 S will contain the pixel location(1, 0). This is read back into the system and mc4 = 1 → 0indicating Q1 S to be empty. The pixel location (1, 0) isarrowed and written into Q1 C. With mc1 − 4 = 0 andmc5 > 0, the pixel locations (0, 0) and (1, 0) is reread into thesystem but nothing is performed because both their PSsequal6 (i.e., completed).

6. Labelling Architecture

This second part of the architecture will describe how weget from Figure 4(c) to Figure 4(d) in hardware. Comparedto the arrowing architecture, the labelling architecture isconsiderably simpler as there are no parallel memory reads.In fact, everything runs in a fairly sequential manner. Part 2of the architecture is shown in Figure 21.

The architecture for Part 2 is very similar to Part 1. Bothare tristate systems whose state depends on the condition

Normal

Fill queue Read queue

PQ

_cou

nter

> 0 PQ

_counter = 0

b = 1 (catchment basin found)

mux = 1

mux = 0

Figure 22: The 3 states in Architecture:Labelling.

of the queues and uses pixel state memory and queuesfor storing pixel locations. The difference is that Part 2architecture only requires a single queue and a single bit pixelstatus register. The three states for the system are shown inFigure 22.

Page 258: 541420

18 EURASIP Journal on Embedded Systems

Values are initially read in from the pixel coordinateregister. Whether this pixel location had been processedbefore is checked against the pixel status (PS) register. If it hasnot been processed before (i.e., was never part of any steepestdescending path), it will be written to the Path Queue (PQ).Once PQ is not empty, the system will process the nextpixel along the current steepest descending path. This iscalculated by the “Reverse Arrowing Block” (RAB) using thecurrent pixel location and direction information obtainedfrom the “Arrow Memory.” This process continues until anon-negative value is read from “Arrow Memory.” This non-negative value is called the “Catchment Basin Label” (CBL).Reading a CBL tells that the system a minimum has beenreached and all the pixel locations stored in PQ will belabelled with that CBL and written to “Label Memory.” Atthe same time, the pixel status for the corresponding pixellocations will be updated accordingly from 0 → 1. Now thatPQ is empty; the next value will be obtained from the pixelcoordinate register.

6.1. The Reverse Arrowing Block. This block calculates theneighbour pixel location in the path of the steepest descentgiven the current location and arrowing label. In otherwords, it simply finds the location of the pixel pointed to bythe current pixel.

The output of this block is a simple case of selectingthe appropriate neighbouring coordinate. Firstly the neigh-bouring coordinates are calculated and are fed into a 4-inputmultiplexer. Invalid neighbours are automatically ignored asthey will never be selected. The values in “Arrow Memory”only point to valid pixels. Hence, no special consideration isrequired to handle these cases.

The bulk of the block’s complexity lies in the control ofthe multiplexer. The control is determined by translating thevalue from the “Arrow Memory” into proper control logic.Using a bank of four comparators, the value from “ArrowMemory” is determined by comparing it to four possiblevalid direction labels (i.e., −4 → −1). For each of thesevalues, only one of the comparators will produce a positiveoutcome (see truth table in Figure 23). Any other valuesoutside the valid range will simply be ignored.

The comparator output is then passed through somelogic that will produce a 2-bit output corresponding to themultiplexer control. If the value from “Arrow Memory” is−1, the control logic will be (x = 0, y = 0) correspondingto the West neighbour location. Similarly, if the value from“Arrow Memory” is −2, −3, or −4, the control logic willbe (x = 0, y = 1), (x = 1, y = 0), or (x = 1, y =1) corresponding to the North, East, or South neighbourlocations, respectively. This is shown in Figure 23.

7. Example for the Labelling Architecture

This example will pick up where the previous example hadstopped. In the previous part, the resulting output waswritten to the “Arrow Memory.” It contains the directions ofthe steepest descent (negative values from −1 → −4) andnumbered minima (positive values from 0 → total number

r + 1

r − 1

c + 1

c − 1

Row

Column

W 0

1

2

3

N

E

S

+1

+1

Lower neighbour location

am

am

am

am

=

=

=

=

am = value from arrow memory

a

b

c

d

x

x

y

y

−1

−1

−1

−2

−3

−4

a b c d x y mux_ctrl

1 0 0 0 0 0 00 1 0 0 0 1 10 0 1 0 1 0 20 0 0 1 1 1 3

x = a′b′c′d + a′b′cd′y = a′b′d′d + a′bc′d′

Figure 23: Inside the reverse arrowing block.

of minima) as seen in Figure 4(c). In this part, we will use theinformation stored in “Arrow Memory” to label each pixelwith the label of its respective minimum. Once all associatedpixels to a minimum/minima have been labelled accordingly,a catchment basin is formed.

The system starts off in the normal state and the initialconditions are as follows. PQ counter = 0, mux = 1. In thefirst clock cycle, the first pixel location (0, 0) is read from thepixel location register. Once this has been read in, the pixellocation register will increment to the next pixel location(0, 1). The PS for the first location (0, 0) is 0. This enablesthe write enable for the PQ and the first location is writtento queue. At the same time, the location (0, 0) and direction−3 obtained from “Arrow Memory” are used to find the nextcoordinate (0, 1) in the steepest descending path.

Since PQ is not empty, the system enters the “Fill Queue”state and mux = 0. The next input into the system is the valuefrom the reverse arrowing block, (0, 1), and since PS = 0,it is put into PQ. The next location processed is (0, 2). For(0, 2), PS = 0 and is also written to PQ. However, for thislocation, the value obtained from “Arrow Memory” is 1. Thisis a CBL and is buffered for the process of the next state. Oncea non-negative value from “Arrow Memory” is read (i.e.,b = 1), the system enters the next state which is the “ReadQueue” state. In this state, all the pixel locations stored inPQ is read one at a time and the memory locations in “LabelMemory” corresponding to these locations are written withthe buffered CBL. At the same time, PS is also updated from0 → 1 to reflect the changes made to “Label Memory.” It tellsthe system that the locations from PQ have been processedso that it will not be rewritten when it is encountered again.

Page 259: 541420

EURASIP Journal on Embedded Systems 19

Table 3: Results of the implemented architecture on a XilinxSpartan-3 FPGA.

64× 64 image size,

Arrowing

Slice flip flops 423 out of 26,624 (1%)

Occupied slices 2,658 out of 13,312 (19%)

Labelling

Slice flip flops 39 out of 26,624 (1%)

Occupied slices 37 out of 13,312 (1%)

With each read from PQ, PQ counter is decremented. WhenPQ is empty, PQ counter = 0 and the system will return tothe normal state.

In the next clock cycle, (0, 1) is read from the pixelcoordinate register. For (0, 1), PS = 1 and nothing getswritten to PQ and PQ counter remains at 0. The same goesfor (0, 2). When the coordinate (0, 3) is read from the pixelcoordinate register, the whole processes of filling up PQ andreading from PQ and writing to “Label Memory” start again.

8. Synthesis and Implementation

The rainfall watershed architecture was designed in Handel-C and implemented on a Celoxica RC10 board containinga Xilinx Spartan-3 FGPA. Place and route were completedto obtain a bitstream which was downloaded into the FPGAfor testing. The watershed transform was computed by theFPGA architecture, and the arrowing and labelling resultswere verified to have the same values as software simulationsin Matlab. The Spartan-3 FPGA contains a total of 13312slices. The implementation results of the architecture aregiven in Table 3 for an image size of 64 × 64 pixels. Animage resolution of 64 × 64 required 2658 and 37 slices forthe arrowing and labelling architecture, respectively. Thisrepresents about 20% of the chip area on the Spartan-3FPGA.

9. Summary

This paper proposed a fast method of implementing thewatershed transform based on rainfall simulation with amultiple bank memory addressing scheme to allow parallelaccess to the centre and neighbourhood pixel values. In asingle read cycle, the architecture is able to obtain all fivevalues of the centre and four neighbours for a 4-connectivitywatershed transform. This multiple bank memory has thesame footprint as a single bank design. The datapathand control architecture for the arrowing and labellinghardware have been described in detail, and an implementedarchitecture on a Xilinx Spartan-3 FGPA has been reported.The work can be extended to implement an 8-connectivitywatershed transform by increasing the number of memorybanks and working out its addressing. The multiple bankmemory approach can also be applied to other watershedarchitectures such as those proposed in [10–13, 15].

References

[1] S. E. Hernandez and K. E. Barner, “Tactile imaging usingwatershed-based image segmentation,” in Proceedings of theAnnual Conference on Assistive Technologies (ASSETS ’00), pp.26–33, ACM, New York, NY, USA, 2000.

[2] M. Fussenegger, A. Opelt, A. Pjnz, and P. Auer, “Objectrecognition using segmentation for feature detection,” inProceedings of the 17th International Conference on PatternRecognition (ICPR ’04), vol. 3, pp. 41–44, IEEE ComputerSociety, Washington, DC, USA, 2004.

[3] W. Zhang, H. Deng, T. G. Dietterich, and E. N. Mortensen,“A hierarchical object recognition system based on multi-scale principal curvature regions,” in Proceedings of the 18thInternational Conference on Pattern Recognition (ICPR ’06),vol. 1, pp. 778–782, IEEE Computer Society, Washington, DC,USA, 2006.

[4] M. S. Schmalz, “Recent advances in object-based image com-pression,” in Proceedings of the Data Compression Conference(DCC ’05), p. 478, March 2005.

[5] S. Han and N. Vasconcelos, “Object-based regions of interestfor image compression,” in Proceedings of the Data Compres-sion Conference (DCC ’05), pp. 132–141, 2008.

[6] T. Acharya and P.-S. Tsai, JPEG2000 Standard for ImageCompression: Concepts, Algorithms and VLSl Architecturcs,John Wiley & Sons, New York, NY, USA, 2005.

[7] V. Osma-Ruiz, J. I. Godino-Llorente, N. Saaenz-Lechon, andP. Gomez-Vilda, “An improved watershed algorithm based onefficient computation of shortest paths,” Pattern Recognition,vol. 40, no. 3, pp. 1078–1090, 2007.

[8] A. Bieniek and A. Moga, “An efficient watershed algorithmbased on connected components,” Pattern Recognition, vol. 33,no. 6, pp. 907–916, 2000.

[9] H. Sun, J. Yang, and M. Ren, “A fast watershed algorithm basedon chain code and its application in image segmentation,”Pattern Recognition Letters, vol. 26, no. 9, pp. 1266–1274, 2005.

[10] M. Neuenhahn, H. Blume, and T. G. Noll, “Pareto optimaldesign of an FPGA-based real-time watershed image seg-mentation,” in Proceedings of the Conference on Program forResearch on Integrated Systems and Circuits (ProRISC ’04),2004.

[11] C. Rambabu and I. Chakrabarti, “An efficient immersion-based watershed transform method and its prototype archi-tecture,” Journal of Systems Architecture, vol. 53, no. 4, pp. 210–226, 2007.

[12] C. Rambabu, I. Chakrabarti, and A. Mahanta, “Flooding-based watershed algorithm and its prototype hardware archi-tecture,” IEE Proceedings: Vision, Image and Signal Processing,vol. 151, no. 3, pp. 224–234, 2004.

[13] C. Rambabu and I. Chakrabarti, “An efficient hillclimbing-based watershed algorithm and its prototype hardware archi-tecture,” Journal of Signal Processing Systems, vol. 52, no. 3, pp.281–295, 2008.

[14] D. Noguet and M. Ollivier, “New hardware memory manage-ment architecture for fast neighborhood access based on graphanalysis,” Journal of Electronic Imaging, vol. 11, no. 1, pp. 96–103, 2002.

[15] C. J. Kuo, S. F. Odeh, and M. C. Huang, “Image segmentationwith improved watershed algorithm and its FPGA implemen-tation,” in Proceedingsof the IEEE International Symposium onCircuits and Systems (ISCAS ’01), vol. 2, pp. 753–756, Sydney,Australia, May 2001.