Top Banner
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 525 An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching Jianhui Wang, Sheng Zhong, Luxin Yan, Member, IEEE, and Zhiguo Cao Abstract —Detecting and matching image features is a funda- mental task in video analytics and computer vision systems. It establishes the correspondences between two images taken at different time instants or from different viewpoints. However, its large computational complexity has been a challenge to most embedded systems. This paper proposes a new FPGA-based em- bedded system architecture for feature detection and matching. It consists of scale-invariant feature transform (SIFT) feature detection, as well as binary robust independent elementary features (BRIEF) feature description and matching. It is able to establish accurate correspondences between consecutive frames for 720-p (1280 x 720) video. It optimizes the FPGA architecture for the SIFT feature detection to reduce the utilization of FPGA resources. Moreover, it implements the BRIEF feature description and matching on FPGA. Due to these contributions, the proposed system achieves feature detection and matching at 60 frame/s for 720-p video. Its processing speed can meet and even exceed the demand of most real-life real-time video analytics applications. Extensive experiments have demonstrated its efficiency and effectiveness. Index Terms—Binary robust independent elementary features (BRIEF), feature detection and matching, field programmable gate array (FPGA), scale-invariant feature transform (SIFT), system-on-chip (SoC). I. Introduction E FFICIENT detection and reliable matching of visual features is a fundamental problem in computer vision applications, such as object recognition, structure from motion, image indexing, and visual localization. Real-time perfor- mance is a critical demand to most of these applications, which require the detection and matching of the visual fea- tures in real time. Although feature detection and matching methods have been studied in the literature, due to their computational complexity, their pure software implementation without using special hardware is far from satisfactory in their performance for real time applications. This paper is focused on a new hardware design to enable real-time performance of Manuscript received March 21, 2013; revised June 19, 2013 and July 31, 2013; accepted August 5, 2013. Date of publication August 29, 2013; date of current version March 4, 2014. This work was supported in part by the National Pre-Research Foundation under Grant 625010221. This paper was recommended by Associate Editor T.-S. Chang. The authors are with the Science and Technology on Multi-Spectral Information Processing Laboratory, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2013.2280040 establishing correspondences between two consecutive frames of high-resolution video. In the literature, there are many different methods to detect local features in an image, such as Harris [1], scale-invariant feature transform (SIFT) [2], and SURF [3]. SIFT is one of the most efficient methods to detect and describe distinctive invariant features from images. Its significant advantage over other methods is that the SIFT feature is invariant to image translation, scaling, and rotation, while at the same time quite robust to illumination changes. However, it is known that it is very difficult, if not impossible, to achieve software-based real- time computing of SIFT due to its computational complexity. Recently, there have been some studies using special hardware [4]–[7] to accelerate the detection part of the SIFT algorithm, and some of these works may achieve satisfactory real-time performance, such as the design in [7]. However, to the– knowledge, a full-fledged feature detection, description, and matching system is yet to be designed. Despite the detection part of SIFT, obtaining the SIFT feature descriptors is also critical and it has been the performance bottleneck of the whole system because it is very difficult, if not impossible, to integrate the description part of SIFT into FPGA. The main challenge is its operational complexity, which prevents it from being parallelized effectively. There have been many modifications and variants of the original SIFT descriptor to speed it up at the algorithmic level. Broadly speaking, these methods can be divided into two classes. One is to shorten the size of the SIFT feature by applying dimensionality reduction, such as principal compo- nent analysis (PCA) [8], to the original SIFT feature descriptor. Another way is to quantize its floating-point coordinates into integer codes on fewer bits, such as the results presented in [9]–[11]. From these important contributions, Calonder et al. [12] presented a method to extract feature descriptor very efficiently, called binary robust independent elementary fea- tures (BRIEF), which greatly reduced the memory demanded to store the feature descriptors and the time consumed to match the features, while yielding comparable recognition accuracy. In order to achieve real-time feature detection and matching for 720-p video, we propose to replace the original SIFT descriptor by the BRIEF descriptor in this paper. Considering the space, power, and real-time constraints of an embedded system, we implement the whole system on a single FPGA chip. This system consists of SIFT detection, BRIEF descrip- tion, and BRIEF matching. The proposed FPGA-based feature 1051-8215 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
14

An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

Dec 25, 2015

Download

Documents

vinod

Detecting and matching image features is a fundamental
task in video analytics and computer vision systems. It
establishes the correspondences between two images taken at
different time instants or from different viewpoints. However,
its large computational complexity has been a challenge to most
embedded systems. This paper proposes a new FPGA-based embedded
system architecture for feature detection and matching.
It consists of scale-invariant feature transform (SIFT) feature
detection, as well as binary robust independent elementary
features (BRIEF) feature description and matching. It is able to
establish accurate correspondences between consecutive frames
for 720-p (1280 x 720) video. It optimizes the FPGA architecture
for the SIFT feature detection to reduce the utilization of
FPGA resources. Moreover, it implements the BRIEF feature
description and matching on FPGA. Due to these contributions,
the proposed system achieves feature detection and matching
at 60 frame/s for 720-p video. Its processing speed can meet
and even exceed the demand of most real-life real-time video
analytics applications. Extensive experiments have demonstrated
its efficiency and effectiveness
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 525

An Embedded System-on-Chip Architecture forReal-time Visual Detection and Matching

Jianhui Wang, Sheng Zhong, Luxin Yan, Member, IEEE, and Zhiguo Cao

Abstract—Detecting and matching image features is a funda-mental task in video analytics and computer vision systems. Itestablishes the correspondences between two images taken atdifferent time instants or from different viewpoints. However,its large computational complexity has been a challenge to mostembedded systems. This paper proposes a new FPGA-based em-bedded system architecture for feature detection and matching.It consists of scale-invariant feature transform (SIFT) featuredetection, as well as binary robust independent elementaryfeatures (BRIEF) feature description and matching. It is able toestablish accurate correspondences between consecutive framesfor 720-p (1280 x 720) video. It optimizes the FPGA architecturefor the SIFT feature detection to reduce the utilization ofFPGA resources. Moreover, it implements the BRIEF featuredescription and matching on FPGA. Due to these contributions,the proposed system achieves feature detection and matchingat 60 frame/s for 720-p video. Its processing speed can meetand even exceed the demand of most real-life real-time videoanalytics applications. Extensive experiments have demonstratedits efficiency and effectiveness.

Index Terms—Binary robust independent elementary features(BRIEF), feature detection and matching, field programmablegate array (FPGA), scale-invariant feature transform (SIFT),system-on-chip (SoC).

I. Introduction

EFFICIENT detection and reliable matching of visualfeatures is a fundamental problem in computer vision

applications, such as object recognition, structure from motion,image indexing, and visual localization. Real-time perfor-mance is a critical demand to most of these applications,which require the detection and matching of the visual fea-tures in real time. Although feature detection and matchingmethods have been studied in the literature, due to theircomputational complexity, their pure software implementationwithout using special hardware is far from satisfactory in theirperformance for real time applications. This paper is focusedon a new hardware design to enable real-time performance of

Manuscript received March 21, 2013; revised June 19, 2013 and July 31,2013; accepted August 5, 2013. Date of publication August 29, 2013; dateof current version March 4, 2014. This work was supported in part by theNational Pre-Research Foundation under Grant 625010221. This paper wasrecommended by Associate Editor T.-S. Chang.

The authors are with the Science and Technology on Multi-SpectralInformation Processing Laboratory, School of Automation, HuazhongUniversity of Science and Technology, Wuhan 430074, China (e-mail:[email protected]; [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2013.2280040

establishing correspondences between two consecutive framesof high-resolution video.

In the literature, there are many different methods to detectlocal features in an image, such as Harris [1], scale-invariantfeature transform (SIFT) [2], and SURF [3]. SIFT is one ofthe most efficient methods to detect and describe distinctiveinvariant features from images. Its significant advantage overother methods is that the SIFT feature is invariant to imagetranslation, scaling, and rotation, while at the same time quiterobust to illumination changes. However, it is known that it isvery difficult, if not impossible, to achieve software-based real-time computing of SIFT due to its computational complexity.Recently, there have been some studies using special hardware[4]–[7] to accelerate the detection part of the SIFT algorithm,and some of these works may achieve satisfactory real-timeperformance, such as the design in [7]. However, to the–knowledge, a full-fledged feature detection, description, andmatching system is yet to be designed. Despite the detectionpart of SIFT, obtaining the SIFT feature descriptors is alsocritical and it has been the performance bottleneck of thewhole system because it is very difficult, if not impossible,to integrate the description part of SIFT into FPGA. The mainchallenge is its operational complexity, which prevents it frombeing parallelized effectively.

There have been many modifications and variants of theoriginal SIFT descriptor to speed it up at the algorithmiclevel. Broadly speaking, these methods can be divided intotwo classes. One is to shorten the size of the SIFT feature byapplying dimensionality reduction, such as principal compo-nent analysis (PCA) [8], to the original SIFT feature descriptor.Another way is to quantize its floating-point coordinates intointeger codes on fewer bits, such as the results presented in[9]–[11]. From these important contributions, Calonder et al.[12] presented a method to extract feature descriptor veryefficiently, called binary robust independent elementary fea-tures (BRIEF), which greatly reduced the memory demandedto store the feature descriptors and the time consumed tomatch the features, while yielding comparable recognitionaccuracy.

In order to achieve real-time feature detection and matchingfor 720-p video, we propose to replace the original SIFTdescriptor by the BRIEF descriptor in this paper. Consideringthe space, power, and real-time constraints of an embeddedsystem, we implement the whole system on a single FPGAchip. This system consists of SIFT detection, BRIEF descrip-tion, and BRIEF matching. The proposed FPGA-based feature

1051-8215 c© 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

526 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

detection and matching system can establish correspondencesbetween two 1280 × 720 images within 16 ms, which canmeet the demands of most real-time applications. The maincontributions of this paper include the following aspects.

1) To the best of our knowledge, it is the first hardwaredesign to implement visual feature detection, descriptionand matching on a single FPGA based on SIFT+BRIEF.Moreover, it can complete these computations in realtime for 720-p video. More specifically, it can achieve60 frames/s for 1280 × 720 images.

2) It combines the stability and repeatability of SIFT key-points with the efficiency of BRIEF to meet the real-time requirements from real-life computer vision appli-cations.

3) Due to the optimization in the proposed design, the SIFTfeature detection part presented in this paper is one ofthe fastest designs, while the FPGA resources utilizationis smallest.

The rest of the paper is organized as follows. Section IIpresents related works. It starts from the early attempts inthis topic and reviews related implementations specificallyon FPGAs. Section III introduces the proposed SIFT featuredetection associated with BRIEF feature description method.Section IV presents the technical details of our design. Ex-perimental results are reported in Section V and Section VIconcludes the paper.

II. Related Works

Generally speaking, there are two approaches to estab-lish correspondences for image sequences: intensity-basedand feature-based. Feature-based approaches find correspond-ing local visual features between images, such as points,lines, and contours, while the intensity-based approachescompare intensity patterns in images via correlation met-rics. As shown in the works of Lowe [2], Sivic andZisserman [13], and Mikolajczyk and Schmid [14], feature-based methods have shown very good results for imagematching [15]–[17]. However, due to their computationalcomplexity, it is very difficult, if not impossible, to havepure software implementations (without using any special-ized hardware, e.g., GPU) of these methods for real-timeapplications.

McCartney et al. [15] presented an image registration algo-rithm for image sequences captured by unmanned aerial ve-hicles (UAV). This design was a pure software-based system,and it consumed about 1 s to complete image registration for640 × 480 images, in which the image feature detection andmatching steps consumed about 0.8 s. It was obtained froma personal computer (PC) equipped with Intel Core 2 Dualprocessor with 1.67 GHz processing speed and 2 GB of RAM.

Agrawal et al. [16] presented two feature detection methodsnamed as CenSurE-DOB and CenSurE-OCT, respectively, aswell as a feature description method called MUSURF. Theyclaimed that the feature detection part CenSurE-DOB took23 ms and CenSurE-OCT 17 ms to detect center-surroundfeatures from 512 × 382 images, and the feature descriptionstep only took 10 ms. Their experiments were obtained from an

Intel Pentium-M 2 GHz machine. This design may be fine inreal-time applications of low resolution image sequences; how-ever, its performance is still far from the demands from high-resolution image sequences. Moreover, the time consumed bythe feature matching process was excluded from the processingtime presented in [16].

Wang et al. [17] presented a multiple view kernel projection(MVKP)-based real-time image matching design. It utilized akernel-projection scheme to describe the image patch centeredat a detected key-point. It treated feature matching as aclassification problem, which leads to an online matchingspeed five times faster than SIFT. It took about 0.1 s tocomplete the feature description on a PC with Pentium VI1.4-G CPU.

In order to achieve real-time performance, GPU-based soft-ware implementation has been studied, such as the workpresented by Cornelis [18], Heymann [19], and Sinha [20]. Thework presented in [18] can achieve 100 frame/s for 640×480images and took about 32 ms to match 3000 features to another3000 features. However, these methods have to largely dependon the performance of the GPU chip and other hardwareconfigurations, and their performances vary significantly fromcomputer to computer.

As the real-time performance is critical to real-life computervision applications, it is natural to resort to effective hardwaredesign for efficient feature detection and description. Bonatoet al. [5] presented an FPGA-based architecture for SIFTfeature detection. This implementation operated at 30 frame/son 320 × 240 images. However, its implementation of thefeature description part of SIFT was on the NIOS II soft-core processor. This step took 11.7 ms per detected feature,which makes it infeasible to perform as a full real-time SIFTimplementation. As a single image may have hundreds offeatures, it is still far from satisfactory for the real-timeperformance.

Zhong et al. [4] presented a design for SIFT feature de-tection and description based on FPGA+DSP architecture for320 × 256 images. The feature detection part of their designcan achieve satisfactory real-time performance. However, thefeature description part of their system was implemented inDSP. Although it took 80 μs to detect one feature, it may bebehind a real-time performance when the number of featuresin an image reaches 400 or more (this is the generally the casefor high resolution images like in HDTV).

Mizuno et al. [6] proposed a SIFT feature detection designfor HDTV based on ASIC. In this system, the input imageswere stored in the external SDRAM and divided into severalregions of interest (ROIs), and the feature detection archi-tecture detects features in each ROI. Since the ROI imageis much smaller than the input image, it can reduce the on-chip memory effectively. However, it needed external memoryto store the input image. Moreover, the performance of thefeature description of this design was not reported, and it isunclear if it can be done in real time.

The FPGA-based design presented by Svab et al. [21]took about 100 ms to detect the SURF features from a1024 × 768 image. In this design, the feature detection partof SURF was implemented in FPGA logical blocks, but the

Page 3: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 527

feature description part was implemented in POWERPC withfloating-point arithmetic. Although 10 frames/s is satisfactoryfor a few real-life applications, it is still far from true real-time performance. Moreover, when considering the featurematching part, it cannot satisfy the demand of real-timeperformance.

The SIFT algorithm was modified in [22] to obtain a high-speed feature detector. This system took 31 ms to detectmultiple features from 640 × 480 images. It implemented twooctaves with four scales of SIFT, and reduced the dimensionof the feature descriptor from 128 to 72 in order to obtainthe desired speed. Due to these approximations, this dedicateddesign gives a near real-time performance. However, theperformance of the feature description of their design is notpresented, let alone the feature matching. Moreover, althoughthe feature descriptor of their design is reduced to 72 from128, the memory requirement to store the feature descriptor isstill very demanding.

Schaeferling et al. [23] proposed a SURF-based objectrecognition system on a single FPGA. One of the core parts ofthis systems is Flex-SURF+. It contains an array of differenceelements (DEs) to overcome the irregular memory accessbehavior exposed by SURF during the computation of imagefilter responses. A special computing pipeline processes thesefilter responses and determines the final detection results. Flex-SURF+ allows a tradeoff between area and speed, whichmakes it efficient with high-end FPGAs as well as low-endones, depending on the application requirements. By usingFlex-SURF+, the SURF detectors determinant calculation steprequires just 70 ms per frame. The minimum, average, andmaximum total execution time per 320×240 frame are 191 ms,481 ms, and 1053 ms, respectively.

Huang et al. [7] presented a hardware-based SIFT featureextraction architecture, which can extract SIFT features for640 × 480 images within 33 ms when the number of featurepoints to be extracted is fewer than 890. This system consistsof two interactive hardware components, one for feature detec-tion and the other for feature description. This system took 3.4ms to detect SIFT features from 640 × 480 images and took33.1 μs to extract descriptor for one detected feature. How-ever, the detail of the feature description was not presented.Moreover, there was no report on the time consumption forfeature matching, which is also a computationally intensivepart for the system.

In this paper, we proposed a single FPGA-based designto detect and match visual feature in real time for real-lifeapplications. Considering the stability and repeatability of theimage matching system, we select SIFT as the feature pointin our design. In order to achieve real-time performance,we replace the SIFT descriptor by the BRIEF descriptor.Moreover, we also implement a BRIEF feature matchingcomponent in the FPGA. The processing speed of the proposedsystem can comfortably achieve 60 frames/s for 1280 × 720images, which is appealing to most real-life applications.

III. Method

The main purpose of our design is to achieve real-timeperformance for image feature detection and matching on

Fig. 1. Process diagram of the proposed system.

720-p video. In this section, we introduce the process toestablish correspondences between consecutive video frames.As the proposed system is based on SIFT key-point associatedwith the BRIEF descriptors, we also briefly review the SIFTdetection and BRIEF description in this section to makethis paper self-contained. For details please see [2], [12],and [24].

A. SIFT Key-point Associated With BRIEF Feature

As mentioned in Section II, it is very difficult for apure software implementation without using special hardwareto achieve real-time performance for high-resolution featuredetection and matching. It is natural to resort to effectivehardware design. In the proposed design, we implement thewhole feature detection and matching system on a singleFPGA. Hence, we must choose the appropriate feature de-tection, description, and matching methods considering notonly their computational complexity, but the constraints onthe available FPGA resources. Among of the state-of-the-artfeature description approaches in the literature, such as SIFT,SURF, and BRIEF, we found that BRIEF is a good choice asit is very efficient and accurate. Moreover, it is very suitablefor parallel computation also, which is a very important factorin our FPGA implementation.

Although the BRIEF descriptor is associated with the SURFkey-point in the original BRIEF paper [12], we combine itwith SIFT key-point in this design. Since SIFT is more stableand robust than SURF and the number of feature detectedby SIFT is less than that of SURF, it can greatly reduce thememory requirement to store the feature descriptors. Althoughdetecting SURF feature points is faster than SIFT pointsbased on software, these two methods have almost the sameprocessing speed when implemented in hardware. For thesereasons, we choose SIFT instead of SURF as the featuredetection method in our system.

In order to establish correspondences between consecutivevideo frames, we need to store the visual features of thetwo frames. The main task of feature matching part in oursystem is to find feature matching pairs that have the minimumHamming distances. The process diagram of the proposedsystem is presented in Fig. 1. The SIFT feature detectionmodule detects the SIFT key-points for every frame (onethe input 720-p video) and the BRIEF feature descriptionpart extracts the BRIEF descriptors for each detected SIFTkey-point. Then, the extracted BRIEF feature descriptors arestored. Finally the BRIEF feature matching part reads theBRIEF feature descriptors from feature storage A and B (forthe two images, respectively) to find the matching pairs. The

Page 4: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

528 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

Fig. 2. Block diagram of the SIFT feature detection in two octaves with sixscales per octave, where k = 21/3 in this paper.

experimental results presented in Section V demonstrate theefficiency and effectiveness of this proposed system.

B. SIFT Key-point Detection

The process of SIFT key-point detection consists of twosteps: constructing difference-of-Gaussian (DoG) image pyra-mid and detecting key-points with stability verification. Be-sides key-point detection, SIFT includes a feature descriptionpart also. As it will be replaced by the BRIEF feature descrip-tor in this paper, we only review the feature detection part ofSIFT. The process of SIFT feature detection is illustrated byFig. 2. For details please see [2] and [24].

1) DoG Pyramid Construction: The first step in SIFTis to determine the image locations that exhibit significantlocal changes in their visual appearances. Such locationsare the candidates for the SIFT key-points. In order to findthese candidates, we need to construct a DoG image pyramidthat approximates the image gradient field. First, we needto convolve the input image I(x, y) with a Gaussian kernelK(x, y; σ), where σ is the scale of Gaussian kernel. The resultis Gaussian-filtered image denoted as G(x, y; σ), i.e.

G(x, y; σ) = conv2(I(x, y), K(x, y; σ)) (1)

where conv2(, ) represents the 2-D convolution operation and

K(x, y; σ) =1

2πσ2e−(x2+y2)/2σ2

. (2)

The DoG image is the difference of two Gaussian-filteredimages over two consecutive scales, denoted by

D(x, y; σ) = G(x, y; kσ) − G(x, y; σ) (3)

where k is a constant multiplicative factor.As can be seen in Fig. 2, there are two octaves (group

of images that have the same resolution) and six scales foreach octave. In each octave, the Gaussian-filtered imagesG(x, y; kiσ), i = 0 . . . 5 are generated by convolving I(x, y)with K(x, y; kiσ). Once a complete octave has been processed,we down-sample the Gaussian-filtered image by taking everyother pixel in the rows and columns of G(x, y; k3σ) as thesource image of the next octave.

2) Stable Key-point Detection: Once the DoG imagepyramid has been constructed, we detect the local maximaand minima in the DoG images by comparing a pixel withits 26 neighbors in 3 × 3 regions at the current and adjacentscales; and we treat these local maxima and minima point asa candidate key-point as shown in Fig. 2.

Once the key-point candidates have been located, we shouldeliminate the low contrast and strong edge response points tomake it robust to noise. To eliminate low contrast points, wecompare the DoG image pixel value (D) with a contrast value.A poorly defined peak in the DoG image has a large principalcurvature across the edge but a small one in the perpendiculardirection. It is known that the principal curvature value can becomputed from Hessian matrix H

H =

(Dxx Dxy

Dxy Dyy

)(4)

where Dxx, estimated by taking the difference of neighborpoints, is the second order derivative along the x direction.

The eigenvalues of H is proportional to the principal cur-vatures of D. Let γ be the ratio between the larger eigenvalueα and the smaller one β, so that α = γβ, then

Tr(H)2

Det(H)=

(α + β)2

αβ=

(γβ + β)2

γβ2=

(1 + γ)2

γ(5)

where Tr(H) is the trace of H and Det(H) the determinantof H . It is clear that γ represents the condition number of theprincipal curvature matrix, which implies its singularity of thedegeneration of the local appearances.

Therefore, to eliminate strong edge response points, we onlyneed to set the threshold γ , and check whether (6) is satisfiedor not. That is

Tr(H)2

Det(H)≤ (1 + γ)2

γ. (6)

C. BRIEF Feature Description

Here, we briefly review the BRIEF feature descriptionand matching. For details please see [12]. The process offorming the BRIEF feature description can be divided into twosteps: smoothing the image to reduce the sensitivity to noiseand comparing the randomly generated image patch pairs togenerate the correspondence bit of the feature vector. Morespecifically, the BRIEF feature extraction can be described asfollowing. Let β be an intensity comparison result betweentwo smoothed image patches u and v, as

β(p; u, v) :=

{1 if I(p, u) < I(p, v)0 otherwise

(7)

where I(p, u) is the pixel intensity in a smoothed version ofp at position u = (x, y)T . Then, we choose a set of point pairsS = {si | i = 1 . . . 256, si = (ui, vi)} to uniquely define a set ofbinary tests. The BRIEF descriptor is a 256-D bit string thatcorresponds to the decimal counterpart of

256∑1

2i−1β(p; ui, vi). (8)

Page 5: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 529

Considering that there are many Gaussian filters in ourdesign, we chose to use the Gaussian kernel as the smoothingkernel and we set the width of the Gaussian kernel to 15.

There are many different methods to select the test locations(ui, vi) of one in a patch of size W ×W . We adopt the methodpresented in [12], i.e., (U, V ) ∼ i.i.d. Uniform

(−W2 , W

2

): the

locations (ui, vi) are evenly distributed over the patches andthe tests are located close to the patch border [12]. Since theBRIEF feature is a vector of bit string, the matching of BRIEFfeature is a process to find the BRIEF feature pairs that havethe minimum Hamming distance.

IV. Proposed Hardware Architecture

In this section, first, we introduce the overall architecture ofthe proposed system. In order to implement proposed featuredetection and matching method in real time, we analyze thecomputational demand of each block in the whole system.Then, we set the clock frequency and parallelism gain foreach block to make it work in real time. Finally, we present thedetails of the proposed on-chip feature detection and matchingsystem, which can detect the SIFT features from 1280 × 720images, extract the BRIEF descriptors for detected features,and complete the BRIEF feature matching in consecutiveframes at a frame rate of 60 frames/s.

A. Overview of Overall System Architecture

Considering the computational demand of the image fea-ture detection and matching, we implement the system in asingle FPGA chip. The overview of the overall architectureis shown in Fig. 3(a)–(d). The entire system consists of1) the SIFT detection module (SIFT DET), 2) the BRIEFdescription module (BRIEF DESC), 3) the BRIEF storagemodule (BRIEF STOR), and 4) the BRIEF matching module(BRIEF MATCH). SIFT DET is driven by a camera directlyand it detects the SIFT features in the input image, and thentranslates the detected features to BRIEF DESC to extract theBRIEF descriptors. BRIEF STOR has two DPRAMs, whichare used to store the BRIEF descriptors of the current frameand the previous frame. The tasks of BRIEF MATCH in-clude: 1) finding the matching point pairs in the current frameand the previous frame, 2) generating the read interruption(ReadInt) to indicate the end of a frame, and 3) latching thenumber of matched point pairs.

The scheduling figure of the proposed architecture is pre-sented in Fig. 3(e), where A = 10265, B = 1048, C = 6,and D = 8978 for 1280 × 720 images. The whole systemis synchronized by frame valid signal (FVAL). SIFT DEToutputs the first result 10 265 clock cycles after the rising edgeof FVAL. BRIEF DESC generates a BRIEF descriptor 1048clock cycles after it accesses a detected SIFT key-point fromKP FIFO. BRIEF MATCH reads a BRIEF descriptor fromthe BRIEF STOR to find the matched BRIEF descriptor inprevious frame six clock cycles after the BRIEF descriptor isstored into BRIEF SOTR. Since the input image of octave 1is down-sampled from the third Gaussian image in octave 0,the end point of the SIFT DET is 8978 clock cycles after the

Fig. 3. Overall architecture of the FPGA-based real-time visual detectionand matching system. (a) SIFT feature detection module. (b) BRIEF featuredescription module. (c) BRIEF feature storage module. (d) BRIEF featurematching module. (e) Scheduling figure of the hardware architecture, whereA = 8N + 25, B = 1048, C = 6, D = 7N + 18, and N is the image width.

falling edge of FVAL. The ReadInt is asserted when the nextframe is coming.

1) Computational Demand Analysis: The computationaldemand is the number of operations executed per second inreal time (60 frames/s). It is the product of the number ofoperations performed to generate a result with the systemthroughput, where each operation needs one clock cycle toperform. As the proposed system is a fully paralleled/pipelinedsystem, the processing speed of the whole system is con-strained by the slowest module of the system. Hence, everymodule of the proposed system must be able to complete itscomputation at a frame rate of 60 frames/s. The computationaldemands and performances for each system function arepresented in Table I, which is built according to the methodpresented in [5].

According the detail analysis presented below, the entiresystem needs to perform 90.75 G operations per second, whichis achieved through the fully paralleled/pipelined architectureas presented in Fig. 3. The SIFT DET module has a constantthroughput defined by the number of octaves and scales. As theproposed system has two octaves and six scales, the numberof Gaussian filters in our system is 12. The image resolutionis 1280 × 720 for octave 0 and 640 x 360 for octave 1; hence,the total throughput is 414.72 M pixels/s for Gaussian filters,345.6 M pixels/s for the image subtraction module, and 207.36M pixels/s for feature detection. As a Gaussian filter needs toperform 46 operations, image subtraction needs to perform oneoperation and feature detection needs to perform 45 operationsfor each pixel, the number of operations per seconds for thesemodules are 19.08 G, 345.6 M, and 9.33 G, respectively. As

Page 6: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

530 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

TABLE I

Computational Demands and Achieved Performance for Each System Function in Proposed System

the SIFT DET module is directly driven by the camera, theclock frequency of the SIFT DET module is 75 MHz, whichis the pixel clock of 720-p video at 60 frames/s. Hence, therequired parallelism gains of each function of SIFT DET are254.4×, 4.6×, and 124.4×, respectively.

In contrast to the SIFT DET, the required throughput of theBRIEF DESC and BRIEF MATCH depends on the numberof features detected by SIFT DET. To predict in advance theperformance requirement of the system, we may assume thatthe number of features in an input image is about 2000. Thisis a reasonable assumption based on our empirical study onthe average number of features on 1280 × 720 images. Thus,the throughput of the BRIEF DESC module is about 120 k(2k × 60) features per second in real time. As the number ofoperations required to extract a BRIEF feature is 2562, thenumber of operations per second is 307.4 M. In order to meetthe requirement of 60 frames/s, we set the clock frequencyof this part to 200 MHz; hence, the required parallelism is1.54×.

Finally, the BRIEF MATCH module needs to perform afeature matching from a set of 2000 features to another setof 2000 features for every frame in the worst case. Then,the throughput of the system is 240 M feature matching persecond in real time. The number of operations for every featurematching is 257. Hence, the required performance is 61.68G operations per second (worst case). In order to balancethe clock frequency and the parallelism, we set the clockfrequency of this part to 250 MHz to meet the requirementof 60 frames/s, and the required parallelism gain of this partis 246.7×.

Table I summaries the computational demand analysis statedabove, along with the achieved performance of our proposedsystem. As can be seen in the table, all the modules in oursystem have achieved greater parallelism gain than the requiredparallelism one. The detailed design of each module willbe introduced in the later sections. Here, we just show theperformance of every block in our system achieved throughpure hardware architecture.

B. SIFT Key-point Detection Component

The SIFT key-point detection component consists of theDoG scale space construction module and the stable key-pointdetection module. The DoG module is driven by the imagestream directly from the camera interface and it performs

Fig. 4. Diagram of DoG module. (a) Overview of overall architecture ofDoG module. (b) Architecture of line buffer. (c) Architecture of 2-D Gaussianfilter, where SUM =

∑14i=0 Ki = 1024.

2-D Gaussian filtering and image subtraction. The result of theDoG module is sent to the stable key-point detection module,which detects the stable key-points as features. The details ofeach module are followed.

1) Implementation of DoG Module: Considering the factthat the 2-D Gaussian kernel is separable and symmetrical[25], Zhong et al. proposed an efficient approach in [4]. In thismethod, the 2-D convolution is performed by first convolvingwith an 1-D Gaussian kernel in the horizontal direction andthen convolving with another 1-D Gaussian kernel in thevertical direction, as presented in the Fig. 4 of [4]. It is morecomputationally efficient than the traditional 2-D convolutionmethod, which takes 225 multipliers and 224 adders for each15×15 Gaussian kernel. Moreover, it performs multiplicationfor the sum of the pixels with the same coefficient to reduce thenumber of multipliers further. In their method, a 2-D Gaussianfilter takes 16 multipliers, 28 adders and 14 line buffers foreach 15 × 15 Gaussian kernel. The logic resource in theirarchitecture is reduced. However, the memory resource in theirdesign is still very expensive, since it performs horizontal filterfirst followed by the vertical filter. In contrast, we proposea new architecture as presented in Fig. 4, which performsvertical filter first. As a result, the line buffers of every 2-DGaussian filter in the same octave can be moved out to sharethe same line buffers. The memory resource utilized in ourdesign is just 1/6 of the method present in [4].

Page 7: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 531

Fig. 5. Block diagram of full parallelism stabilization feature detection module. (a) Architecture of the window generator. (b) Overall architecture of thestable key-point detection module. (c) Architecture of low contrast rejection module. (d) Detail architecture of extremum detection module. (e) Details of edgeresponse rejection module.

Additionally, considering the characteristics of FPGA, wenormalize the SUM of the Gaussian kernel to 1024, i.e.,SUM =

∑14i=0 Ki = 1024 = 210. Hence, the division operation

in the Gaussian module can be simplified as the shift rightoperation, which is much efficient than performing division inFPGA. The Gaussian pixel Gs is represented in a fixed-pointformat (8.10), where eight bits are for the integer part and tenbits for the fraction part.

2) Implementation of Stable Key-point Detection Module:The feature detection of our design consists of five modules:window generator, extremum detection, low contrast rejection,strong edge response rejection, and feature information stor-age, as presented in Fig. 5(b). The window generator modulegenerates 3 × 3 pixel windows, as presented in Fig. 5(a).The extremum detection module evaluates whether the currentpixel is a local extremum or not, which needs to compare thecurrent pixel with its 26 neighbors in the 3 × 3 regions atthe current and the adjacent scales, as presented in Fig. 5(d).The low contrast rejection module determines whether thecurrent pixel is a low contrast pixel or not, as presented inFig. 5(c). It is completed by comparing the absolute valueof current D with a constant threshold, which is set to bethree (pixel value range is [-31.0, 32.0]) in this paper. Toachieve full parallelism, we compare the absolute values of

the original gray scales rather than the absolute values of therefined gray scales with the constant threshold. The strongedge response rejection module constructs the Hessian matrixfor current D and compares the ratio of the largest magnitudeeigenvalue with respect to the smallest magnitude eigenvaluewith a threshold, which is set to ten in this paper, as shownin Fig. 5(e).

Finally, we combine the results of the extremum detectionmodule, the low contrast rejection module and the strong edgeresponse rejection module to determine whether the currentpixel is a stable feature or not. The feature information storagemodule receives the evaluation results and stores the parame-ters of stable key-points, which consists of two bits for octavelocation o, two bits for scale location s, 11 bits for horizontallocation x, and 11 bits for vertical location y, to the key pointFIFO. Considering that the features in the same octave willgenerate the same descriptor if they have the same coordinateseven though they are in different scales, we only store one ofthem if there is more than one feature at the same time.

Additionally, as the 2-D Gaussian filtering result Gs isrepresented in a fixed-point format (8.10), DoG needs 19 bits(9.10) to represent one pixel. However, the absolute value ofDs greater than 32.0 is highly unlikely in nature images aspresented in [4]. Hence, 16 bits (6.10) (with the sign) for

Page 8: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

532 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

Fig. 6. Feature detection, matching, and error matching rates versus bitwidth of Ds.

Ds is sufficient. To reduce the FPGA resources utilizationwithout significantly sacrificing the accuracy, we have foundthe optimal pixel width in terms of feature detection rate (9),matching rate (10), and error matching rate (11).

In this section, the rate of feature detection is defined as thenumber of detected features in an image with respect to thenumber of pixels in an image

Pfeature =#features

(#img rows) × (#img cols). (9)

The rate of matching is defined as the number of matchedfeatures pairs with respect to the product of the number offeatures in image 1 and the number of features in image 2

Pmatch =#matched pairs

(#img1 features) × (#img2 features). (10)

The error matching rate is the number of false matched pairswith respect to the number of total matched pairs, we considerthe location error greater than three pixels as false matched

Perror =#false matches

#total matches. (11)

The experimental results are presented in Fig. 6, in whichall the value is the average over 100 runs. From the statistics,we can conclude that 9 bits (1 bit for sign, 5 bits for integer,and 3 bits for fraction) for Ds is sufficient, since using a largerpixel width has few performance improvement.

Due to the proposed fully parallel architecture, the stablekey-point detection part in this design performs the stable key-point evaluation in all of the scales per cycle. Additionally,the clock frequency of the detection module is up to 168MHz. The clock frequency of an 1280 × 720 image at 60frames/s is less than 80 MHz. Hence, the detection moduleof the proposed design is able to detect multiple features at aspeed of 120 frames/s for 1280 × 720 images.

C. BRIEF Description Component

In this design, we extract the BRIEF descriptor rather thanthe SIFT descriptor for the detected SIFT key-points. TheBRIEF description component consists of six modules: ImageBuffer DPRAM Writing (IBDW), Image Buffer DPRAM(IB DPRAM), Key-point Buffer FIFO (KPB FIFO), PointPair Comparison (PPC), Random Generator (RAND), andPoint Pair DPRAM (PP DPRAM), as presented in Fig. 7(a).

Fig. 7. Block diagram of the BRIEF description module. (a) Overview ofoverall architecture of BRIEF DESC. (b) Detail of the RAND. (c) Finitestate machine of PPC, where Start means a key-point position has been readfrom KPB FIFO, Coming means the pixels for current feature have not beenprepared, Gone means the pixels for the current feature have been all orpartially overlapped by other pixels, INDPRAM means the pixels for thecurrent feature are in the IB DPRAM currently, and Finished means the 256point pairs of current feature have all been compared.

The IBDW module writes the image into IB DPRAM,which is used to buffer images for the PPC module. Inthis design, the BRIEF feature is extracted from a 17 × 17pixel window. Considering the reduction of the Block RAM(BRAM) resource utilization, the depth of IB DPRAM is setto 32 768 for octave 0 and 16 384 for octave 1, which canbuffer 25.6 lines of the source image. The PPC module hasto compute the address of current pixel to fetch a pixel fromIB DPRAM. In order to avoid the influence of the numberof detected key-point bursts, we insert a KPB FIFO betweenSIFT DET module and PPC module to balance the speedof feature detection and description. Since the BRIEF DESCutilizes randomly distributed point pairs to extract the BRIEFfeatures, we instantiate a RAND module in BRIEF DESCmodule, as presented in Fig. 7(b). In the RAND module, weutilize the gate delay, which cannot be predicted in FPGA,to generate random numbers that are stored in PP DPRAM.In this design, we write PP DPRAM only when the systemis reset or powered-on. As the BRIEF descriptor in ourdesign is a 256 bits vector, we write 512 random numbersto PP DPRAM, in which every random number presents anoffset from upright point of the 17 × 17 pixel window.

The PPC module is mainly composed of a finite statemachine (FSM) as presented in Fig. 7(c). The functions ofthese states are as follows.

1) OTHER: the undefined state or the state after reset. Onceit enters this state, the FSM jumps to WAIT.

2) WAIT: the FSM waits the Start signal be asserted, andjumps to DATA VALID state.

3) DATA VALID: checks the validation of the pixels. Ifthe pixels needed by the current feature have not beenprepared, the FSM waits in this state until all the neededpixels are in IB DPRAM, else if the pixels needed bythe current feature have all or partially been overlapped

Page 9: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 533

by other pixels, then jumps to WAIT, else if the pixelsneeded by the current feature are all in IB DPRAM,the FSM jumps to UR COMPUTE.

4) UR COMPUTE: computes the upright coordinate ofthe pixels window according to the position informationin the key-point and jumps to RP STATE1.

5) RP STATE1: gives the address to read a data fromPP DPRAM and verifies whether the 256 point pairshave all been compared or not. If yes, then jumps toWAIT, else jumps to RP STATE2.

6) RP STATE2: computes a pixel address by adding therandom number read from PP DPRAM with uprightposition and gives another address to read a data fromPP DPRAM, then jumps to RP STATE3.

7) RP STATE3: buffers the pixel read from IB DPRAMas Pixel1 and computes another pixel address by addingthe offset number read from PP DPRAM with uprightposition, then jumps to RP STATE4.

8) RP STATE4: buffers the pixel read from IB DPRAMas Pixel2 and compares it with Pixel1, changes corre-spondence bit of the descriptor according the compari-son result, then jumps to RP STATE1.

D. BRIEF Feature Storage Component

In this paper, we construct an FPGA-based architecture toestablish correspondences in consecutive frames. The BRIEFfeatures of each image are matched twice, to the previousframe as well as to the next frame. Hence, we need to storethe extracted BRIEF features of the current frame in order tomatch to the next frame. In this paper, we adopt the ping-pong operation to store the BRIEF descriptors, as presentedin Fig. 3(c). The input MUX selects the DPRAM to store theBRIEF descriptors of the current frame according to the ping-pong signal generated by the pingpong signal generator. Whilethe output MUX selects the DPRAM from which the featuresof the current frame and the previous frame are read. The ping-pong signal generator module generates the ping-pong signalaccording FVAL.

E. Feature Matching Component

Feature matching is also a computationally demanding step,since we have to compute the feature distances for everyfeature between the current frame and every feature in theprevious frame. The operation requirement of the featurematching for two frames is

Or = NcurNpreOcd (12)

where Ncur is the number of features in the current frame,Npre means the number of features in the previous frame,and Ocd means the operations to compute the distance oftwo features. The design presented in [12] took about 5.03ms to complete the matching computation for 512 key-pointson a 2.66 GHz/Linux x86-64 laptop. However, the numberof features in an 1280 × 720 image is about 2000 as perour empirical study. It will take about 80 ms to completethe matching of two 1280 × 720 images, which cannot meetthe demand of real-time performance. Considering that the

Fig. 8. Block diagram of feature matching component.

parallel characteristic of FPGA is very appropriate for BRIEFfeature matching, we implemented it in FPGA, as presented inFig. 8. It consists of four components: FSM1, feature distancecomputer (FDC), FSM2, and matched point pair DPRAM(MPP DPRAM). The FDC module computes the distancebetween two features. It is a fully parallel architecture, whichcan output a distance result per clock. The MPP DPRAMis used as the matched point pair storage and it latches thenumber of matched point pairs when FVAL is deasserted.

The FSM1 is triggered by every new feature (NewFeature)in current frame extracted by BRIEF DESC module, andit iterates all of the features in previous frame to give twofeatures to FDC module every cycle. The function of eachstate is as follows.

1) OTHER: the undefined state or the state after reset. Onceentering this state, the FSM jumps to WAIT.

2) WAIT: waits until NewFeature is asserted, jumps toREAD CUR.

3) READ CUR: reads a feature vector from the fea-ture DPRAM of the current frame. Then jumps toREAD PRE.

4) READ PRE: reads features from the feature DPRAMof the previous frame and generates the NewResultsignal, which is delayed by FDC module, for FSM2.Moreover, it has to verify whether the features of previ-ous frame have all been iterated or not. If yes, it jumpsto WRITE, else reads another feature in the previousframe.

5) WRITE: waits the FSM2 to write the matching pointpair to MPP DPRAM, then jumps to WAIT.

The FSM2 receives the distance of features and finds featurepairs that have the minimum distance as correspondences.They will be stored in MPP DPRAM. The function of eachstate is listed as follows.

1) OTHER: the undefined state or the state after reset. Onceentering this state, the FSM jumps to WAIT.

2) WAIT: waits until the NewResult be asserted, then jumpsto FIND MIN, and initializes the minimum distanceregister(MIN DIST) to a threshold, which is the upperbound of the valid minimum distance.

Page 10: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

534 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

3) FIND MIN: reads a distance result from FDC moduleand compares it with MIN DIST. If the distance issmaller than MIN DIST, then it assigns MIN DISTwith the current distance and marks the current pointpair as the matched point pair, or simply does nothing.It iterates until all the distances have been compared,and then jumps to WRITE.

4) WRITE: if the minimum distance point pair found byFIND MIN is a valid matched point pair, then writesthe point pair to MPP DPRAM. Once entering thisstate, the next state is WAIT.

V. Test and Verification

In order to evaluate the performance of the proposed system,we present the experimental results in three aspects. First, weevaluate the distinguishability of the proposed system on aself-generated video and the benchmark image sets in [14].Next, we present the verification platform and evaluate itsresources utilization. Then, we compare the performance ofthe proposed system with the state-of-the-art methods.

A. Performance Evaluation

In order to evaluate the system reliability, we detect stableSIFT key-points as features from a sequence of images. Wealso extract the BRIEF descriptors on the detected key-points.Then, we match features between two consecutive imagesby using the Hamming distance as suggested in the originalBRIEF paper [12]. We test the proposed system on a 720-pvideo, which has 250 images frame1 (resolution 1280 × 720).For this video, 2 02 587 features are detected, among which42 379 are matched with 37 893 correctly matched. The av-erage recognition rate is 89.41%. In this experiment, theinitial value of MIN DIST in BRIEF MATCH module is20. We consider the location error less than 3.0 pixels ascorrectly matched. We also establish the correspondence forother images, as presented in Fig. 9. There are four differentapplication scenarios: 1) inside room, 2) natural scenario, 3)navigates of the city, and 4) bird’s eye view of the city. Asshown in Fig. 9, the matched point pairs are distributed allover the images. There are some error matched point pairs, butthese points can be easily eliminated by using robust fittingmethods, such as RANSAC.

To illustrate the performance more clearly, we utilize thebenchmark image sets2 in [14] to test the distinguishability ofthe proposed system. Some matching results are presented inFig. 10. The recognition rate in these benchmark image sets ispresented in Fig. 11(a). Currently, the proposed system doesnot take rotation or scale change into account, although thesmoothing provides a fair amount of robustness to rotationand scale change as confirmed in the experiments below. Toquantify the robustness to rotation and scale change, we matchthe first image of the Graf (from [14]) with the rotated ordown-sampled version of itself. The experimental results ofrotation and scale change are presented in Fig. 11(b) and (c),

1For more experimental results please visit .2Download from http://lear.inrialpes.fr/people/mikolajczyk/Database.

Fig. 9. Feature matching results for four different application scenarios.(a) Inside room. (b) Natural scenario. (c) Navigates of the city. (d) Bird’seye view of the city.

respectively, by plotting the recognition rate with respect to therotation angle and scaling factor. Moreover, we also comparethe performance of the proposed hardware with the goldensoftware of SIFT+BRIEF, SURF and SIFT in Fig. 11(b)and (c). As shown in Fig. 11(b) and (c), although we havetried to follow the software-based SIFT+BRIEF algorithm asclosely as possible, but we also make some minor changesto optimize the SIFT detection and BRIEF description forFPGA implementation. All these experiments are done withthe proposed embedded architecture.

Moreover, all these test images are sent on the verificationplatform that is presented in Fig. 12, via Gigabit Ethernetport of a personal computer. We also compare the matchingresult to the SURF+BRIEF software-based implementationpresented in [12]. The number of correctly matched pointpairs in their design is a little bit more than the proposedsystem, but the rate of correctly matched point pairs is almostthe same. This is mainly caused by the number of visualfeatures detected by SURF is much more than SIFT and theHamming distance threshold in our system is smaller to [12].A possible solution to this issue is to increase the Hammingdistance threshold or to change the parameters of the SIFTfeature detection to detect more features (but this will reducethe reliability of the SIFT features). Another drawback of thispossible solution is that it will largely increase the memory

Page 11: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 535

Fig. 10. Feature matching results for benchmark image sets presented in[14], leuven, where bikes, ubc, and trees reflect illumination, blurring, JPGcompression, and blurring, respectively. We match the grayscale version ofthese images, but display in color version. (a) Leuven. (b) Bikes. (c) ubc.(d) Trees.

usage, since we need to store a 256 bits vector for everyfeature. It will also increase the number of mismatches, andmake it very difficult to correct these mismatches.

B. Develop and Verification PlatformIn order to evaluate the performance of the proposed system,

we establish a verification platform based on the XilinxXUPV5-LX110T development board. The overview of theverification platform is presented in Fig. 12(a) and a snapshotof the running verification platform3 is presented in Fig. 12(b).The whole verification platform consists of three components:a PC, a XUPV5-LX110T board, and a LCD. The PC is used tosend the testing videos via a Gigabit Ethernet port and receivethe matching results from XUPV5-LX110T. XUPV5-LX110Tdisplays the images at LCD via DVI controller and sendsthe images to SIFT BRIEF component to complete imagematching.

The proposed system is implemented on a single XilinxXC5VLX110T FPGA in Verilog-HDL. Table II shows theFPGA resources for each block of our proposed system. The

3For demo videos of the running verification platform please visita nd.

Fig. 11. (a) Recognition rates for the benchmark image sets presents in [14].(b) Recognition rate when matching the first image of the Graf against arotated version of itself. (c) Recognition rate under (down)scaling using thefirst image from Graf. L0: original image width, Li: width of matched image i.

entire system uses 32% logic resource, 81% DSP48E resource,and 92% BRAM resource. Due to the optimization in DoGmodule and feature extraction module, this system is able toimplement the fully paralleled/pipelined feature detection andmatching system in a single FPGA chip. Note that the wholesystem in the table adds an EMAC controller, a ZBT-SRAMcontroller, and a DVI display controller.

The power consumption of FPGA chip is 4.512 W, whichis estimated by using the Xilinx power estimation tool. Thepower consumed by the XUPV5-LX110T development boardis 13.6 W (5.0 V, 2.72 A), which is measured directly, aspresented in Fig. 12(b). Note that the power consumption ofthe development board adds the power consumed by EMACcircuit, DVI display circuit, and other chips on the develop-ment board.

Finally, as we implement an EMAC controller in our design,the correspondence point pairs can be either used in an on-chipapplication or accessed via Ethernet for remotely computer

Page 12: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

536 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

TABLE II

FPGA Resources Utilized by Proposed Architecture

Fig. 12. Verification platform of the SIFT+BRIEF real-time image matchingarchitecture. (a) Architecture of the verification platform. (b) Snapshot of therunning verification platform, where the red points represent the detected key-points.

vision applications. Both options have been tested and areready to be used in real-life computer vision applications, suchas structure from motion and so on.

C. Performance Comparison

As presented in Section II, there were several designsattempting to implement feature detection and matching in realtime. However, most such solutions implemented just a portionof the entire process. For example, the architecture proposedby Bonato et al. [5] was targeted on SIFT feature detectionand description based on FPGA for 320 × 240 images. Thedesign in Zhong et al. [4] used both FPGA and DSP chip to

achieve SIFT feature detection and description for 320 × 256images. Mizuno et al. [6] proposed a SIFT feature detectiondesign for HDTV based on ASIC. The FPGA-based designproposed by Svab et al. [21] takes about 100 ms to detectSURF features from a 1024×768 images. Yao et al. presentedan FPGA SIFT feature detection system in [22] for 640×480images. Huang et al. [7] designed a SIFT feature detection anddescription system for 640×480 images based on two interac-tive hardware components. In addition, there were some otherworks attempting to implement feature detection and matchingin GPU to achieve real-time performance; for example, Sinhaet al. [20] presented a GPU-based design for edge detection,Heymann et al. [19] implemented a SIFT feature detection inGPU, and Cornelis et al. presented a real-time SURF featuredetection and matching design based-on GPU in [18].

However, the design presented in [5] only implemented theSIFT feature detection, leaving out the more computation-ally demanding components of the feature description andfeature matching steps in their design. In their design, ittakes about 11.7 ms to extract a SIFT feature, which is farfrom satisfactory for real-time computer vision applications.The design presented in [4] can detect SIFT features at100 frames/s for 320 × 256 images and takes only 80 μsto extract SIFT features for the detected features based onFPGA+DSP architecture. However, it does not perform featurematching in their design, which is also a time consumingstep. The ASIC presented in [6] can detect SIFT features forHDTV at 30 frames/s; however, this design does not containfeature description and feature matching. The FPGA-SURFpresented in [21] takes about 100 ms to detect SURF featuresand leaves the feature descriptor and feature matching tobe implemented in software, which is far from satisfactoryfor real-time performance. The SIFT feature detection anddescription architecture presented in [7] can complete featuredetection and description for 640 × 480 within 33 ms whenthe number of feature points to be extracted is fewer than 890;however, there is no time profiling information about featurematching.

In contrast, in the proposed design, we implement allcomponents including feature detection, feature description,and feature matching in a single FPGA for higher resolutionimages (1280 × 720) than these existing designs. Moreover,the frame rate of our design can achieve 60 frames/s, whichcan easily satisfy the requirement of most real-life computervision application. We compared the proposed design with thestate-of-art designs in performance as presented in Table III.

Page 13: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

WANG et al.: EMBEDDED SYSTEM-ON-CHIP ARCHITECTURE FOR REAL-TIME VISUAL DETECTION AND MATCHING 537

TABLE III

Comparison Performances With the State-of-Art

TABLE IV

Comparison of Hardware Resources Utilization With [4], [5], [7], and [21]−[23]

As can be seen in Table III, in terms of illumination, blurring,and JPG compression, our system has the similar performancewith the state-of-the-art methods. As our system does not takerotation or scale change into account, the robustness to rotationand scale change of our method is a bit weaker. But suchweaker robustness to rotation and scale change is not consid-ered to be a problem, since the rotation and scale changesbetween consecutive frames of normal video are generallysmall. So our method is suitable for real-time video analyticsapplications.

The comparison of resources utilization is not straightfor-ward between our system and the approach presented in [4],[5], [7], and [21]–[23], because a different FPGA technologiesare used. Hence, we can only compare the basic resource suchas LUT, register, DSP block, and BRAM. Table IV showsthe resource utilization comparison between the proposedapproach and the approach presented in [4], [5], [7], and[21]–[23]. It is evident that the proposed system utilizes theleast amount of LUT, registers, and DSP blocks among theseven different approaches. Our design utilizes more BRAMthan those in [4], [5], [22], and [23] but less than those in[7]. That is because the design presented in [4] and [22] onlyimplemented a feature detection part. Another reason is thatthe image resolution in our design is much higher than thedesigns presented in [4], [5], [22], and [23].

VI. Conclusion

We have presented the design of a computationally efficientfeature detection and matching system on a single FPGAchip. This system is able to detect SIFT features, extractBRIEF descriptor for detected SIFT features, and completeBRIEF features matching for two images within 16 ms.This system is able to comfortably achieve real-time featuredetection and matching computation for image sequences. Inour extensive experiments, our system has shown that theprocessing speed has been greatly improved, while keepingmore or less the same detection and matching accuracy. Asthe proposed system is a completely stand-alone on-chipsystem and can be driven directly by video cameras, it canbe readily embedded into many real-time systems, such assmart cameras, visual based robotics systems, and real-timeimage alignment systems to provide reliable correspondence.Moreover, the proposed FPGA-based verification system forimage processing algorithms is an all-propose architecture;hence, it can be a good option for most FPGA-based real-time computer vision algorithm implementation.

The major future works include: 1) implementing a robustfitting methods, such as RANSAC, after the BRIEF featurematching module, to eliminate the outlier matching pointsand 2) optimizing the feature description module to make thesystem invariant to rotation and scale changes.

Page 14: An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching

538 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

Acknowledgment

The authors would like to thank the reviewers for theirvaluable suggestions and comments.

References

[1] C. Harris, “A combined corner and edge detector,” in Proc. 4th AlveyVision Conf., 1988, pp. 147–152.

[2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.

[3] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robustfeatures (SURF),” Comput. Vision Image Understand., vol. 110, no. 3,pp. 346–359, 2008.

[4] S. Zhong, J. Wang, L. Yan, L. Kang, and Z. Cao, “A real-time embeddedarchitecture for SIFT,” J. Syst. Arch., vol. 59, no. 1, pp. 16–29, Jan. 2013.

[5] V. Bonato, E. Marques, and G. A. Constantinides, “A parallel hardwarearchitecture for scale and rotation invariant feature detection,” IEEETrans. Circuits Syst. Video Technol., vol. 18, no. 12, pp. 1703–1712,Dec. 2008.

[6] K. Mizuno, H. Noguchi, G. He, Y. Terachi, T. Kamino, T. Fujinaga,S. Izumi, Y. Ariki, H. Kawaguchi, and M. Yoshimoto, “A low-powerreal-time SIFT descriptor generation engine for full-HDTV video recog-nition,” IEICE Trans. Electron., vol. 94, no. 4, pp. 448–457, Apr. 2011.

[7] F.-C. Huang, S.-Y. Huang, J.-W. Ker, and Y.-C. Chen, “High-performance SIFT hardware accelerator for real-time image featureextraction,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 3,pp. 340–351, Mar. 2012.

[8] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representationfor local image descriptors,” in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. Jun.–Jul. 2004, pp. 506–513.

[9] T. Tuytelaars and C. Schmid, “Vector quantizing feature space with aregular lattice,” in Proc. IEEE Int. Conf. Comput. Vision, vols. 1–6, Oct.2007, pp. 754–761.

[10] S. Winder, G. Hua, and M. Brown, “Picking the best DAISY,” in Proc.IEEE Conf. Comput. Vision Pattern Recognit., vols. 1–4. Jun. 2009,pp. 178–185.

[11] M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, andP. Mihelich, “Compact signatures for high-speed interest point descrip-tion and matching,” in Proc. 12th IEEE Int. Conf. Comput. Vision, Sep.–Oct. 2009, pp. 357–364.

[12] M. Calonder, V. Lepetit, M. Oezuysal, T. Trzcinski, C. Strecha, andP. Fua, “BRIEF: Computing a local binary descriptor very fast,” IEEETrans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1281–1298, Jul.2012.

[13] J. Sivic and A. Zisserman, “Video google: A text retrieval approach toobject matching in videos,” in Proc. 9th IEEE Int. Conf. Comput. Vision,vols. 1–2. Oct. 2003, pp. 1470–1477.

[14] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10,pp. 1615–1630, Oct. 2005.

[15] M. I. McCartney, S. Zein-Sabatto, and M. Malkani, “Image registrationfor sequence of visual images captured by UAV,” in Proc. IEEE Symp.Comput. Intell. Multimedia Signal Vision Process., Mar.–Apr. 2009,pp. 91–97.

[16] M. Agrawal, K. Konolige, and M. Blas, “Censure: Center surroundextremas for realtime feature detection and matching,” in Proc. Eur.Conf. Comput. Vision, vol. 5305. 2008, pp. 102–115.

[17] Q. Wang and S. You, “Real-time image matching based on multipleview kernel projection,” in Proc. IEEE Conf. Comput. Vision PatternRecognit., vols. 1–8. Jun. 2007, pp. 3286–3293.

[18] N. Cornelis and L. Van Gool, “Fast scale invariant feature detection andmatching on programmable graphics hardware,” in Proc. IEEE Comput.Soc. Conf. Comput. Vision Pattern Recognit. Workshops, vols. 1–3. Jun.2008, pp. 1013–1020.

[19] S. Heymann, K. Mueller, A. Smolic, B. Froelich, and T. Wiegand, “SIFTimplementation and optimization for general-purpose GPU,” in Proc.15th Int. Conf. Central Eur. Comput. Graph., Vis. Comput. Vision, Jan.–Feb. 2007, pp. 317–322.

[20] S. N. Sinha, J. michael Frahm, M. Pollefeys, and Y. Genc, “GPU-basedvideo feature tracking and matching,” in Proc. Workshop Edge Comput.Using New Commodity Arch., vol. 278, 2006, pp. 695–699.

[21] J. Svab, T. Krajnik, J. Faigl, and L. Preucil, “FPGA based speeded uprobust features,” in Proc. IEEE Int. Conf. Technol. Prac. Robot Applicat.,Nov. 2009, pp. 35–41.

[22] L. Yao, H. Feng, Y. Zhu, Z. Jiang, D. Zhao, and W. Feng, “An architec-ture of optimised SIFT feature detection for an FPGA implementationof an image matcher,” in Proc. Int. Conf. Field Program. Technol., 2009,pp. 30–37.

[23] M. Schaeferling and G. Kiefer, “Object recognition on a chip: Acomplete SURF-based system on a single FPGA,” in Proc. Int. Conf.FPGAs ReConFig, 2011, pp. 49–54.

[24] D. Lowe, “Object recognition from local scale-invariant features,” inProc. 7th IEEE Int. Conf. Comput. Vision,, vol. 2. 1999, pp. 1150–1157.

[25] S. Mitra, Digital Signal Processing: A Computer Based Approach, vol. 1,3rd ed. New York, NY, USA: McGraw-Hill, 2004.

Jianhui Wang received the B.S. degree from theDepartment of Electronics and Information Engi-neering, Huazhong University of Science and Tech-nology (HUST), Wuhan, China, in 2009. He iscurrently pursuing the Ph.D. degree at the Schoolof Automation, HUST.

His research interests include real-time embeddedsystems, image processing, and pattern recognition.

Sheng Zhong received the Ph.D. and M.S. degreesin pattern recognition and intelligent systems, in2005 and 1999, respectively, as well as the B.S.degree in automatic control in 1994, from HuazhongUniversity of Science and Technology (HUST),Wuhan, China.

He is an Associate Professor with HUST. Hisresearch interests include pattern recognition, imageprocessing, and real-time embedded systems.

Luxin Yan (M’12) received the B.S. degree incommunication engineering in 2001 and the Ph.D.degree in pattern recognition and intelligent systemin 2007, both from Huazhong University of Scienceand Technology (HUST), Wuhan, China.

In 2009, he joined the School of Automation,HUST. His research interests include image restora-tion, video surveillance, and embedded real-timeprocessing hardware.

Zhiguo Cao received the Ph.D. degree in patternrecognition and intelligent systems from HuazhongUniversity of Science and Technology (HUST),Wuhan, China, in 2001 and the M.S. and B.S.degrees from the Department of Electronic Engineer-ing, University of Electronic Science and Technol-ogy of China, Chengdu, China, in 1990 and 1985,respectively.

He is a Professor with HUST. His research inter-ests include pattern recognition, image processing,data fusion, embedded systems, and system depend-

ability.