Top Banner
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1756 CVPR #1756 CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Accelerated Template Matching with GPU Specific Algorithms Anonymous CVPR submission Paper ID 1756 Abstract Developing parallel processing algorithms for GPUs can lead to a considerable increase in performance. While sequential algorithms, designed for CPUs, can be engi- neered to run on GPUs, it seems more promising to de- sign from scratch algorithms that target the architecture of GPUs. In this paper we present an algorithm for tem- plate matching that is targeted for the GPU architecture and achieves greater than an order of magnitude speedup over algorithms designed for the CPU and reimplemented on the GPU. This shows that it is not only desirable to adapt existing algorithms to run on GPUs, but also that future al- gorithms should be designed with a multicore architecture in mind to fully take advantage of GPUs benefits. 1. Introduction The advent of massively multiprocessor GPUs has opened a floodgate of opportunities for parallel processing applications, ranging from cutting-edge gaming graphics to the efficient implementation of classic algorithms [13]. In 2007 NVIDIA released one of the first GPU archi- tectures for general purpose computing. This architecture, known as CUDA (Compute Unified Device Architecture), has since been further refined toward high-performance, computationally intensive applications. The CUDA archi- tecture is an improvement over the previously used Gen- eral Purpose GPU (GPGPU) architecture in that CUDA contains a fast memory region that can be shared amongst threads where the code can read from arbitrary addresses in memory. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture memory lookups. CUDA also enables faster downloads and read-backs to and from the GPU, as well as full support for integer and bitwise operations, including integer texture lookups [13]. CUDA gives developers access to the native instruction set and memory of the parallel computational elements in CUDA GPUs, which allows the GPUs to effectively be- come open architectures like CPUs. Unlike CPUs however, GPUs have a parallel multiprocessor architecture with each processor core capable of running many threads simulta- neously. A hardware thread-execution manager automati- cally issues threads to the processors without requiring pro- grammers to write explicitly threaded code. These thread or “stream” processors each have their own FPU and registers, and are clustered into groups with each cluster having its own shared local memory (typically 16K) [7]. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path. Figure 1 depicts the structure of the NVIDIA GeForce 8800 series as an example. The GeForce 8800 contains 128 thread processors, each capable of running 96 threads con- currently, for a total of 12,288 simultaneous threads. These processors are grouped into clusters of 8, with each cluster having a 16K shared local memory [7]. Template matching is a building block for many high level Computer Vision applications, such as face and ob- ject detection [10, 4], texture synthesis [5], image com- pression [12, 15], and video compression [11, 15]. Its run- time is often infeasibly slow in raw form [19] and there has been much research into methods for accelerating template matching for various applications. Some methods ignore ir- relevant or unnecessary image information to achieve accel- eration, but are unable to guarantee finding the best match according to the chosen error measure e.g. [6, 16, 14]. A second set which has emerged recently makes use of bounds on the error measure to achieve acceleration without sacri- ficing accuracy [19, 8]. Our proposed algorithm falls into that second set. Previous work with vision algorithms and template matching on the GPU has typically involved adapting se- quential algorithms to the data-parallel GPU architecture [3, 9, 18, 1]. We have taken the approach of designing a tem- plate matching algorithm that was designed for data-parallel execution from the beginning to show the advantages of GPU specific design. 1.1. Template Matching Notation Throughout this paper we make use of the l 1 norm-based distance measure (i.e. the sum of absolute differences) be- 1
6

Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

May 19, 2018

Download

Documents

vankhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Accelerated Template Matching with GPU Specific Algorithms

Anonymous CVPR submission

Paper ID 1756

Abstract

Developing parallel processing algorithms for GPUscan lead to a considerable increase in performance. Whilesequential algorithms, designed for CPUs, can be engi-neered to run on GPUs, it seems more promising to de-sign from scratch algorithms that target the architectureof GPUs. In this paper we present an algorithm for tem-plate matching that is targeted for the GPU architectureand achieves greater than an order of magnitude speedupover algorithms designed for the CPU and reimplementedon the GPU. This shows that it is not only desirable to adaptexisting algorithms to run on GPUs, but also that future al-gorithms should be designed with a multicore architecturein mind to fully take advantage of GPUs benefits.

1. IntroductionThe advent of massively multiprocessor GPUs has

opened a floodgate of opportunities for parallel processingapplications, ranging from cutting-edge gaming graphics tothe efficient implementation of classic algorithms [13].

In 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This architecture,known as CUDA (Compute Unified Device Architecture),has since been further refined toward high-performance,computationally intensive applications. The CUDA archi-tecture is an improvement over the previously used Gen-eral Purpose GPU (GPGPU) architecture in that CUDAcontains a fast memory region that can be shared amongstthreads where the code can read from arbitrary addressesin memory. This can be used as a user-managed cache,enabling higher bandwidth than is possible using texturememory lookups. CUDA also enables faster downloads andread-backs to and from the GPU, as well as full supportfor integer and bitwise operations, including integer texturelookups [13].

CUDA gives developers access to the native instructionset and memory of the parallel computational elements inCUDA GPUs, which allows the GPUs to effectively be-come open architectures like CPUs. Unlike CPUs however,

GPUs have a parallel multiprocessor architecture with eachprocessor core capable of running many threads simulta-neously. A hardware thread-execution manager automati-cally issues threads to the processors without requiring pro-grammers to write explicitly threaded code. These thread or“stream” processors each have their own FPU and registers,and are clustered into groups with each cluster having itsown shared local memory (typically 16K) [7]. Branches inthe program code do not impact performance significantly,provided that each of 32 threads takes the same executionpath. Figure 1 depicts the structure of the NVIDIA GeForce8800 series as an example. The GeForce 8800 contains 128thread processors, each capable of running 96 threads con-currently, for a total of 12,288 simultaneous threads. Theseprocessors are grouped into clusters of 8, with each clusterhaving a 16K shared local memory [7].

Template matching is a building block for many highlevel Computer Vision applications, such as face and ob-ject detection [10, 4], texture synthesis [5], image com-pression [12, 15], and video compression [11, 15]. Its run-time is often infeasibly slow in raw form [19] and there hasbeen much research into methods for accelerating templatematching for various applications. Some methods ignore ir-relevant or unnecessary image information to achieve accel-eration, but are unable to guarantee finding the best matchaccording to the chosen error measure e.g. [6, 16, 14]. Asecond set which has emerged recently makes use of boundson the error measure to achieve acceleration without sacri-ficing accuracy [19, 8]. Our proposed algorithm falls intothat second set.

Previous work with vision algorithms and templatematching on the GPU has typically involved adapting se-quential algorithms to the data-parallel GPU architecture [3,9, 18, 1]. We have taken the approach of designing a tem-plate matching algorithm that was designed for data-parallelexecution from the beginning to show the advantages ofGPU specific design.

1.1. Template Matching Notation

Throughout this paper we make use of the l1 norm-baseddistance measure (i.e. the sum of absolute differences) be-

1

Page 2: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 1. NVIDIA GeForce 8800 Architecture

tween template and image subwindow. We denote the l1norm of a vector x by |x|.

Let vector x ∈ <n represent the template we are match-ing. This vector is formed by concatenating the rows of thetemplate image together into one long sequence. We con-sider each subwindow yi of the search image I a potentialmatch. The subwindows may overlap, and all contain n pix-els. For convenience we define Y = {y1, y2, . . . ym} to bethe set of all potential matches. In practice m (the num-ber of potential matches) is slightly less than the number ofpixels in I .

The error for the ith candidate (or sub-window) is:

Ei = ||x− yi||

Given x and I , a template matching algorithm attempts tofind the yi which minimizes Ei.

2. Full Search MethodWe first consider the case of the Full Search Method

of template matching, otherwise known as a brute forcemethod. The traditional Full Search Method calculates Ei

for all yi ∈ Y , and returns yopt = arg minyiE(yi). Figures2 and 3 delineate the two Full Search Method pseudocodesthat we employed for the purposes of this project, with Fig-ure 2 being the pseudocode for the Full Search Method runentirely on the CPU, while Figure 3 is the pseudocode thatwas adapted to run on the GPU. Obviously, this approach iscomputationally expensive. Given that m is the number ofsubwindows, and the template x contains n pixels, it runsin O(mn) time, which comes to≈ 4∗1010 operations. TheGPU version achieves speed up that is nearly linear in thenumber of cores, with the exception of the reduction step,which requires an additional O(log(mn)) steps. The GPUimplementation of similar methods has been explored in [1].While those efforts have resulted in an increase in perfor-mance, ultimately they are simply adapting an existing al-gorithm to run on GPU architecture. Table 1 represents theresults that we recorded when implementing the above Full

FullSearchCPU(Y, x)

1 Eopt , yopt

2 for yi ∈ Y3 do Ei ← ComputeE(yi, x)4 if Ei < Eopt

5 then6 Eopt ← Ei

7 yopt ← yi

8 return Eopt, yopt

Figure 2. Full search method on CPU

Run Time Copy TimeCPU 23290 N/AGPU 3042 217.7GPU Shared 200.68 217.7GPU Text. 107.38 2.361

Table 1. Runtime in ms for full search template matching on a512x512 image and a 64x64 template. GPU is the basic form ofthe full search algorithm as shown in fig 3. GPU Shared makesuse of shared memory to cache the template, and GPU Text. storesall image and template information in the on card texture memory.

Search Method algorithm in the traditional manner on theCPU compared to implementing the same algorithm on theGPU using CUDA architecture. The results in the table il-luminate some of the existing bottlenecks in the GPU sys-tem, as well as our motivation for working with the GPU.Not only is the GPU much faster than the CPU, but mak-ing good use of specific types of GPU memory markedlyreduces runtime. Additionally, note that the “copy time” formoving the data to and from the card can be quite signifi-cant.

We propose that a considerable increase in performanceover these results can be attained when from the onset youdesign an algorithm which solves an inherently parallelproblem with the express purpose of implementation on aparallel architecture.

2

Page 3: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

FullSearchGPU(Y, x)

1 Eopt , yopt

2 i← ThreadID3 do Ei ← ComputeE(yi, x)4 Ei ← reduce(E)5 if ThreadID = 06 Eopt ← Ei

7 yopt ← yi

8 return Eopt, yopt

Figure 3. Full search method on GPU

size malloc copy malloc 2D copy 2D4000 0.067567 0.005253 0.116700 0.014929400000 0.118616 0.291486 0.122187 0.2966804000000 0.141160 2.576290 0.180513 2.71312640000000 0.241793 23.344471 0.629537 24.801236

Table 2. Average results over 1000 trials of basic CUDA memoryoperations. “malloc” and ‘’malloc 2D” refer to allocating an arrayand a byte aligned 2 dimensional array on the GPU, respectively.“copy” and “copy 2D” refer to copying data from the “host” (CPU)to the GPU. The first column refers to the amount of data (in KB)used for that experiment, in bytes. All times are in ms.

3. Design Considerations

Designing an algorithm from the beginning to be im-plemented on a current parallel processing architecture re-quires a different paradigm of design than is currently em-ployed. This is due to both the advantages and limitationsof the parallel processing architecture, and deciding how toweigh those against each other. The CUDA architecture,in particular, has certain drawbacks due to its current de-sign that greatly influence how CUDA programmers designtheir code. For example, the shared memory for each mul-tiprocessor is currently limited to 16K; however, this is thefastest memory by far available on the GPU card. The cardalso contains a texture memory which is much larger, but isalso read-only and slightly slower. Finally, there is the sys-tem’s main memory, which is naturally very large (relativelyspeaking), but exacts a tremendous cost in transfer timesto and from the GPU memory, with downloading from theGPU to the main memory being the most costly. Essen-tially, every CUDA program begins on the CPU, or “host”,which then copies the necessary data from main memoryto the GPU, or “device” memory. The CPU then blockswhile the GPU functions, or “kernels” perform their com-putations. When they have completed, the resulting data isthen copied back to the main memory, and the CPU con-tinues execution. With all this in mind, it becomes appar-ent that an algorithm that makes the most effective use ofa GPU’s processing power must do so by making certain

design decisions from the beginning. For example, whilethe shared memory is the fastest on the card, its size is con-siderably limited, so one must decide what size and typeof data is needed and where it should be stored. Addition-ally, shared memory suffers from “bank-conflicts”, whereinmultiple requests to neighbouring addresses results in serialaccess. Bank conflicts can be avoided if all requests are forthe same or sequential addresses. Furthermore, transfer ofdata between cores in the same multiprocessor is fast, butbetween multiprocessors requires the use of slower globalmemory, so one has to consider which kernels to implementon which cores and group them according to what data theywill need to share. In terms of threading, threads should ex-ecute in instruction parallel groups of at least 32 to take thegreatest advantage of the CUDA architecture.

Finally, and perhaps most importantly, due to the con-siderable cost in time of transferring data from host mem-ory to device memory, one has to decide what data to loadfrom main memory to device memory as well as when andhow much to minimize these transfers as much as possible.This cost of allocating memory on the card and copying datafrom the host to the card is explored in Table 2.

4. GPU Acceleration MethodAs stated previously, our intention with this endeavour

was to design an algorithm that from the ground up was in-tended to be run on a GPU and take full advantage of itscapabilities, with the point being to demonstrate that suchan algorithm would be considerably faster than any algo-rithm of the same nature that ran entirely on the CPU, orwas designed originally to run on the CPU and then adaptedto run on a GPU. With that goal in mind, we chose thefollowing CUDA-enabled NVIDIA graphics card for thisproject: NVIDIA GeForce 9800 GTX+, with 16 multipro-cessors and 128 cores, running CUDA version 2.30. Withthe performance capabilities of this card in mind and theabove stated design considerations, we developed an algo-rithm that takes maximum advantage of the parallel process-ing capabilities that CUDA offers (see Figures 4-8).

In designing this algorithm, we wanted to off-load asmuch of the computation that could be conducted in par-allel onto the GPU as possible, while still minimizing theamount of memory transfer that had to be done. Conse-quently, lines 1 through 3 of the pseudocode are conductedon the CPU, while lines 5 through 13 are conducted on theGPU. Essentially, the algorithm begins by performing aninitial scan of the data on the CPU, performing very littlecomputation to find initial upper and lower bounds on thematch value of each location in the image. This is analo-gous to the initialization step used in [19], and we make useof their ‘image strips’ to provide the initial bounding. Theconcept of using upper bounds in template matching wasfirst proposed in [17]. These steps are examined more in

3

Page 4: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

depth in Figure 5. The step which is particularly unique inthis instance is the single run of the “Prune” method on theCPU before beginning the run of the algorithm on the GPU.The Prune step reduces execution time because it lowersthe number of locations that the GPU must consider, whiledoing only a very small fraction of the overall work of thealgorithm. Any yi (image location) whose lower bound isgreater than the initial E used during pruning is not onlyremoved from consideration, but is never accessed by theGPU. The remaining computation takes place on the GPU.We chose to transfer to GPU here because the workload in-creases dramatically at this point, as the algorithm beginscomparing pixel values directly to tighten the bounds on theinvidiual yi. The GPU allocates a unique thread for each yi,which tightens the lower and upper bounds on the matchdistance, before testing whether it is possible that the givenyi is the best match. The pixel values of the yi are held intexture memory, as is the template, since they are not mod-ified during the run. This allows for a great increase in ac-cess and copy speeds. The bounds of the candidates are heldin global memory initially, but since we have chosen a oneto one candidate to thread mapping, each thread copies thebounds to local memory and works on them there, avoid-ing costly global memory access. Allthough branching isoften avoided in GPU work, we stop those threads whosecandidates are no longer possible matches (ie, those threadsfor which, after an update on the bounds, li > Eguess).These threads wait at a synchronization point, allowing themultiprocessor to allocate more time to the threads that stillcontain potential matches. Each thread then compares itscurrent distance value against a global minimum to allowfor synchronization between multiprocessors. This step issomewhat costly, but only those threads which have deter-mined that their yi is still a potential match perform it. Thecombination of these steps to reduce the memory footprint,memory copy time, and execution workload on the GPU re-sult in our algorithm’s accelerated performance. This de-sign is not hardware specific, and can be ported to anyCUDA GPU with similar results.

5. ResultsAfter many experiments with data storage, transfer, and

manipulation, we have developed a very fast and efficientimplementation of our algorithm, which takes full advan-tage of the CUDA architecture on our GPU. Our experi-mental design consisted of averaging the results of runningour algorithm a number of trials with a variety of images ofdifferent sizes and resolutions. We first tested with a fewstandard test images (lenna, at 512x512, and pentagon, air-port, and man at 1024x1024), and then considered a fewimages captured on a modern digital camera. We extracteda template from each, and tested with noise ranging fromnoiseless to very noisy (σ = 70). We then ran the above

PARALLELTEMPLATEMATCH(x, Y )

� We first need to initialize the lower and� upper bounds for each yi ∈ Y

1 InitBounds(Y, x)� Using those bounds we make a guess at the best match value.

2 Eguess, ybest ← FindBestInitMatch(Y, x)� We remove everything in Y whose bounds prohibit� it from being a best match.

3 Y ← Prune(Y, Eguess)� At this point Y is typically very small.� From this point onwards, the code is executed on the GPU� by many threads in parallel

4 while |Y | > 15 do

� Tighten the bounds on the remaining members of Y6 i← ThreadID7 UpdateBounds(yi, x)8 if li < Eguess

9 then10 if ui < Eguess

11 then li, ui ← ComputeE(yi, x)12 Eguess ← ui

13 ybest ← yi

14 else break15 if Ei < Eguess

16 then Eguess ← Ei, ybest ← yi

17 return ybest, Eguess

Figure 4. The main method of our GPU based template matchingalgorithm

Full Search Method (using textures, as described above)for the same number of trials on the same GPU using thesame input. We compare and report the run time for bothmethods in milliseconds. Our experimentation yielded thefollowing performance results: When comparing the per-formance of our algorithm to the Full Search Method onsmall images (512 x 512) at zero to low noise level, ouralgorithm has nearly the same results as the Full SearchMethod. However, as the amount of noise increases toextreme levels, our algorithm begins to slow down, whilethe Full Search Method remains unchanged. This is dueto the fact that our algorithm can get bogged down in thepre-processing step, wherein it attempts to eliminate can-didates, which becomes more difficult as noise levels in-crease. When comparing our algorithm’s performance tothat of the Full Search Method on medium to large images,however, one can see the tremendous performance increaseof our algorithm. With an image size of 1024x1024 and atemplate size of 128x128, our algorithm experiences a 12xperformance increase over the Full Search Method. Further-more, with an image size of 2306x1535 and a template sizeof 304x280, our algorithm performed 7 times faster, and38 times faster with an 3072x2304 and a template size of

4

Page 5: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

INITBOUNDS(Y, x)

1 for yi ∈ Y2 do ui ← ||x|+ |yi||3 li ← ||x| − |yi||

FindBestInitMatch(Y, x)

1 lmin , ymin

2 for yi ∈ Y3 do4 if li < lmin

5 then lmin ← li6 ymin ← yi

7 lmin, umin ← ComputeE(ymin, x)8 return lmin, ymin

Prune(Y, Eguess)

1 for yi ∈ Y2 do3 if li > Eguess

4 then Y ← {Y − yi}5 return Y

Figure 5. The relevant subroutines called by our main method.

ComputeE(yi, x)

1 Ei ← |P

a,b(x(a, b)− yi(a, b))|2 return Ei

Figure 6. Here we assume the L2 norm. The summation is overthe pixel coordinates of x and yi. Note that this routine is compu-tationally expensive, and in accelerating template matching we areessentially trying to call it as rarely as possible.

UPDATEBOUNDS(yi, x)

1 if li 6= ui

2 then3 diff ← ||yia − xa||4 li ← diff +||yi − yia | − |x− xa||5 ui ← diff +||yi − yia |+ |x− xa||

Figure 7. The procedure for tightening the bounds in the our al-gorithm. What happens in UpdateBounds is dependent on errormeasure. This example uses the L1 norm. One way to visualizea is to imagine masking off the lower n−1

nof yi and x, and call-

ing the remaining portions a. Each time we call this function, wewould unmask another 1

nand add it to a.

584x782. The performance of our algorithm in these casesvaries from image to image. Based on inspection of thehistograms of the respective images, we found that highercontrast images allow stronger performance. Figure 8 andTable 3 summarize these results.

100.8 101 101.2 101.40

5

10

σ

Spee

dup

manairport

pentagon

Figure 8. Ratio of runtime of Full Search on the GPU to the run-time of our algorithm.

Image Noise Parallel Full Search Improvementsecond 0 3979.879 27930.175 7.018rob ref 0 6123.839 237215.515 38.736

Table 3. The images ’second’ and ’rob ref’ were taken with a mod-ern digital camera, and are of size 2306x1535 and 3072x2304 re-spectively. These larger images allow for comparatively large im-provements in runtime.

It is important to note that due to the relatively sizeableloading times from main memory into GPU memory, andlikewise from GPU memory back to main memory, thoseload times must be taken into consideration when calcu-lating the total cost of run time on a GPU. These copyingtimes are included in our figures. Just as the Full SearchMethod implemented on the GPU experienced a perfor-mance increase of 226x over its implementation strictly onthe CPU, so has our algorithm experienced a 38.5x perfor-mance increase over the Full Search Method adapted for theGPU. Furthermore, our algorithm overall has a 8701x in-crease in performance over the original Full Search Methodas it is traditionally executed on the CPU. This undoubtedlydemonstrates that a very considerable performance increasecan be achieved when an algorithm is designed from the be-ginning to run on a GPU, and when its code is optimized totake full advantage of the GPU’s architecture and comput-ing power.

6. Conclusions and Future WorkWe have shown here that while adapting existing algo-

rithms to run on GPUs can provide considerable increasesin performance, a algorithm that is designed specifically torun on a GPU can have a nearly 40x performance increaseover algorithms that are simply adapted to run on GPUs,as is the case with our template matching algorithm ver-

5

Page 6: Accelerated Template Matching with GPU Specific … Acclerated Template Matching.pdfIn 2007 NVIDIA released one of the first GPU archi-tectures for general purpose computing. This

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1756

CVPR#1756

CVPR 2010 Submission #1756. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 9. The images ‘man’ and ‘pentagon’. The low-contrast na-ture of ‘pentagon’ can be seen in this comparison.

sus the Full Search Method. Therefore, a paradigm that in-volves designing algorithms from the onset to run on GPUarchitectures must be adopted to full take advantage of allthat this technology has to offer. Our design improvementsshould be extendable to other GPU architectures where itwould offer similar improvements.

NVIDIA recently announced the development and emi-nent release of its newest CUDA architecture, codenamed“Fermi” [2]. This architecture was developed expresslywith the purpose of computationally intensive scientific re-search in mind. Consequently, this newest architecture ad-dresses many of the short-comings of the existing CUDAarchitecture, notably the limited shared memory size, themain memory-to-GPU bottleneck, and the lack of doubleprecision capabilities. In addition Fermi will feature ECCmemory and C++ support [2]. This should allow for farmore complex algorithms to make their way to the GPU,and allow further research along the lines of this paper.More advanced acceleration techniques, such as [17] maybe possible. The work done here could very well be ex-tended to multimedia database search, as our algorithm’sability to eliminate many candidates before calling the GPUwould allow searching a very large database without over-whelming the GPU’s limited memory. Additionally, usinga clever memory copy algorithm, one could adapt this algo-rithm to search extremely large images, such as those gen-erated by Astronomical surveys, by never loading the entireimage at once onto the GPU.

References[1] IAP09 CUDA@MIT 6.963, 2009. 1, 2[2] NVIDIAs next generation CUDA compute architecture:

Fermi. Sept. 2009. 6[3] A. Abate, M. Nappi, S. Ricciardi, and G. Sabatino. GPU

accelerated 3D face registration / recognition. In Advancesin Biometrics, pages 938–947. 2007. 1

[4] R. Brunelli and T. Poggio. Face recognition: features versustemplates. Pattern Analysis and Machine Intelligence, IEEETransactions on, 15(10):1042–1052, 1993. 1

[5] A. A. Efros and W. T. Freeman. Image quilting for tex-ture synthesis and transfer. In Proceedings of the 28th an-nual conference on Computer graphics and interactive tech-niques, pages 341–346. ACM, 2001. 1

[6] A. Goshtasby, S. H. Gage, and J. F. Bartholic. A Two-Stagecross correlation approach to template matching. PatternAnalysis and Machine Intelligence, IEEE Transactions on,PAMI-6(3):374–378, 1984. 1

[7] T. R. Halfhill. Parallel processing with CUDA. Micropro-cessor Report, (01), 2008. 1

[8] Y. Hel-Or. Real-time pattern matching using projectionkernels. Pattern Analysis and Machine Intelligence, IEEETransactions on, 27(9):1430–1445, 2005. 1

[9] J. Huang, S. P. Ponce, S. I. Park, Y. Cao, and F. Quek. GPU-accelerated computation for robust motion tracking using theCUDA framework. In Visual Information Engineering, 2008.VIE 2008. 5th International Conference on, pages 437–442,2008. 1

[10] Z. Jin, Z. Lou, J. Yang, and Q. Sun. Face detection usingtemplate matching and skin-color information. Neurocom-put., 70(4-6):794–800, 2007. 1

[11] R. Li, B. Zeng, and M. Liou. A new three-step search al-gorithm for block motion estimation. Circuits and Systemsfor Video Technology, IEEE Transactions on, 4(4):438–442,1994. 1

[12] T. Luczak and W. Szpankowski. A suboptimal lossy datacompression based on approximate pattern matching. In-formation Theory, IEEE Transactions on, 43(5):1439–1451,1997. 1

[13] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, andJ. Phillips. GPU computing. Proceedings of the IEEE,96(5):879–899, 2008. 1

[14] O. Pele and M. Werman. Robust Real-Time pattern match-ing using bayesian sequential hypothesis testing. PatternAnalysis and Machine Intelligence, IEEE Transactions on,30(8):1427–1443, 2008. 1

[15] N. Rodrigues, E. da Silva, M. de Carvalho, S. de Faria, andV. da Silva. On dictionary adaptation for recurrent patternimage coding. Image Processing, IEEE Transactions on,17(9):1640–1653, 2008. 1

[16] A. Rosenfeld and G. Vanderburg. Coarse-Fine templatematching. Systems, Man and Cybernetics, IEEE Transac-tions on, 7(2):104–107, 1977. 1

[17] H. Schweitzer, R. F. Anderson, and R. A. Deng. A nearoptimal Acceptance-Rejection algorithm for exact Cross-Correlation search. In Proceedings of the International Con-ference on Computer Vision, Kyoto, Japan, 2009. Poster Ses-sion. 4, 6

[18] L. D. Stefano, S. Mattoccia, and F. Tombari. Speeding-upNCC-based template matching using parallel multimedia in-structions. In Computer Architecture for Machine Percep-tion, 2005. CAMP 2005. Proceedings. Seventh InternationalWorkshop on, pages 193–197, 2005. 1

[19] F. Tombari, S. Mattoccia, and L. D. Stefano. Full-Search-Equivalent pattern matching with incremental dissimilarityapproximations. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 31(1):129–141, 2009. 1, 3

6