TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FOR COMPUTER VISION ALGORITHMS by Allen Paul Nichols B.S., University of Colorado Denver, 2007 A report submitted to the University of Colorado Denver in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering 2011
66
Embed
TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION …€¦ · TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FOR COMPUTER VISION ALGORITHMS by Allen Paul Nichols B.S.,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TECHNIQUES TO LEVERAGE
DATA-PARALLEL GPU ACCELERATION FOR
COMPUTER VISION ALGORITHMS
by
Allen Paul Nichols
B.S., University of Colorado Denver, 2007
A report submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Electrical Engineering
2011
This report for the Master of Science
degree by
Allen Paul Nichols
has been approved
by
Daniel A. Connors
Robert Grabbe
Yiming Deng
Date
Nichols, Allen Paul (M.S., Electrical Engineering)
TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FORCOMPUTER VISION ALGORITHMS
Computer vision is the science that enables computer systems to extract infor-
mation from an image or a sequence of images. The development of computer vision is
essential for the advancement of a multitude of areas including medical, entertainment
and security. Computer vision systems are useful for tasks such as industrial control,
event detection, informational organization and object modeling. Other domains of
computer vision systems also include motion analysis, image restoration and scene con-
struction. With a wide variety of emerging applications, the demand for more advanced
computer vision systems is quickly growing.
There are several limiting factors when developing accurate real-time computer
vision systems. Most computer vision tasks require a great deal of mathematical com-
putation. For many computer vision algorithms, the analysis of a single image can take
anywhere from a few seconds to several hours to process. In short, computer vision al-
gorithms require a large number of computations as well as an equally large number of
memory values. Computer vision algorithms are applied to broad range of environments
from home entertainment systems to the operation of unmanned aerial and ground ve-
hicles. Each new generation of application increases the need for more computational
resources. Traditionally software developers and even scientists have strictly relied on
the increase of processor clock frequency as the primary method of gaining performance
for the next generation of applied algorithms.
Emerging technology constraints are slowing the rate of performance growth in
computer systems. Specifically, designers are finding difficulties addressing strict pro-
1
cessor power consumption and cooling constraints. Design modifications to address
power consumption generally limit processor performance and reduce peak operating
frequency, thus changing the trend of providing increased system performance with every
new processor generation. As such, modern architectures have diverged from the clock
speed race into the multicore era with multiple processing cores on a single chip. While
the multicore design strategy improves chip power density, there remains a significant
potential in improving run-time power consumption.
Graphics Processing Units (GPUs) are commonly found on graphics cards and
main computer boards. These specialized processors have great potential for solving
a number of problems. Unlike traditional Central Processing Units (CPUs), the GPU
contains as many as several hundred mathematical computation cores. These cores
have evolved recently from being able to perform simple graphics computations to fully
capable processing engines. Due to the nature of many processing cores, GPUs can
perform mathematical computations in a massively parallel manner.
Many applications and algorithms have the potential to take advantage of the
parallel processing capabilities of GPU systems. In most cases, problems possessing
data-level parallelism are best suited for GPU execution. Data parallelism focuses on
distributing the large amounts of data across different parallel computing cores. A
problem appears data-parallel if each core can perform the same identical task on
different pieces of distributed data. There are distinct ranges of data-parallel problems.
Small scale image processing that includes the parallel manipulation or analysis of pix-
els can be achieved with multiprocessor extensions such as Single Instruction, Multiple
Data (SIMD). Larger problems of data-parallel computing can be solved with large-
scale distributed systems consisting of multiple independent computers that communi-
2
cate through a computer network or network grid. Non-large-scale data-level parallel
problems are ideal applications that can be optimized. In this categorization, modern
computer vision applications fall into a unique domain since they are more complex than
simple image processing tasks yet would not be described as large enough to require
massive computing resources of a distributed system.
This thesis investigates the potential of adapting computer vision algorithms to
execute on GPUs. As GPUs operate in a heterogeneous system in which both the
CPU and GPU perform some fraction of the computational work, there are unique
performance constraints to explore. This thesis considers two primary parameters in
developing optimal GPU solutions: problem size and problem reformulation.
Problem Size: For every application, the problem size (number of data items
to be processed) is a direct factor to consider when computing results on a GPU. Most
GPU systems are deployed as hardware accelerators for CPU systems, requiring memory
transfers to be performed between each computational component. As the GPU system
includes its own memory system, a necessary step in the heterogeneous system is to
transfer data between the CPU and GPU memory systems. Applications with a smaller
input size may not necessarily benefit from using the GPU even for the computationally
intensive sections. An optimal use of GPUs would include either compile-time or run-
time capabilities to detect computation sizes and automatically choose the best resource
(CPU only or CPU/GPU combination) for the task. Compile-time techniques would
have the code written in such a way that the executable decides at fixed computation
thresholds to use a particular resource. While a run-time approach would allow the
application to determine the optimal method at run-time based on computation size,
available resources and any other computation parameters.
3
Reformulation: For every application, there is a unique degree in which the
computation tasks can be transformed for the GPU resources. Often the reformulation
requires significant code development and sufficient experience with GPU programming
techniques. Several forms of reformulation for GPU execution have been studied for
sorting [1] and parallel reduction methods [1]. This thesis examines two techniques for
mapping non-traditional computations to GPUs: work reformulation and computation
speculation. These techniques are not traditionally exploited on traditional CPU sys-
tems, as they appear to require more computational time. However, based on the fact
that the GPU can do a large amount of computations very quickly, there appear to be
successful ways to exploit GPUs for non-data-level parallel algorithms.
This thesis is organized as follows: Chapter 2 discusses the motivation and back-
ground of computer vision applications. Chapter 3 examines several examples of prob-
lem solving on GPUs using reformulation techniques. The experimental results section,
chapter 4, shows performance data for the various optimization cases. Finally, chapter 5
concludes this thesis.
4
2. Background and Motivation
2.1 Computer Vision
Computer vision systems are integrated within a wide variety of industrial and
scientific applications. Such systems extract key information from an image or series of
images necessary to complete a single specific task or a series of tasks. There are increas-
ing uses of computer vision in emerging software applications and product development.
In the scientific realm, computer vision methods exist in medical imaging, intelligent
robotics and topographical modeling. There are also many industrial applications such
as autonomous vehicles, visual surveillance, industrial inspection and quality control.
Even modern day digital cameras now apply simple algorithms for face detection to
ensure the best outcome in photographs.
Many computer vision applications already have well defined algorithms and asso-
ciated methodologies. Most of these algorithms have a trend of being computationially
intensive as well as memory intensive. The computational intensive nature of computer-
based vision algorithms has traditionally detered the development of new applications
because results could not be generated in real-time. In many cases, only off-line appli-
cation of computer vision systems is expected. Nevertheless, as computer performance
generally increases in each generation, there is a great potential for real-time computer
vision applications to become more common. However, as time progresses, the data in-
put available for computer vision algorithms increases as the image size and resolution
5
increase. This increase in resolution allows for greater scientific discovery and higher
industrial precision. However, the trade-off of more precise image quality and availabil-
ity results in even further increasing the demand for speed and volume of computations
that must be performed. In many ways the computer system designs are consistently
behind the performance needs of the latest applications.
The application of muti-core and many-core processing architecture systems is
an area of current research and development. A system with multiple processor cores
would be ideal for a wide variety of computer vision applications. Each algorithm must
be carefully studied to determine if all or part of the computation can be performed in
parallel. Some such computer based vision algorithms lend themselves to computational
parallelism while others do not.
2.1.1 San Diego Vision Benchmark Suite
To gain a better understanding of the possible advantages to be gained with
many-core systems, a set of common computer vision algorithms will be carefully in-
spected and tested. The specific sub-set of algorithms to be examined has been de-
veloped by the Department of Computer Science and Engineering at the University of
California, San Diego. The benchmark suite they have developed is known as ”The San
Diego Vision Benchmark Suite” (SD-VBS) [2]. The suite contains applications from
the following representative areas: Image Processing and Formation; Image Analysis;
Image Understanding; and Motion, Tracking and Stereo Vision. This suite contains
nine representative computer vision applications and each application contains a set of
image inputs that vary in size.
6
2.1.2 Image Segmentation
The Image Segmentation algorithm processes an image by dividing it into con-
ceptual regions or segments. These regions include boundaries, borders and objects that
appear in the image. The algorithm functions on the premise that a set of pixels share
a common set of characteristics [2]. This algorithm is commonly applied to fingerprint
and face recognition. Other applications include medical imaging, machine vision and
computational photography.
To ease understanding of this complex algorithm, it has been broken down into
three separate parts or sections. The first section of the algorithm deals with the con-
struction of a similarity matrix. This matrix is computed by analyzing pixel-pairs across
the entire data input stream (image). The second section involves the computation of
discrete regions based on the results from the first matrix computation. The third and
final part of this algorithm normalizes the segmentation results from the regions previ-
ously computed. This application is very computationally intensive based on the fine
granularity of multiple complex operations [2]. Figure 2.1 shows an example of an image
after it has been processed by the Image Segmentation algorithm.
2.1.3 Disparity Map
This algorithm takes a pair of stereo images as an input. These stereo images are
taken from a slightly different position looking in the same direction. This is similar to
a person taking a picture with the left eye looking through a traditional camera view-
finder and then switching the camera to the right eye and taking an additional picture.
There are a number of applications in which this algorithm could be utilized. One
possible computer vision application for this algorithm would be to use two cameras
7
Figure 2.1: A Pre-Processed image (left) and the same image processed by the ImageSegmentation algorithm (right).
that could be used as depth sensors on a conveyor belt for an assembly line. After
running this algorithm against the input images, the computer control system would
have depth information about where each product is on the conveyor belt.
The Disparity algorithm computes depth information based on a set of two
stereo images that are provided. Take for example the robot with cameras placed as
eyes for virtual vision. This algorithm could be applied to these camera inputs and the
computer could calculate depth information based on the stereo input. Other industrial
applications include systems such as intelligent cruise control, pedestrian tracking and
collision avoidance systems.
The San Diego Vision Benchmark Suite has implemented this algorithm using the
concept of Stereopsis [2] which allows depth analysis to be performed based on a set of
stereo images. Figure 2.2 shows an example of a set of images that has been processed
by the Disparity algorithm and the generated disparity map.
This algorithm computes a dense disparity map between two images while preserv-
ing any discontinuities resulting from image boundaries. The concept of dense disparity
8
mapping operates on the premise that every pixel of a given image is important, not
just sections or features of an image. Because every pixel is analyzed, this algorithm
is very computationally intensive. The general sections of the Disparity algorithm are
filtering, correlation, calculation of the sum of squared differences and sorting.
Figure 2.2: Stereo Image Inputs (left and right). Output image after processing by theDisparity algorithm (bottom).
2.1.4 Feature Tracking
The operation of feature tracking is fundamental in computer vision systems.
Feature tracking is the process of locating and characterizing moving objects given a set
of subsequent frames. When tracking is enabled in vision systems, multiple objects in a
given field of view can be monitored and analyzed with other computer vision algorithms.
Applications for this algorithm include robotic vision systems and automotive traffic
monitoring systems. Figure 2.3 shows an example of a series of sequential image inputs
9
and the resulting motion vectors that this algorithm generates.
The SD-VBS implementation involves the Kanade Lucas Tomasi (KLT) tracking
algorithm [2]. The overall algorithm can be broken down into three major sections.
The first section operates on pixel level granularity while the second and third sec-
tions operate on coarse grained data or feature points. The first section is an image
processing phase. This phase accomplishes things such as noise filtering, gradient im-
age and image pyramid computations. This low-level image processing is comprised of
mostly Multiply-and-Accumulate computations. The second and third sections contain
the core functionality of the algorithm. These routines invlove feature extraction and
feature tracking, respectively. The core functionality is based on a large number of
complex matrix operations and vector estimation.
Figure 2.3: Series of sequential image inputs (left) and the resulting RepresentativeMotion Vectors (right).
2.1.5 Support Vector Machines
The support vector machine (SVM) algorithm is used for data classification and
regression analysis. For each application the algorithm separates the input data into
10
two categories. These categories are the calculated maximal geometric margins of the
data sets. This classic machine learning algorithm is closely related to neural networks
and is a form of a generalized linear classifier. Figure 2.4 shows a representative data
set with lines depicting the maximal margins.
The SVM algorithm is organized into two distinctive stages. The first stage is
the training phase, the SVM classifier is trained based on the training data. When the
training phase is complete, the classifier has a polynomial function that describes the
learning data. The second stage involves applying the input data set to the classifier.
As the algorithm continues these same two stages are iterated multiple times to achieve
higher accuracy of the polynomial function.
Figure 2.4: A representative data set and the result of the Support Vector Machinealgorithm. H3 (green) doesn’t separate the 2 classes. H1 (blue) does, with a smallmargin and H2 (red) with the maximum margin.
2.2 Data-level Parallelism
Data parallelism is a method of parallel computing across multiple processors.
This type of parallelism focuses on distributing the data across different parallel comput-
11
ing nodes. When multiprocessor systems are given a data parallel task, each processor
performs the same task on a different piece of data. Flynn’s taxonomy classifies this
as Single Instruction, Multiple Data or SIMD [3]. In some instances, a single execution
thread controls all of the ongoing operations. Different situations dictate that different
threads control the ongoing operations, but ultimately all threads execute the same
code.
We call these algorithms data parallel because their parallelism comes from si-
multaneous operations using large sets of data [4]. To illustrate data-level parallelism,
consider a system with ten processor cores. The task at hand is to add together two
matrices with ten elements in each matrix. Traditional code can be written so that each
processor does one addition and store. At run-time, each processor executes exactly the
same instructions, but on different pieces of data. For example, matrix addition can
assign each matrix position to a unique processor to complete many times faster than
a system with only one processor core. In this particular example each matrix position
is independent from the surrounding positions making data-level parallelism possible.
2.2.1 Data-Level Parallelism Opportunities with SD-VBS
Large portions of the SD-VBS benchmark suite exhibit forms of data-level paral-
lelism. Upon careful inspection of the internal workings of the algorithms, it can be seen
that a large portion of the routines are performing the same task on different pieces of
data repeatedly. There are also several portions of the algorithms which are dependent
on the previous iteration and cannot be parallelized at the data level.
In the Disparity algorithm, one section of the code calculates an output matrix
based on two input matrices. The algorithm takes a given point from each matrix and
12
performs a simple calculation. The result is then stored in an output matrix. Examples
like this lend themselves very nicely to data-level parallelism. Other portions of this
algorithm allow for the same optimization.
The Image Segmentation algorithm operates on pixel granularity which results
in a large number of repetitive operations. The execution time of this algorithm is almost
completely consumed with performing a sort on a single dimensional array of numbers.
Performing a sort in a parallel fashion can be performed significantly faster than a serial
sorting method.
Taking a closer look at the feature tracking algorithm shows several opportunities
for data-level parallelization. The image processing phase operates on the entire image
and thus makes it parallelization friendly. The feature extraction and feature track-
ing portions operate on feature level precision and are comprised of complex matrix
operations. These complex operations involve matrix inversions and motion vector esti-
mations. These operations are computationally intensive. While parallelism is possible
it is more challenging for this algorithm [2].
The support vector machine by nature is irregular and random. The sections
of this suite are not necessarily considered as data level parallelism candidates. The
iterative nature of this algorithm is comprised of mostly complex computations. This
algorithm can be optimized with thread level parallelism and instruction level paral-
lelism [2].
2.2.2 The Diminishing Return of the Traditional CPU
Microprocessors that feature a single processing core have evolved over several
decades. These single core processors are the driving force behind the development of
Figure 4.5 shows the performance improvement of BlurImage in terms of execution
time measured in microseconds. The blue line shows how much of the overall GPU time
is spent on the execution, the rest is spent on memory transfers and overhead. For the
high definition image (1080x1920) we see that 0.35% of the total GPU time is actually
spent doing the execution and the other 99.65% of that time is used on kernel setup
and memory transfers.
The Sobel algorithm performance for the delta X direction can be seen in figure 4.6.
The results show the performance improvement of calcSobel dX for various input sizes.
The minimum, average and maximum values have similar trends on each data set with
the CPU versus the GPU. For example, the largest (right most) data set is very flat for
both the CPU values as well as the GPU values. In all cases for this algorithm there is
a performance increase by using the GPU implementation of the algorithm.
Similarly, figure 4.7 shows the performance improvements the Sobel algorithm in
the delta Y. The results indicate a performance improvement for every given input.
48
Figure 4.5: Tracking kernel BlurImage comparison of CPU and GPU: minimum, max-imum, average.
Figure 4.6: Tracking kernel Sobel dX comparison of CPU and GPU: minimum, maxi-mum, average.
When the algorithm is optimized for the GPU, less than one quarter of one percent of
the total GPU time is used for computation. More than 99.7% of the GPU time is taken
up by memory transfers and kernel setup.
49
Figure 4.7: Tracking kernel Sobel dY comparison of CPU and GPU: minimum, maxi-mum, average.
Figure 4.8 shows an overview plot of all the kernels involved in Tracking. The
plot shows the standard deviation as a percentage of the average. In all cases, the GPU
has a lower standard deviation than the same function ran on the CPU. Each GPU run
has a more consistent execution time than the correlating CPU only run. Some of the
deviations on the CPU are attributed to the operating system scheduling various other
tasks on the CPU while the benchmark is running.
4.3 Computation Speculation
An example of speculative computation comes from SVM. The SVM benchmark
was implemented on the GPU in such a way that a large amount of pre-calculation is
performed at the beginning of the algorithm. Not all of the pre-calculated data may be
used during the run. Figure 4.9 shows the overall performance improvement of SVM by
using the GPU to pre-calculate the entire table. For the smaller inputs we see a 5x to
10x performance increase. For the larger inputs we see a consistent 28x increase. The
50
Figure 4.8: Tracking kernels (BlurImage, Sobel dX, Sobel dY) standard deviation asa percentage of average.
bars (plotted on the right-axis) show how much of the GPU time is used in the actual
computation (the remainder is GPU kernel setup and memory transfer time). For the
largest input (cif), less than 5% of the GPU execution time is used on computation.
Figure 4.9: GPU performance for SVM application.
Figure 4.10 shows the speedup of each table access versus generation. Each bar
51
indicates the speedup factor of the GPU compared to the CPU for each table access
versus generation. The number on top of each bar shows the ratio of standard deviation
relative to the average. The data shows that the GPU time is more fixed than the CPU
and does not stray from that amount.
Figure 4.10: GPU performance for each table access versus generation.
The polynomial algorithm, a portion of the SVM application, exhibits extreme
performance improvement using the GPU implementation. Figure 4.11 shows the total
speedup of the polynomial function and the total speedup of the SVM application. The
polynomial function is sped up as much as sixty times faster than the CPU for the
sqcif input. The overall speedup is around 2.5 to 3 times faster than the CPU only
implementation for all input sizes.
52
Figure 4.11: GPU performance for SVM application.
53
5. Conclusion
The optimization approaches described and the experimental results have proved
successful techniques for real-world applications. The results show, given realistic prob-
lem sizes, a performance increase is seen with GPU optimized algorithms.
The example of sorting optimization shows that in some cases the GPU can
actually slow down the overall performance of the algorithm. This particular example
also shows that the GPU can perform tasks such as sorting significantly faster than
a CPU given a large enough input data set. Future work would potentially focus on
optimizing additional commonly used algorithms used in software.
Many cases exist where optimization can be done by reformulating the computa-
tion methods contained in the algorithm. Often times the GPU will actually perform
more computations than the CPU version. Reformulation can allow the GPU to per-
form the same calculations in a parallel form. Additional work for this case would focus
on creating templates and examples of ways to reformulate commonly seen algorithms.
Speculative computation involves careful observation of the underlying algorithm
and identifying the generation of large lookup tables and the like. Code for CPUs is
often written to optimize the code for serial execution and minimize the total number of
computations performed. Future work would involve finding a method to easily identify
when such situations occur in software code.
This thesis shows that different types of computationally intensive problems can
be optimized further by using the processing power of the GPU. By identifying and
properly applying the techniques presented, a software developer could increase the
performance of a given process by using the GPU. Utilizing the GPU, which exists on
54
many desktop systems today, is a great way to improve performance, boost efficiency
and decrease hardware costs.
55
BIBLIOGRAPHY
[1] M. Pharr and R. Fernando, Gpu gems 2: programming techniques forhigh-performance graphics and general-purpose computation. Addison-Wesley Pro-fessional, 2005.
[2] SD-VBS: The San Diego Vision Benchmark Suite, October 2009.
[3] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Trans.Comput., vol. C-21, pp. 948+, 1972.
[4] W. D. Hillis and G. L. Steele, Jr., “Data parallel algorithms,” Commun. ACM,vol. 29, pp. 1170–1183, December 1986.
[5] INTEL, “The evolution of a revolution. http: // download. intel. com/ pressroom/kits/ intelprocessorhistory. pdf , accessed,” 2008.
[11] H. Sutter and J. Larus, “Software and the concurrency revolution,” Queue, vol. 3,pp. 54–62, September 2005.
[12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanra-han, “Brook for GPUs: stream computing on graphics hardware,” in SIGGRAPH’04: ACM SIGGRAPH 2004 Papers, (New York, NY, USA), pp. 777–786, ACMPress, 2004.
[13] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, “Shader algebra,” ACMTrans. Graph., vol. 23, pp. 787–795, August 2004.
56
[14] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips,“GPU computing,” Proceedings of the IEEE, vol. 96, pp. 879–899, May 2008.
[15] NVIDIA, NVIDIA CUDA Programming Guide 2.0. 2008.
[16] Standard Performance Evaluation Corporation, “The SPEC CPU 2006 benchmarksuite,” 2006.
[17] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.Reddi, and K. Hazelwood, “Pin: building customized program analysis tools withdynamic instrumentation,” SIGPLAN Not., vol. 40, pp. 190–200, June 2005.
[18] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedingsof the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV’99, (Washington, DC, USA), pp. 1150–, IEEE Computer Society, 1999.
[19] S. Warn, W. Emeneker, J. Cothren, and A. W. Apon, “Accelerating sift on parallelarchitectures.,” in CLUSTER, pp. 1–4, IEEE, 2009.
[20] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20,pp. 273–297, 1995. 10.1007/BF00994018.