1 Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi) 1 Summary We have ported image characterization algorithm implemented in doc2learn application to the Graphics Processing Unit (GPU) platform using both CUDA C targeting NVIDIA GPUs and OpenCL targeting NVIDIA and AMD GPU architectures. We also implemented doc2learn image analysis algorithm in C targeting microprocessor architecture. Our conclusion is that doc2learn image processing part can be accelerated up to 4 times using NVIDIA GTX 480 GPU, but 1) the speedup depends on the image size and 2) other parts of doc2lear application dominate the execution time. 2 Preliminaries Doc2learn implements algorithms for computing probability density functions for text, image, and vector graphics objects embedded into PDF files. Specifically, non-parametric probability density function estimation techniques are implemented that require computing a histogram of frequencies of occurrence of all values in a particular object, e.g., image or text. The computed probability density functions (histograms) form the feature vector which can be used for measuring the similarity between pairs of images. Figure 1. Doc2learn execution profile using has.PDF as an example. Pie chart on the left shows time distribution of the entire application. Pie chart on the right shows time distribution of the data processing part, which is 25% of the overall application execution time. Application profiling of doc2learn SVN revision 760 on has.pdf file (one of test cases) reveals that about 25% of the overall time is spent on the probability density function computation (Figure 1, left) most of which is spent on the image analysis (Figure 1, right). This time distribution is typical for documents containing many images.
17
Embed
Evaluation and Exploration of Next Generation Systems for ...1 Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Evaluation and Exploration of Next Generation Systems for Applicability and
Performance (Volodymyr Kindratenko, Guochun Shi)
1 Summary
We have ported image characterization algorithm implemented in doc2learn application to the
Graphics Processing Unit (GPU) platform using both CUDA C targeting NVIDIA GPUs and
OpenCL targeting NVIDIA and AMD GPU architectures. We also implemented doc2learn
image analysis algorithm in C targeting microprocessor architecture. Our conclusion is that
doc2learn image processing part can be accelerated up to 4 times using NVIDIA GTX 480 GPU,
but 1) the speedup depends on the image size and 2) other parts of doc2lear application dominate
the execution time.
2 Preliminaries
Doc2learn implements algorithms for computing probability density functions for text, image,
and vector graphics objects embedded into PDF files. Specifically, non-parametric probability
density function estimation techniques are implemented that require computing a histogram of
frequencies of occurrence of all values in a particular object, e.g., image or text. The computed
probability density functions (histograms) form the feature vector which can be used for
measuring the similarity between pairs of images.
Figure 1. Doc2learn execution profile using has.PDF as an example. Pie chart on the left shows time distribution of
the entire application. Pie chart on the right shows time distribution of the data processing part, which is 25% of the
overall application execution time.
Application profiling of doc2learn SVN revision 760 on has.pdf file (one of test cases) reveals
that about 25% of the overall time is spent on the probability density function computation
(Figure 1, left) most of which is spent on the image analysis (Figure 1, right). This time
distribution is typical for documents containing many images.
2
Doc2learn is written in Java. Java code is executed by a virtual machine that executes on top of
the actual hardware (microprocessor). Java virtual machine isolates many of the performance-
enhancing features of the underlying hardware, such as vector units, thus potentially not allowing
to fully utilize the capabilities of modern microprocessors without explicitly programming for
them. The main goal of this study is to investigate how the execution of the probability
density function computation for images can be improved on modern microprocessor
architectures and GPUs. We specifically target image analysis part of the doc2learn
application because it is perceived to be the most data-intensive and time-consuming part of the
computation for documents that contain either a lot of images or large-size images.
2.1 Doc2learn image characterization algorithm
Doc2learn implements a non-parametric probability density function estimation technique which
is based on the computation of a histogram of frequencies of occurrence of pixels of different
colors in an image. Most significant bits of red, green, and blue components (and optionally
alpha channel when present) of each image pixel are used as indexes of a 3D array that holds the
histogram. This algorithm is an improvement over the histogram computation algorithm used in
earlier versions of doc2learn. Instead of building a histogram consisting of all color values
present in the image, a highly reduced histogram is computed, resulting in a very significant
(over an order of magnitude) reduction of the computation time.
Figure 2 contains the Java source code for probability density function estimation taken from the
original doc2learn implementation.
// compute histogram for image
int pix, red, green, blue;
for (int r = 0; r < bi.getHeight(); r++ ) {
for (int c = 0; c < bi.getWidth(); c++ ) {
pix = bi.getRGB(c, r);
red = ((pix >> 16) & 0xff) / size;
green = ((pix >> 8) & 0xff) / size;
blue = (pix & 0xff) / size;
histogram[red][green][blue]++;
}
}
Figure 2. Original Java implementation of the image probability density function computation.
2.2 Hardware platforms used in this study
Two computer systems have been used in this study: 1) Intel based with Fermi GTX 480 GPU
and 2) AMD based with ATI Radeon 5870 GPU. Table 1 contains technical characteristics of
both systems. If not explicitly said, results reported in this report, such as the application profile
shown above, were obtained on the Intel-based system which was found to be better performing.
Intel-based system AMD-based system
Processor 3.3 GHz single 4-core Core i7 2.8 GHz dual 6-core Istanbul
Figure 12. Doc2learn execution profile when using the NVIDIA CUDA GPU image analysis code.
10
3.4 OpenCL implementation
We implemented similar variations of the GPU kernels in OpenCL targeting both NVIDIA and
AMD GPUs. OpenCL implementations of the kernels follow very closely our CUDA kernel
implementations when the necessary hardware features are supported by OpenCL, which is not
always the case.
For the 3133x3233 pixels test image, best OpenCL result obtained on the NVIDIA GPU is with
kernel 6: 1.28 ms for the actual kernel execution and 5.48 ms for the data transfer. This is only
slightly worse that the native CUDA implementation.
For the 3133x3233 pixels test image, best OpenCL result obtained on the AMD GPU is with the
kernel 1: 2.79 ms for the kernel execution and 38.12 ms for the data transfer. Unfortunately data
transfer on the AMD GPU platform is prohibitively long. The PCIe interface supports data
transfer rates similar to those observed on the NVIDIA hardware and there is no technical reason
for such a poor performance; we believe AMD GPU drivers are not tuned for the best
performance.
3.5 Performance study to understand the impact of image size
We investigate performance of different implementations of the image analysis algorithm as the
function of image size. We consider image sizes ranging from 128x128 pixels to 8192x8192
pixels and seven implementation/platform configurations:
Java implementation with the optimized Java code running on a single core of 2.8 GHz AMD Istanbul processor
Java implementation with the optimized Java code running on a single core of 3.3 GHz Intel Core i7 processor
C implementation running on a single core of 2.8 GHz AMD Istanbul processor
C implementation running on a single core of 3.3 GHz Intel Core i7 processor
CUDA C implementation running on NVIDIA GTX 480 GPU installed on the 3.3 GHz Intel Core i7 platform
OpenCL implementation running on NVIDIA GTX 480 GPU installed on the 3.3 GHz Intel Core i7 platform
OpenCL implementation running on ATI Radeon HD5870 GPU installed on the 2.8 GHz AMD Istanbul platform
Table 3 contain raw measurements obtained in this study. These measurements are used to
generate the plots shown in Figures 13-15.
Figure 13 provides a plot of the execution time for seven implementation/platform configurations
as a function of image size. In the measurements provided in this plot we included all relevant
overheads, such as JNI overhead and PCIe data transfer overhead. In other words, the measured
execution time is what the user sees in the Java application that invokes one or another type of
kernel (Java, C, or GPU). From this plot, we can make several important observations:
Both Java and C implementations perform substantially worse on the 2.8 GHz AMD Istanbul platform compared to the 3.3 GHz Intel Core i7 platform. It is most likely due to the Java and GCC compiler ports that do not fully take into account all AMD processor’s architecture features.
11
On the 3.3 GHz Intel Core i7 platform, the code in which Java-based computation is replaced with a call to a C-based computation executes faster with a near-constant speedup for all image sizes tested in this study.
For a sufficiently large image size, all GPU-based implementations outperform the CPU-based implementations.
CUDA implementation executed on NVIDIA GTX 480 GPU platform outperforms OpenCL implementations running on both platforms.
Image Size AMD CPU host run time (ms) Intel CPU host run time (ms)
java-based
C-based java-based
C-based
C loop JNI overhead total C loop JNI overhead total
Table 3. Performance measurements for different implementations and architectures.
12
Figure 13. Logarithmic plot of the execution time as a function of image size.
Figure 14 provides additional insights into performance of the C and GPU implementations. In
this figure, we plot the following ratios that indicate relative speedups:
Java time to C time, including JNI overhead, on the Intel 3.3 GHz Core i7 platform. This is what the user sees in the Java application that invokes a C function that processes the image.
Java time to C time, NOT including JNI overhead, on the Intel 3.3 GHz Core i7 platform. This is what the user would see when comparing the Java-based application with an application which is entirely C-based. This measure shows the full C over Java advantage.
Intel 3.3 GHz Core i7 time spent by the C-based kernel to the time spent transferring data between the host and NVIDIA GTX 480 GPU and executing the kernel on the NVIDIA GTX 480 GPU. This is GPU vs CPU speedup, including data transfer over PCIe bus overhead.
Intel 3.3 GHz Core i7 time spent by the Java-based implementation to the time spent by the Java-based code that calls a C-based implementation that calls the GPU-based implementation. All the overheads, JNI and PCIe bus data transfer, are included in the GPU-based implementation. This is what the user sees in the Java application that invokes a GPU-based function that processes the image.
Intel 3.3 GHz Core i7 time spent by the Java-based implementation to the time spent by the C-based implementation that calls the GPU-based implementation, excluding JNI overhead. This is GPU vs Java speedup, including data transfer over PCIe bus overhead.
13
From this plot, we can make the following observations:
Replacing optimized Java code with the call to a C-based function results in on average 3x speedup. This speedup takes into account the overhead of calling a C function from the Java application.
We observe an average 6x speedup when comparing Java computation time with the corresponding C-based code computation time, assuming that there is a full C-based application. The speedup is even greater, up to 8.8x, for smaller images. This observation indicates that if the image analysis algorithm used in doc2learn was implemented as a stand-alone C program, it would have executed 6 times faster than the current doc2learn all-Java implementation.
For small images, GPU implementation is actually slower than the C-based CPU implementation. For image sizes close to 512x512, the CPU and GPU implementations break even. Maximum GPU speedup for sufficiently large images is only 2.6x as compared to the C-based CPU implementation.
Replacing optimized Java code with the call to a C function that calls a GPU function results in the speedup ranging from 1.2x for small images to 4x for large images. This is what the user sees from the standpoint of the Java application.
Finally, if we compare full Java-based implementation with the C-based implementation that calls a GPU kernel, we observe a speedup of almost 3x for small images to almost 16x for large images. This observation indicates that if the image analysis algorithm used in doc2learn was implemented as a stand-alone C program that uses NVIDIA GTX 480 GPU, it would have executed up to 16 times faster than the current doc2learn all-Java implementation.
Figure 14. Speedup as a function of image size.
14
Figure 15 provides some additional insights into the image analysis code execution profile for
two image sizes and three platforms. For a sufficiently large image, GPU-based implementation
suffers from a substantial JNI Java-to-C interface overhead.
Java Java+C Java+C+GPU
GPU compute 0.00 0.00 0.53
PCIe overhead 0.00 0.00 2.32
CPU compute 0.00 11.08 0.00
JNI overhead 0.00 7.49 7.71
Java compute 42.23 0.00 0.00
0
5
10
15
20
25
30
35
40
45
exe
cuti
on
tim
e (m
s)
2048x2048 image
Java Java+C Java+C+GPU
GPU compute 0.00 0.00 0.06
PCIe overhead 0.00 0.00 0.08
CPU compute 0.00 0.10 0.00
JNI overhead 0.00 0.07 0.17
Java compute 0.68 0.00 0.00
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
exe
cuti
on
tim
e (m
s)
256x256 image
Figure 15. Execution profile comparison for two image sizes.
3.6 Performance study to understand the impact of the number of images
We investigate performance of different implementations of the image analysis algorithm as the
function of the number of images embedded in the PDF file. We consider three implementations
using Intel platform:
Java implementation with the optimized Java code running on a single core of 3.3 GHz Intel Core i7 processor
C implementation running on a single core of 3.3 GHz Intel Core i7 processor
CUDA C implementation running on NVIDIA GTX 480 GPU installed on the 3.3 GHz Intel Core i7 platform
The dataset used in this study consists of 100 PDF files that contain only images. Each PDF file
contains some number of randomly generated images of a fixed size. Image sizes include 50x50,
100x100, 150x150, and 200x200 pixels and image count per PDF file is from 10 to 250. This
synthetic dataset was generated by Peter Bajcsy’s team and it is deemed to be statistically
representative of a set of PDF files that the team has been working with.
Table 4 lists all the image sizes and the number of images per PDF file. It also lists the overall
image processing time for each of the PDF files using each of the three algorithm
implementations. Figure 16 provides a graphical representation of the data from Table 4. The
measured execution time confirms our prior findings:
For small images, C-based implementation outperforms both the Java-based and GPU-based implementations