Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya
Implementation and Optimization of SIFT on a OpenCL GPU
6.869-6.338 Final Project5/5/2010
Guy-Richard Kayombya
Overview
Motivation Quick Intro to OpenCL Implementation Results
Motivation
Learn OpenCL Adapt the SIFT algorithm to yet another
parallel architecture Maybe achieve some speedup
Quick Intro to OpenCL
New Standard from Khronos for HeterogeneousParallel Computing (v1.0 Released Dec 2008) Initiated by Apple
Open and royalty free Cross-Vendor and Cross-Platform Make use of all available processing entities
CPUs, GPUs and other Processors Scales from Embedded to HPC solutions
Quick Intro to OpenCL(2)
One Host, Multiple Devices
Each Device has multiple Compute Units
Each Compute Unit has multiple Processing Elements
E.g: GT200 has 30 Compute units/Streaming processors and 8 Processing Elements/Scalar SIMD processors = 240 Processing elements
Quick Intro to OpenCL(3)
NDRange = size of the problem to solve 1D or 2D
Work-Group = block of work-items Work-Item ~ lightweight thread
Quick Intro to OpenCL(4)
Global : per device Local : per Work-
Group Private : Per Work-
Item
Quick Intro to OpenCL(4)
__kernel void vec_inc ( __global float *a, __global const float b){ int gid = get_global_id(0); a[gid] = a[gid] + b;}
Implementation
Abstraction Layer (85 %) Gaussian/DoG Pyramids (100 % semi-
optimized) Keypoint Detection (95 % - Naive) Keypoint Refinement (90 % - Naive) Orientation Assignment (10 %) Descriptor generation(0 %)
Abstraction Layer
Problem : Host device code is cumbersome Requires dozens of repetitive lines to setup device
contexts kernels,buffers,etc... Solution: OpenCL wrapper
Simplifies creation and management of hybrid Host/Client buffers and execution of kernels
Facilitates transition from serial to parallel execution Host/Client Synchronization
Memory management issues still need to be fixed
Gaussian Pyramid
Separable convolution 2 1D filters
Indirect filtering to reduce kernel size sigma_diff = sqrt(sig_dst^2 – sigma_src^2)
Use convolutionSeperable() provided by Nvidia for efficient 2D seperable convolution on the GPU
Keypoint detection
Each pixels is processed by one work item independently No state sharing Worst case 26 comparisons / per work Item
Keypoint Refinement
Each Keypoint is processed independently by one work item
Kernel is a slightly modified version of the keypoint refinement Matlab Mex module by Vedaldi
Preliminary Results (Time)All the measurements are performed on an input image of size 512x512.
Gaussian Filtering (sigma 4.1):• Vedaldi Matlab CPU = 0.19s – 100 %• Naïve C++ CPU = 0.33s – 57%• GPU = 0.0094s – 2000 %• GPU with data transfer = 0.0133s – 1400 %
Extrema Detection (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.179 s – 100 %• GPU = 0.035725s – 500 %
Keypoint Refinement (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.004 s – 100 %• GPU = 0.0689s – 6 %
Preliminary Results(performance)
Refined Keypoints for octave 0
Blue: Matlab implementation
Red: OpenCL Green: Common 85% Correspondence