Top Banner
Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya
15

Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Jan 17, 2018

Download

Documents

Bennett Lyons

Motivation Learn OpenCL Adapt the SIFT algorithm to yet another parallel architecture Maybe achieve some speedup
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Implementation and Optimization of SIFT on a OpenCL GPU

6.869-6.338 Final Project5/5/2010

Guy-Richard Kayombya

Page 2: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Overview

Motivation Quick Intro to OpenCL Implementation Results

Page 3: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Motivation

Learn OpenCL Adapt the SIFT algorithm to yet another

parallel architecture Maybe achieve some speedup

Page 4: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL

New Standard from Khronos for HeterogeneousParallel Computing (v1.0 Released Dec 2008) Initiated by Apple

Open and royalty free Cross-Vendor and Cross-Platform Make use of all available processing entities

CPUs, GPUs and other Processors Scales from Embedded to HPC solutions

Page 5: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(2)

One Host, Multiple Devices

Each Device has multiple Compute Units

Each Compute Unit has multiple Processing Elements

E.g: GT200 has 30 Compute units/Streaming processors and 8 Processing Elements/Scalar SIMD processors = 240 Processing elements

Page 6: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(3)

NDRange = size of the problem to solve 1D or 2D

Work-Group = block of work-items Work-Item ~ lightweight thread

Page 7: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(4)

Global : per device Local : per Work-

Group Private : Per Work-

Item

Page 8: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(4)

__kernel void vec_inc ( __global float *a, __global const float b){ int gid = get_global_id(0); a[gid] = a[gid] + b;}

Page 9: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Implementation

Abstraction Layer (85 %) Gaussian/DoG Pyramids (100 % semi-

optimized) Keypoint Detection (95 % - Naive) Keypoint Refinement (90 % - Naive) Orientation Assignment (10 %) Descriptor generation(0 %)

Page 10: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Abstraction Layer

Problem : Host device code is cumbersome Requires dozens of repetitive lines to setup device

contexts kernels,buffers,etc... Solution: OpenCL wrapper

Simplifies creation and management of hybrid Host/Client buffers and execution of kernels

Facilitates transition from serial to parallel execution Host/Client Synchronization

Memory management issues still need to be fixed

Page 11: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Gaussian Pyramid

Separable convolution 2 1D filters

Indirect filtering to reduce kernel size sigma_diff = sqrt(sig_dst^2 – sigma_src^2)

Use convolutionSeperable() provided by Nvidia for efficient 2D seperable convolution on the GPU

Page 12: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Keypoint detection

Each pixels is processed by one work item independently No state sharing Worst case 26 comparisons / per work Item

Page 13: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Keypoint Refinement

Each Keypoint is processed independently by one work item

Kernel is a slightly modified version of the keypoint refinement Matlab Mex module by Vedaldi

Page 14: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Preliminary Results (Time)All the measurements are performed on an input image of size 512x512.

Gaussian Filtering (sigma 4.1):• Vedaldi Matlab CPU = 0.19s – 100 %• Naïve C++ CPU = 0.33s – 57%• GPU = 0.0094s – 2000 %• GPU with data transfer = 0.0133s – 1400 %

Extrema Detection (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.179 s – 100 %• GPU = 0.035725s – 500 %

 

Keypoint Refinement (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.004 s – 100 %• GPU = 0.0689s – 6 %

Page 15: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Preliminary Results(performance)

Refined Keypoints for octave 0

Blue: Matlab implementation

Red: OpenCL Green: Common 85% Correspondence