YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Implementation and Optimization of SIFT on a OpenCL GPU

6.869-6.338 Final Project5/5/2010

Guy-Richard Kayombya

Page 2: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Overview

Motivation Quick Intro to OpenCL Implementation Results

Page 3: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Motivation

Learn OpenCL Adapt the SIFT algorithm to yet another

parallel architecture Maybe achieve some speedup

Page 4: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL

New Standard from Khronos for HeterogeneousParallel Computing (v1.0 Released Dec 2008) Initiated by Apple

Open and royalty free Cross-Vendor and Cross-Platform Make use of all available processing entities

CPUs, GPUs and other Processors Scales from Embedded to HPC solutions

Page 5: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(2)

One Host, Multiple Devices

Each Device has multiple Compute Units

Each Compute Unit has multiple Processing Elements

E.g: GT200 has 30 Compute units/Streaming processors and 8 Processing Elements/Scalar SIMD processors = 240 Processing elements

Page 6: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(3)

NDRange = size of the problem to solve 1D or 2D

Work-Group = block of work-items Work-Item ~ lightweight thread

Page 7: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(4)

Global : per device Local : per Work-

Group Private : Per Work-

Item

Page 8: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Quick Intro to OpenCL(4)

__kernel void vec_inc ( __global float *a, __global const float b){ int gid = get_global_id(0); a[gid] = a[gid] + b;}

Page 9: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Implementation

Abstraction Layer (85 %) Gaussian/DoG Pyramids (100 % semi-

optimized) Keypoint Detection (95 % - Naive) Keypoint Refinement (90 % - Naive) Orientation Assignment (10 %) Descriptor generation(0 %)

Page 10: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Abstraction Layer

Problem : Host device code is cumbersome Requires dozens of repetitive lines to setup device

contexts kernels,buffers,etc... Solution: OpenCL wrapper

Simplifies creation and management of hybrid Host/Client buffers and execution of kernels

Facilitates transition from serial to parallel execution Host/Client Synchronization

Memory management issues still need to be fixed

Page 11: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Gaussian Pyramid

Separable convolution 2 1D filters

Indirect filtering to reduce kernel size sigma_diff = sqrt(sig_dst^2 – sigma_src^2)

Use convolutionSeperable() provided by Nvidia for efficient 2D seperable convolution on the GPU

Page 12: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Keypoint detection

Each pixels is processed by one work item independently No state sharing Worst case 26 comparisons / per work Item

Page 13: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Keypoint Refinement

Each Keypoint is processed independently by one work item

Kernel is a slightly modified version of the keypoint refinement Matlab Mex module by Vedaldi

Page 14: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Preliminary Results (Time)All the measurements are performed on an input image of size 512x512.

Gaussian Filtering (sigma 4.1):• Vedaldi Matlab CPU = 0.19s – 100 %• Naïve C++ CPU = 0.33s – 57%• GPU = 0.0094s – 2000 %• GPU with data transfer = 0.0133s – 1400 %

Extrema Detection (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.179 s – 100 %• GPU = 0.035725s – 500 %

 

Keypoint Refinement (octave 0 of pyramid):• Vedaldi Matlab CPU = 0.004 s – 100 %• GPU = 0.0689s – 6 %

Page 15: Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Preliminary Results(performance)

Refined Keypoints for octave 0

Blue: Matlab implementation

Red: OpenCL Green: Common 85% Correspondence


Related Documents