Image Rotation Using CUDA

A Practical Training Report on

IMAGE ROTATION USING CUDA

Under the Guidance of

Prof. Tim Poston

National Institute Of Advance Science, Bangalore

Submitted By

Harmit Singh

08IT28

VII SEM B.Tech (IT)

Department of Information Technology

National Institute of Technology Karnataka, Surathkal 2011-2012

1

DECLARATION BY STUDENT

I hereby declare that the project entitled “Parallel Programming using CUDA” carried out by

me during the Summer Term of the academic year 2011 – 2012, for the Practical

Training/Educational Tour as per the B.Tech (I.T) degree curriculum, is my original work and

has been completed successfully according to my guide’s direction and NITK’s specifications.

Harmit Singh

(Signature of the Student)

Place: Surathkal

Date : 26-09-2011

2

ACKNOWLEDGEMENT

I take the opportunity to express my sincere gratitude to my mentor Mr. Vivek Na and Prof Tim Poston who has sincerely helped and supported me during my project. I am thankful to them as they have devoted their precious time, out of their busy schedules to have discussion on the project.

Without their kind support and help project completion was not successful. I would like to extend my sincere thanks to all of them.

I would like to express my gratitude towards Sir Ashutosh Mukherji Professor of National Institute of Advanced Studies for their kind co-operation and encouragement which help me in completion of this project.

My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities.

3

ABSTRACT

Problem Statement:

Developing a parallel code for fast interpolation for image rotation, required in the numerics of a novel 3-degree-of-freedom optical sensor, and for equalization of pressure across a grid of cells, which will play a part in a ‘deformable sets’ schema that will model foams, the folding of the brains, and other phenomenon.

Aim:

Developing algorithm for image rotation on CPU and then deploying the algorithm to work on GPU parallel threads using CUDA.

Implementation details:

NVIDIA graphics card as 240-core GPU to be installed on a PC, Running SDK tool kit to enabling CUDA for parallel programming. Developing algorithm to run on parallel artitecture.

4

CONTENTS

1. Introduction1.1 About Parallel Programming ……………………. 61.2 GPU ……………………. 61.3 GPU Computing ……………………. 61.4 About CUDA ……………………. 71.5 About Project ……………………..7

2.Working Details Of CUDA2.1 The CUDA Architecture ……………………. 8

2.2 Advantages of CUDA ……………………. 82.3 CUDA programming model ……………………..8-92.4 Installing the CUDA Development Tools ……………………. 92.5 Purpose Of NVCC ……………………. 92.6 Writing C/C++ Code for CUDA ……………………. 102.7 Kernels ……………………. 102.8 Thread Hierarchy ……………………. 10-112.9 Image Rotation ……………………..11-122.10 Bilinear Interpolation ……………………..12-142.11 Implementation ……………………. 14-152.12 Results ……………………. 152.13 Limitations ……………………. 15-162.14 Future usages of CUDA architecture for ……………………. 16 image processing.

3.Conclusion …………………….. 174.References …………………….. 18

5

1. Introduction

1.1 About Parallel Programming

A problem is broken into a discrete series of instructions which are executed parallel on GPU threads. The application developer modified their application to take the compute-intensive kernels and map them to the GPU. The rest of the application remains on the CPU. Problem can be solved in less time with multiple compute resources of GPU than with a single compute resource of CPU.

1.2 GPU

A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized processor that offloads 3D or 2D graphics rendering from the microprocessor. It is used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a dedicated video card.

GPU’s can be of following types:1. Dedicated video cards: These have their own dedicated memory.2. Integrated graphics processors: Share a portion of RAM3. Hybrid: Share RAM while having their own cache memory.

1.3 GPU Computing

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing. The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

6

1.4 About CUDA

Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company’s powerful GPUs. It is a NVIDIA's parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU. Computing here is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture. CUDA’s programming model differs significantly from single threaded CPU code. In a single-threaded model, the CPU fetches a single instructionstream that operates serially on the data. While in CUDA multiple instructions are processed simultaneously.

1.5 About Project

The base was to develop a parallel code for fast interpolation for image rotation, which occurs in the numerics of a novel 3-degree-of-freedom optical sensor, and for equalization of pressure across a grid of cells, which will play a part in a ‘deformable sets’ schema that will model foams, the folding of the brains, and other phenomenon. The Algorithms designed for image rotation and pressure balance developed for serial execution can easily be implemented in CUDA.

.

7

2. Working Details of CUDA

2.1 The CUDA Architecture

It consists of several components

1. Parallel compute engines inside NVIDIA GPUs 2. OS kernel-level support for hardware initialization, configuration, etc. 3. User-mode driver, which provides a device-level API for developers 4. PTX instruction set architecture (ISA) for parallel computing kernels and functions

CUDA includes C/C++ software development tools, function libraries, and a hardware abstraction mechanism that hides the GPU hardware from developers. Although CUDA requires programmers to write special code for parallel processing, it doesn’t require them to explicitly manage threads in the conventional sense, which greatly simplifies the programming model. CUDA development tools work alongside a conventional C/C++ compiler, so programmers can mix GPU code with general-purpose code for the host CPU.

2.2 Advantages of CUDA

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.

Scattered reads – code can read from arbitrary addresses in memory.• Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be

shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.

• Faster downloads and readbacks to and from the GPU • Full support for integer and bitwise operations, including integer texture lookups

2.3 CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threads

Threads are grouped into thread blocks

Parallel code is written for a thread

8

Each thread is free to execute a unique code path

Built-in thread and block ID variables

2.4 Installing the CUDA Development Tools

2.1.1 Verify the system has a CUDA-capable GPU and a supported version of Linux.2.1.2 Download the NVIDIA driver and the CUDA software.2.1.3 Install the NVIDIA driver.2.1.4 Install the CUDA software.

Test your installation by compiling and running one of the sample programs in the CUDA software to validate that the hardware and software are running correctly and communicating with each other.

2.5 Purpose of nvcc

This compilation trajectory involves several splitting, compilation, preprocessing, and merging steps for each CUDA source file, and several of these steps are subtly different for different modes of CUDA compilation (such as compilation for device emulation, or the generation of device code repositories). It is the purpose of the CUDA compiler driver nvcc to hide the intricate details of CUDA compilation from Developers.

NVCC will use the following compilers for host code compilation:

On Linux platforms: The GNU compiler, gccOn Windows platforms: The Microsoft Visual Studio compiler, cl

9

2.6 Writing C/C++ Code for CUDA

After a developer has performed his data analysis, the next step is to express the solution in C or C++. For this purpose, CUDA adds a few special extensions and API calls to the language.

One extension, __global__ is a declaration specification (“decl spec”) for the C/C++ parser. It indicates that the function saxpy_parallel is a CUDA kernel that should be compiled for an Nvidia GPU, not for a host CPU, and that the kernel is globally accessible to the whole program.

Another extension,<<<nblocks, 256>>> to specify the dimensions of the data grid and its blocks. The first parameter specifies the dimensions of the grid in blocks, and the second parameter specifies the dimensions of the blocks in threads.

__shared__ it indicates that a variable in local memory is shared among threads in the same thread block. There are also some special API calls, such as cudaMalloc() and cudaFree(), forCUDA-specific memory allocation; and cudaMemcpy() and cuda Memcpy2D(), for copying regions of CPU memory to GPU memory.

2.7 Kernels

C for CUDA extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using the __global__ declaration specifier and the number of CUDA threads for each call is specified using a new <<<…>>> syntax:

Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

2.8 Thread Hierarchy

ThreadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or field.

Thread batching:

10

1.) Kernel launches a grid of thread blocks2.) Threads within block cooperate via shared memory3.) Threads within a block can synchronize4.) Threads in different blocks cannot cooperate

Each thread has access to:

1.) threadIdx.x - thread ID within block2.) blockIdx.x - block ID within grid3.) blockDim.x - number of threads per block

Figure 1: Thread Processors and Shared Memory:

2.9 Image Rotation

Image Rotation is a common digital image process. A usual method is a geometry transformation which rotates angle a around center of the image. For example [1], the original pixel p [x , y] rotates to p’[x’, y’].

11

If rotated angle a pi/2 *n (n =integer) then pixel will be integer too. However some rotations may cause zoom effect because some rotated pixels are not integer . in this case interpolation can be used to figure out the rotated pixel to reduce distortion.

A rotation matrix is a matrix that is used to perform a rotation in Euclidean space. For example

the matrix.

This rotates column vectors by means of the following matrix multiplication:

.

So the coordinates (x',y') of the point (x,y) after rotation are:

,

.

2.10 Bilinear Interpolation

Bilinear Interpolation is a resampling method that uses the distanceweighted average of the fournearest pixel values to estimate a new pixel value.

The key idea is to perform linear interpolation first in one direction, and then again in the other

direction. Although each step is linear in the sampled values and in the position, the interpolation

as a whole is not linear but rather quadratic in the sample location (details below).

12

Figure 2: The four red dots show the data points and the green dot is the point at which we want to interpolate.

Suppose that we want to find the value of the unknown function f at the point P = (x, y). It is

assumed that we know the value of f at the four points Q11 = (x1, y1), Q12 = (x1, y2), Q21 = (x2, y1),

and Q22 = (x2, y2).

We first do linear interpolation in the x-direction. This yields

where R1 = (x,y1),

where R2 = (x,y2).

We proceed by interpolating in the y-direction.

This gives us the desired estimate of f(x, y).

13

http://en.wikipedia.org/wiki/File:Bilinear_interpolation.png

2.11 Implementation

To rotate an Image by an angle Q:

V(X, Y) denotes the initial pixel value of image at Co-ordinates X, Y

Co-ordinates after rotation, where we have to store value of that pixel are:

X = X*cos(Q) – Y*sin(Q);

Y = X*sin(Q) + Y*cos(Q);

This is being implemented in CUDA as Follows:

The index of a thread and its thread ID relate to each other in a straightforward way: For a two-dimensional block of size (Dx, Dy).

__global__ void Rotate(float Source[N][N], float Destination[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x; // Kernel definition

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

Destination[i*cos(Q) - j*sin(Q)][i*sin(Q) + j*cos(Q)] = Source[i][j];

}

14

int main()

{

...

dim3 dimBlock(16, 16); // Kernel invocation

dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y);

Rotate<<<dimGrid, dimBlock>>>(Source, Destination);

}

2.12 Results

given a square input matrix of integers representing a grayscale image, generates a corresponding output matrix of the same dimension which contains the image rotated by a given angle theta.

How ?

Take every co-ordinate in the destination matrix x, and y, and rotate them ( about the exact center of the matrix ) by -theta, getting nx, ny which will almost always be irrational numbers.

Look up the 4 values in the source matrix that are nearest to the position nx and ny. Interpolate those values to get the image intensity at the point nx, ny and place that value (after rounding) in the destination matrix at x, y.

Implementing this algorithm in CUDA gives a very good result but is slower if size of the image is small otherwise produces optimal results.

2.13 Limitations

Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine)

15

Therefore with small size images as input results are not optimal as compared to serial execution of rotating an image.However , large size images perform better with CUDA as work is done parallelly.

Maximum number of threads per block: 512, therefore concurrency is limited.

2.14 Future usages of CUDA architecture for image processing.

Accelerated rendering of 3D graphics

Accelerated interconversion of video file formats

Accelerated encryption, decryption and compression

Medical analysis simulations, for example virtual reality based on CT and MRI scan images.

Physical simulations, in particular in fluid dynamics.

16

http://en.wikipedia.org/wiki/Fluid_dynamics

http://en.wikipedia.org/wiki/Magnetic_resonance_imaging

http://en.wikipedia.org/wiki/X-ray_computed_tomography

http://en.wikipedia.org/wiki/Virtual_reality

http://en.wikipedia.org/wiki/Data_compression

http://en.wikipedia.org/wiki/Decryption

http://en.wikipedia.org/wiki/Encryption

4. Conclusion

In this Implementation I have demonstrated three features of the algorithm that help it achieve such high efficiency:

● Straight forward parallelism with sequential memory access patterns● Data reuse that keeps the arithmetic units busy● Fully pipelined arithmetic, including complex operations such as rotation of co-ordinates,

This is much faster clock-for-clock on a GeForce 8800 GTX GPU than on a CPU.The result is an algorithm that runs more than 50 times as fast as a highly tuned serial implementation or 250 times faster than our portable C implementation.At this performance level, 3D simulations of large numbers of pixels can berun interactively, efficiently, effectively.

17

5. References

[1] NVIDIA_CUDA_BestPracticesGuide_2.3.pdf

[2] NVIDIA_CUDA_Programming_Guide_2.3.pdf

[3] CUDA_Getting_Started_2.2_Windows.pdf

[4]CUDA_Reference_Manual_2.3.pdf

[5] nvcc_2.0.pdf

[6] NVIDIA Corporation. 2007. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Version 0.8.1

[7] http://nbodylab.interconnect.com/docs/P3.1.6_revised.pdf

18

Image Rotation Using CUDA

Documents

gpu parallel threads

gpu threads

gpu computinggpu computing

core gpu

gpu graphics processing

details of cuda

advantages of cuda

cuda programming model