Report on Gpu

8/2/2019 Report on Gpu

1/39

SEMINAR REPORT

ON

GRAPHIC PROCESSING

UNIT

Maharaja Agrasen Institute of technology,

PSP area,Sector 22, Rohini, New Delhi 110085

SUBMITTED BY:

DHIRAJ JAIN

0431482707

C2-2

1


2/39

INDEX

1 . A c k no w l ed g e me n t 3

2 . I n t r od u c ti o n t o G P U 4

3 . G P U f o rm s 5

4 . S t r e am p r o ce s s in g a n d G P G P U 7

5 . C UD A 8

6. N v i d i a ' s C U D A : T h e E n d o f t h e C P U 1 0

6.1. I n t r o d u c t i o n 1 0

6.2. T h e C U D A A r c h i t e c t ur e 1 1

6.3. C U D A S o f t w a r e d e v e l o p m e n t k i t 1 2

6.4. C U D A A P I S 1 4

6.5. T h e o r y : C U DA f r o m t h e H a r d w a r e P o i n t o f V i e w 1 9

6.6. T h e o r y : C U DA f r o m t h e S o f t w a r e P o i n t o f V i e w 2 2

6.7. P e r f o r m a n c e 2 4

6.8. A n a l y s i s 2 6

6.9. W o r k i n g 27

6.10. C o n c l u s i o n 34

7 . R e fe r en c es 3 9

2
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html


3/39

ACKNOWLEDGEMENT

It is pleasure to acknowledge my debt to the many people involved, directly or

indirectly, in the production of the seminar. I would like to thank my faculty

guide Mrs. POOJA GUPTA for providing the motivational guidance during the

entire preparation of the seminar, answering a number of my technical queriesand despite her busy schedule, always gave time for checking the progress

report.

I gratefully acknowledge the efforts of several of my colleagues who offeredmany suggestions throughout this project. Their constructive criticism andtimely review have resulted in several improvements.

Thanks a lot for your guidance and support.

DHIRAJ JAIN

0431482707

3


4/39

Introduction To GPU

August 31, 1999 marks the introduction of the Graphics Processing Unit (GPU) for

the PC industry. The technical definition of a GPU is "a single chip processor with

integrated transform, lighting, triangle setup/clipping, and rendering engines that is

capable of processing a minimum of 10 million polygons per second."

The GPU changes everything you have ever seen or experienced on your PC.

As 3D becomes more pervasive in our lives, the need for faster processing speeds

increases. With the advent of the GPU, computationally intensive transform and

lighting calculations were offloaded from the CPU onto the GPUallowing for

faster graphics processing speeds. This means all scenes increase in detail and

complexity without sacrificing performance. In essence, the GPU gives you truly

stunning realism for free.

The difficulty in virtual representations of the real world is robustly mimicking how

objects interact with one another and their surroundings, due to the intense, split-

second computations needed to process all the variables. Once, the process's

bottleneck freed up the CPU's resources. Now, that does not apply.

4


5/39

Before GPU After GPU

GPU forms

Dedicated graphics cards

The most powerful class of GPUs typically interface with the motherboard by means of anexpansion slot such as PCI Express (PCIe) orAccelerated Graphics Port(AGP) and can usually

be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting

the upgrade. A few graphics cards still use Peripheral Component Interconnect(PCI) slots, buttheir bandwidth is so limited that they are generally used only when a PCIe or AGP slot is

unavailable.

A dedicated GPU is not necessarily removable, nor does it necessarily interface with the

motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphicscards have RAMthat is dedicated to the card's use, not to the fact that most dedicated GPUs are

removable. Dedicated GPUs for portable computers are most commonly interfaced through a

non-standard and often proprietary slot due to size and weight constraints. Such ports may stillbe considered PCIe or AGP in terms of their logical host interface, even if they are not physically

interchangeable with their counterparts.

Technologies such as SLI by NVIDIA and CrossFireby ATI allow multiple GPUs to be used to

draw a single image, increasing the processing power available for graphics.

Integrated graphics solutions

5
http://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFire


6/39

Intel GMA X3000 IGP (under heatsink)

Integrated graphics solutions, orshared graphics solutions are graphics

processors that utilize a portion of a computer's system RAM rather than dedicated

graphics memory. Computers with integrated graphics account for 90% of all PC

shipments. These solutions are cheaper to implement than dedicated graphicssolutions, but are less capable. Historically, integrated solutions were often

considered unfit to play 3D games or run graphically intensive programs such as

Adobe Flash (Examples of such IGPs would be offerings from SiS and VIA circa

2004.) However, today's integrated solutions such as the Intel's GMA X4500HD

(Intel G45 chipset), AMD's Radeon HD 3200 (AMD 780G chipset) and NVIDIA's

GeForce 8200 (NVIDIA nForce 730a) are more than capable of handling 2D

graphics from Adobe Flash or low stress 3D graphics. However, most integrated

graphics still struggle with high-end video games. Chips like the Nvidia GeForce

9400M in Apple's new MacBook and MacBook Pro and AMD's Radeon HD 3300

(AMD 790GX) have improved performance, but still lag behind dedicated graphicscards. Some Integrated Graphics Modern desktop motherboards often include an

integrated graphics solution and have expansion slots available to add a dedicated

graphics card later.

As a GPU is extremely memory intensive, an integrated solution may find itself

competing for the already slow system RAM with the CPU as it has minimal or no

dedicated video memory. System RAM may be 2 Gbit/s to 12.8 Gbit/s, yet

dedicated GPUs enjoy between 10 Gbit/s to over 100 Gbit/s of bandwidth

depending on the model.

Older integrated graphics chipsets lacked hardware transform and lighting, but

newer ones include it.

6
http://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=1http://en.wikipedia.org/wiki/File:Harumphy.dg965.heatsink.jpghttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=1


7/39

Hybrid solutions

This newer class of GPUs competes with integrated graphics in the low-end

desktop and notebook markets. The most common implementations of this are

ATI's HyperMemory and NVIDIA's TurboCache. Hybrid graphics cards aresomewhat more expensive than integrated graphics, but much less expensive than

dedicated graphics cards. These also share memory with the system, but have a

smaller dedicated amount of it than discrete graphics cards do, to make up for the

high latency of the system RAM. Technologies within PCI Express can make this

possible. While these solutions are sometimes advertised as having as much as

768MB of RAM, this refers to how much can be shared with the system memory.

Stream Processing and General PurposeGPUs (GPGPU)

A new concept is to use a modified form of a stream processorto allow a general

purpose graphics processing unit. This concept turns the massive floating-point

computational power of a modern graphics accelerator's shader pipeline into

general-purpose computing power, as opposed to being hard wired solely to do

graphical operations. In certain applications requiring massive vector operations,

this can yield several orders of magnitude higher performance than a conventional

CPU. The two largest discrete (see "Dedicated graphics cards" above) GPU

designers, ATI andNVIDIA, are beginning to pursue this new market with an

array of applications. Both nVidia and ATI have teamed with Stanford University

to create a GPU-based client for the Folding@Home distributed computing project

(for protein folding calculations). In certain circumstances the GPU calculates forty

times faster than the conventional CPUs traditionally used in such applications.

Recently NVidia began releasing cards supporting an API extension to the C

programming language called CUDA ("Compute Unified Device Architecture"),

which allows specified functions from a normal C program to run on the GPU'sstream processors. This makes C programs capable of taking advantage of a GPU's

ability to operate on large matrices in parallel, while still making use of the CPU

where appropriate. CUDA is also the first API to allow CPU-based applications to

access directly the resources of a GPU for more general purpose computing

without the limitations of using a graphics API.

7
http://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDAhttp://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDA


8/39

Since 2005 there has been interest in using the performance offered by GPUs for

evolutionary computation in general and for accelerating the fitness evaluation in

genetic programming in particular. There is a short introduction on pages 9092 of

A Field Guide To Genetic Programming. Most approaches compile linearortree

programs on the host PC and transfer the executable to the GPU to run. Typically

the performance advantage is only obtained by running the single active program

simultaneously on many example problems in parallel using the GPU's SIMD

architecture. However, substantial acceleration can also be obtained by not

compiling the programs but instead transferring them to the GPU and interpreting

them there. Acceleration can then be obtained by either interpreting multiple

programs simultaneously, simultaneously running multiple example problems, or

combinations of both. A modern GPU (e.g.8800 GTX or later) can readily

simultaneously interpret hundreds of thousands of very small programs.

CUDA: Compute Unified Device Architecture

What is CUDA?

NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's

only C language environment that enables programmers and developers to writesoftware to solve complex computational problems in a fraction of the time by

tapping into the many-core parallel processing power of GPUs. With millions of

CUDA-capable GPUs already deployed, thousands of software programmers are

already using the free CUDA software tools to accelerate applications-from video

and audio encoding to oil and gas exploration, product design, medical imaging,

and scientific research.

Providing orders of magnitude more performance than current CPUs and

simplifying software development by extending the standard C language, CUDAtechnology enables developers to create innovative solutions for data-intensive

problems. For advanced research and language development, CUDA includes a

low level assembly language layer and driver interface.

CUDA is a software and GPU architecture that makes it possible to use the many

processor cores (and eventually thousands of cores) in a GPU to perform general-

purpose mathematical calculations. CUDA is accessible to all programmers

8
http://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Serieshttp://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Series


9/39

through an extension to the C and C++ programming languages for parallel

computing.

Technology Features:

Standard C language for parallel application development onthe GPU

Standard numerical libraries for FFT (Fast Fourier Transform)and BLAS (Basic Linear Algebra Subroutines)

Dedicated CUDA driver for computing with fast data transferpath between GPU and CPU

CUDA driver interoperates with OpenGL and DirectX graphicsdrivers

Support for Linux 32/64-bit and Windows XP 32/64-bit

operating systems

How is CUDA different from GPGPU?

CUDA is designed from the ground-up for efficient general purpose computation

on GPUs. It uses a C-like programming language and does not require remapping

algorithms to graphics concepts. CUDA is an extension to C for parallel

computing. It allows the programmer to program in C, without the need to translate

problems into graphics concepts. Anyone who can program C can swiftly learn to

program in CUDA.

GPGPU (General-Purpose computation on GPUs) uses graphics APIs like DirectX

and OpenGL for computation. It requires detailed knowledge of graphics APIs and

hardware. The programming model is limited in terms of random read and write

and thread cooperation.

CUDA exposes several hardware features that are not available via the graphics

API. The most significant of these is shared memory, which is a small (currently

16KB per multiprocessor) area of on-chip memory which can be accessed in

parallel by blocks of threads. This allows caching of frequently used data and canprovide large speedups over using textures to access data. Combined with a thread

synchronization primitive, this allows cooperative parallel processing of on-chip

data, greatly reducing the expensive off-chip bandwidth requirements of many

parallel algorithms. This benefits a number of common applications such as linear

algebra, Fast Fourier Transforms, and image processing filters.

9


10/39

Whereas fragment programs in the graphics API are limited to outputting 32 floats

(RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered

writes - i.e. an unlimited number of stores to any address. This enables many new

algorithms that were not possible to perform efficiently using graphics-based

GPGPU.

The graphics API forces the user to store data in textures, which requires packing

long arrays into 2D textures. This is cumbersome and imposes extra addressing

math. CUDA can perform loads from any address. CUDA also offers highly

optimized data transfers to and from the GPU.

Nvidia's CUDA: The End of the

CPU?

Introduction

Lets take a trip back in time way back to 2003 when Intel and AMD became

locked in a fierce struggle to offer increasingly powerful processors. In just a few

years, clock speeds increased quickly as a result of that competition, especially

with Intels release of its Pentium 4.

But the clock speed race would soon hit a wall. After riding the wave of sustained

clock speed boosts (between 2001 and 2003 the Pentium 4s clock speed doubled

from 1.5 to 3 GHz), users now had to settle for improvements of a few measlymegahertz that the chip makers managed to squeeze out (between 2003 and 2005

clock speeds only increased from 3 to 3.8 GHz).

Even architectures optimized for high clock speeds, like the Prescott, ran afoul of

the problem, and for good reason: This time the challenge wasnt simply an

industrial one. The chip makers had simply come up against the laws of physics.

Some observers were even prophesying the end of Moores Law. But that was far

10


11/39

from being the case. While its original meaning has often been misinterpreted, the

real subject of Moores Law was the number of transistors on a given surface area

of silicon. And for a long time, the increase in the number of transistors in a CPU

was accompanied by a concomitant increase in performance which no doubt

explains the confusion. But then, the situation became complicated. CPU architects

had come up against the law of diminishing returns: The number of transistors that

had to be added to achieve a given gain in performance was becoming ever greater

and was headed for a dead end.

The CUDA Architecture

The CUDA Architecture consists of several components, in the green boxes

below:

11. Parallel compute engines inside NVIDIA GPUs

2. OS kernel-level support for hardware initialization, configuration, etc.

3. User-mode driver, which provides a device-level API for developers

4. PTX instruction set architecture (ISA) for parallel computing kernels and

functions

11


12/39

12


13/39

The CUDA Software Development

Environment

The CUDA Software Development Environment provides all the tools, examples

and documentation necessary to develop applications that take advantage of the

CUDA architecture.

Libraries Advanced libraries that include BLAS, FFT,

and other functions optimized for the CUDA

architecture

C Runtime The C Runtime for CUDA provides supportfor executing standard C functions on the GPU

and allows native bindings for other high-level

languages such as Fortran, Java, and Python

Tools NVIDIA C Compiler (nvcc), CUDA Debugger

(cudagdb), CUDA Visual Profiler (cudaprof),

and other helpful tools

Documentation Includes the CUDA Programming Guide, API

specifications, and other helpful

documentation

Samples SDK code samples and documentation that

demonstrate best practices for a wide variety

GPU Computing algorithms and applications

The CUDA Software Development Environment supports two different

programming interfaces:

11. A device-level programming interface, in which the application uses DirectXCompute, OpenCL or the CUDA Driver API directly to configure the GPU, launch

compute kernels, and read back results.

2

2. A language integration programming interface, in which an application uses the

C Runtime for CUDA and developers use a small set of extensions to indicate

which compute functions should be performed on the GPU instead of the CPU.

13


14/39

When using the device-level programming interface, developers write compute

kernels in separate files using the kernel language supported by their API of

choice. DirectX Compute kernels (aka compute shaders) are written in HLSL.

OpenCL kernels are written in a C-like language called OpenCL C. The CUDA

Driver API accepts kernels written in C or PTX assembly.

When using the language integration programming interface, developers write

compute functions in C and the C Runtime for CUDA automatically handles

setting up the GPU and executing the compute functions. This programming

interface enables developers to take advantage of native support for high-level

languages such as C, C++, Fortran, Java, Python, and more (see below), reducing

code complexity and development costs through type integration and codeintegration:

Type integration allows standard types as well as vector types and user-defined

types (including structs) to be used seamlessly across functions that are executed on

the CPU and functions that are executed on the GPU.

14


15/39

The CUDA APIs

But, Brooks critical success was enough to attract the attention of ATI and Nvidia,

since the two giants saw the incipient interest in this type of initiative as an

opportunity to broaden their market even more by reaching a new sector that had so

far been indifferent to their graphics achievements.

Researchers who were in on Brook at its origin quickly joined the Santa Claradevelopment teams to put together a global strategy for targeting the new market.

The idea was to offer a hardware/software ensemble suited to this type of

calculation since Nvidias developers know all the secrets of their GPU, there was

no question of relying only on a graphics API, which only communicates with the

hardware via a driver, with all the problems that implies, as we saw above. So the

CUDA (Compute Unified Device Architecture) development team created a set of

software layers to communicate with the GPU.

As you can see on this diagram, CUDA provides two APIs:

15
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/gallery/014-111103,0101-111103-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.html


16/39

A high-level API: the CUDA Runtime API;

A low-level API: the CUDA Driver API.

Since the high-level API is implemented above the low-level API, each call to a

function of the Runtime is broken down into more basic instructions managed by

the Driver API. Note that these two APIs are mutually exclusive the programmermust use one or the other, but its not possible to mix function calls from both. The

term high-level API is relative. Even the Runtime API is still what a lot of people

would consider very low-level; yet it still offers functions that are highly practical

for initialization or context management. But dont expect a lot more abstraction

you still need a good knowledge of Nvidia GPUs and how they work.

The Driver API, then, is more complex to manage; it requires more work to launch

processing on the GPU. But the upside is that its more flexible, giving the

programmer who wants it additional control. The two APIs are capable of

communicating with OpenGL or Direct3D resources (only nine for the moment).

The usefulness of this is obvious CUDA could be used to generate resources

(geometry, procedural textures, etc.) that could then be passed to the graphics API,

or conversely, its possible that the 3D API could send the results of the rendering

to CUDA, which in that case would be used to perform post-processing. There arenumerous examples of interactions, and the advantage is that the resources remain

stored in the GPUs RAM without having to transit through the bottleneck of the

PCI-Express bus.

Conversely, we should point out that sharing resources in this case video memory

with graphics data is not always idyllic and can lead to a few headaches. For

example, for a change of resolution or color depth, the graphics data have priority.

So, if the resources for the frame buffer need to increase, the driver wont hesitate

16
http://www.tomshardware.com/gallery/016-111105,0101-111105-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/015-111104,0101-111104-0-2-3-0-jpg-.html


17/39

to grab the ones that are allocated to applications using CUDA, causing them to

crash. Its not very elegant, granted; but you have to admit that the situation

shouldnt come up very often. And since were on the subject of little

disadvantages: If you want to use several GPUs for a CUDA application, youll

have to disable SLI mode first, or only a single GPU will be visible to CUDA.

Finally, the third software layer is a set of libraries two to be precise:

CUBLAS, which has a set of building blocks for linear algebra calculationson the GPU;

CUFFT, which can handle calculation of Fourier transforms an algorithm

much used in the field of signal processing.

17
http://www.tomshardware.com/gallery/017-111106,0101-111106-0-2-3-0-jpg-.html


18/39

A Few Definitions

Before we dive into CUDA, lets define a few terms that are sprinkled throughout

Nvidias documentation. The company has chosen to use a rather special

terminology that can be hard to grasp. First we need to define what a thread is inCUDA, because the term doesnt have quite the same meaning as a CPU thread,

nor is it the equivalent of what we call "threads" in our GPU articles. A thread on

the GPU is a basic element of the data to be processed. Unlike CPU threads, CUDA

threads are extremely lightweight, meaning that a context change between two

threads is not a costly operation.

The second term frequently encountered in the CUDA documentation is warp. No

confusion possible this time (unless you think the term might have something to do

with Start TrekorWarhammer). No, the term is taken from the terminology of

weaving, where it designates threads arranged lengthwise on a loom and crossedby the woof. A warp in CUDA, then, is a group of 32 threads, which is the

minimum size of the data processed in SIMD fashion by a CUDA multiprocessor.

But that granularity is not always sufficient to be easily usable by a programmer,

and so in CUDA, instead of manipulating warps directly, you work with blocks that

can contain 64 to 512 threads.

Finally, these blocks are put together in grids. The advantage of the grouping is that

the number of blocks processed simultaneously by the GPU are closely linked to

hardware resources, as well see later. The number of blocks in a grid make it

possible to totally abstract that constraint and apply a kernel to a large quantity of

threads in a single call, without worrying about fixed resources. The CUDAruntime takes care of breaking it all down for you. This means that the model is

extremely extensible. If the hardware has few resources, it executes the blocks

sequentially; if it has a very large number of processing units, it can process them in

parallel. This in turn means that the same code can target both entry-level GPUs,

high-end ones and even future GPUs.

18
http://www.tomshardware.com/gallery/-111133,0101-111133-0-2-3-0-jpg-.html


19/39

The other terms youll run into frequently in the CUDA API are used to designate

the CPU, which is called the host, and the GPU, which is referred to as the device.

After that little introduction, which we hope hasnt scared you away, its time to

plunge in!

19


20/39

The Theory: CUDA from the Hardware Point of

View

With CUDA, Nvidia presents its architecture in a slightly different way and

exposes certain details that hadnt been revealed before now.

As you can see above, Nvidias Shader Core is made up of several clusters Nvidia

calls Texture Processor Clusters. An 8800GTX, for example, has eight clusters,

an 8800GTS six, and so on. Each cluster, in fact, is made up of a texture unit and

two streaming multiprocessors. These processors consist of a front end that

reads/decodes and launches instructions and a backend made up of a group of eight

calculating units and two SFUs (Super Function Units) where the instructions are

executed in SIMD fashion: The same instruction is applied to all the threads in the

warp. Nvidia calls this mode of execution SIMT (for single instruction multiple

threads). Its important to point out that the backend operates at double thefrequency of the front end. In practice, then, the part that executes the instructions

appears to be twice as wide as it actually is (that is, as a 16-way SIMD unit

instead of an eight-way one). The streaming multiprocessors operating mode is as

follows: At each cycle, a warp ready for execution is selected by the front end,

which launches execution of an instruction. To apply the instruction to all 32

threads in the warp, the backend will take four cycles, but since it operates at

20
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/gallery/-111131,0101-111131-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.html


21/39

double the frequency of the front end, from its point of view only two cycles will be

executed. So, to avoid having the front end remain unused for one cycle and to

maximize the use of the hardware, the ideal is to alternate types of instructions

every cycle a classic instruction for one cycle and an SFU instruction for the

other.

Each multiprocessor also has certain amount of resources that should be understood

in order to make the best use of them. They have a small memory area called

Shared Memory with a size of 16 KB per multiprocessor. This is not a cache

memory the programmer has a free hand in its management. As such, its like the

Local Store of the SPUs on Cell processors. This detail is particularly interesting,

and demonstrates the fact that CUDA is indeed a set ofsoftware and hardware

technologies. This memory area is not used for pixel shaders as Nvidia says,

tongue in cheek, We dislike pixels talking to each other.

This memory area provides a way for threads in the same block to communicate.

Its important to stress the restriction: all the threads in a given block are guaranteed

to be executed by the same multiprocessor. Conversely, the assignment of blocks to

the different multiprocessors is completely undefined, meaning that two threads

from different blocks cant communicate during their execution. That means that

using this memory is complicated. But it can also be worthwhile, because except

for cases where several threads try to access the same memory bank, causing a

conflict; the rest of the time, access to shared memory is as fast as access to the

registers.

21


22/39

The shared memory is not the only memory the multiprocessors can access.

Obviously they can use the video memory, but it has lower bandwidth and higher

latency. Consequently, to limit too-frequent access to this memory, Nvidia has also

provided its multiprocessors with a cache (approximately 8 KB per multiprocessor)

for access to constants and textures.

The multiprocessors also have 8,192 registers that are shared among all the threads

of all the blocks active on that multiprocessor. The number of active blocks per

multiprocessor cant exceed eight, and the number of active warps are limited to 24

(768 threads). So, an 8800GTX can have up to 12,288 threads being processed at a

given instant. Its worth mentioning all these limits because it helps in

dimensioning the algorithm as a function of the available resources.

Optimizing a CUDA program, then, essentially consists of striking the optimum

balance between the number of blocks and their size more threads per block will

be useful in masking the latency of the memory operations, but at the same time the

number of registers available per thread are reduced. Whats more, a block of 512

threads would be particularly inefficient, since only one block might be active on a

multiprocessor, potentially wasting 256 threads. So, Nvidia advises using blocks of

128 to 256 threads, which offers the best compromise between masking latency and

the number of registers needed for most kernels.

22


23/39

The Theory: CUDA from the Software Point of

View

From a software point of view, CUDA consists of a set of extensions to the Clanguage, which of course recalls BrookGPU, and a few specific API calls. Among

the extensions are type qualifiers that apply to functions and variables. The

keyword to remember here is __global__, which when prefixed to a function

indicates that the latter is a kernel that is, a function that will be called by the

CPU and executed by the GPU. The __device__ keyword designates a function that

will be executed by the GPU (which CUDA refers to as the device) but can only

be called from the GPU (in other words, from another __device__ function or from

a __global__ function). Finally, the __host__ keyword is optional, and designates a

function thats called by the CPU and executed by the CPU in other words, atraditional function.

There are a few restrictions associated with __device__ and __global__ functions:

They cant be recursive (that is, they cant call themselves) and they cant have a

variable number of arguments. Finally, regarding __device__ functions resident in

the GPUs memory space, logically enough its impossible to obtain their address.

Variables also have new qualifiers that allow control of the memory area where

theyll be stored. A variable preceded by the keyword __shared__ indicates that it

will be stored in the streaming multiprocessors shared memory. The way a

__global__ function is called is also a little different. Thats because the executionconfiguration has to be defined at the time of the call more concretely, the size of

the grid to which the kernel is applied and the size of each block. Take the example

of a kernel with the following signature:

__global__ void Func(float* parameter);

which will be called as follows:

Func>(parameter);

where Dg is the grid dimension and Db the dimension of a block. These two

variables are of a new vector type introduced by CUDA.

The CUDA API essentially comprises functions for memory manipulation in

VRAM: cudaMalloc to allocate memory, cudaFree to free it and cudaMemcpy to

copy data between RAM and VRAM and vice-versa.

23
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.html


24/39

Well end this overview with the way a CUDA program is compiled, which is

interesting: Compiling is done in several phases first of all the code dedicated to

the CPU is extracted from the file and passed to the standard compiler. The code

dedicated to the GPU is first converted into an intermediate language, PTX. This

intermediate language is like an assembler, and so enables the generated source

code to be studied for potential inefficiencies. Finally, the last phase translates this

intermediate language into commands that are specific to the GPU and encapsulates

them in binary form in the executable.

Performance

However we did decide to measure the processing time to see if there was any

advantage to using CUDA even with our crude implementation, or on the otherhand if was going to take long, exhaustive practice to get any real control over the

use of the GPU. The test machine was our development box a laptop computer

with a Core 2 Duo T5450 and a GeForce 8600M GT, operating underVista. Its far

from being a supercomputer, but the results are interesting since our test is not all

that favorable to the GPU. Its fine for Nvidia to show us huge accelerations on

systems equipped with monster GPUs and enormous bandwidth, but in practice

24


25/39

many of the 70 million CUDA GPUs existing on current PCs are much less

powerful, and so our test is quite germane.

The results we got are as follows for processing a 2048x2048 image:

CPU 1 thread: 1419 ms CPU 2 threads: 749 ms

CPU 4 threads: 593 ms

GPU (8600M GT) blocks of 256 pixels: 109 ms

GPU (8600M GT) blocks of 128 pixels: 94 ms

GPU (8800 GTX) blocks of 128 pixels / 256 pixels: 31 ms

Several observations can be made about these results. First of all youll notice that

despite our crack about programmers laziness, we did modify the initial CPU

implementation by threading it. As we said, the code is ideal for this situation all

you do is break down the initial image into as many zones as there are threads. Note

that we got an almost linear acceleration going from one to two threads on our dual-

core CPU, which shows the strongly parallel nature of our test program. Fairly

unexpectedly, the four-thread version proved faster, whereas we were expecting to

see no difference at all on our processor, or even and more logically a slight loss

of efficiency due to the additional cost generated by the creation of the additional

threads. What explains that result? Its hard to say, but it may be that the Windows

thread scheduler has something to do with it; but in any case the result was

reproducible. With a texture with smaller dimensions (512x512), the gain achievedby threading was a lot less marked (approximately 35% as opposed to 100%) and

the behavior of the four-thread version was more logical, showing no gain over the

two-thread version. The GPU was still faster, but less markedly so (the 8600M GT

was three times faster than the two-thread version).

25


26/39

The second notable observation is that even the slowest GPU implementation was

nearly six times faster than the best-performing CPU version. For a first program

and a trivial version of the algorithm, thats very encouraging. Notice also that we

got significantly better results using smaller blocks, whereas intuitively you mightthink that the reverse would be true. The explanation is simple our program uses

14 registers per thread, and with 256-thread blocks it would need 3,584 registers

per block, and to saturate a multiprocessor it takes 768 threads, as we saw. In our

case, thats three blocks or 10,572 registers. But a multiprocessor has only 8,192

registers, so it can only keep two blocks active. Conversely, with blocks of 128

pixels, we need 1,792 registers per block; 8,192 divided by 1,792 and rounded to

the nearest integer works out to four blocks being processed. In practice, the

number of threads are the same (512 per multiprocessor, whereas theoretically it

takes 768 to saturate it), but having more blocks gives the GPU additionalflexibility with memory access when an operation with a long latency is executed,

it can launch execution of the instructions on another block while waiting for the

results to be available. Four blocks would certainly mask the latency better,

especially since our program makes several memory accesses.

Analysis

Finally, in spite of what we said earlier about this not being a horserace, we

couldnt resist the temptation of running the program on an 8800 GTX, which

proved to be three times as fast as the mobile 8600, independent of the size of the

blocks. You might think the result would be four or more times faster based on the

respective architectures: 128 ALUs compared to 32 and a higher clock frequency

(1.35GHz compared to 950MHz), but in practice that wasnt the case. Here again

the most likely hypothesis is that we were limited by the memory accesses. To bemore precise, the initial image is accessed like a CUDA multidimensional array a

very complicated term for whats really nothing more than a texture. There are

several advantages:

accesses get the benefit of the texture cache;

26


27/39

we have a wrapping mode, which avoids having to manage the edges of the

image, unlike the CPU version.

We could also have taken advantage of free filtering with normalized addressing

between [0,1] instead of [0, width] and [0, height], but that wasnt useful in our

case. As you know as a faithful reader, the 8600 has 16 texture units compared to32 for the 8800GTX. So theres only a two-to-one ratio between the two

architectures. Add the difference in frequency and we get a ratio of (32 x 0.575) /

(16 x 0.475) = 2.4 in the neighborhood of the three-to-one we actually observed.

That theory also has the advantage of explaining why the size of the blocks doesnt

change much on the G80, since the ALUs are limited by the texture units anyway.

In addition to the encouraging results, our first steps with CUDA went very well

considering the unfavorable conditions wed chosen. Developing on a Vista laptop

means youre forced to use CUDA SDK 2.0, still in its beta stage, with the 174.55

driver, which is also in beta. Despite all that, we have no unpleasant surprises to

report just a little scare when the first execution of our program, still very buggy,

tried to address memory beyond the allocated space.

The monitor blinked frenetically, then went black until Vista launched the video

driverrecovery service and all was well. But you have to admit it was surprisingwhen youre used to seeing an ordinary Segmentation Fault with standard programs

in cases like that. Finally, one (very small) criticism of Nvidia: In all the

documentation available for CUDA, its a shame not to find a little tutorial

explaining step by step how to set up the development environment in Visual

Studio. Thats not too big a problem since the SDK is full of example programs you

27
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/gallery/-111124,0101-111124-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html


28/39

can explore to find out how to build the skeleton of a minimal project for a CUDA

application, but for beginners a tutorial would have been a lot more convenient.

WORKING

THE GRAPHICS PIPELINE

The task of any 3D graphics system is to synthesize an image from a description

scene. This scene contains the geometric primitives to be viewed as well as

descriptions of the lights illuminating the scene.GPU designers traditionally have

expressed this image-synthesis process as a hardware pipeline of specialized

stages.

HERE - I will provide an overview of the classic graphics pipeline

The goal is to highlight those aspects of the real-time rendering calculation that

allow graphics application developers to exploit modern GPUs as general-purpose

parallel computation engines.

How GPUs Work

1.Pipeline input

Most real-time graphics systems assume that everything is made of triangles,

and they first carve up any more complex shapes, such as quadrilaterals or

curved surface patches, into triangles. The developer uses a computer graphics

28


29/39

library (such as OpenGL or Direct3D) to provide each triangle to the graphics

pipeline one vertex at a time; the GPU assembles vertices into triangles as

needed.

2.Model transformations

A GPU can specify each logical object in a scene in its own locally defined

coordinate system.

This convenience comes at a price: before rendering, the GPU must first

transform all objects into a common coordinate system. To ensure that triangles

arent warped or twisted into curved shapes, this transformation is limited to

simple affine operations such as rotations, translations, scalings, and the like.

3.Lighting

Once each triangle is in a global coordinate system, the GPU can compute its

color based on the lights in the scene. As an example, we describe the

calculations for a single-point light source (imagine a very small lightbulb).The

GPU handles multiple lights by summing the contributions of each individual

light.

The traditional graphics pipeline supports the Phong lighting equation - a

phenomenological appearance model that approximates the look of plastic..

The Phong lighting equation gives the output color :-

C = Kd Li (N.L) + Ks Li (R.V)^S.

29


30/39

4.Camera simulation

The graphics pipeline next projects each colored 3D triangle onto the virtual

cameras film plane. Like the model transformations, the GPU does this using

matrix-vector multiplication, again leveraging efficient vector operations in

hardware. This stages output is a stream of triangles in screen coordinates, ready to

be turned into pixels.

5.Rasterization

Each visible screen-space triangle overlaps some pixels on the display;

determining these pixels is called rasterization. GPU designers have

incorporated many rasterizatiom algorithms over the years, which all exploit one

crucial observation: Each pixel can be treated independently from all other

30


31/39

pixels. Therefore, the machine can handle all pixels in parallelindeed, some

exotic machines have had a processor for each pixel.This inherent independence

has led GPU designers to build increasingly parallel sets of pipelines.

6.Texturing

The actual color of each pixel can be taken directly from the lighting

calculations, but for added realism, images called textures are often draped over

the geometry to give the illusion of detail. GPUs store these textures in high-

speed memory, which each pixel calculation must access to determine or modify

that pixels color.

7.Hidden surfaces

In most scenes, some objects obscure other objects. If each pixel were simply

written to display memory, the most recently submitted triangle would appear to

be in front.All modern GPUs provide a depth buffer, a region of memory that

stores the distance from each pixel to the viewer. Before writing to the display,

the GPU compares a pixels distance to the distance of the pixel thats already

present, and it updates the display memory only if the new pixel is closer.

31


32/39

Figure 1.Programmable shading. The introduction of programmable shading in

2001 led to several visual effects not previously possible, such as this simulationof refractive chromatic dispersion for a soap bubble effect.

32


33/39

Figure 2.Unprecedented visual realism. Modern GPUs can use programmable

shading to achieve near-cinematic realism, as this interactive demonstrationshows, featuring actress Adrianne Curry on an NVIDIA GeForce 8800 GTX.

33


34/39

Conclusion

Nvidia introduced CUDA with the release of the GeForce 8800. At that time the

promises they were making were extremely seductive, but we kept our enthusiasm

in check. After all, wasnt this likely to be just a way of staking out the territory and

surfing the GPGPU wave? Without an SDK available, you cant blame us forthinking it was all just a marketing operation and that nothing really concrete would

come of it. It wouldnt be the first time a good initiative has been announced too

early and never really saw the light of day due to a lack of resources especially in

such a competitive sector. Now, a year and a half after the announcement, we can

say that Nvidia has kept its word.

34
http://www.tomshardware.com/gallery/029-111118,0101-111118-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/-111134,0101-111134-0-2-3-1-jpg-.html


35/39

Not only was the SDK available quickly in a beta version, in early 2007, but its

also been updated frequently, proving the importance of this project for Nvidia.

Today CUDA has developed nicely; the SDK is available in a beta 2.0 version for

the major operating systems (Windows XP and Vista and Linux and 1.1 for MacOS X), and Nvidia is devoting an entire section of its site for developers.

On a more personal level, the impression we got from our first steps with CUDA

was extremely positive. Even if youre familiar with the GPUs architecture, its

natural to be apprehensive about programming it, and while the API looks clear at

first glance you cant keep from thinking it wont be easy to get convincing results

with the architecture. Wont the gain in processing time be siphoned off by the

multiple CPU-GPU transfers? And how to make good use of those thousands of

threads with almost no synchronization primitive? We started our experimentation

with all these uncertainties in mind. But they soon evaporated when the first version

of our algorithm, trivial as it was, already proved to be significantly faster than theCPU implementation.

So, CUDA is not a gimmick intended for researchers who want to cajole their

university into buying them a GeForce. CUDA is genuinely usable by any

programmer who knows C, provided he or she is ready to make a small investment

of time and effort to adapt to this new programming paradigm. That effort wont be

wasted provided your algorithms lend themselves to parallelization. We should also

35
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.htmlhttp://www.tomshardware.com/gallery/031-111120,0101-111120-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/030-111119,0101-111119-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.html


36/39

tip our hat to Nvidia for providing ample, quality documentation to answer all the

questions of beginning programmers.

So what does CUDA need in order to become the API to reckon with? In a word:

portability. We know that the future of IT is in parallel computing everybodys

preparing for the change and all initiatives, both software and hardware, are takingthat direction. Currently, in terms of development paradigms, were still in

prehistory creating threads by hand and making sure to carefully plan access to

shared resources is still manageable today when the number of processor cores can

be counted on the fingers of one hand; but in a few years, when processors will

number in the hundreds, that wont be a possibility. With CUDA, Nvidia is

proposing a first step in solving this problem but the solution is obviously

reserved only for their own GPUs, and not even all of them. Only the GF8 and 9

(and their Quadro/Tesla derivatives) are currently able to run CUDA programs.

Nvidia may boast that it has sold 70 million CUDA-compatible GPUs worldwide,

but thats still not enough for it to impose itself as the de facto standard. All the

more so since their competitors arent standing by idly. AMD is offering its own

SDK (Stream Computing) and Intel has also announced a solution (Ct), though its

36
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/gallery/033-111122,0101-111122-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/032-111121,0101-111121-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.html


37/39

not available yet. So the war is on and there wont be room for three competitors,

unless another player say Microsoft were to step in and pick up all the marbles

with a common API, which would certainly be welcomed by developers.

So Nvidia still has a lot of challenges to meet to make CUDA stick, since while

technologically its undeniably a success, the task now is to convince developers

that its a credible platform and that doesnt look like itll be easy. However,

judging by the many recent announcements in the news about the API, the future

doesnt look unpromising.

Supported GPUsA table of devices officially supporting CUDA (Note that many applications require at least 256

MB of dedicated VRAM).

Nvidia GeForce

GeForce GTX 295

GeForce GTX 285

GeForce GTX 280

GeForce GTX 275GeForce GTX 260

GeForce GTS 250

GeForce GT 220

GeForce G210

GeForce 9800 GX2

GeForce 9800 GTX+

GeForce 9800 GTX

GeForce 9800 GT

GeForce 9600 GSO

GeForce 9600 GTGeForce 9500 GT

GeForce 9400 GT

GeForce 9400 mGPU

GeForce 9300 mGPU

GeForce 8800 Ultra

GeForce 8800 GTX

Nvidia GeForce Mobile

GeForce GTX 280M

GeForce GTX 260M

GeForce GTS 260M

GeForce GTS 250MGeForce GTS 160M

GeForce GT 240M

GeForce GT 230M

GeForce GT 220M

GeForce G210M

GeForce 9800M GTX

GeForce 9800M GTS

GeForce 9800M GT

GeForce 9800M GS

GeForce 9700M GTSGeForce 9700M GT

GeForce GT 130M

GeForce GT 120M

GeForce 9650M GT

GeForce 9650M GS

GeForce 9600M GT

Nvidia Quadro

Quadro FX 5800

Quadro FX 5600

Quadro FX 4800

Quadro FX 4700 X2Quadro FX 4600

Quadro FX 3700

Quadro FX 1800

Quadro FX 1700

Quadro FX 570

Quadro FX 370

Quadro NVS 290

Quadro FX 3600M

Quadro FX 1600M

Quadro FX 770MQuadro FX 570M

Quadro FX 370M

Quadro Plex 1000 Model IV

Quadro Plex 1000 Model S4

Nvidia Quadro Mobile

Quadro NVS 360M

37
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://www.tomshardware.com/gallery/034-111123,0101-111123-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadro


38/39

GeForce 8800 GTS

GeForce 8800 GT

GeForce 8800 GS

GeForce 8600 GTS

GeForce 8600 GTGeForce 8600 mGT

GeForce 8500 GT

GeForce 8400 GS

GeForce 8300 mGPU

GeForce 8200 mGPU

GeForce 8100 mGPU

GeForce 9600M GS

GeForce 9500M GS

GeForce 9500M G

GeForce 9400M G

GeForce 9300M GSGeForce 9300M G

GeForce 9200M GS

GeForce 9100M G

GeForce 8800M GTS

GeForce 8700M GT

GeForce 8600M GT

GeForce 8600M GS

Quadro NVS 140M

Quadro NVS 135M

Quadro NVS 130M

Nvidia Tesla

Tesla S1070Tesla C1060

Tesla C870

Tesla D870

Tesla S870

ExampleThis example code in C++ loads a texture from an image into an array on the GPU:

cudaArray* cu_array;texture tex;// Allocate arraycudaChannelFormatDesc description = cudaCreateChannelDesc();cudaMallocArray(&cu_array, &description, width, height);// Copy image data to arraycudaMemcpy(cu_array, image, width*height*sizeof(float),

cudaMemcpyHostToDevice);// Bind the array to the texturecudaBindTextureToArray(tex, cu_array);// Run kerneldim3 blockDim(16, 16, 1);dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);kernel>(d_odata, width, height);cudaUnbindTexture(tex);__global__ void kernel(float* odata, int height, int width){

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;float c = tex2D(tex, x, y);odata[y*width+x] = c;}

REFERENCES

38
http://en.wikipedia.org/wiki/Nvidia_Teslahttp://en.wikipedia.org/wiki/Nvidia_Tesla


39/39

1.GPU Computing

This paper appears in:Proceedings of the IEEEIssue Date: May 2008Volume: 96Issue:5On page(s): 879 - 899

ISSN: 0018-9219INSPEC Accession Number: 9921316Digital Object Identifier:10.1109/JPROC.2008.917757Date of Current Version: 15 April 2008Sponsored by:IEEE

2. http://developer.nvidia.com/object/gpu_programming_guide.html

3. www.tomshardware.com/reviews/nvidia-cuda-gpu
http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnkhttp://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnk

Report on Gpu

Documents