8/2/2019 Report on Gpu
1/39
SEMINAR REPORT
ON
GRAPHIC PROCESSING
UNIT
Maharaja Agrasen Institute of technology,
PSP area,Sector 22, Rohini, New Delhi 110085
SUBMITTED BY:
DHIRAJ JAIN
0431482707
C2-2
1
8/2/2019 Report on Gpu
2/39
INDEX
1 . A c k no w l ed g e me n t 3
2 . I n t r od u c ti o n t o G P U 4
3 . G P U f o rm s 5
4 . S t r e am p r o ce s s in g a n d G P G P U 7
5 . C UD A 8
6. N v i d i a ' s C U D A : T h e E n d o f t h e C P U 1 0
6.1. I n t r o d u c t i o n 1 0
6.2. T h e C U D A A r c h i t e c t ur e 1 1
6.3. C U D A S o f t w a r e d e v e l o p m e n t k i t 1 2
6.4. C U D A A P I S 1 4
6.5. T h e o r y : C U DA f r o m t h e H a r d w a r e P o i n t o f V i e w 1 9
6.6. T h e o r y : C U DA f r o m t h e S o f t w a r e P o i n t o f V i e w 2 2
6.7. P e r f o r m a n c e 2 4
6.8. A n a l y s i s 2 6
6.9. W o r k i n g 27
6.10. C o n c l u s i o n 34
7 . R e fe r en c es 3 9
2
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html8/2/2019 Report on Gpu
3/39
ACKNOWLEDGEMENT
It is pleasure to acknowledge my debt to the many people involved, directly or
indirectly, in the production of the seminar. I would like to thank my faculty
guide Mrs. POOJA GUPTA for providing the motivational guidance during the
entire preparation of the seminar, answering a number of my technical queriesand despite her busy schedule, always gave time for checking the progress
report.
I gratefully acknowledge the efforts of several of my colleagues who offeredmany suggestions throughout this project. Their constructive criticism andtimely review have resulted in several improvements.
Thanks a lot for your guidance and support.
DHIRAJ JAIN
0431482707
3
8/2/2019 Report on Gpu
4/39
Introduction To GPU
August 31, 1999 marks the introduction of the Graphics Processing Unit (GPU) for
the PC industry. The technical definition of a GPU is "a single chip processor with
integrated transform, lighting, triangle setup/clipping, and rendering engines that is
capable of processing a minimum of 10 million polygons per second."
The GPU changes everything you have ever seen or experienced on your PC.
As 3D becomes more pervasive in our lives, the need for faster processing speeds
increases. With the advent of the GPU, computationally intensive transform and
lighting calculations were offloaded from the CPU onto the GPUallowing for
faster graphics processing speeds. This means all scenes increase in detail and
complexity without sacrificing performance. In essence, the GPU gives you truly
stunning realism for free.
The difficulty in virtual representations of the real world is robustly mimicking how
objects interact with one another and their surroundings, due to the intense, split-
second computations needed to process all the variables. Once, the process's
bottleneck freed up the CPU's resources. Now, that does not apply.
4
8/2/2019 Report on Gpu
5/39
Before GPU After GPU
GPU forms
Dedicated graphics cards
The most powerful class of GPUs typically interface with the motherboard by means of anexpansion slot such as PCI Express (PCIe) orAccelerated Graphics Port(AGP) and can usually
be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting
the upgrade. A few graphics cards still use Peripheral Component Interconnect(PCI) slots, buttheir bandwidth is so limited that they are generally used only when a PCIe or AGP slot is
unavailable.
A dedicated GPU is not necessarily removable, nor does it necessarily interface with the
motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphicscards have RAMthat is dedicated to the card's use, not to the fact that most dedicated GPUs are
removable. Dedicated GPUs for portable computers are most commonly interfaced through a
non-standard and often proprietary slot due to size and weight constraints. Such ports may stillbe considered PCIe or AGP in terms of their logical host interface, even if they are not physically
interchangeable with their counterparts.
Technologies such as SLI by NVIDIA and CrossFireby ATI allow multiple GPUs to be used to
draw a single image, increasing the processing power available for graphics.
Integrated graphics solutions
5
http://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFire8/2/2019 Report on Gpu
6/39
Intel GMA X3000 IGP (under heatsink)
Integrated graphics solutions, orshared graphics solutions are graphics
processors that utilize a portion of a computer's system RAM rather than dedicated
graphics memory. Computers with integrated graphics account for 90% of all PC
shipments. These solutions are cheaper to implement than dedicated graphicssolutions, but are less capable. Historically, integrated solutions were often
considered unfit to play 3D games or run graphically intensive programs such as
Adobe Flash (Examples of such IGPs would be offerings from SiS and VIA circa
2004.) However, today's integrated solutions such as the Intel's GMA X4500HD
(Intel G45 chipset), AMD's Radeon HD 3200 (AMD 780G chipset) and NVIDIA's
GeForce 8200 (NVIDIA nForce 730a) are more than capable of handling 2D
graphics from Adobe Flash or low stress 3D graphics. However, most integrated
graphics still struggle with high-end video games. Chips like the Nvidia GeForce
9400M in Apple's new MacBook and MacBook Pro and AMD's Radeon HD 3300
(AMD 790GX) have improved performance, but still lag behind dedicated graphicscards. Some Integrated Graphics Modern desktop motherboards often include an
integrated graphics solution and have expansion slots available to add a dedicated
graphics card later.
As a GPU is extremely memory intensive, an integrated solution may find itself
competing for the already slow system RAM with the CPU as it has minimal or no
dedicated video memory. System RAM may be 2 Gbit/s to 12.8 Gbit/s, yet
dedicated GPUs enjoy between 10 Gbit/s to over 100 Gbit/s of bandwidth
depending on the model.
Older integrated graphics chipsets lacked hardware transform and lighting, but
newer ones include it.
6
http://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=1http://en.wikipedia.org/wiki/File:Harumphy.dg965.heatsink.jpghttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=18/2/2019 Report on Gpu
7/39
Hybrid solutions
This newer class of GPUs competes with integrated graphics in the low-end
desktop and notebook markets. The most common implementations of this are
ATI's HyperMemory and NVIDIA's TurboCache. Hybrid graphics cards aresomewhat more expensive than integrated graphics, but much less expensive than
dedicated graphics cards. These also share memory with the system, but have a
smaller dedicated amount of it than discrete graphics cards do, to make up for the
high latency of the system RAM. Technologies within PCI Express can make this
possible. While these solutions are sometimes advertised as having as much as
768MB of RAM, this refers to how much can be shared with the system memory.
Stream Processing and General PurposeGPUs (GPGPU)
A new concept is to use a modified form of a stream processorto allow a general
purpose graphics processing unit. This concept turns the massive floating-point
computational power of a modern graphics accelerator's shader pipeline into
general-purpose computing power, as opposed to being hard wired solely to do
graphical operations. In certain applications requiring massive vector operations,
this can yield several orders of magnitude higher performance than a conventional
CPU. The two largest discrete (see "Dedicated graphics cards" above) GPU
designers, ATI andNVIDIA, are beginning to pursue this new market with an
array of applications. Both nVidia and ATI have teamed with Stanford University
to create a GPU-based client for the Folding@Home distributed computing project
(for protein folding calculations). In certain circumstances the GPU calculates forty
times faster than the conventional CPUs traditionally used in such applications.
Recently NVidia began releasing cards supporting an API extension to the C
programming language called CUDA ("Compute Unified Device Architecture"),
which allows specified functions from a normal C program to run on the GPU'sstream processors. This makes C programs capable of taking advantage of a GPU's
ability to operate on large matrices in parallel, while still making use of the CPU
where appropriate. CUDA is also the first API to allow CPU-based applications to
access directly the resources of a GPU for more general purpose computing
without the limitations of using a graphics API.
7
http://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDAhttp://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDA8/2/2019 Report on Gpu
8/39
Since 2005 there has been interest in using the performance offered by GPUs for
evolutionary computation in general and for accelerating the fitness evaluation in
genetic programming in particular. There is a short introduction on pages 9092 of
A Field Guide To Genetic Programming. Most approaches compile linearortree
programs on the host PC and transfer the executable to the GPU to run. Typically
the performance advantage is only obtained by running the single active program
simultaneously on many example problems in parallel using the GPU's SIMD
architecture. However, substantial acceleration can also be obtained by not
compiling the programs but instead transferring them to the GPU and interpreting
them there. Acceleration can then be obtained by either interpreting multiple
programs simultaneously, simultaneously running multiple example problems, or
combinations of both. A modern GPU (e.g.8800 GTX or later) can readily
simultaneously interpret hundreds of thousands of very small programs.
CUDA: Compute Unified Device Architecture
What is CUDA?
NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's
only C language environment that enables programmers and developers to writesoftware to solve complex computational problems in a fraction of the time by
tapping into the many-core parallel processing power of GPUs. With millions of
CUDA-capable GPUs already deployed, thousands of software programmers are
already using the free CUDA software tools to accelerate applications-from video
and audio encoding to oil and gas exploration, product design, medical imaging,
and scientific research.
Providing orders of magnitude more performance than current CPUs and
simplifying software development by extending the standard C language, CUDAtechnology enables developers to create innovative solutions for data-intensive
problems. For advanced research and language development, CUDA includes a
low level assembly language layer and driver interface.
CUDA is a software and GPU architecture that makes it possible to use the many
processor cores (and eventually thousands of cores) in a GPU to perform general-
purpose mathematical calculations. CUDA is accessible to all programmers
8
http://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Serieshttp://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Series8/2/2019 Report on Gpu
9/39
through an extension to the C and C++ programming languages for parallel
computing.
Technology Features:
Standard C language for parallel application development onthe GPU
Standard numerical libraries for FFT (Fast Fourier Transform)and BLAS (Basic Linear Algebra Subroutines)
Dedicated CUDA driver for computing with fast data transferpath between GPU and CPU
CUDA driver interoperates with OpenGL and DirectX graphicsdrivers
Support for Linux 32/64-bit and Windows XP 32/64-bit
operating systems
How is CUDA different from GPGPU?
CUDA is designed from the ground-up for efficient general purpose computation
on GPUs. It uses a C-like programming language and does not require remapping
algorithms to graphics concepts. CUDA is an extension to C for parallel
computing. It allows the programmer to program in C, without the need to translate
problems into graphics concepts. Anyone who can program C can swiftly learn to
program in CUDA.
GPGPU (General-Purpose computation on GPUs) uses graphics APIs like DirectX
and OpenGL for computation. It requires detailed knowledge of graphics APIs and
hardware. The programming model is limited in terms of random read and write
and thread cooperation.
CUDA exposes several hardware features that are not available via the graphics
API. The most significant of these is shared memory, which is a small (currently
16KB per multiprocessor) area of on-chip memory which can be accessed in
parallel by blocks of threads. This allows caching of frequently used data and canprovide large speedups over using textures to access data. Combined with a thread
synchronization primitive, this allows cooperative parallel processing of on-chip
data, greatly reducing the expensive off-chip bandwidth requirements of many
parallel algorithms. This benefits a number of common applications such as linear
algebra, Fast Fourier Transforms, and image processing filters.
9
8/2/2019 Report on Gpu
10/39
Whereas fragment programs in the graphics API are limited to outputting 32 floats
(RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered
writes - i.e. an unlimited number of stores to any address. This enables many new
algorithms that were not possible to perform efficiently using graphics-based
GPGPU.
The graphics API forces the user to store data in textures, which requires packing
long arrays into 2D textures. This is cumbersome and imposes extra addressing
math. CUDA can perform loads from any address. CUDA also offers highly
optimized data transfers to and from the GPU.
Nvidia's CUDA: The End of the
CPU?
Introduction
Lets take a trip back in time way back to 2003 when Intel and AMD became
locked in a fierce struggle to offer increasingly powerful processors. In just a few
years, clock speeds increased quickly as a result of that competition, especially
with Intels release of its Pentium 4.
But the clock speed race would soon hit a wall. After riding the wave of sustained
clock speed boosts (between 2001 and 2003 the Pentium 4s clock speed doubled
from 1.5 to 3 GHz), users now had to settle for improvements of a few measlymegahertz that the chip makers managed to squeeze out (between 2003 and 2005
clock speeds only increased from 3 to 3.8 GHz).
Even architectures optimized for high clock speeds, like the Prescott, ran afoul of
the problem, and for good reason: This time the challenge wasnt simply an
industrial one. The chip makers had simply come up against the laws of physics.
Some observers were even prophesying the end of Moores Law. But that was far
10
8/2/2019 Report on Gpu
11/39
from being the case. While its original meaning has often been misinterpreted, the
real subject of Moores Law was the number of transistors on a given surface area
of silicon. And for a long time, the increase in the number of transistors in a CPU
was accompanied by a concomitant increase in performance which no doubt
explains the confusion. But then, the situation became complicated. CPU architects
had come up against the law of diminishing returns: The number of transistors that
had to be added to achieve a given gain in performance was becoming ever greater
and was headed for a dead end.
The CUDA Architecture
The CUDA Architecture consists of several components, in the green boxes
below:
11. Parallel compute engines inside NVIDIA GPUs
2. OS kernel-level support for hardware initialization, configuration, etc.
3. User-mode driver, which provides a device-level API for developers
4. PTX instruction set architecture (ISA) for parallel computing kernels and
functions
11
8/2/2019 Report on Gpu
12/39
12
8/2/2019 Report on Gpu
13/39
The CUDA Software Development
Environment
The CUDA Software Development Environment provides all the tools, examples
and documentation necessary to develop applications that take advantage of the
CUDA architecture.
Libraries Advanced libraries that include BLAS, FFT,
and other functions optimized for the CUDA
architecture
C Runtime The C Runtime for CUDA provides supportfor executing standard C functions on the GPU
and allows native bindings for other high-level
languages such as Fortran, Java, and Python
Tools NVIDIA C Compiler (nvcc), CUDA Debugger
(cudagdb), CUDA Visual Profiler (cudaprof),
and other helpful tools
Documentation Includes the CUDA Programming Guide, API
specifications, and other helpful
documentation
Samples SDK code samples and documentation that
demonstrate best practices for a wide variety
GPU Computing algorithms and applications
The CUDA Software Development Environment supports two different
programming interfaces:
11. A device-level programming interface, in which the application uses DirectXCompute, OpenCL or the CUDA Driver API directly to configure the GPU, launch
compute kernels, and read back results.
2
2. A language integration programming interface, in which an application uses the
C Runtime for CUDA and developers use a small set of extensions to indicate
which compute functions should be performed on the GPU instead of the CPU.
13
8/2/2019 Report on Gpu
14/39
When using the device-level programming interface, developers write compute
kernels in separate files using the kernel language supported by their API of
choice. DirectX Compute kernels (aka compute shaders) are written in HLSL.
OpenCL kernels are written in a C-like language called OpenCL C. The CUDA
Driver API accepts kernels written in C or PTX assembly.
When using the language integration programming interface, developers write
compute functions in C and the C Runtime for CUDA automatically handles
setting up the GPU and executing the compute functions. This programming
interface enables developers to take advantage of native support for high-level
languages such as C, C++, Fortran, Java, Python, and more (see below), reducing
code complexity and development costs through type integration and codeintegration:
Type integration allows standard types as well as vector types and user-defined
types (including structs) to be used seamlessly across functions that are executed on
the CPU and functions that are executed on the GPU.
14
8/2/2019 Report on Gpu
15/39
The CUDA APIs
But, Brooks critical success was enough to attract the attention of ATI and Nvidia,
since the two giants saw the incipient interest in this type of initiative as an
opportunity to broaden their market even more by reaching a new sector that had so
far been indifferent to their graphics achievements.
Researchers who were in on Brook at its origin quickly joined the Santa Claradevelopment teams to put together a global strategy for targeting the new market.
The idea was to offer a hardware/software ensemble suited to this type of
calculation since Nvidias developers know all the secrets of their GPU, there was
no question of relying only on a graphics API, which only communicates with the
hardware via a driver, with all the problems that implies, as we saw above. So the
CUDA (Compute Unified Device Architecture) development team created a set of
software layers to communicate with the GPU.
As you can see on this diagram, CUDA provides two APIs:
15
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/gallery/014-111103,0101-111103-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.html8/2/2019 Report on Gpu
16/39
A high-level API: the CUDA Runtime API;
A low-level API: the CUDA Driver API.
Since the high-level API is implemented above the low-level API, each call to a
function of the Runtime is broken down into more basic instructions managed by
the Driver API. Note that these two APIs are mutually exclusive the programmermust use one or the other, but its not possible to mix function calls from both. The
term high-level API is relative. Even the Runtime API is still what a lot of people
would consider very low-level; yet it still offers functions that are highly practical
for initialization or context management. But dont expect a lot more abstraction
you still need a good knowledge of Nvidia GPUs and how they work.
The Driver API, then, is more complex to manage; it requires more work to launch
processing on the GPU. But the upside is that its more flexible, giving the
programmer who wants it additional control. The two APIs are capable of
communicating with OpenGL or Direct3D resources (only nine for the moment).
The usefulness of this is obvious CUDA could be used to generate resources
(geometry, procedural textures, etc.) that could then be passed to the graphics API,
or conversely, its possible that the 3D API could send the results of the rendering
to CUDA, which in that case would be used to perform post-processing. There arenumerous examples of interactions, and the advantage is that the resources remain
stored in the GPUs RAM without having to transit through the bottleneck of the
PCI-Express bus.
Conversely, we should point out that sharing resources in this case video memory
with graphics data is not always idyllic and can lead to a few headaches. For
example, for a change of resolution or color depth, the graphics data have priority.
So, if the resources for the frame buffer need to increase, the driver wont hesitate
16
http://www.tomshardware.com/gallery/016-111105,0101-111105-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/015-111104,0101-111104-0-2-3-0-jpg-.html8/2/2019 Report on Gpu
17/39
to grab the ones that are allocated to applications using CUDA, causing them to
crash. Its not very elegant, granted; but you have to admit that the situation
shouldnt come up very often. And since were on the subject of little
disadvantages: If you want to use several GPUs for a CUDA application, youll
have to disable SLI mode first, or only a single GPU will be visible to CUDA.
Finally, the third software layer is a set of libraries two to be precise:
CUBLAS, which has a set of building blocks for linear algebra calculationson the GPU;
CUFFT, which can handle calculation of Fourier transforms an algorithm
much used in the field of signal processing.
17
http://www.tomshardware.com/gallery/017-111106,0101-111106-0-2-3-0-jpg-.html8/2/2019 Report on Gpu
18/39
A Few Definitions
Before we dive into CUDA, lets define a few terms that are sprinkled throughout
Nvidias documentation. The company has chosen to use a rather special
terminology that can be hard to grasp. First we need to define what a thread is inCUDA, because the term doesnt have quite the same meaning as a CPU thread,
nor is it the equivalent of what we call "threads" in our GPU articles. A thread on
the GPU is a basic element of the data to be processed. Unlike CPU threads, CUDA
threads are extremely lightweight, meaning that a context change between two
threads is not a costly operation.
The second term frequently encountered in the CUDA documentation is warp. No
confusion possible this time (unless you think the term might have something to do
with Start TrekorWarhammer). No, the term is taken from the terminology of
weaving, where it designates threads arranged lengthwise on a loom and crossedby the woof. A warp in CUDA, then, is a group of 32 threads, which is the
minimum size of the data processed in SIMD fashion by a CUDA multiprocessor.
But that granularity is not always sufficient to be easily usable by a programmer,
and so in CUDA, instead of manipulating warps directly, you work with blocks that
can contain 64 to 512 threads.
Finally, these blocks are put together in grids. The advantage of the grouping is that
the number of blocks processed simultaneously by the GPU are closely linked to
hardware resources, as well see later. The number of blocks in a grid make it
possible to totally abstract that constraint and apply a kernel to a large quantity of
threads in a single call, without worrying about fixed resources. The CUDAruntime takes care of breaking it all down for you. This means that the model is
extremely extensible. If the hardware has few resources, it executes the blocks
sequentially; if it has a very large number of processing units, it can process them in
parallel. This in turn means that the same code can target both entry-level GPUs,
high-end ones and even future GPUs.
18
http://www.tomshardware.com/gallery/-111133,0101-111133-0-2-3-0-jpg-.html8/2/2019 Report on Gpu
19/39
The other terms youll run into frequently in the CUDA API are used to designate
the CPU, which is called the host, and the GPU, which is referred to as the device.
After that little introduction, which we hope hasnt scared you away, its time to
plunge in!
19
http://www.tomshardware.com/gallery/-111132,0101-111132-0-2-3-1-jpg-.html8/2/2019 Report on Gpu
20/39
The Theory: CUDA from the Hardware Point of
View
With CUDA, Nvidia presents its architecture in a slightly different way and
exposes certain details that hadnt been revealed before now.
As you can see above, Nvidias Shader Core is made up of several clusters Nvidia
calls Texture Processor Clusters. An 8800GTX, for example, has eight clusters,
an 8800GTS six, and so on. Each cluster, in fact, is made up of a texture unit and
two streaming multiprocessors. These processors consist of a front end that
reads/decodes and launches instructions and a backend made up of a group of eight
calculating units and two SFUs (Super Function Units) where the instructions are
executed in SIMD fashion: The same instruction is applied to all the threads in the
warp. Nvidia calls this mode of execution SIMT (for single instruction multiple
threads). Its important to point out that the backend operates at double thefrequency of the front end. In practice, then, the part that executes the instructions
appears to be twice as wide as it actually is (that is, as a 16-way SIMD unit
instead of an eight-way one). The streaming multiprocessors operating mode is as
follows: At each cycle, a warp ready for execution is selected by the front end,
which launches execution of an instruction. To apply the instruction to all 32
threads in the warp, the backend will take four cycles, but since it operates at
20
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/gallery/-111131,0101-111131-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.html8/2/2019 Report on Gpu
21/39
double the frequency of the front end, from its point of view only two cycles will be
executed. So, to avoid having the front end remain unused for one cycle and to
maximize the use of the hardware, the ideal is to alternate types of instructions
every cycle a classic instruction for one cycle and an SFU instruction for the
other.
Each multiprocessor also has certain amount of resources that should be understood
in order to make the best use of them. They have a small memory area called
Shared Memory with a size of 16 KB per multiprocessor. This is not a cache
memory the programmer has a free hand in its management. As such, its like the
Local Store of the SPUs on Cell processors. This detail is particularly interesting,
and demonstrates the fact that CUDA is indeed a set ofsoftware and hardware
technologies. This memory area is not used for pixel shaders as Nvidia says,
tongue in cheek, We dislike pixels talking to each other.
This memory area provides a way for threads in the same block to communicate.
Its important to stress the restriction: all the threads in a given block are guaranteed
to be executed by the same multiprocessor. Conversely, the assignment of blocks to
the different multiprocessors is completely undefined, meaning that two threads
from different blocks cant communicate during their execution. That means that
using this memory is complicated. But it can also be worthwhile, because except
for cases where several threads try to access the same memory bank, causing a
conflict; the rest of the time, access to shared memory is as fast as access to the
registers.
21
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/gallery/-111129,0101-111129-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.html8/2/2019 Report on Gpu
22/39
The shared memory is not the only memory the multiprocessors can access.
Obviously they can use the video memory, but it has lower bandwidth and higher
latency. Consequently, to limit too-frequent access to this memory, Nvidia has also
provided its multiprocessors with a cache (approximately 8 KB per multiprocessor)
for access to constants and textures.
The multiprocessors also have 8,192 registers that are shared among all the threads
of all the blocks active on that multiprocessor. The number of active blocks per
multiprocessor cant exceed eight, and the number of active warps are limited to 24
(768 threads). So, an 8800GTX can have up to 12,288 threads being processed at a
given instant. Its worth mentioning all these limits because it helps in
dimensioning the algorithm as a function of the available resources.
Optimizing a CUDA program, then, essentially consists of striking the optimum
balance between the number of blocks and their size more threads per block will
be useful in masking the latency of the memory operations, but at the same time the
number of registers available per thread are reduced. Whats more, a block of 512
threads would be particularly inefficient, since only one block might be active on a
multiprocessor, potentially wasting 256 threads. So, Nvidia advises using blocks of
128 to 256 threads, which offers the best compromise between masking latency and
the number of registers needed for most kernels.
22
http://www.tomshardware.com/gallery/-111128,0101-111128-0-2-3-1-jpg-.html8/2/2019 Report on Gpu
23/39
The Theory: CUDA from the Software Point of
View
From a software point of view, CUDA consists of a set of extensions to the Clanguage, which of course recalls BrookGPU, and a few specific API calls. Among
the extensions are type qualifiers that apply to functions and variables. The
keyword to remember here is __global__, which when prefixed to a function
indicates that the latter is a kernel that is, a function that will be called by the
CPU and executed by the GPU. The __device__ keyword designates a function that
will be executed by the GPU (which CUDA refers to as the device) but can only
be called from the GPU (in other words, from another __device__ function or from
a __global__ function). Finally, the __host__ keyword is optional, and designates a
function thats called by the CPU and executed by the CPU in other words, atraditional function.
There are a few restrictions associated with __device__ and __global__ functions:
They cant be recursive (that is, they cant call themselves) and they cant have a
variable number of arguments. Finally, regarding __device__ functions resident in
the GPUs memory space, logically enough its impossible to obtain their address.
Variables also have new qualifiers that allow control of the memory area where
theyll be stored. A variable preceded by the keyword __shared__ indicates that it
will be stored in the streaming multiprocessors shared memory. The way a
__global__ function is called is also a little different. Thats because the executionconfiguration has to be defined at the time of the call more concretely, the size of
the grid to which the kernel is applied and the size of each block. Take the example
of a kernel with the following signature:
__global__ void Func(float* parameter);
which will be called as follows:
Func>(parameter);
where Dg is the grid dimension and Db the dimension of a block. These two
variables are of a new vector type introduced by CUDA.
The CUDA API essentially comprises functions for memory manipulation in
VRAM: cudaMalloc to allocate memory, cudaFree to free it and cudaMemcpy to
copy data between RAM and VRAM and vice-versa.
23
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.html8/2/2019 Report on Gpu
24/39
Well end this overview with the way a CUDA program is compiled, which is
interesting: Compiling is done in several phases first of all the code dedicated to
the CPU is extracted from the file and passed to the standard compiler. The code
dedicated to the GPU is first converted into an intermediate language, PTX. This
intermediate language is like an assembler, and so enables the generated source
code to be studied for potential inefficiencies. Finally, the last phase translates this
intermediate language into commands that are specific to the GPU and encapsulates
them in binary form in the executable.
Performance
However we did decide to measure the processing time to see if there was any
advantage to using CUDA even with our crude implementation, or on the otherhand if was going to take long, exhaustive practice to get any real control over the
use of the GPU. The test machine was our development box a laptop computer
with a Core 2 Duo T5450 and a GeForce 8600M GT, operating underVista. Its far
from being a supercomputer, but the results are interesting since our test is not all
that favorable to the GPU. Its fine for Nvidia to show us huge accelerations on
systems equipped with monster GPUs and enormous bandwidth, but in practice
24
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/gallery/-111127,0101-111127-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.html8/2/2019 Report on Gpu
25/39
many of the 70 million CUDA GPUs existing on current PCs are much less
powerful, and so our test is quite germane.
The results we got are as follows for processing a 2048x2048 image:
CPU 1 thread: 1419 ms CPU 2 threads: 749 ms
CPU 4 threads: 593 ms
GPU (8600M GT) blocks of 256 pixels: 109 ms
GPU (8600M GT) blocks of 128 pixels: 94 ms
GPU (8800 GTX) blocks of 128 pixels / 256 pixels: 31 ms
Several observations can be made about these results. First of all youll notice that
despite our crack about programmers laziness, we did modify the initial CPU
implementation by threading it. As we said, the code is ideal for this situation all
you do is break down the initial image into as many zones as there are threads. Note
that we got an almost linear acceleration going from one to two threads on our dual-
core CPU, which shows the strongly parallel nature of our test program. Fairly
unexpectedly, the four-thread version proved faster, whereas we were expecting to
see no difference at all on our processor, or even and more logically a slight loss
of efficiency due to the additional cost generated by the creation of the additional
threads. What explains that result? Its hard to say, but it may be that the Windows
thread scheduler has something to do with it; but in any case the result was
reproducible. With a texture with smaller dimensions (512x512), the gain achievedby threading was a lot less marked (approximately 35% as opposed to 100%) and
the behavior of the four-thread version was more logical, showing no gain over the
two-thread version. The GPU was still faster, but less markedly so (the 8600M GT
was three times faster than the two-thread version).
25
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/gallery/-111125,0101-111125-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.html8/2/2019 Report on Gpu
26/39
The second notable observation is that even the slowest GPU implementation was
nearly six times faster than the best-performing CPU version. For a first program
and a trivial version of the algorithm, thats very encouraging. Notice also that we
got significantly better results using smaller blocks, whereas intuitively you mightthink that the reverse would be true. The explanation is simple our program uses
14 registers per thread, and with 256-thread blocks it would need 3,584 registers
per block, and to saturate a multiprocessor it takes 768 threads, as we saw. In our
case, thats three blocks or 10,572 registers. But a multiprocessor has only 8,192
registers, so it can only keep two blocks active. Conversely, with blocks of 128
pixels, we need 1,792 registers per block; 8,192 divided by 1,792 and rounded to
the nearest integer works out to four blocks being processed. In practice, the
number of threads are the same (512 per multiprocessor, whereas theoretically it
takes 768 to saturate it), but having more blocks gives the GPU additionalflexibility with memory access when an operation with a long latency is executed,
it can launch execution of the instructions on another block while waiting for the
results to be available. Four blocks would certainly mask the latency better,
especially since our program makes several memory accesses.
Analysis
Finally, in spite of what we said earlier about this not being a horserace, we
couldnt resist the temptation of running the program on an 8800 GTX, which
proved to be three times as fast as the mobile 8600, independent of the size of the
blocks. You might think the result would be four or more times faster based on the
respective architectures: 128 ALUs compared to 32 and a higher clock frequency
(1.35GHz compared to 950MHz), but in practice that wasnt the case. Here again
the most likely hypothesis is that we were limited by the memory accesses. To bemore precise, the initial image is accessed like a CUDA multidimensional array a
very complicated term for whats really nothing more than a texture. There are
several advantages:
accesses get the benefit of the texture cache;
26
8/2/2019 Report on Gpu
27/39
we have a wrapping mode, which avoids having to manage the edges of the
image, unlike the CPU version.
We could also have taken advantage of free filtering with normalized addressing
between [0,1] instead of [0, width] and [0, height], but that wasnt useful in our
case. As you know as a faithful reader, the 8600 has 16 texture units compared to32 for the 8800GTX. So theres only a two-to-one ratio between the two
architectures. Add the difference in frequency and we get a ratio of (32 x 0.575) /
(16 x 0.475) = 2.4 in the neighborhood of the three-to-one we actually observed.
That theory also has the advantage of explaining why the size of the blocks doesnt
change much on the G80, since the ALUs are limited by the texture units anyway.
In addition to the encouraging results, our first steps with CUDA went very well
considering the unfavorable conditions wed chosen. Developing on a Vista laptop
means youre forced to use CUDA SDK 2.0, still in its beta stage, with the 174.55
driver, which is also in beta. Despite all that, we have no unpleasant surprises to
report just a little scare when the first execution of our program, still very buggy,
tried to address memory beyond the allocated space.
The monitor blinked frenetically, then went black until Vista launched the video
driverrecovery service and all was well. But you have to admit it was surprisingwhen youre used to seeing an ordinary Segmentation Fault with standard programs
in cases like that. Finally, one (very small) criticism of Nvidia: In all the
documentation available for CUDA, its a shame not to find a little tutorial
explaining step by step how to set up the development environment in Visual
Studio. Thats not too big a problem since the SDK is full of example programs you
27
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/gallery/-111124,0101-111124-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html8/2/2019 Report on Gpu
28/39
can explore to find out how to build the skeleton of a minimal project for a CUDA
application, but for beginners a tutorial would have been a lot more convenient.
WORKING
THE GRAPHICS PIPELINE
The task of any 3D graphics system is to synthesize an image from a description
scene. This scene contains the geometric primitives to be viewed as well as
descriptions of the lights illuminating the scene.GPU designers traditionally have
expressed this image-synthesis process as a hardware pipeline of specialized
stages.
HERE - I will provide an overview of the classic graphics pipeline
The goal is to highlight those aspects of the real-time rendering calculation that
allow graphics application developers to exploit modern GPUs as general-purpose
parallel computation engines.
How GPUs Work
1.Pipeline input
Most real-time graphics systems assume that everything is made of triangles,
and they first carve up any more complex shapes, such as quadrilaterals or
curved surface patches, into triangles. The developer uses a computer graphics
28
8/2/2019 Report on Gpu
29/39
library (such as OpenGL or Direct3D) to provide each triangle to the graphics
pipeline one vertex at a time; the GPU assembles vertices into triangles as
needed.
2.Model transformations
A GPU can specify each logical object in a scene in its own locally defined
coordinate system.
This convenience comes at a price: before rendering, the GPU must first
transform all objects into a common coordinate system. To ensure that triangles
arent warped or twisted into curved shapes, this transformation is limited to
simple affine operations such as rotations, translations, scalings, and the like.
3.Lighting
Once each triangle is in a global coordinate system, the GPU can compute its
color based on the lights in the scene. As an example, we describe the
calculations for a single-point light source (imagine a very small lightbulb).The
GPU handles multiple lights by summing the contributions of each individual
light.
The traditional graphics pipeline supports the Phong lighting equation - a
phenomenological appearance model that approximates the look of plastic..
The Phong lighting equation gives the output color :-
C = Kd Li (N.L) + Ks Li (R.V)^S.
29
8/2/2019 Report on Gpu
30/39
4.Camera simulation
The graphics pipeline next projects each colored 3D triangle onto the virtual
cameras film plane. Like the model transformations, the GPU does this using
matrix-vector multiplication, again leveraging efficient vector operations in
hardware. This stages output is a stream of triangles in screen coordinates, ready to
be turned into pixels.
5.Rasterization
Each visible screen-space triangle overlaps some pixels on the display;
determining these pixels is called rasterization. GPU designers have
incorporated many rasterizatiom algorithms over the years, which all exploit one
crucial observation: Each pixel can be treated independently from all other
30
8/2/2019 Report on Gpu
31/39
pixels. Therefore, the machine can handle all pixels in parallelindeed, some
exotic machines have had a processor for each pixel.This inherent independence
has led GPU designers to build increasingly parallel sets of pipelines.
6.Texturing
The actual color of each pixel can be taken directly from the lighting
calculations, but for added realism, images called textures are often draped over
the geometry to give the illusion of detail. GPUs store these textures in high-
speed memory, which each pixel calculation must access to determine or modify
that pixels color.
7.Hidden surfaces
In most scenes, some objects obscure other objects. If each pixel were simply
written to display memory, the most recently submitted triangle would appear to
be in front.All modern GPUs provide a depth buffer, a region of memory that
stores the distance from each pixel to the viewer. Before writing to the display,
the GPU compares a pixels distance to the distance of the pixel thats already
present, and it updates the display memory only if the new pixel is closer.
31
8/2/2019 Report on Gpu
32/39
Figure 1.Programmable shading. The introduction of programmable shading in
2001 led to several visual effects not previously possible, such as this simulationof refractive chromatic dispersion for a soap bubble effect.
32
8/2/2019 Report on Gpu
33/39
Figure 2.Unprecedented visual realism. Modern GPUs can use programmable
shading to achieve near-cinematic realism, as this interactive demonstrationshows, featuring actress Adrianne Curry on an NVIDIA GeForce 8800 GTX.
33
8/2/2019 Report on Gpu
34/39
Conclusion
Nvidia introduced CUDA with the release of the GeForce 8800. At that time the
promises they were making were extremely seductive, but we kept our enthusiasm
in check. After all, wasnt this likely to be just a way of staking out the territory and
surfing the GPGPU wave? Without an SDK available, you cant blame us forthinking it was all just a marketing operation and that nothing really concrete would
come of it. It wouldnt be the first time a good initiative has been announced too
early and never really saw the light of day due to a lack of resources especially in
such a competitive sector. Now, a year and a half after the announcement, we can
say that Nvidia has kept its word.
34
http://www.tomshardware.com/gallery/029-111118,0101-111118-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/-111134,0101-111134-0-2-3-1-jpg-.html8/2/2019 Report on Gpu
35/39
Not only was the SDK available quickly in a beta version, in early 2007, but its
also been updated frequently, proving the importance of this project for Nvidia.
Today CUDA has developed nicely; the SDK is available in a beta 2.0 version for
the major operating systems (Windows XP and Vista and Linux and 1.1 for MacOS X), and Nvidia is devoting an entire section of its site for developers.
On a more personal level, the impression we got from our first steps with CUDA
was extremely positive. Even if youre familiar with the GPUs architecture, its
natural to be apprehensive about programming it, and while the API looks clear at
first glance you cant keep from thinking it wont be easy to get convincing results
with the architecture. Wont the gain in processing time be siphoned off by the
multiple CPU-GPU transfers? And how to make good use of those thousands of
threads with almost no synchronization primitive? We started our experimentation
with all these uncertainties in mind. But they soon evaporated when the first version
of our algorithm, trivial as it was, already proved to be significantly faster than theCPU implementation.
So, CUDA is not a gimmick intended for researchers who want to cajole their
university into buying them a GeForce. CUDA is genuinely usable by any
programmer who knows C, provided he or she is ready to make a small investment
of time and effort to adapt to this new programming paradigm. That effort wont be
wasted provided your algorithms lend themselves to parallelization. We should also
35
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.htmlhttp://www.tomshardware.com/gallery/031-111120,0101-111120-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/030-111119,0101-111119-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.html8/2/2019 Report on Gpu
36/39
tip our hat to Nvidia for providing ample, quality documentation to answer all the
questions of beginning programmers.
So what does CUDA need in order to become the API to reckon with? In a word:
portability. We know that the future of IT is in parallel computing everybodys
preparing for the change and all initiatives, both software and hardware, are takingthat direction. Currently, in terms of development paradigms, were still in
prehistory creating threads by hand and making sure to carefully plan access to
shared resources is still manageable today when the number of processor cores can
be counted on the fingers of one hand; but in a few years, when processors will
number in the hundreds, that wont be a possibility. With CUDA, Nvidia is
proposing a first step in solving this problem but the solution is obviously
reserved only for their own GPUs, and not even all of them. Only the GF8 and 9
(and their Quadro/Tesla derivatives) are currently able to run CUDA programs.
Nvidia may boast that it has sold 70 million CUDA-compatible GPUs worldwide,
but thats still not enough for it to impose itself as the de facto standard. All the
more so since their competitors arent standing by idly. AMD is offering its own
SDK (Stream Computing) and Intel has also announced a solution (Ct), though its
36
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/gallery/033-111122,0101-111122-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/032-111121,0101-111121-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.html8/2/2019 Report on Gpu
37/39
not available yet. So the war is on and there wont be room for three competitors,
unless another player say Microsoft were to step in and pick up all the marbles
with a common API, which would certainly be welcomed by developers.
So Nvidia still has a lot of challenges to meet to make CUDA stick, since while
technologically its undeniably a success, the task now is to convince developers
that its a credible platform and that doesnt look like itll be easy. However,
judging by the many recent announcements in the news about the API, the future
doesnt look unpromising.
Supported GPUsA table of devices officially supporting CUDA (Note that many applications require at least 256
MB of dedicated VRAM).
Nvidia GeForce
GeForce GTX 295
GeForce GTX 285
GeForce GTX 280
GeForce GTX 275GeForce GTX 260
GeForce GTS 250
GeForce GT 220
GeForce G210
GeForce 9800 GX2
GeForce 9800 GTX+
GeForce 9800 GTX
GeForce 9800 GT
GeForce 9600 GSO
GeForce 9600 GTGeForce 9500 GT
GeForce 9400 GT
GeForce 9400 mGPU
GeForce 9300 mGPU
GeForce 8800 Ultra
GeForce 8800 GTX
Nvidia GeForce Mobile
GeForce GTX 280M
GeForce GTX 260M
GeForce GTS 260M
GeForce GTS 250MGeForce GTS 160M
GeForce GT 240M
GeForce GT 230M
GeForce GT 220M
GeForce G210M
GeForce 9800M GTX
GeForce 9800M GTS
GeForce 9800M GT
GeForce 9800M GS
GeForce 9700M GTSGeForce 9700M GT
GeForce GT 130M
GeForce GT 120M
GeForce 9650M GT
GeForce 9650M GS
GeForce 9600M GT
Nvidia Quadro
Quadro FX 5800
Quadro FX 5600
Quadro FX 4800
Quadro FX 4700 X2Quadro FX 4600
Quadro FX 3700
Quadro FX 1800
Quadro FX 1700
Quadro FX 570
Quadro FX 370
Quadro NVS 290
Quadro FX 3600M
Quadro FX 1600M
Quadro FX 770MQuadro FX 570M
Quadro FX 370M
Quadro Plex 1000 Model IV
Quadro Plex 1000 Model S4
Nvidia Quadro Mobile
Quadro NVS 360M
37
http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://www.tomshardware.com/gallery/034-111123,0101-111123-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadro8/2/2019 Report on Gpu
38/39
GeForce 8800 GTS
GeForce 8800 GT
GeForce 8800 GS
GeForce 8600 GTS
GeForce 8600 GTGeForce 8600 mGT
GeForce 8500 GT
GeForce 8400 GS
GeForce 8300 mGPU
GeForce 8200 mGPU
GeForce 8100 mGPU
GeForce 9600M GS
GeForce 9500M GS
GeForce 9500M G
GeForce 9400M G
GeForce 9300M GSGeForce 9300M G
GeForce 9200M GS
GeForce 9100M G
GeForce 8800M GTS
GeForce 8700M GT
GeForce 8600M GT
GeForce 8600M GS
Quadro NVS 140M
Quadro NVS 135M
Quadro NVS 130M
Nvidia Tesla
Tesla S1070Tesla C1060
Tesla C870
Tesla D870
Tesla S870
ExampleThis example code in C++ loads a texture from an image into an array on the GPU:
cudaArray* cu_array;texture tex;// Allocate arraycudaChannelFormatDesc description = cudaCreateChannelDesc();cudaMallocArray(&cu_array, &description, width, height);// Copy image data to arraycudaMemcpy(cu_array, image, width*height*sizeof(float),
cudaMemcpyHostToDevice);// Bind the array to the texturecudaBindTextureToArray(tex, cu_array);// Run kerneldim3 blockDim(16, 16, 1);dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);kernel>(d_odata, width, height);cudaUnbindTexture(tex);__global__ void kernel(float* odata, int height, int width){
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;float c = tex2D(tex, x, y);odata[y*width+x] = c;}
REFERENCES
38
http://en.wikipedia.org/wiki/Nvidia_Teslahttp://en.wikipedia.org/wiki/Nvidia_Tesla8/2/2019 Report on Gpu
39/39
1.GPU Computing
This paper appears in:Proceedings of the IEEEIssue Date: May 2008Volume: 96Issue:5On page(s): 879 - 899
ISSN: 0018-9219INSPEC Accession Number: 9921316Digital Object Identifier:10.1109/JPROC.2008.917757Date of Current Version: 15 April 2008Sponsored by:IEEE
2. http://developer.nvidia.com/object/gpu_programming_guide.html
3. www.tomshardware.com/reviews/nvidia-cuda-gpu
http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnkhttp://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnk