Top Banner

of 39

Report on Gpu

Apr 06, 2018

Download

Documents

Divya Porwal
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/2/2019 Report on Gpu

    1/39

    SEMINAR REPORT

    ON

    GRAPHIC PROCESSING

    UNIT

    Maharaja Agrasen Institute of technology,

    PSP area,Sector 22, Rohini, New Delhi 110085

    SUBMITTED BY:

    DHIRAJ JAIN

    0431482707

    C2-2

    1

  • 8/2/2019 Report on Gpu

    2/39

    INDEX

    1 . A c k no w l ed g e me n t 3

    2 . I n t r od u c ti o n t o G P U 4

    3 . G P U f o rm s 5

    4 . S t r e am p r o ce s s in g a n d G P G P U 7

    5 . C UD A 8

    6. N v i d i a ' s C U D A : T h e E n d o f t h e C P U 1 0

    6.1. I n t r o d u c t i o n 1 0

    6.2. T h e C U D A A r c h i t e c t ur e 1 1

    6.3. C U D A S o f t w a r e d e v e l o p m e n t k i t 1 2

    6.4. C U D A A P I S 1 4

    6.5. T h e o r y : C U DA f r o m t h e H a r d w a r e P o i n t o f V i e w 1 9

    6.6. T h e o r y : C U DA f r o m t h e S o f t w a r e P o i n t o f V i e w 2 2

    6.7. P e r f o r m a n c e 2 4

    6.8. A n a l y s i s 2 6

    6.9. W o r k i n g 27

    6.10. C o n c l u s i o n 34

    7 . R e fe r en c es 3 9

    2

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html
  • 8/2/2019 Report on Gpu

    3/39

    ACKNOWLEDGEMENT

    It is pleasure to acknowledge my debt to the many people involved, directly or

    indirectly, in the production of the seminar. I would like to thank my faculty

    guide Mrs. POOJA GUPTA for providing the motivational guidance during the

    entire preparation of the seminar, answering a number of my technical queriesand despite her busy schedule, always gave time for checking the progress

    report.

    I gratefully acknowledge the efforts of several of my colleagues who offeredmany suggestions throughout this project. Their constructive criticism andtimely review have resulted in several improvements.

    Thanks a lot for your guidance and support.

    DHIRAJ JAIN

    0431482707

    3

  • 8/2/2019 Report on Gpu

    4/39

    Introduction To GPU

    August 31, 1999 marks the introduction of the Graphics Processing Unit (GPU) for

    the PC industry. The technical definition of a GPU is "a single chip processor with

    integrated transform, lighting, triangle setup/clipping, and rendering engines that is

    capable of processing a minimum of 10 million polygons per second."

    The GPU changes everything you have ever seen or experienced on your PC.

    As 3D becomes more pervasive in our lives, the need for faster processing speeds

    increases. With the advent of the GPU, computationally intensive transform and

    lighting calculations were offloaded from the CPU onto the GPUallowing for

    faster graphics processing speeds. This means all scenes increase in detail and

    complexity without sacrificing performance. In essence, the GPU gives you truly

    stunning realism for free.

    The difficulty in virtual representations of the real world is robustly mimicking how

    objects interact with one another and their surroundings, due to the intense, split-

    second computations needed to process all the variables. Once, the process's

    bottleneck freed up the CPU's resources. Now, that does not apply.

    4

  • 8/2/2019 Report on Gpu

    5/39

    Before GPU After GPU

    GPU forms

    Dedicated graphics cards

    The most powerful class of GPUs typically interface with the motherboard by means of anexpansion slot such as PCI Express (PCIe) orAccelerated Graphics Port(AGP) and can usually

    be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting

    the upgrade. A few graphics cards still use Peripheral Component Interconnect(PCI) slots, buttheir bandwidth is so limited that they are generally used only when a PCIe or AGP slot is

    unavailable.

    A dedicated GPU is not necessarily removable, nor does it necessarily interface with the

    motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphicscards have RAMthat is dedicated to the card's use, not to the fact that most dedicated GPUs are

    removable. Dedicated GPUs for portable computers are most commonly interfaced through a

    non-standard and often proprietary slot due to size and weight constraints. Such ports may stillbe considered PCIe or AGP in terms of their logical host interface, even if they are not physically

    interchangeable with their counterparts.

    Technologies such as SLI by NVIDIA and CrossFireby ATI allow multiple GPUs to be used to

    draw a single image, increasing the processing power available for graphics.

    Integrated graphics solutions

    5

    http://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/ATI_CrossFirehttp://en.wikipedia.org/wiki/Motherboardhttp://en.wikipedia.org/wiki/Expansion_slothttp://en.wikipedia.org/wiki/PCI_Expresshttp://en.wikipedia.org/wiki/Accelerated_Graphics_Porthttp://en.wikipedia.org/wiki/Graphics_cardshttp://en.wikipedia.org/wiki/Peripheral_Component_Interconnecthttp://en.wikipedia.org/wiki/RAMhttp://en.wikipedia.org/wiki/Scalable_Link_Interfacehttp://en.wikipedia.org/wiki/ATI_CrossFire
  • 8/2/2019 Report on Gpu

    6/39

    Intel GMA X3000 IGP (under heatsink)

    Integrated graphics solutions, orshared graphics solutions are graphics

    processors that utilize a portion of a computer's system RAM rather than dedicated

    graphics memory. Computers with integrated graphics account for 90% of all PC

    shipments. These solutions are cheaper to implement than dedicated graphicssolutions, but are less capable. Historically, integrated solutions were often

    considered unfit to play 3D games or run graphically intensive programs such as

    Adobe Flash (Examples of such IGPs would be offerings from SiS and VIA circa

    2004.) However, today's integrated solutions such as the Intel's GMA X4500HD

    (Intel G45 chipset), AMD's Radeon HD 3200 (AMD 780G chipset) and NVIDIA's

    GeForce 8200 (NVIDIA nForce 730a) are more than capable of handling 2D

    graphics from Adobe Flash or low stress 3D graphics. However, most integrated

    graphics still struggle with high-end video games. Chips like the Nvidia GeForce

    9400M in Apple's new MacBook and MacBook Pro and AMD's Radeon HD 3300

    (AMD 790GX) have improved performance, but still lag behind dedicated graphicscards. Some Integrated Graphics Modern desktop motherboards often include an

    integrated graphics solution and have expansion slots available to add a dedicated

    graphics card later.

    As a GPU is extremely memory intensive, an integrated solution may find itself

    competing for the already slow system RAM with the CPU as it has minimal or no

    dedicated video memory. System RAM may be 2 Gbit/s to 12.8 Gbit/s, yet

    dedicated GPUs enjoy between 10 Gbit/s to over 100 Gbit/s of bandwidth

    depending on the model.

    Older integrated graphics chipsets lacked hardware transform and lighting, but

    newer ones include it.

    6

    http://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=1http://en.wikipedia.org/wiki/File:Harumphy.dg965.heatsink.jpghttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/Intel_GMAhttp://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsetshttp://en.wikipedia.org/wiki/AMD_780Ghttp://en.wikipedia.org/w/index.php?title=NForce_710&action=edit&redlink=1
  • 8/2/2019 Report on Gpu

    7/39

    Hybrid solutions

    This newer class of GPUs competes with integrated graphics in the low-end

    desktop and notebook markets. The most common implementations of this are

    ATI's HyperMemory and NVIDIA's TurboCache. Hybrid graphics cards aresomewhat more expensive than integrated graphics, but much less expensive than

    dedicated graphics cards. These also share memory with the system, but have a

    smaller dedicated amount of it than discrete graphics cards do, to make up for the

    high latency of the system RAM. Technologies within PCI Express can make this

    possible. While these solutions are sometimes advertised as having as much as

    768MB of RAM, this refers to how much can be shared with the system memory.

    Stream Processing and General PurposeGPUs (GPGPU)

    A new concept is to use a modified form of a stream processorto allow a general

    purpose graphics processing unit. This concept turns the massive floating-point

    computational power of a modern graphics accelerator's shader pipeline into

    general-purpose computing power, as opposed to being hard wired solely to do

    graphical operations. In certain applications requiring massive vector operations,

    this can yield several orders of magnitude higher performance than a conventional

    CPU. The two largest discrete (see "Dedicated graphics cards" above) GPU

    designers, ATI andNVIDIA, are beginning to pursue this new market with an

    array of applications. Both nVidia and ATI have teamed with Stanford University

    to create a GPU-based client for the Folding@Home distributed computing project

    (for protein folding calculations). In certain circumstances the GPU calculates forty

    times faster than the conventional CPUs traditionally used in such applications.

    Recently NVidia began releasing cards supporting an API extension to the C

    programming language called CUDA ("Compute Unified Device Architecture"),

    which allows specified functions from a normal C program to run on the GPU'sstream processors. This makes C programs capable of taking advantage of a GPU's

    ability to operate on large matrices in parallel, while still making use of the CPU

    where appropriate. CUDA is also the first API to allow CPU-based applications to

    access directly the resources of a GPU for more general purpose computing

    without the limitations of using a graphics API.

    7

    http://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDAhttp://en.wikipedia.org/wiki/HyperMemoryhttp://en.wikipedia.org/wiki/TurboCachehttp://en.wikipedia.org/wiki/Memory_latencyhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/ATI_Technologieshttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/Stanford_Universityhttp://en.wikipedia.org/wiki/Folding@Homehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/CUDA
  • 8/2/2019 Report on Gpu

    8/39

    Since 2005 there has been interest in using the performance offered by GPUs for

    evolutionary computation in general and for accelerating the fitness evaluation in

    genetic programming in particular. There is a short introduction on pages 9092 of

    A Field Guide To Genetic Programming. Most approaches compile linearortree

    programs on the host PC and transfer the executable to the GPU to run. Typically

    the performance advantage is only obtained by running the single active program

    simultaneously on many example problems in parallel using the GPU's SIMD

    architecture. However, substantial acceleration can also be obtained by not

    compiling the programs but instead transferring them to the GPU and interpreting

    them there. Acceleration can then be obtained by either interpreting multiple

    programs simultaneously, simultaneously running multiple example problems, or

    combinations of both. A modern GPU (e.g.8800 GTX or later) can readily

    simultaneously interpret hundreds of thousands of very small programs.

    CUDA: Compute Unified Device Architecture

    What is CUDA?

    NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's

    only C language environment that enables programmers and developers to writesoftware to solve complex computational problems in a fraction of the time by

    tapping into the many-core parallel processing power of GPUs. With millions of

    CUDA-capable GPUs already deployed, thousands of software programmers are

    already using the free CUDA software tools to accelerate applications-from video

    and audio encoding to oil and gas exploration, product design, medical imaging,

    and scientific research.

    Providing orders of magnitude more performance than current CPUs and

    simplifying software development by extending the standard C language, CUDAtechnology enables developers to create innovative solutions for data-intensive

    problems. For advanced research and language development, CUDA includes a

    low level assembly language layer and driver interface.

    CUDA is a software and GPU architecture that makes it possible to use the many

    processor cores (and eventually thousands of cores) in a GPU to perform general-

    purpose mathematical calculations. CUDA is accessible to all programmers

    8

    http://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Serieshttp://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Fitness_(genetic_algorithm)http://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Linear_genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/GeForce_8_Series
  • 8/2/2019 Report on Gpu

    9/39

    through an extension to the C and C++ programming languages for parallel

    computing.

    Technology Features:

    Standard C language for parallel application development onthe GPU

    Standard numerical libraries for FFT (Fast Fourier Transform)and BLAS (Basic Linear Algebra Subroutines)

    Dedicated CUDA driver for computing with fast data transferpath between GPU and CPU

    CUDA driver interoperates with OpenGL and DirectX graphicsdrivers

    Support for Linux 32/64-bit and Windows XP 32/64-bit

    operating systems

    How is CUDA different from GPGPU?

    CUDA is designed from the ground-up for efficient general purpose computation

    on GPUs. It uses a C-like programming language and does not require remapping

    algorithms to graphics concepts. CUDA is an extension to C for parallel

    computing. It allows the programmer to program in C, without the need to translate

    problems into graphics concepts. Anyone who can program C can swiftly learn to

    program in CUDA.

    GPGPU (General-Purpose computation on GPUs) uses graphics APIs like DirectX

    and OpenGL for computation. It requires detailed knowledge of graphics APIs and

    hardware. The programming model is limited in terms of random read and write

    and thread cooperation.

    CUDA exposes several hardware features that are not available via the graphics

    API. The most significant of these is shared memory, which is a small (currently

    16KB per multiprocessor) area of on-chip memory which can be accessed in

    parallel by blocks of threads. This allows caching of frequently used data and canprovide large speedups over using textures to access data. Combined with a thread

    synchronization primitive, this allows cooperative parallel processing of on-chip

    data, greatly reducing the expensive off-chip bandwidth requirements of many

    parallel algorithms. This benefits a number of common applications such as linear

    algebra, Fast Fourier Transforms, and image processing filters.

    9

  • 8/2/2019 Report on Gpu

    10/39

    Whereas fragment programs in the graphics API are limited to outputting 32 floats

    (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered

    writes - i.e. an unlimited number of stores to any address. This enables many new

    algorithms that were not possible to perform efficiently using graphics-based

    GPGPU.

    The graphics API forces the user to store data in textures, which requires packing

    long arrays into 2D textures. This is cumbersome and imposes extra addressing

    math. CUDA can perform loads from any address. CUDA also offers highly

    optimized data transfers to and from the GPU.

    Nvidia's CUDA: The End of the

    CPU?

    Introduction

    Lets take a trip back in time way back to 2003 when Intel and AMD became

    locked in a fierce struggle to offer increasingly powerful processors. In just a few

    years, clock speeds increased quickly as a result of that competition, especially

    with Intels release of its Pentium 4.

    But the clock speed race would soon hit a wall. After riding the wave of sustained

    clock speed boosts (between 2001 and 2003 the Pentium 4s clock speed doubled

    from 1.5 to 3 GHz), users now had to settle for improvements of a few measlymegahertz that the chip makers managed to squeeze out (between 2003 and 2005

    clock speeds only increased from 3 to 3.8 GHz).

    Even architectures optimized for high clock speeds, like the Prescott, ran afoul of

    the problem, and for good reason: This time the challenge wasnt simply an

    industrial one. The chip makers had simply come up against the laws of physics.

    Some observers were even prophesying the end of Moores Law. But that was far

    10

  • 8/2/2019 Report on Gpu

    11/39

    from being the case. While its original meaning has often been misinterpreted, the

    real subject of Moores Law was the number of transistors on a given surface area

    of silicon. And for a long time, the increase in the number of transistors in a CPU

    was accompanied by a concomitant increase in performance which no doubt

    explains the confusion. But then, the situation became complicated. CPU architects

    had come up against the law of diminishing returns: The number of transistors that

    had to be added to achieve a given gain in performance was becoming ever greater

    and was headed for a dead end.

    The CUDA Architecture

    The CUDA Architecture consists of several components, in the green boxes

    below:

    11. Parallel compute engines inside NVIDIA GPUs

    2. OS kernel-level support for hardware initialization, configuration, etc.

    3. User-mode driver, which provides a device-level API for developers

    4. PTX instruction set architecture (ISA) for parallel computing kernels and

    functions

    11

  • 8/2/2019 Report on Gpu

    12/39

    12

  • 8/2/2019 Report on Gpu

    13/39

    The CUDA Software Development

    Environment

    The CUDA Software Development Environment provides all the tools, examples

    and documentation necessary to develop applications that take advantage of the

    CUDA architecture.

    Libraries Advanced libraries that include BLAS, FFT,

    and other functions optimized for the CUDA

    architecture

    C Runtime The C Runtime for CUDA provides supportfor executing standard C functions on the GPU

    and allows native bindings for other high-level

    languages such as Fortran, Java, and Python

    Tools NVIDIA C Compiler (nvcc), CUDA Debugger

    (cudagdb), CUDA Visual Profiler (cudaprof),

    and other helpful tools

    Documentation Includes the CUDA Programming Guide, API

    specifications, and other helpful

    documentation

    Samples SDK code samples and documentation that

    demonstrate best practices for a wide variety

    GPU Computing algorithms and applications

    The CUDA Software Development Environment supports two different

    programming interfaces:

    11. A device-level programming interface, in which the application uses DirectXCompute, OpenCL or the CUDA Driver API directly to configure the GPU, launch

    compute kernels, and read back results.

    2

    2. A language integration programming interface, in which an application uses the

    C Runtime for CUDA and developers use a small set of extensions to indicate

    which compute functions should be performed on the GPU instead of the CPU.

    13

  • 8/2/2019 Report on Gpu

    14/39

    When using the device-level programming interface, developers write compute

    kernels in separate files using the kernel language supported by their API of

    choice. DirectX Compute kernels (aka compute shaders) are written in HLSL.

    OpenCL kernels are written in a C-like language called OpenCL C. The CUDA

    Driver API accepts kernels written in C or PTX assembly.

    When using the language integration programming interface, developers write

    compute functions in C and the C Runtime for CUDA automatically handles

    setting up the GPU and executing the compute functions. This programming

    interface enables developers to take advantage of native support for high-level

    languages such as C, C++, Fortran, Java, Python, and more (see below), reducing

    code complexity and development costs through type integration and codeintegration:

    Type integration allows standard types as well as vector types and user-defined

    types (including structs) to be used seamlessly across functions that are executed on

    the CPU and functions that are executed on the GPU.

    14

  • 8/2/2019 Report on Gpu

    15/39

    The CUDA APIs

    But, Brooks critical success was enough to attract the attention of ATI and Nvidia,

    since the two giants saw the incipient interest in this type of initiative as an

    opportunity to broaden their market even more by reaching a new sector that had so

    far been indifferent to their graphics achievements.

    Researchers who were in on Brook at its origin quickly joined the Santa Claradevelopment teams to put together a global strategy for targeting the new market.

    The idea was to offer a hardware/software ensemble suited to this type of

    calculation since Nvidias developers know all the secrets of their GPU, there was

    no question of relying only on a graphics API, which only communicates with the

    hardware via a driver, with all the problems that implies, as we saw above. So the

    CUDA (Compute Unified Device Architecture) development team created a set of

    software layers to communicate with the GPU.

    As you can see on this diagram, CUDA provides two APIs:

    15

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/gallery/014-111103,0101-111103-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-6.html
  • 8/2/2019 Report on Gpu

    16/39

    A high-level API: the CUDA Runtime API;

    A low-level API: the CUDA Driver API.

    Since the high-level API is implemented above the low-level API, each call to a

    function of the Runtime is broken down into more basic instructions managed by

    the Driver API. Note that these two APIs are mutually exclusive the programmermust use one or the other, but its not possible to mix function calls from both. The

    term high-level API is relative. Even the Runtime API is still what a lot of people

    would consider very low-level; yet it still offers functions that are highly practical

    for initialization or context management. But dont expect a lot more abstraction

    you still need a good knowledge of Nvidia GPUs and how they work.

    The Driver API, then, is more complex to manage; it requires more work to launch

    processing on the GPU. But the upside is that its more flexible, giving the

    programmer who wants it additional control. The two APIs are capable of

    communicating with OpenGL or Direct3D resources (only nine for the moment).

    The usefulness of this is obvious CUDA could be used to generate resources

    (geometry, procedural textures, etc.) that could then be passed to the graphics API,

    or conversely, its possible that the 3D API could send the results of the rendering

    to CUDA, which in that case would be used to perform post-processing. There arenumerous examples of interactions, and the advantage is that the resources remain

    stored in the GPUs RAM without having to transit through the bottleneck of the

    PCI-Express bus.

    Conversely, we should point out that sharing resources in this case video memory

    with graphics data is not always idyllic and can lead to a few headaches. For

    example, for a change of resolution or color depth, the graphics data have priority.

    So, if the resources for the frame buffer need to increase, the driver wont hesitate

    16

    http://www.tomshardware.com/gallery/016-111105,0101-111105-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/015-111104,0101-111104-0-2-3-0-jpg-.html
  • 8/2/2019 Report on Gpu

    17/39

    to grab the ones that are allocated to applications using CUDA, causing them to

    crash. Its not very elegant, granted; but you have to admit that the situation

    shouldnt come up very often. And since were on the subject of little

    disadvantages: If you want to use several GPUs for a CUDA application, youll

    have to disable SLI mode first, or only a single GPU will be visible to CUDA.

    Finally, the third software layer is a set of libraries two to be precise:

    CUBLAS, which has a set of building blocks for linear algebra calculationson the GPU;

    CUFFT, which can handle calculation of Fourier transforms an algorithm

    much used in the field of signal processing.

    17

    http://www.tomshardware.com/gallery/017-111106,0101-111106-0-2-3-0-jpg-.html
  • 8/2/2019 Report on Gpu

    18/39

    A Few Definitions

    Before we dive into CUDA, lets define a few terms that are sprinkled throughout

    Nvidias documentation. The company has chosen to use a rather special

    terminology that can be hard to grasp. First we need to define what a thread is inCUDA, because the term doesnt have quite the same meaning as a CPU thread,

    nor is it the equivalent of what we call "threads" in our GPU articles. A thread on

    the GPU is a basic element of the data to be processed. Unlike CPU threads, CUDA

    threads are extremely lightweight, meaning that a context change between two

    threads is not a costly operation.

    The second term frequently encountered in the CUDA documentation is warp. No

    confusion possible this time (unless you think the term might have something to do

    with Start TrekorWarhammer). No, the term is taken from the terminology of

    weaving, where it designates threads arranged lengthwise on a loom and crossedby the woof. A warp in CUDA, then, is a group of 32 threads, which is the

    minimum size of the data processed in SIMD fashion by a CUDA multiprocessor.

    But that granularity is not always sufficient to be easily usable by a programmer,

    and so in CUDA, instead of manipulating warps directly, you work with blocks that

    can contain 64 to 512 threads.

    Finally, these blocks are put together in grids. The advantage of the grouping is that

    the number of blocks processed simultaneously by the GPU are closely linked to

    hardware resources, as well see later. The number of blocks in a grid make it

    possible to totally abstract that constraint and apply a kernel to a large quantity of

    threads in a single call, without worrying about fixed resources. The CUDAruntime takes care of breaking it all down for you. This means that the model is

    extremely extensible. If the hardware has few resources, it executes the blocks

    sequentially; if it has a very large number of processing units, it can process them in

    parallel. This in turn means that the same code can target both entry-level GPUs,

    high-end ones and even future GPUs.

    18

    http://www.tomshardware.com/gallery/-111133,0101-111133-0-2-3-0-jpg-.html
  • 8/2/2019 Report on Gpu

    19/39

    The other terms youll run into frequently in the CUDA API are used to designate

    the CPU, which is called the host, and the GPU, which is referred to as the device.

    After that little introduction, which we hope hasnt scared you away, its time to

    plunge in!

    19

    http://www.tomshardware.com/gallery/-111132,0101-111132-0-2-3-1-jpg-.html
  • 8/2/2019 Report on Gpu

    20/39

    The Theory: CUDA from the Hardware Point of

    View

    With CUDA, Nvidia presents its architecture in a slightly different way and

    exposes certain details that hadnt been revealed before now.

    As you can see above, Nvidias Shader Core is made up of several clusters Nvidia

    calls Texture Processor Clusters. An 8800GTX, for example, has eight clusters,

    an 8800GTS six, and so on. Each cluster, in fact, is made up of a texture unit and

    two streaming multiprocessors. These processors consist of a front end that

    reads/decodes and launches instructions and a backend made up of a group of eight

    calculating units and two SFUs (Super Function Units) where the instructions are

    executed in SIMD fashion: The same instruction is applied to all the threads in the

    warp. Nvidia calls this mode of execution SIMT (for single instruction multiple

    threads). Its important to point out that the backend operates at double thefrequency of the front end. In practice, then, the part that executes the instructions

    appears to be twice as wide as it actually is (that is, as a 16-way SIMD unit

    instead of an eight-way one). The streaming multiprocessors operating mode is as

    follows: At each cycle, a warp ready for execution is selected by the front end,

    which launches execution of an instruction. To apply the instruction to all 32

    threads in the warp, the backend will take four cycles, but since it operates at

    20

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/gallery/-111131,0101-111131-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.html
  • 8/2/2019 Report on Gpu

    21/39

    double the frequency of the front end, from its point of view only two cycles will be

    executed. So, to avoid having the front end remain unused for one cycle and to

    maximize the use of the hardware, the ideal is to alternate types of instructions

    every cycle a classic instruction for one cycle and an SFU instruction for the

    other.

    Each multiprocessor also has certain amount of resources that should be understood

    in order to make the best use of them. They have a small memory area called

    Shared Memory with a size of 16 KB per multiprocessor. This is not a cache

    memory the programmer has a free hand in its management. As such, its like the

    Local Store of the SPUs on Cell processors. This detail is particularly interesting,

    and demonstrates the fact that CUDA is indeed a set ofsoftware and hardware

    technologies. This memory area is not used for pixel shaders as Nvidia says,

    tongue in cheek, We dislike pixels talking to each other.

    This memory area provides a way for threads in the same block to communicate.

    Its important to stress the restriction: all the threads in a given block are guaranteed

    to be executed by the same multiprocessor. Conversely, the assignment of blocks to

    the different multiprocessors is completely undefined, meaning that two threads

    from different blocks cant communicate during their execution. That means that

    using this memory is complicated. But it can also be worthwhile, because except

    for cases where several threads try to access the same memory bank, causing a

    conflict; the rest of the time, access to shared memory is as fast as access to the

    registers.

    21

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.htmlhttp://www.tomshardware.com/gallery/-111129,0101-111129-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-8.html
  • 8/2/2019 Report on Gpu

    22/39

    The shared memory is not the only memory the multiprocessors can access.

    Obviously they can use the video memory, but it has lower bandwidth and higher

    latency. Consequently, to limit too-frequent access to this memory, Nvidia has also

    provided its multiprocessors with a cache (approximately 8 KB per multiprocessor)

    for access to constants and textures.

    The multiprocessors also have 8,192 registers that are shared among all the threads

    of all the blocks active on that multiprocessor. The number of active blocks per

    multiprocessor cant exceed eight, and the number of active warps are limited to 24

    (768 threads). So, an 8800GTX can have up to 12,288 threads being processed at a

    given instant. Its worth mentioning all these limits because it helps in

    dimensioning the algorithm as a function of the available resources.

    Optimizing a CUDA program, then, essentially consists of striking the optimum

    balance between the number of blocks and their size more threads per block will

    be useful in masking the latency of the memory operations, but at the same time the

    number of registers available per thread are reduced. Whats more, a block of 512

    threads would be particularly inefficient, since only one block might be active on a

    multiprocessor, potentially wasting 256 threads. So, Nvidia advises using blocks of

    128 to 256 threads, which offers the best compromise between masking latency and

    the number of registers needed for most kernels.

    22

    http://www.tomshardware.com/gallery/-111128,0101-111128-0-2-3-1-jpg-.html
  • 8/2/2019 Report on Gpu

    23/39

    The Theory: CUDA from the Software Point of

    View

    From a software point of view, CUDA consists of a set of extensions to the Clanguage, which of course recalls BrookGPU, and a few specific API calls. Among

    the extensions are type qualifiers that apply to functions and variables. The

    keyword to remember here is __global__, which when prefixed to a function

    indicates that the latter is a kernel that is, a function that will be called by the

    CPU and executed by the GPU. The __device__ keyword designates a function that

    will be executed by the GPU (which CUDA refers to as the device) but can only

    be called from the GPU (in other words, from another __device__ function or from

    a __global__ function). Finally, the __host__ keyword is optional, and designates a

    function thats called by the CPU and executed by the CPU in other words, atraditional function.

    There are a few restrictions associated with __device__ and __global__ functions:

    They cant be recursive (that is, they cant call themselves) and they cant have a

    variable number of arguments. Finally, regarding __device__ functions resident in

    the GPUs memory space, logically enough its impossible to obtain their address.

    Variables also have new qualifiers that allow control of the memory area where

    theyll be stored. A variable preceded by the keyword __shared__ indicates that it

    will be stored in the streaming multiprocessors shared memory. The way a

    __global__ function is called is also a little different. Thats because the executionconfiguration has to be defined at the time of the call more concretely, the size of

    the grid to which the kernel is applied and the size of each block. Take the example

    of a kernel with the following signature:

    __global__ void Func(float* parameter);

    which will be called as follows:

    Func>(parameter);

    where Dg is the grid dimension and Db the dimension of a block. These two

    variables are of a new vector type introduced by CUDA.

    The CUDA API essentially comprises functions for memory manipulation in

    VRAM: cudaMalloc to allocate memory, cudaFree to free it and cudaMemcpy to

    copy data between RAM and VRAM and vice-versa.

    23

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-10.html
  • 8/2/2019 Report on Gpu

    24/39

    Well end this overview with the way a CUDA program is compiled, which is

    interesting: Compiling is done in several phases first of all the code dedicated to

    the CPU is extracted from the file and passed to the standard compiler. The code

    dedicated to the GPU is first converted into an intermediate language, PTX. This

    intermediate language is like an assembler, and so enables the generated source

    code to be studied for potential inefficiencies. Finally, the last phase translates this

    intermediate language into commands that are specific to the GPU and encapsulates

    them in binary form in the executable.

    Performance

    However we did decide to measure the processing time to see if there was any

    advantage to using CUDA even with our crude implementation, or on the otherhand if was going to take long, exhaustive practice to get any real control over the

    use of the GPU. The test machine was our development box a laptop computer

    with a Core 2 Duo T5450 and a GeForce 8600M GT, operating underVista. Its far

    from being a supercomputer, but the results are interesting since our test is not all

    that favorable to the GPU. Its fine for Nvidia to show us huge accelerations on

    systems equipped with monster GPUs and enormous bandwidth, but in practice

    24

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/gallery/-111127,0101-111127-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.html
  • 8/2/2019 Report on Gpu

    25/39

    many of the 70 million CUDA GPUs existing on current PCs are much less

    powerful, and so our test is quite germane.

    The results we got are as follows for processing a 2048x2048 image:

    CPU 1 thread: 1419 ms CPU 2 threads: 749 ms

    CPU 4 threads: 593 ms

    GPU (8600M GT) blocks of 256 pixels: 109 ms

    GPU (8600M GT) blocks of 128 pixels: 94 ms

    GPU (8800 GTX) blocks of 128 pixels / 256 pixels: 31 ms

    Several observations can be made about these results. First of all youll notice that

    despite our crack about programmers laziness, we did modify the initial CPU

    implementation by threading it. As we said, the code is ideal for this situation all

    you do is break down the initial image into as many zones as there are threads. Note

    that we got an almost linear acceleration going from one to two threads on our dual-

    core CPU, which shows the strongly parallel nature of our test program. Fairly

    unexpectedly, the four-thread version proved faster, whereas we were expecting to

    see no difference at all on our processor, or even and more logically a slight loss

    of efficiency due to the additional cost generated by the creation of the additional

    threads. What explains that result? Its hard to say, but it may be that the Windows

    thread scheduler has something to do with it; but in any case the result was

    reproducible. With a texture with smaller dimensions (512x512), the gain achievedby threading was a lot less marked (approximately 35% as opposed to 100%) and

    the behavior of the four-thread version was more logical, showing no gain over the

    two-thread version. The GPU was still faster, but less markedly so (the 8600M GT

    was three times faster than the two-thread version).

    25

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.htmlhttp://www.tomshardware.com/gallery/-111125,0101-111125-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-12.html
  • 8/2/2019 Report on Gpu

    26/39

    The second notable observation is that even the slowest GPU implementation was

    nearly six times faster than the best-performing CPU version. For a first program

    and a trivial version of the algorithm, thats very encouraging. Notice also that we

    got significantly better results using smaller blocks, whereas intuitively you mightthink that the reverse would be true. The explanation is simple our program uses

    14 registers per thread, and with 256-thread blocks it would need 3,584 registers

    per block, and to saturate a multiprocessor it takes 768 threads, as we saw. In our

    case, thats three blocks or 10,572 registers. But a multiprocessor has only 8,192

    registers, so it can only keep two blocks active. Conversely, with blocks of 128

    pixels, we need 1,792 registers per block; 8,192 divided by 1,792 and rounded to

    the nearest integer works out to four blocks being processed. In practice, the

    number of threads are the same (512 per multiprocessor, whereas theoretically it

    takes 768 to saturate it), but having more blocks gives the GPU additionalflexibility with memory access when an operation with a long latency is executed,

    it can launch execution of the instructions on another block while waiting for the

    results to be available. Four blocks would certainly mask the latency better,

    especially since our program makes several memory accesses.

    Analysis

    Finally, in spite of what we said earlier about this not being a horserace, we

    couldnt resist the temptation of running the program on an 8800 GTX, which

    proved to be three times as fast as the mobile 8600, independent of the size of the

    blocks. You might think the result would be four or more times faster based on the

    respective architectures: 128 ALUs compared to 32 and a higher clock frequency

    (1.35GHz compared to 950MHz), but in practice that wasnt the case. Here again

    the most likely hypothesis is that we were limited by the memory accesses. To bemore precise, the initial image is accessed like a CUDA multidimensional array a

    very complicated term for whats really nothing more than a texture. There are

    several advantages:

    accesses get the benefit of the texture cache;

    26

  • 8/2/2019 Report on Gpu

    27/39

    we have a wrapping mode, which avoids having to manage the edges of the

    image, unlike the CPU version.

    We could also have taken advantage of free filtering with normalized addressing

    between [0,1] instead of [0, width] and [0, height], but that wasnt useful in our

    case. As you know as a faithful reader, the 8600 has 16 texture units compared to32 for the 8800GTX. So theres only a two-to-one ratio between the two

    architectures. Add the difference in frequency and we get a ratio of (32 x 0.575) /

    (16 x 0.475) = 2.4 in the neighborhood of the three-to-one we actually observed.

    That theory also has the advantage of explaining why the size of the blocks doesnt

    change much on the G80, since the ALUs are limited by the texture units anyway.

    In addition to the encouraging results, our first steps with CUDA went very well

    considering the unfavorable conditions wed chosen. Developing on a Vista laptop

    means youre forced to use CUDA SDK 2.0, still in its beta stage, with the 174.55

    driver, which is also in beta. Despite all that, we have no unpleasant surprises to

    report just a little scare when the first execution of our program, still very buggy,

    tried to address memory beyond the allocated space.

    The monitor blinked frenetically, then went black until Vista launched the video

    driverrecovery service and all was well. But you have to admit it was surprisingwhen youre used to seeing an ordinary Segmentation Fault with standard programs

    in cases like that. Finally, one (very small) criticism of Nvidia: In all the

    documentation available for CUDA, its a shame not to find a little tutorial

    explaining step by step how to set up the development environment in Visual

    Studio. Thats not too big a problem since the SDK is full of example programs you

    27

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/gallery/-111124,0101-111124-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-13.html
  • 8/2/2019 Report on Gpu

    28/39

    can explore to find out how to build the skeleton of a minimal project for a CUDA

    application, but for beginners a tutorial would have been a lot more convenient.

    WORKING

    THE GRAPHICS PIPELINE

    The task of any 3D graphics system is to synthesize an image from a description

    scene. This scene contains the geometric primitives to be viewed as well as

    descriptions of the lights illuminating the scene.GPU designers traditionally have

    expressed this image-synthesis process as a hardware pipeline of specialized

    stages.

    HERE - I will provide an overview of the classic graphics pipeline

    The goal is to highlight those aspects of the real-time rendering calculation that

    allow graphics application developers to exploit modern GPUs as general-purpose

    parallel computation engines.

    How GPUs Work

    1.Pipeline input

    Most real-time graphics systems assume that everything is made of triangles,

    and they first carve up any more complex shapes, such as quadrilaterals or

    curved surface patches, into triangles. The developer uses a computer graphics

    28

  • 8/2/2019 Report on Gpu

    29/39

    library (such as OpenGL or Direct3D) to provide each triangle to the graphics

    pipeline one vertex at a time; the GPU assembles vertices into triangles as

    needed.

    2.Model transformations

    A GPU can specify each logical object in a scene in its own locally defined

    coordinate system.

    This convenience comes at a price: before rendering, the GPU must first

    transform all objects into a common coordinate system. To ensure that triangles

    arent warped or twisted into curved shapes, this transformation is limited to

    simple affine operations such as rotations, translations, scalings, and the like.

    3.Lighting

    Once each triangle is in a global coordinate system, the GPU can compute its

    color based on the lights in the scene. As an example, we describe the

    calculations for a single-point light source (imagine a very small lightbulb).The

    GPU handles multiple lights by summing the contributions of each individual

    light.

    The traditional graphics pipeline supports the Phong lighting equation - a

    phenomenological appearance model that approximates the look of plastic..

    The Phong lighting equation gives the output color :-

    C = Kd Li (N.L) + Ks Li (R.V)^S.

    29

  • 8/2/2019 Report on Gpu

    30/39

    4.Camera simulation

    The graphics pipeline next projects each colored 3D triangle onto the virtual

    cameras film plane. Like the model transformations, the GPU does this using

    matrix-vector multiplication, again leveraging efficient vector operations in

    hardware. This stages output is a stream of triangles in screen coordinates, ready to

    be turned into pixels.

    5.Rasterization

    Each visible screen-space triangle overlaps some pixels on the display;

    determining these pixels is called rasterization. GPU designers have

    incorporated many rasterizatiom algorithms over the years, which all exploit one

    crucial observation: Each pixel can be treated independently from all other

    30

  • 8/2/2019 Report on Gpu

    31/39

    pixels. Therefore, the machine can handle all pixels in parallelindeed, some

    exotic machines have had a processor for each pixel.This inherent independence

    has led GPU designers to build increasingly parallel sets of pipelines.

    6.Texturing

    The actual color of each pixel can be taken directly from the lighting

    calculations, but for added realism, images called textures are often draped over

    the geometry to give the illusion of detail. GPUs store these textures in high-

    speed memory, which each pixel calculation must access to determine or modify

    that pixels color.

    7.Hidden surfaces

    In most scenes, some objects obscure other objects. If each pixel were simply

    written to display memory, the most recently submitted triangle would appear to

    be in front.All modern GPUs provide a depth buffer, a region of memory that

    stores the distance from each pixel to the viewer. Before writing to the display,

    the GPU compares a pixels distance to the distance of the pixel thats already

    present, and it updates the display memory only if the new pixel is closer.

    31

  • 8/2/2019 Report on Gpu

    32/39

    Figure 1.Programmable shading. The introduction of programmable shading in

    2001 led to several visual effects not previously possible, such as this simulationof refractive chromatic dispersion for a soap bubble effect.

    32

  • 8/2/2019 Report on Gpu

    33/39

    Figure 2.Unprecedented visual realism. Modern GPUs can use programmable

    shading to achieve near-cinematic realism, as this interactive demonstrationshows, featuring actress Adrianne Curry on an NVIDIA GeForce 8800 GTX.

    33

  • 8/2/2019 Report on Gpu

    34/39

    Conclusion

    Nvidia introduced CUDA with the release of the GeForce 8800. At that time the

    promises they were making were extremely seductive, but we kept our enthusiasm

    in check. After all, wasnt this likely to be just a way of staking out the territory and

    surfing the GPGPU wave? Without an SDK available, you cant blame us forthinking it was all just a marketing operation and that nothing really concrete would

    come of it. It wouldnt be the first time a good initiative has been announced too

    early and never really saw the light of day due to a lack of resources especially in

    such a competitive sector. Now, a year and a half after the announcement, we can

    say that Nvidia has kept its word.

    34

    http://www.tomshardware.com/gallery/029-111118,0101-111118-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/gallery/-111134,0101-111134-0-2-3-1-jpg-.html
  • 8/2/2019 Report on Gpu

    35/39

    Not only was the SDK available quickly in a beta version, in early 2007, but its

    also been updated frequently, proving the importance of this project for Nvidia.

    Today CUDA has developed nicely; the SDK is available in a beta 2.0 version for

    the major operating systems (Windows XP and Vista and Linux and 1.1 for MacOS X), and Nvidia is devoting an entire section of its site for developers.

    On a more personal level, the impression we got from our first steps with CUDA

    was extremely positive. Even if youre familiar with the GPUs architecture, its

    natural to be apprehensive about programming it, and while the API looks clear at

    first glance you cant keep from thinking it wont be easy to get convincing results

    with the architecture. Wont the gain in processing time be siphoned off by the

    multiple CPU-GPU transfers? And how to make good use of those thousands of

    threads with almost no synchronization primitive? We started our experimentation

    with all these uncertainties in mind. But they soon evaporated when the first version

    of our algorithm, trivial as it was, already proved to be significantly faster than theCPU implementation.

    So, CUDA is not a gimmick intended for researchers who want to cajole their

    university into buying them a GeForce. CUDA is genuinely usable by any

    programmer who knows C, provided he or she is ready to make a small investment

    of time and effort to adapt to this new programming paradigm. That effort wont be

    wasted provided your algorithms lend themselves to parallelization. We should also

    35

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.htmlhttp://www.tomshardware.com/gallery/031-111120,0101-111120-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/030-111119,0101-111119-0-2-3-0-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-14.html
  • 8/2/2019 Report on Gpu

    36/39

    tip our hat to Nvidia for providing ample, quality documentation to answer all the

    questions of beginning programmers.

    So what does CUDA need in order to become the API to reckon with? In a word:

    portability. We know that the future of IT is in parallel computing everybodys

    preparing for the change and all initiatives, both software and hardware, are takingthat direction. Currently, in terms of development paradigms, were still in

    prehistory creating threads by hand and making sure to carefully plan access to

    shared resources is still manageable today when the number of processor cores can

    be counted on the fingers of one hand; but in a few years, when processors will

    number in the hundreds, that wont be a possibility. With CUDA, Nvidia is

    proposing a first step in solving this problem but the solution is obviously

    reserved only for their own GPUs, and not even all of them. Only the GF8 and 9

    (and their Quadro/Tesla derivatives) are currently able to run CUDA programs.

    Nvidia may boast that it has sold 70 million CUDA-compatible GPUs worldwide,

    but thats still not enough for it to impose itself as the de facto standard. All the

    more so since their competitors arent standing by idly. AMD is offering its own

    SDK (Stream Computing) and Intel has also announced a solution (Ct), though its

    36

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/gallery/033-111122,0101-111122-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/gallery/032-111121,0101-111121-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.html
  • 8/2/2019 Report on Gpu

    37/39

    not available yet. So the war is on and there wont be room for three competitors,

    unless another player say Microsoft were to step in and pick up all the marbles

    with a common API, which would certainly be welcomed by developers.

    So Nvidia still has a lot of challenges to meet to make CUDA stick, since while

    technologically its undeniably a success, the task now is to convince developers

    that its a credible platform and that doesnt look like itll be easy. However,

    judging by the many recent announcements in the news about the API, the future

    doesnt look unpromising.

    Supported GPUsA table of devices officially supporting CUDA (Note that many applications require at least 256

    MB of dedicated VRAM).

    Nvidia GeForce

    GeForce GTX 295

    GeForce GTX 285

    GeForce GTX 280

    GeForce GTX 275GeForce GTX 260

    GeForce GTS 250

    GeForce GT 220

    GeForce G210

    GeForce 9800 GX2

    GeForce 9800 GTX+

    GeForce 9800 GTX

    GeForce 9800 GT

    GeForce 9600 GSO

    GeForce 9600 GTGeForce 9500 GT

    GeForce 9400 GT

    GeForce 9400 mGPU

    GeForce 9300 mGPU

    GeForce 8800 Ultra

    GeForce 8800 GTX

    Nvidia GeForce Mobile

    GeForce GTX 280M

    GeForce GTX 260M

    GeForce GTS 260M

    GeForce GTS 250MGeForce GTS 160M

    GeForce GT 240M

    GeForce GT 230M

    GeForce GT 220M

    GeForce G210M

    GeForce 9800M GTX

    GeForce 9800M GTS

    GeForce 9800M GT

    GeForce 9800M GS

    GeForce 9700M GTSGeForce 9700M GT

    GeForce GT 130M

    GeForce GT 120M

    GeForce 9650M GT

    GeForce 9650M GS

    GeForce 9600M GT

    Nvidia Quadro

    Quadro FX 5800

    Quadro FX 5600

    Quadro FX 4800

    Quadro FX 4700 X2Quadro FX 4600

    Quadro FX 3700

    Quadro FX 1800

    Quadro FX 1700

    Quadro FX 570

    Quadro FX 370

    Quadro NVS 290

    Quadro FX 3600M

    Quadro FX 1600M

    Quadro FX 770MQuadro FX 570M

    Quadro FX 370M

    Quadro Plex 1000 Model IV

    Quadro Plex 1000 Model S4

    Nvidia Quadro Mobile

    Quadro NVS 360M

    37

    http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://www.tomshardware.com/gallery/034-111123,0101-111123-0-2-3-1-jpg-.htmlhttp://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-15.htmlhttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/GeForcehttp://en.wikipedia.org/wiki/Nvidia_Quadrohttp://en.wikipedia.org/wiki/Nvidia_Quadro
  • 8/2/2019 Report on Gpu

    38/39

    GeForce 8800 GTS

    GeForce 8800 GT

    GeForce 8800 GS

    GeForce 8600 GTS

    GeForce 8600 GTGeForce 8600 mGT

    GeForce 8500 GT

    GeForce 8400 GS

    GeForce 8300 mGPU

    GeForce 8200 mGPU

    GeForce 8100 mGPU

    GeForce 9600M GS

    GeForce 9500M GS

    GeForce 9500M G

    GeForce 9400M G

    GeForce 9300M GSGeForce 9300M G

    GeForce 9200M GS

    GeForce 9100M G

    GeForce 8800M GTS

    GeForce 8700M GT

    GeForce 8600M GT

    GeForce 8600M GS

    Quadro NVS 140M

    Quadro NVS 135M

    Quadro NVS 130M

    Nvidia Tesla

    Tesla S1070Tesla C1060

    Tesla C870

    Tesla D870

    Tesla S870

    ExampleThis example code in C++ loads a texture from an image into an array on the GPU:

    cudaArray* cu_array;texture tex;// Allocate arraycudaChannelFormatDesc description = cudaCreateChannelDesc();cudaMallocArray(&cu_array, &description, width, height);// Copy image data to arraycudaMemcpy(cu_array, image, width*height*sizeof(float),

    cudaMemcpyHostToDevice);// Bind the array to the texturecudaBindTextureToArray(tex, cu_array);// Run kerneldim3 blockDim(16, 16, 1);dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);kernel>(d_odata, width, height);cudaUnbindTexture(tex);__global__ void kernel(float* odata, int height, int width){

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;float c = tex2D(tex, x, y);odata[y*width+x] = c;}

    REFERENCES

    38

    http://en.wikipedia.org/wiki/Nvidia_Teslahttp://en.wikipedia.org/wiki/Nvidia_Tesla
  • 8/2/2019 Report on Gpu

    39/39

    1.GPU Computing

    This paper appears in:Proceedings of the IEEEIssue Date: May 2008Volume: 96Issue:5On page(s): 879 - 899

    ISSN: 0018-9219INSPEC Accession Number: 9921316Digital Object Identifier:10.1109/JPROC.2008.917757Date of Current Version: 15 April 2008Sponsored by:IEEE

    2. http://developer.nvidia.com/object/gpu_programming_guide.html

    3. www.tomshardware.com/reviews/nvidia-cuda-gpu

    http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnkhttp://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4490117http://dx.doi.org/10.1109/JPROC.2008.917757http://www.ieee.org/http://developer.nvidia.com/object/gpu_programming_guide.htmlhttp://webcache.googleusercontent.com/search?q=cache:hp3Q1FLoFKcJ:www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-9.html+NVIDIA+HARDWARE+POINT+OF+VIEW&cd=2&hl=en&ct=clnk