Top Banner
NIPUN SAXENA 7-IT-058 ~ 1 ~
30

Seminar Report Cud A

Mar 26, 2015

Download

Documents

Nipun Saxena
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

~ 1 ~

Page 2: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

CONTENTS

INTRODUCTION ABSTRACT BACKGROUND CUDA-ENABLED GPUS EXAMPLE OF CUDA PROCESSING FLOW CURRENT CUDA ARCHITECTURES FUTURE USAGES OF CUDA ARCHITECTURE CUDA ON x86 PLATFORM GPGPU SUPPORTED GPUS ADVANTAGES& LIMITATIONS GPU V/S CPU

~ 2 ~

Page 3: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

INTRODUCTIONCUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIAgraphics processing units (GPUs) that is accessible to software developers through variants of industry standard programming languages. Programmers use 'C for CUDA' (C with NVIDIA extensions and certain restrictions), compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors -the Khronos Group's Open Computing Language and Microsoft's DirectCompute. Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, MATLAB and IDL.

CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast. This approach of solving general purpose problems on GPUs is known as GPGPU.

In the computer game industry, in addition to graphics rendering, GPUs are used in game physics calculations (physical effects like debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by anorder of magnitude or more. An example of this is the BOINC distributed computing client.

CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS Xsupport was later added in version 2.0, which supersedes the beta released February 14, 2008. CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. NVIDIA states that programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards, due to binary compatibility.

ABSTRACT~ 3 ~

Page 4: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

CUDA is NVIDIA’s parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU.

With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad-ranging uses for CUDA, including image and video processing, computational biology and chemistry, fluid dynamics simulation, CT image reconstruction, seismic analysis, ray tracing, and much more.

Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers.

In the consumer market, nearly every major consumer video application has been, or will soon be, accelerated by CUDA, including products from Elemental Technologies, MotionDSP and LoiLo, Inc.

CUDA has been enthusiastically received in the area of scientific research. For example, CUDA now accelerates AMBER, a molecular dynamics simulation program used by more than 60,000 researchers in academia and pharmaceutical companies worldwide to accelerate new drug discovery.

In the financial market, Numerix and CompatibL announced CUDA support for a new counterparty risk application and achieved an 18X speedup. Numerix is used by nearly 400 financial institutions.

An indicator of CUDA adoption is the ramp of the Tesla GPU for GPU computing. There are now more than 700 GPU clusters installed around the world at Fortune 500 companies ranging from Schlumberger and Chevron in the energy sector to BNP Paribas in banking.

And with the recent launches of Microsoft Windows 7 and Apple Snow Leopard, GPU computing is going mainstream. In these new operating systems, the GPU will not only be the graphics processor, but also a general purpose parallel processor accessible to any application.

BACKGROUND

~ 4 ~

Page 5: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

• CUDA is a platform for performing massively parallel computations on graphics accelerators

• CUDA was developed by nVidia Corporation

• It was first available with their G8X line of graphics cards

• Approximately 1 million CUDA capable GPUs are shipped every week

• CUDA presents a unique opportunity to develop widely-deployed parallel applications

Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers.The CUDA platform represents the shift from traditional clock speed intensive processing to distributed stream processing

Implementations

There are two levels for the runtime API.

The C API (cuda_runtime_api.h) is a C-style interface that does not require compiling with nvcc.The C++ API (cuda_runtime.h) is a C++-style interface built on top of the C API. It wraps some of the C API routines, using overloading, references and default arguments. These wrappers can be used from C++ code and can be compiled with any C++ compiler. The C++ API also has some CUDA-specific wrappers that wrap C API routines that deal with symbols, textures, and device functions. These wrappers require the use of nvcc because they depend on code being generated by the compiler. For example, the execution configuration syntax to invoke kernels is only available in source code compiled with nvcc.

CUDA-ENABLED GPUS 

~ 5 ~

Page 6: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

Dozens of vendors provide graphics cards based upon CUDA-capable NVIDIA GPUs and it is almost not possible for a high-performance PC or motherboard vendor to introduce a product that does not do a good job of hosting such GPU cards. The insatiable demand of gamers for more performance has also spawned an industry of vendors offering ever-faster memory, more powerful power supplies and other systems components that are perfect for creating outstanding GIS desktop and server machines. The easiest way to see if a particular graphics card is CUDA-enabled is to first check which NVIDIA GPU it utilizes. Almost all contemporary NVIDIA GPUs are CUDA-enabled. If anything, the surprise is discovering that quite a few NVIDIA GPUs aimed at motherboard chipsets or mobile applications like portable computers are also CUDA-enabled, albeit with a smaller number of processors per GPU. NVIDIA presently provides three families of CUDA-enabled GPU products. All three families may be used with Manifold and CUDA: ·      GeForce - The GeForce line of NVIDIA GPUs are sold primarily through a wide variety of graphics card and motherboard manufacturers which incorporate the NVIDIA chips into their own graphics cards. Performance tends to be high and prices kept low by fierce competition in gaming markets.·      Quadro - The Quadro line of NVIDIA GPUs is manufactured and sold directly by NVIDIA in very high end professional workstation graphics markets. Quadro cards provide extraordinarily high resolutions and massive graphics memory for the most demanding workstation applications. Some Quadro products also appear in high end portable computers, and some Quadro products are also provided in external cabinets similar to Tesla packaging.·      Tesla - The Tesla line of NVIDIA GPUs is also manufactured and sold directly by NVIDIA to support high performance computing where supercomputer performance through parallel processing is required. Although they are also available as plug-in cards, Tesla GPUs are best known for being packaged into external cabinets (either desktop or rack mount) that provide two or four GPUs per cabinet. The external cabinets used for Tesla and some Quadro products attach to desktop computers using a special cable that plugs into an interface card plugged into a standard PCI-E slot. This allows the external Tesla or Quadro configuration to appear to software as if it were a plugged-in PCI-E card just like typical GeForce cards. However, because the actual GPUs are hosted in

~ 6 ~

Page 7: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

an external cabinet that provides power and cooling, the host computer does not need to be retrofitted with additional power and cooling.

Parallel Computing

Whether you are porting an existing application, designing a new application, or just want to get your work done faster using the applications you have, these resources will help you get started.GPU Acceleration in Existing Applications

Many scientists, engineers, and professionals can realize the benefits of parallel computing on GPUs simply by upgrading to GPU-accelerated versions of the applications they already use. Examples include LabVIEW, Mathematica, MATLAB, and many more…Developing Your Own Applications

If you are developing applications or libraries, first decide whether you want to take advantage of existing libraries that are already optimized for parallel computing on GPUs. If all the functionality you need already exists in a library, you may simply need to use these libraries in your application. Even if you know you want to write your own custom code for the GPU, it’s worth reviewing the available libraries to see what you can leverage.

If you will be writing your own code for the GPU, there are many available language solutions and APIs, so it’s worth reviewing the options and selecting the solution that best meets your needs.Approach ExamplesApplication Integration MATLAB, Mathematica, LabVIEWImplicit Parallel Languages PGI Accelerator, HMPPAbstraction Layer/Wrapper PyCUDA, CUDA.NET, jCUDALanguage Integration CUDA C/C++, PGI CUDA FortranLow-level Device API CUDA C/C++, DirectCompute, OpenCL

Once you have decided which libraries and language solution or API you’re going to use, you’re ready to start programming. If you selected a solution

~ 7 ~

Page 8: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

provided by NVIDIA, download the latest CUDA Toolkit and review the Getting Started Guide. There’s also a collection of essential training materials, webinars, etc. on our CUDA Education & Training page. Choose Your Development Platform

With over 300M CUDA-architecture GPUs sold to date, most developers will be able to use the GPU you already have to get started. When you’re ready to test and deploy your applications, make sure you review our current product lines and OEM solutions to select the best systems for your needs.

* Tesla products are designed for datacenter and workstation computing applications * Quadro products are designed for professional graphics and engineering applications * GeForce products are designed for interactive gaming and consumer applications

Language bindings

Python - PyCUDA KappaCUDA

Java - jCUDA, JCuda, JCublas, JCufft

.NET - CUDA.NET

MATLAB - Jacket, GPUmat

Fortran - FORTRAN CUDA, PGI CUDA Fortran Compiler

Perl - KappaCUDA

Ruby - KappaCUDA

Lua - KappaCUDA

IDL - GPULib

EXAMPLE OF CUDA PROCESSING FLOW

~ 8 ~

Page 9: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

1. Copy data from main mem to GPU mem

2. CPU instructs the process to GPU

3. GPU execute parallel in each core

4. Copy the result from GPU mem to main mem

CURRENT CUDA ARCHITECTURES

~ 9 ~

Page 10: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

The next generation CUDA architecture (codename: "Fermi") which is standard on NVIDIA's released GeForce 400 Series GPU is designed from the ground up to natively support more programming languages such as C++. It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla GPU. It also introduced several new features including:

up to 512 CUDA cores and 3.0 billion transistors

NVIDIA Parallel DataCache technology

NVIDIA GigaThread engine

ECC memory support

Native support for Visual Studio

FUTURE USAGES OF CUDA ARCHITECTURE

Search for Extra-Terrestrial Intelligence Accelerated rendering of 3D graphics Real Time Cloth Simulation OptiTex.com - Real Time Cloth

Simulation Distributed Calculations, such as predicting the native conformation

of proteins Medical analysis simulations, for example virtual reality based on CT

and MRI scan images. Physical simulations, in particular in fluid dynamics. Environment statistics Accelerated encryption, decryption and compression Accelerated interconversion of video file formats

CUDA ON x86 PLATFORM

~ 10 ~

Page 11: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

NVIDIA's CUDA architecture provides developers with a way to efficiently program NVIDIA GPUs using a very easy-to-read, C-like syntax. Since its launch in 2007, CUDA has become incredibly popular for a wide spectrum of supercomputing applications, from finance to oil and gas. The reason that CUDA ended up as the de facto way to write GPGPU applications is straightforward: it was first. NVIDIA took the idea of pitching GPUs to the HPC market much more seriously than AMD, and the company was very aggressive about improving CUDA and pitching it to developers. The end result is an architecture and software stack that is mature and developer-friendly, both of which go a long way in the HPC world.

The open alternative to CUDA is, of course, OpenCL, and of the three major GPU vendors (including Intel), NVIDIA has far and away the most robust OpenCL support right now. AMD's OpenCL support is less advanced, while Intel's support won't debut until the end of this year. So if NVIDIA has put more effort than everyone else into OpenCL, a framework that is also designed to target both GPUs and CPUs, why go to the trouble of adding x86 as a CUDA target?

There are a few likely answers, the first of which is that user demand is there.Adding in x86 support will open up the possibility of improving CUDA's ease-of-use for programmers by reducing the number of abstractions that they have to juggle.

Another advantage of adding x86 support to CUDA is that it could speed up adoption of the platform by making it easier to get started with. Not every developer who might like to learn CUDA has access to an NVIDIA GPU, so by expanding the hardware that CUDA can target to include x86, you'll be able to get CUDA on, say, an ATI-equipped iMac.Finally, there's the secret NVIDIA x86 CPU project, which may or may not factor into this move. If NVIDIA does decide to launch its own x86 part, then having two options for architectures (CUDA and OpenCL) that developers can target and have the code run seamlessly across its entire product line will be a big bonus.The actual work of porting CUDA to x86 will be done by a consulting company, the Portland Group, which is owned by STMicro.

GPGPU

General-purpose computing on graphics processing units

~ 11 ~

Page 12: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

(GPGPU, also referred to as GPGP and less often GP²) is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU. It is made possible by the addition of programmable stages and higher precision arithmetic to the rendering pipelines, which allows software developers to use stream processing on non-graphics data.GPU improvements.GPU functionality has, traditionally, been very limited. In fact, for many years the GPU was only used to accelerate certain parts of the graphics pipeline. Some improvements were needed before GPGPU became feasible.

Programmability

Programmable vertex and fragment shaders were added to the graphics pipeline to enable game programmers to generate even more realistic effects. Vertex shaders allow the programmer to alter per-vertex attributes, such as position, color, texture coordinates, and normal vector. Fragment shaders are used to calculate the color of a fragment, or per-pixel. Programmable fragment shaders allow the programmer to substitute, for example, a lighting model other than those provided by default by the graphics card, typically simple Gouraud shading. Shaders have enabled graphics programmers to create lens effects, displacement mapping, and depth of field.

DirectX 8 introducing Shader Model 1.1, DirectX 8.1 Pixel Shader Models 1.2, 1.3 and 1.4, and DirectX 9 defining Shader Model 2.x and 3.0. Each shader model increased the programming model flexibilities and capabilities, ensuring the conforming hardware follows suit. The DirectX 10 specification introduces Shader Model 4.0 which unifies the programming specification for vertex, geometry (“Geometry Shaders” are new to DirectX 10) and fragment processing allowing for a better fit for unified shader hardware, thus providing a single computational pool of programmable resource.

Data types

~ 12 ~

Page 13: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

* 8 bits per pixel – Palette mode[vague], where each value is an index in a table with the real color value specified in one of the other formats. Possibly two bits for red, three bits for green, and three bits for blue.* 16 bits per pixel – Usually allocated as five bits for red, six bits for green, and five bits for blue.* 24 bits per pixel – eight bits for each of red, green, and blue* 32 bits per pixel – eight bits for each of red, green, blue, and alpha

GPGPU programming concepts

GPUs are designed specifically for graphics and thus are very restrictive in terms of operations and programming. Because of their nature, GPUs are only effective at tackling problems that can be solved using stream processing and the hardware can only be used in certain ways.

Stream processing

GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once.

A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements independently there is no way to have shared or static data. For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable

Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high Applications

Applications of GPU

~ 13 ~

Page 14: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

k-nearest neighbor algorithm Computer clusters or a variation of a parallel computing (utilizing

GPU cluster technology) for highly calculation-intensive tasks: High-performance computing clusters (HPC clusters) (often referred

to as supercomputers) Grid computing (a form of distributed computing) (networking many

heterogeneous computers to create a virtual computer architecture) Load-balancing clusters (sometimes referred to as a server farm) Statistical physics Segmentation – 2D and 3D Level-set methods CT reconstruction Audio signal processing Audio and Sound Effects Processing, to use a GPU for DSP (digital

signal processing) Analog signal processing Speech processing Digital image processing Video Processing[8] Hardware accelerated video decoding and post-processing

Inverse discrete cosine transform (iDCT) Variable-length decoding (VLD) Inverse quantization (IQ) In-loop deblocking Bitstream processing (CAVLC/CABAC) using special

purpose hardware for this task because this is a serial task not suitable for regular GPGPU computation

Color correction Hardware accelerated video encoding and pre-processing Global illumination – photon mapping, radiosity, subsurface

scattering Geometric computing – constructive solid geometry, distance fields,

collision detection, transparency computation, shadow generation Scientific computing Weather forecasting Climate research Molecular modeling on GPU Quantum mechanical physics Astrophysics

~ 14 ~

Page 15: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

Bioinformatics Computational finance Medical imaging Computer vision Digital signal processing / signal processing Control engineering Neural networks Database operations Lattice Boltzmann methods Cryptography and cryptanalysis Electronic Design Automation Antivirus software

Intrusion Detection

SUPPORTED GPUS

~ 15 ~

Page 16: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

ADVANTAGES

~ 16 ~

Page 17: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.

Scattered reads – code can read from arbitrary addresses in memory. Shared memory – CUDA exposes a fast shared memory region (16KB

in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.

Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations, including integer

texture lookups.

LIMITATIONS

CUDA (with compute capability 1.x) uses a recursion-free, function-pointer-free subset of the C language, plus some simple extensions. However, a single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments. Fermi GPUs now have (nearly) full support of C++.

Code compiled for devices with compute capability 2.0 (Fermi) and greater may make use of C++ classes, as long as none of the member functions are virtual (this restriction will be removed in some future release).

Texture rendering is not supported. For double precision (only supported in newer GPUs like GTX 260)

there are some deviations from the IEEE 754 standard: round-to-nearest-even is the only supported rounding mode for reciprocal, division, and square root. In single precision, denormals and signalling NaNs are not supported; only two IEEE rounding modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.

The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.

~ 17 ~

Page 18: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path; the SIMD execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure duringraytracing).

Unlike OpenCL, CUDA-enabled GPUs are only available from NVIDIA (GeForce 8 series and above, Quadro and Tesla)

GPU V/S CPU

Graphics processing units (GPUs) have, for many years, powered the display of images and motion on computer displays. GPUs are now powerful enough to do more than just move images across the screen. They are capable of performing high-end computations that are the staple of many engineering activities.

~ 18 ~

Page 19: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

Benchmarks that focus on floating point arithmetic, those most often used in these engineering computations, show that GPUs can perform such computations much faster than the traditional central processing units (CPUs) used in today’s workstations—sometimes as much as 20 times faster, depending on the computation.

But the performance advantage in these benchmarks doesn’t automatically make it a slam dunk for running engineering applications. Comparing CPUs with GPUs is like comparing apples with oranges.

GPU Challenges and Rewards

The GPU remains a specialized processor, and its performance in graphics computation belies a host of difficulties to perform true general-purpose computing. The processors themselves require recompiling any software; they have rudimentary programming tools, as well as limits in programming languages and features.

These difficulties mean applications are limited to those that commercial software vendors develop and make available to engineering customers, or in some cases, where source code is owned by the engineering firm and ported to the GPU. Vendors have to perceive that a market for a GPU version of their software exists, while engineering groups have to determine that it will pay for them to make the investment in hardware, software and expertise.

Commercial GPU-based systems are becoming increasingly common. NVIDIA, in addition to providing processors to third parties, also builds its own systems and clusters under the Tesla brand. These include the Tesla Personal Supercomputer, which has up to 448 cores in a multiprocessor configuration, with up to 6GB of memory per processor,The cluster systems include either straight GPU or GPU-CPU systems in 1U configurations for the data center. A 1U NVIDIA unit with a quad processor configuration can do four teraflops of single precision operations, and about 340 gigaflops of double precision.

In addition, third-party systems are available from engineering system vendors such as Appro, Microway, Supermicro and Tyan. These systems typically provide multiple processors and cores, and deliver high levels of computational power for specific uses.

~ 19 ~

Page 20: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

That concept is a long way from the industry standard Intel and AMD CPUs, which are used to power the majority of workstations (and even high-end supercomputers). Changing that would be an expensive and time-consuming affair for software vendors.

Nevertheless, the cost and performance of GPUs can make a difference in how design engineering is done. Imagine being able to run an analysis on your design 20 times faster than you can today, for example.

Benchmarks

But it’s not a simple matter. First of all, “20 times faster” is highly problematic: Just because some computations can be speeded up by that much doesn’t mean that the entire analysis would be. In fact, the overall analysis could even be slower than using a CPU, if the CPU can compute other parts of the analysis faster.

Second, it would be a significant software development effort to run even fairly common code on a GPU. Some types of code may require modification, while other types may not be able to run on the GPU at all. Many engineering software vendors aren’t yet convinced that the effort can pay for itself and make a profit.

So it turns out that you still need the traditional CPU after all. You need it because that is where the vast majority of engineering and office software runs, where the primary software development skill set resides, and whose all-around performance is at least good enough to remain in that role for the foreseeable future.

Intel hasn’t been sitting still as GPUs have increased performance. Up until the beginning of this year, the company had been working on its own multi-core processor, codenamed Larrabee. While it ultimately canceled the initial release of a Larrabee processor, the technology still exists, and will likely find its way into either an Intel-designed GPU or a hybrid CPU.

Such technology may ultimately provide the best of both worlds: compatible performance on most applications, and high performance on engineering computations.

~ 20 ~

Page 21: Seminar Report Cud A

NIPUN SAXENA 7-IT-058

~ 21 ~

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU GPU