© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0.

© NVIDIA Corporation 2011

The ‘Super’ Computing Company

From Super Phones to Super Computers

CUDA 4.0


CUDA Toolkit 4.0 Release CandidateAvailable to Registered Developers on March 4th

Press Embargo : February 28th – 6am PST (San Francisco)


Rapid Application PortingUnified Virtual Addressing

Faster Multi-GPU ProgrammingGPUDirect 2.0

CUDA 4.0Application Porting Made Simpler

Easier Parallel Programming in C++ Thrust


CUDA 4.0 for Broader Developer Adoption

CUDA 1.0 2007

Researchers

and Early

Adopters

CUDA 2.0 2008

Scientists and

HPC

Applications

CUDA 3.0 2009

Application

Innovation

Leaders

CUDA 4.0 2011

Broader

Developer

Adoption


NVIDIA GPUDirect™:Towards Eliminating the CPU Bottleneck

• Direct access to GPU memory for 3rd party devices

• Eliminates unnecessary sys mem copies & CPU overhead

• Supported by Mellanox and Qlogic

• Up to 30% improvement in communication performance

Version 1.0

for applications that communicate over a network

• Peer-to-Peer memory access, transfers & synchronization

• MPI implementations natively support GPU data transfers

• Less code, higher programmer productivity

Details @ http://www.nvidia.com/object/software-for-tesla-products.html

Version 2.0

for applications that communicate within a node

http://www.nvidia.com/object/software-for-tesla-products.html


Before GPUDirect v2.0

Required Copy into Main Memory

GPU1

GPU1Memory

GPU2

GPU2Memory

PCI-e

CPU

Chipset

SystemMemory


GPUDirect v2.0: Peer-to-Peer Communication

Direct Transfers b/w GPUs

GPU1

GPU1Memory

GPU2

GPU2Memory

PCI-e

CPU

Chipset

SystemMemory


Unified Virtual Addressing Easier to Program with Single Address Space

No UVA: Multiple Memory Spaces

UVA : Single Address Space

System

Memory

CPU GPU0

GPU0Memor

y

GPU1

GPU1Memor

y

System

Memory

CPU GPU0

GPU0Memor

y

GPU1

GPU1Memor

y

PCI-e PCI-e

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF


C++ Templatized Algorithms & Data Structures (Thrust)

Powerful open source C++ parallel algorithms & data structures

Similar to C++ Standard Template Library (STL)

Automatically chooses the fastest code path at compile time

Divides work between GPUs and multi-core CPUs

Parallel sorting @ 5x to 100x faster than STL and TBB

Data Structures

• thrust::device_vector

• thrust::host_vector• thrust::device_ptr• Etc.

Algorithms

• thrust::sort• thrust::reduce• thrust::exclusive_scan

• Etc.

http://code.google.com/p/thrust/downloads/list

© NVIDIA Corporation 2011Source: http://www.tiobe.com

C

C++

Parallel Programming Sweet Spot


CUDA 4.0: Highlights

• Share GPUs across multiple threads

• Single thread access to all GPUs

• No-copy pinning of system memory

• New CUDA C/C++ features

• Thrust templated primitives library

• NPP image/video processing library

• Layered Textures

Easier ParallelApplication Porting

• Auto Performance Analysis

• C++ Debugging

• GPU Binary Disassembler

• cuda-gdb for MacOS

New & Improved Developer Tools

• Unified Virtual Addressing

• NVIDIA GPUDirect™ v2.0

• Peer-to-Peer Access

• Peer-to-Peer Transfers

• GPU-accelerated MPI

Faster Multi-GPU Programming


GPU Technology Conference 2011Oct. 11-14 | San Jose, CA

3rd annual GPU Technology Conference

New for 2011:

Co-located with Los Alamos HPC Symposium

300+ Research Scientists from National Labs

2010 highlights

• 280 hours of sessions

• 100+ Research posters

• 42 countries representedwww.gputechconf.com


BACKGROUND SLIDESCUDA 4.0


NVIDIA CUDA Summary

New in

CUDA 4.0

Libraries

Thrust C++ LibraryTemplated Performance Primitives

NVIDIA Library Support

Complete math.hComplete BLAS Library (1, 2

and 3)

Sparse Matrix Math LibraryRNG LibraryFFT Library (1D, 2D and 3D)Image Processing Library

(NPP)

Video Processing Library (NPP)

3rd Party Math Libraries• CULA Tools• MAGMA• IMSL• VSIPL

Tools

Parallel Nsight Pro

NVIDIA Tools SupportParallel Nsight 1.0 IDEcuda-gdb Debugger with

multi-GPU

CUDA/OpenCL Visual Profiler

CUDA Memory CheckerCUDA C SDKCUDA Disassembler

CUDA Partner Tools

Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPP

Platform

GPUDirect 2.0Fast Path to Data

Hardware SupportECC MemoryDouble PrecisionNative 64-bit ArchitectureConcurrent Kernel ExecutionDual Copy Engines Multi-GPU support 6GB per GPU supported

Operating System Support

MS Windows 32/64Linux 32/64 supportMac OSX support

Cluster ManagementGPUDirect Tesla Compute Cluster (TCC)Graphics Interoperability

Programming Model

Unified Virtual Addressing

C++ new/delete

C++ Virtual Functions

C support• NVIDIA C Compiler• CUDA C Parallel Extensions• Function Pointers • Recursion• Atomics• malloc/free

C++ support• Classes/Objects• Class Inheritance• Polymorphism• Operator Overloading • Class Templates• Function Templates• Virtual Base Classes • Namespaces

Fortran, OpenCL


cuda-gdb Now Available for MacOS

Details @ http://developer.nvidia.com/object/cuda-gdb.html

http://developer.nvidia.com/object/cuda-gdb.html

http://developer.nvidia.com/object/cuda-gdb.html


Automated Performance Analysis in Visual Profiler

Summary analysis & hints

Session

Device

Context

Kernel

New UI for kernel analysis

Identify limiting factor

Analyze instruction throughput

Analyze memory throughput

Analyze kernel occupancy


NVIDIA Parallel Nsight™

Professional features now available

free of charge!

Key FeaturesProfessional Profiler Standard

Microsoft Visual Studio 2010 support

Single System Debugging

Tesla Compute Cluster

CUDA Toolkit 3.2


CUDA 3rd Party Ecosystem

Tools

Parallel Debuggers

Visual Studio IDE with

Parallel Nsight Pro

Allinea DDT Debugger

TotalView Debugger

Performance Tools

ParaTools VampirTrace

TauCUDA Performance Tools

PAPI

HPC Toolkit

Compute Platform Providers

Cloud Compute

Amazon EC2

Peer 1

OEM’s

Dell

HP

IBM

Cluster Tools

Cluster Management

Platform LSF Cluster Manager

Platform Symphony

Bright Cluster manager

Job Scheduling Altair PBS

Cluster Resources TORQUE

MPI Libraries

MPI

OpenMPI

Qlogic OFED

Compilers

PGI CUDA Fortran

PGI Accelerators

PGI CUDA x86

CAPS HMPP

TidePowerd GPU.net

pyCUDA



NVIDIA CUDA Developer Resources

ENGINES &LIBRARIES

Math LibrariesCUFFT, CUBLAS, CUSPARSE, CURAND

3rd Party LibrariesCULA LAPACK, VSIPL,

NPP Image LibrariesPerformance primitives for imaging

App Acceleration EnginesRay Tracing: Optix, iRay

Video Libraries

NVCUVID / NVCUVENC

DEVELOPMENTTOOLS

CUDA ToolkitComplete GPU computing development kit

cuda-gdbGPU hardware debugging

Visual ProfilerGPU hardware profiler for CUDA C and OpenCL

Parallel NsightIntegrated development environment for Visual Studio

SDKs AND CODE SAMPLES

GPU Computing SDK CUDA C/C++, DirectCompute,OpenCL code samples and documentation

Books CUDA by Example, GPU Gems

Optimization GuidesBest Practices for GPU computing and graphics development

http://developer.nvidia.com


Proven Research Vision

John Hopkins University

Nanyan University

Technical University-Czech

CSIRO

SINTEF

HP Labs

ICHEC

Barcelona SuperComputer Center

Clemson University

Fraunhofer SCAI

Karlsruhe Institute Of Technology

World Class Research Leadership and Teaching

University of Cambridge

Harvard University

University of Utah

University of Tennessee

University of Maryland

University of Illinois at Urbana-Champaign

Tsinghua University

Tokyo Institute of Technology

Chinese Academy of Sciences

National Taiwan University

Georgia Institute of Technology

http://research.nvidia.com

GPGPU Education350+ Universities

Academic Partnerships / Fellowships

GPU Computing Research & Education

Mass. Gen. Hospital/NE Univ

North Carolina State University

Swinburne University of Tech.

Techische Univ. Munich

UCLA

University of New Mexico

University Of Warsaw-ICM

VSB-Tech

University of Ostrava

And more coming shortly.


CUDA Applications Momentum Increasing


Today’s CUDA CAE Solutions

Structural Mechanics

Electromagnetics

ANSYS Mechanical

AFEA

Abaqus/Standard

(beta)AcuSolveMoldflowCulises (OpenFOAM)Particleworks

NexximEMProCST MSXFdtdSEMCAD X

Fluid Dynamics

http://www.simulia.com/index.html

http://www.remcom.com/

http://www.speag.com/

© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0.

Documents

gpu memory

nvidia corporation

memory system memory

main memory gpu

peer memory access

c thrust slide

nvidia gpudirect

gpu data transfers