1 Product Availability Update Product Inventory Leadtime for big orders Notes C1060 200 units 8 weeks Build to order M1060 500 units 8 weeks Build to order S1070-400 50 units 10 weeks Build to order S1070-500 25 units+ 75 being built 10 weeks Build to order M2050 Shipping now Building 20K for Q2 8 weeks Sold out through mid- July S2050 Shipping now Building 200 for Q2 8 weeks Sold out through mid- July C2050 2000 units 8 weeks Will maintain inventory M2070 Sept 2010 - Get PO in now to get priority Processamento Paralelo em GPU’s na Arquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America
Product Availability Update. Processamento Paralelo em GPU’s na Arquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America. Quadro or Tesla?. TESLA TM. QUADRO TM. GPU Computing. CPU + GPU Co-Processing. 448 cores. 4 cores. CPU 48 GigaFlops (DP). GPU - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000
500
1000
1500
2000
2500
3000
0
1
2
3
4
5
6
7
8G
igaf
lops
Meg
awat
ts
8
3 of Top5 Supercomputers
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000
500
1000
1500
2000
2500
3000
0
1
2
3
4
5
6
7
8G
igaf
lops
Meg
awat
ts
9
What if Every Supercomputer Had Fermi?
Oak Ridge National Laboratory Lawrence Livermore National Laboratory IDRIS Network Company IT Service Provider (D) Semiconductor Company (P) Semiconductor Company (O) Merlion Trade GmbH Geoscience (P) Semiconductor Company (O) Hosting Services IT Service Provider (D)0
200
400
600
800
1000
LinpackTeraflops
Top 500 Supercomputers (Nov 2009)
150 GPUs37 TeraFlops
$740KTop 150
225 GPUs55 TeraFlops
$1.1 MTop 100
450 GPUs110 TeraFlops
$2.2 MTop 50
10
Hybrid ExaScale Trajectory
20081 TFLOP7.5 KWatts
20101.27 PFLOPS2.55 MWatts
2017 *2 EFLOPS10 MWatts
* This is a projection based on Moore’s law and does not represent a committed roadmap
11
Tesla Roadmap
12
The March of the GPUs
2007 2008 2009 2010 2011 20120
50
100
150
200
250 Peak Memory Bandwidth GBytes/sec
T10
Nehalem 3 GHz
Westmere3 GHz
8-core Sandy Bridge3 GHz
T20
T20A
2007 2008 2009 2010 2011 20120
200
400
600
800
1000
1200 Peak Double Precision FP GFlops/sec
Nehalem3 GHz
Westmere3 GHz
T20
T20A
T10
8-coreSandy Bridge
3 GHz
NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU
13
Project Denver
14
Expected Tesla Roadmap with Project Denver
15
WorkstationsUp to 4x
Tesla C2050/70 GPUs
Integrated CPU-GPU Server
2x Tesla M2050/70 GPUs in 1U
OEM CPU Server +Tesla S2050/70
4 Tesla GPUs in 2U
Workstation / Data Center Solutions2 Tesla
M2050/70 GPUs
16
Tesla C2050 Tesla C2070Processors Tesla 20-series GPU
Number of Cores 448
Caches64 KB L1 cache + Shared Memory / 32 cores
768 KB L2 cache
Floating Point Peak Performance
1030 Gigaflops (single)515 Gigaflops (double)
GPU Memory3 GB
2.625 GB with ECC on6 GB
5.25 GB with ECC on
Memory Bandwith 144 GB/s (GDDR5)
System I/O PCIe x16 Gen2
Power 238 W (max) 238 W (max)
Available Shipping Now Shipping Now
Tesla C-Series Workstation GPUs
17
How is the GPU Used?
Basic Component: “Stream Multiprocessor” (SM)
SIMD: “Single Instruction Multiple Data”
Same Instruction for all cores, but can operate over different data
“SIMD at SM, MIMD at GPU chip”
Source: Presentation from Felipe A. Cruz, Nagasaki University
18
The Use of GPU’s and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
19
The Fermi Architecture3 billion transistors
16 x Streaming Multiprocessors (SM’s)
6 x 64-bit Memory Partitions = 384-bit Memory Interface
Host Interface: connects the GPU to the CPU via PCI-Express
GigaThread global scheduler: distribute thread blocks to SM thread schedulers
20
SM ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
32 CUDA cores per SM (512 total)
16 x Load/Store Units = source and destin. address calculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.)
64 KB of RAM for shared memory and L1 cache (configurable)
Dual Warp Scheduler
21
Dual Warp Scheduler
1 Warp = 32 parallel threads
2 Warps issued and executed concurrently
Each Warp goes to 16 CUDA Cores
Most instructions can be dual issued (exception: Double Precision instructions)
Dual-Issue Model allows near peak hardware performance
22
CUDA Core ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs
Newly designed integer ALU optimized for 64-bit and extended precision operations
Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision
23
Fused Multiply-Add Instruction (FMA)
24
GigaThreadTM Hardware Thread Scheduler (HTS)
Hierarchically manages thousands of simultaneously active threads
10x faster application context switching (each program receives a time slice of processing resources)
void saxpy_serial(float ... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}
void main( ) { float x; saxpy_serial(..); ...}
NVCC(Open64) CPU Compiler
C CUDAKey Kernels
CUDA objectfiles
Rest of CApplication
CPU objectfilesLinker
CPU-GPUExecutable
Modify into Parallel
CUDA code
34
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
35
Software Programming
Source: Presentation from Andreas Klöckner, NYU
36
Software Programming
Source: Presentation from Andreas Klöckner, NYU
37
Software Programming
Source: Presentation from Andreas Klöckner, NYU
38
Software Programming
Source: Presentation from Andreas Klöckner, NYU
39
Software Programming
Source: Presentation from Andreas Klöckner, NYU
40
Software Programming
Source: Presentation from Andreas Klöckner, NYU
41
Software Programming
Source: Presentation from Andreas Klöckner, NYU
42
Software Programming
Source: Presentation from Andreas Klöckner, NYU
43
CUDA C/C++ Leadership
2007 2008 2009 2010
July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10CUDA Toolkit 1.1
• Win XP 64
• Atomics support
• Multi-GPU support
CUDA Toolkit 2.0
• Double Precision
• Compiler Optimizations
• Vista 32/64
• Mac OSX
• 3D Textures
• HW Interpolation
CUDA Toolkit 2.3
• DP FFT
• 16-32 Conversion intrinsics
• Performance enhancements
CUDA Toolkit 1.0
• C Compiler• C Extensions
• Single Precision• BLAS• FFT• SDK 40 examples
CUDAVisual Profiler 2.2
cuda-gdbHW Debugger
Parallel NsightBeta CUDA Toolkit 3.0
• C++ inheritance
• Fermi arch support
• Tools updates
• Driver / RT interop
44
Why should I choose Tesla over consumer cards?Feature Benefits
Features
4x Higher double precision (on 20-series) Higher Performance for scientific CUDA applications
ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and on DRAM memories
Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine)
Higher Performance for CUDA applications (by overlapping communication & computation)
Larger memory for larger data sets – 3GB and 6GB Products Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE)
Cluster management software tools available on Tesla only Needed for GPU monitoring and job scheduling in data center deployments
TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla.
Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services
Integrated OEM workstations and servers Trusted, reliable systems built for Tesla products.
Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only.
Quality & Warranty
2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Built for 24/7 computing in data center and workstation environments.
Manufactured & guaranteed by NVIDIA No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance.
3-year warranty from HP Reliable, long life products
Support & Lifecycle
Enterprise support, higher priority for CUDA bugs and requests Ability to influence CUDA and GPU roadmap. Get early access to features requests.