Financial computing on NVIDIA GPUs Mike Giles [email protected]Oxford University Mathematical Institute Oxford-Man Institute for Quantative Finance Oxford eResearch Centre Acknowledgments: Gerd Heber, Abinash Pati, Vignesh Sundaresh, Xiaoke Su and funding from Microsoft, EPSRC, TCS/CRL Finance on GPUs – p. 1/30
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Acknowledgments: Gerd Heber, Abinash Pati, Vignesh Sundaresh, Xiaoke Su
and funding from Microsoft, EPSRC, TCS/CRL
Finance on GPUs – p. 1/30
Overview
trends in mainstream HPC
the co-processor alternatives
NVIDIA graphics cards
CUDA programming
LIBOR Monte Carlo application
finite difference PDE applications
Finance on GPUs – p. 2/30
Computing – Recent Past
driven by the cost benefits of massive economies ofscale, specialised chips (e.g. CRAY vector chips)died out, leaving Intel/AMD dominant
Intel/AMD chips designed for office/domestic use,not for high performance computing
increased speed through higher clock frequencies,and complex parallelism within each CPU
PC clusters provided the high-end compute power,initially in universities and then in industry
at same time, NVIDIA and ATI grew big on graphicschip sales driven by computer games
Finance on GPUs – p. 3/30
Computing – Present/Future
move to faster clock frequencies stopped due to highpower consumption (proportional to f2?)
big push now is to multicore (multiple processing unitswithin a single chip) at (slightly) reduced clockfrequencies
graphics chips have even more cores (up to 240 onNVIDIA GPUs)– big new development here is a more generalpurpose programming environment
Why? At least partly because computer games doincreasing amounts of “physics” simulation
Finance on GPUs – p. 4/30
CPUs and GPUs
Copyright NVIDIA 2006/7
Finance on GPUs – p. 5/30
Mainstream CPUs
currently up to 6 cores – 16 cores likely within 5 years?
intended for general applications
MIMD (Multiple Instruction / Multiple Data)– each core works independently of the others,executing different instructions, often for differentprocesses
specialised vector capabilities (SSE2/SSE3) for vectorsof length 4 (s.p.) or 2 (d.p.) – motivated by graphicsrequirements but sometimes used for scientificapplications?
Finance on GPUs – p. 6/30
Mainstream CPUs
How does one exploit all of these cores?
OpenMP multithreading for shared-memory parallelismeasy to get parallel code runningcan be harder to get good parallel performancedegree of difficulty: 2/10
MPI message-passing for distributed-memoryparallelism
hard to get started, need to partition data andprogramming is low-level and tediousgenerally easier to get good parallel performancedegree of difficulty: 6/10
Finance on GPUs – p. 7/30
Mainstream CPUs
Importance of standards:
makes it possible to write portable code to run on anyhardware
encourages developers to work on code optimisation
encourages academic/commercial development of toolsand libraries to assist application developers
Finance on GPUs – p. 8/30
Co-processor alternatives
GPUs:
Cell processor, developed by IBM/Sony/Toshiba forSony Playstation 3
NVIDIA GeForce 8 and 9 series GPUs, developedprimarily for high-end computer games market
each card has fast graphics memory which is used for:global memory accessible by all multiprocessorsspecial read-only constant memoryadditional local memory for each multiprocessor
Finance on GPUs – p. 13/30
NVIDIA GPUs
For high-end HPC, NVIDIA have Tesla systems:
C1060 card:PCIe card, plugs into standard PC/workstationsingle GPU with 240 cores and 4GB graphicsmemory
S1070 server:4 cards packaged in a 1U serverconnect to 2 external servers, one for each pair ofcardseach GPU has 240 cores plus 4GB graphics memory
neither product has any graphics output, intendedpurely for scientific computing
Finance on GPUs – p. 14/30
NVIDIA GPUs
Most important hardware feature is that the 8 cores in amultiprocessor are SIMD (Single Instruction Multiple Data)cores:
all cores execute the same instructions simultaneously
vector style of programming harks back to CRAY vectorsupercomputing
natural for graphics processing and much scientificcomputing
SIMD is also a natural choice for massively multicore tosimplify each core
requires specialised programming (no standard)
Finance on GPUs – p. 15/30
CUDA programming
CUDA is NVIDIA’s program development environment:
based on C with some extensions
lots of example code and good documentation– 2-4 week learning curve for those with experience ofOpenMP and MPI programming
growing user community active on NVIDIA forum
main process runs on host system (Intel/AMD CPU)and launches multiple copies of “kernel” process ongraphics card
communication is through data transfers to/fromgraphics memory
minimum of 4 threads per core, but more is better
Finance on GPUs – p. 16/30
CUDA programming
How hard is it to program?
Needs combination of skills:
splitting the application between the multiplemultiprocessors is similar to MPI programming,but no need to split data – it all resides in main graphicsmemory
SIMD CUDA programming within each multiprocessoris a bit like OpenMP programming – needs goodunderstanding of memory operation
difficulty also depends a lot on application
Finance on GPUs – p. 17/30
CUDA programming
One option is to use linear algebra libraries to off-load partsof a calculation:
libraries for BLAS and FFTs (with LAPACK comingsoon?)
performance restricted by 5GB/s bandwidth of PCIe-2link between host and graphics card
still, quick easy win for some applications (e.g. solving10,000 simultaneous linear equations)
spectral CFD testcase from Univ. of Washington gets20× speedup using MATLAB/CUDA interface
degree of difficulty (2/10)
Finance on GPUs – p. 18/30
CUDA programming
Monte Carlo application:
ideal because it is trivially parallel – each pathcalculation is independent of the others
degree of difficulty (4/10)
we obtained excellent results for a LIBOR model
timings in seconds for 96,000 paths, with 40 activethreads per core, each thread doing just one path
launches multiple copies of execution kernel on device
copies back results from device memory
de-allocates memory and terminates
Finance on GPUs – p. 22/30
CUDA multithreading
Lots of active threads is the key to high performance:
no “context switching”; each thread has its ownregisters, which limits the number of active threads
threads execute in “warps” of 32 threads permultiprocessor (4 per core) – execution alternatesbetween “active” warps, with warps becomingtemporarily “inactive” when waiting for data
Finance on GPUs – p. 23/30
CUDA multithreading
for each thread, one operation completes long beforethe next starts – avoids the complexity of pipelineoverlaps which can limit the performance of modernprocessors
-
time1 2 3 4 5-
--
1 2 3 4 5-
--
1 2 3 4 5-
--
memory access from device memory has a delay of400-600 cycles; with 40 threads this is equivalent to10-15 operations and can be managed by the compiler
Finance on GPUs – p. 24/30
CUDA programming
Other Monte Carlo considerations:
need RNG routineswhich ones?skip-ahead for multiple threads?
need to generate correlated streams (a bit tricky due tolimited shared-memory in each 8-core multiprocessor)
QMC much trickier because of memory requirementsfor BB or PCA construction
working with NAG to develop a generic Monte Carloengine
Finance on GPUs – p. 25/30
CUDA programming
Finite difference application:
recently started work on 2D/3D finite differenceapplications
Jacobi iteration for discrete Laplace equationCG iteration for discrete Laplace equationADI time-marching
conceptually straightforward for someone who is usedto partitioning grids for MPI implementations
each multiprocessor works on a block of the gridthreads within each block read data into local sharedmemory, do the calculations in parallel and write newdata back to main device memory
degree of difficulty: 6/10 for explicit solvers, 8/10 forADI solver
Finance on GPUs – p. 26/30
CUDA programming
3D finite difference implementation:
insufficient shared memory to hold whole 3D block,so hold 3 working planes at a time (halo depth of 1,just one Jacobi iteration at a time)
key steps in kernel code:load in k=0 z-plane (inc x and y-halos)loop over all z-planes
load k+1 z-plane (over-writing k−2 plane)process k z-planestore new k z-plane
50× speedup relative to Xeon single core, compared to5× speedup using OpenMP with 8 cores.
Finance on GPUs – p. 27/30
CUDA programming
Development of PDE demo codes is being funded byTCS/CRL:
TCS: Tata Consultancy Services – India’s biggest ITservices company
CRL: Computational Research Laboratories – part ofTata group, with an HP supercomputer ranked at #4 inTop 500 six months ago (now #8)
demo codes will be made freely available on my website
trying to create generic 3D library/template to enableeasy development of new applications
looking for new test applications
Finance on GPUs – p. 28/30
Will GPUs have real impact?
I think they’re the most exciting development sinceinitial development of PVM and Beowulf clusters
Have generated a lot of interest/excitement inacademia, being used by application scientists,not just computer scientists
Potential for 10−100× speedup and improvement inGFLOPS/£ and GFLOPS/watt
Effectively a personal cluster in a PC under your desk
Needs work on tools and libraries to simplifydevelopment effort
Finance on GPUs – p. 29/30
Webpages
Wikipedia overviews of GeForce cards:en.wikipedia.org/wiki/GeForce 8 Seriesen.wikipedia.org/wiki/GeForce 9 Series
NVIDIA’s CUDA homepage:www.nvidia.com/object/cuda home.html