S3012 - Simplifying Portable Killer Apps with OpenACC and CUDA-5 Concisely and Efficiently Rob Farber Chief Scientist, BlackDog Endeavors, LLC Author, “CUDA Application Design and Development” Research consultant: ICHEC, Fortune 100 companies, and others Scientist: . Dr. Dobb’s Journal CUDA & OpenACC tutorials • OpenCL “The Code Project” tutorials • Columnist Scientific Computing, and other venues
52
Embed
S3012 - Simplifying Portable Killer Apps with OpenACC …on-demand.gputechconf.com/gtc/2013/presentations/S3012-Simplifying... · S3012 - Simplifying Portable Killer Apps with ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
S3012 - Simplifying Portable Killer Apps with OpenACC and CUDA-5 Concisely and Efficiently
Rob Farber Chief Scientist, BlackDog Endeavors, LLC
Author, “CUDA Application Design and Development” Research consultant: ICHEC, Fortune 100 companies, and others
Scientist: .
Dr. Dobb’s Journal CUDA & OpenACC tutorials
• OpenCL “The Code Project” tutorials
• Columnist Scientific Computing, and other venues
Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC”
! matrix-acc.f program example1 … !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1
/* matrix-omp.c */ int main() { … // Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }
• Delve down into lower level programming when – You need higher performance
– The high level API does not do what you want
» It is necessary to use a lower level capability
» Make use of some hardware feature
“Computational Universality” An XOR Neural Network
• The example of XOR nicely emphasizes the importance of hidden neurons:
• They re-represent the input such that the problem becomes linearly separable.
• Networks with hidden units can implement any Boolean function -> Computational Universal devices!
• Networks without hidden units cannot learn XOR • Cannot represent large classes of problems
G(x)
NetTalk Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel network that learns to read aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
500 learning loops Finished
"Applications of Neural Net and Other Machine Learning Algorithms to DNA Sequence Analysis", (1989).
NetTalk Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel network that learns to read aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
Internal connections
t t e X A T C G T
"Applications of Neural Net and Other Machine Learning Algorithms to DNA Sequence Analysis", A.S. Lapedes, C. Barnes, C. Burks, R.M. Farber, K. Sirotkin, Computers and DNA, SFI Studies in the Sciences of Complexity, vol. VII, Eds. G. Bell and T. Marr, Addison-Wesley, (1989).
Predicting binding affinity (The closer you look the greater the complexity)
Electron Microscope
The question for computational biology
• How do we know you are not playing expensive computer games with our money?
Utilize a blind test
Internal connections
A0
Binding affinity for a specific antibody
A1 A2 A3 A4 A5
Possible hexamers 206 = 64M
1k – 2k pseudo-random (hexamer, binding)
affinity pairs
Approx. 0.001% sampling
“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan
Lapedes and Robert Farber, Los Alamos National Laboratory
Technical Report LA-UR-94-4391 (1994).
Hill climbing to find high affinity
Internal connections
A0
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴𝑛𝑡𝑖𝑏𝑜𝑑𝑦
A1 A2 A3 A4 A5
Learn: 𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴𝑛𝑡𝑖𝑏𝑜𝑑𝑦 = 𝑓 𝐴0, … , 𝐴5
𝑓(F,F,F,F,F,F) 𝑓(F,F,F,F,F,L)
𝑓(F,F,F,F,F,V) 𝑓(F,F,F,F,L,L)
𝑓(P,C,T,N,S,L)
Predict P,C,T,N,S,L has the highest binding affinity
Confirm experimentally
Two important points
• The computer appears to correctly predict experimental data
• Demonstrated that complex binding affinity relationships can be learned from a small set of samples – Necessary because it is only possible to sample a very
small subset of the binding affinity landscape for drug candidates
1995 drug design hardware vs 2013 (analyzed all available chemical databases … TB of data)
• Quad-core 512 MB Sun workstation
– My Samsung S3 is more powerful and has 2 GB RAM
• 80 GB disk and a TB DLT tape stacker
– A TB laptop hard drive
• 60 Gflop/s Connection machine
– A mobile GeForce GPU
You can change the world! $30M of hardware replaced by a GPU accelerated laptop
Example: PCA (Principle Components Analysis) • Widely used in data-mining and data reduction
– Discuss a method proposed by Sanger (1989)
• Extends to Nonlinear PCA (NLPCA) – Discuss a method by E. Oja, J. Harhunen, L. Wang, and R. Vigario (1995)
33
B B B
O O O OO O O O OO O O O OO O O O OOO O O OO
I I I II I I I II I I I II I I I III I I II
• The general mapping scales according to data • Exascale capable!
• Provides the ability to compare Linear and Nonlinear performance
Intel Xeon Phi runs Linux • Great from a code portability point of view
• Watch out for Operating System Jitter! Read: “The Case of the Missing Supercomputer Performance”
– The world is nonlinear … so are many computational models
TF/s devices open the door to new topics
• Works great for manufacturing optimization – Best product for lowest cost of materials – Works great for color matching
• Multiterm objective functions – Best design for the lowest (cost, weight, {your metric here}, …) – A teraflop/s per device can run many optimizations to map the
decision space.
• Machine learning with memory or variable inputs – Recurrent neural networks, IIR filters, …. – Have to iterate the network during training
You can change the world!
Data handling can take as much time as the computational problem!
• Longhorn GPU capabilities – 2,048 GB of GPU memory in 512 Quadro FX 5800 GPUs
• ORNL Titan – 112,128 GB of GPU memory in 18,688 K20x GPUs
• Need: 1. Fast and scalable data load 2. Fast and scalable, heterogeneous, flexible and robust data preprocessing workflows
• What a mouthful!
Expect 600+ GF/s per device { *big number* here}
Average sustained performance
Big data social media • Need a simplifying framework
– A laptop can represent a billion node graph
– People don’t understand billion node graphs! • Million node graphs are not comprehensible
• Thousand node graphs are too complex
• Hundred node graphs are still too big
• A few to tens of nodes are potentially understandable
Validate against 3rd party experts and machine metrics • Understand this is a lens looking into a social reality • Cannot forget that the computer only represents reality!
Sorry, part of my next talk S3443 - Clicking GPUs into a Portable, Persistent and Scalable Massive Data Framework
Time: 15:00 - 15:50 Location: Room 230B
Important design concept #2
• Try to maintain just one source tree – OpenACC/OpenMP pragmas are interesting
(disclaimer/shameless commerce:
I’m writing an OpenACC book)
OpenACC portability /* matrix-acc.c */ int main() { … // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }
! matrix-acc.f program example1 … !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1
int main()
{
cout << "Hello World" << endl;
// load data and initialize parameters
init();
#pragma acc data \
copyin(param[0:N_PARAM-1]) \
pcopyin(example[0:nExamples*EXAMPLE_SIZE-1])
{
optimize( objFunc ); // the optimizer calls the objective function
}
return 0;
}
C Fortran
C++
Coprocessor and GPU demos shown at SC12 by PGI and CAPS CAPS Demo at SC12 via OpenCL
translation
Lessons • Use the highest level interface first
– Delve down into lower level programming when • You need higher performance
• The high level API does not do what you want
• Use a single source tree
OpenACC source tree
C/C++
Legacy
New
Fortran
Legacy
New
CUDA
File
File
Translate to OpenCL
Will OpenCL match CUDA-5 features like dynamic parallelism?
– Part of the OpenACC version 2 specification
– Necessary for divide-and-conquer problems
Others
CUDA + Primitive Restart (a potent combination!)
Primitive restart: – A feature of OpenGL 3.1 – Roughly 60x faster than optimized OpenGL – Avoids the PCIe bottleneck – Variable length data works great!
44
LiDAR: 131M points 15 – 33 FPS (C2070)
In collaboration with Global Navigation Sciences (http://http://globalnavigationsciences.com/
Sedláček, M. (2004). Evaluation of RGB and HSV Models in Human Faces Detection. Central European Seminar on Computer Graphics, Budmerice. CompSysTech’2004 , (pp. 125-131).
The entire segmentation method __global__ void kernelSkin(float4* pos, uchar4 *colorPos, unsigned int width, unsigned int height, int lowPureG, int highPureG, int lowPureR, int highPureR) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; int r = colorPos[y*width+x].x; int g = colorPos[y*width+x].y; int b = colorPos[y*width+x].z; int pureR = 255*( ((float)r)/(r+g+b)); int pureG = 255*( ((float)g)/(r+g+b)); if( !( (pureG > lowPureG) && (pureG < highPureG) && (pureR > lowPureR) && (pureR < highPureR) ) ) colorPos[y*width+x] = make_uchar4(0,0,0,0); }
For the demo, think Kinect and 3D morphing for augmented reality (identify flesh colored blobs for hands)
Artifacts caused by picking a colorspace rectangle rather than an ellipse
Manipulating real-time video (Chapter 12 source code )