Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments Using the Parallel Computing Toolbox Using the Matlab Distributed Compute Engine Server Using pMatlab Mixing Matlab and Fortran/C code Compiling MEX code from C/Fortran Turning Matlab routines into C code
26
Embed
Outline Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Outline
Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments
Using the Parallel Computing Toolbox Using the Matlab Distributed Compute Engine Server Using pMatlab
Mixing Matlab and Fortran/C code Compiling MEX code from C/Fortran Turning Matlab routines into C code
Symmetric Multi-Processing
By default Matlab uses all cores on a given node for operations that can be threadedArrays and matrices• Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN• Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE)• Array operations: PROD, SUM• Array manipulation: BSXFUN, SORTLinear algebra• Matrix Analysis: DET, RCOND• Linear Equations: CHOL, INV, LDL, LINSOLVE, LU, QR• Eigenvalues and singular values: EIG, HESS, SCHUR, SVD, QZElementary math• Trigonometric: ATAN2, COS, CSC, HYPOT, SEC, SIN, TAN, including variants for radians, degrees, hyperbolics, and inverses.• Exponential: EXP, POW2, SQRT• Complex: ABS• Rounding and remainder: CEIL, FIX, FLOOR, MOD, REM, ROUND• LOG, LOG2, LOG10, LOG1P, EXPM1, SIGN, BITAND, BITOR, BITXORSpecial Functions• ERF, ERFC, ERFCINV, ERFCX, ERFINV, GAMMA, GAMMALNData Analysis• CONV2, FILTER, FFT and IFFT of multiple columns or long vectors, FFTN, IFFTN
Symmetric Multi-Processing
To be sure you only use the resources you request, you should either request an entire node and all of the CPU’s...
Or request a single cpu and specify that Matlab should only use a single thread
Distribute data to GPUFFT performed on GPUGather data from GPU onto CPU
Using GPUs with Matlab
For our example, doing the FFT on the GPU is 7x faster. (4x if you include moving the data to the GPU and back)
>> H=hilb(5000);>> tic; A=gather(gpuArray(H)); tocElapsed time is 0.161166 seconds.>> tic; A=gather(fft2(gpuArray(H))); tocElapsed time is 0.348159 seconds.>> tic; A=fft2(H); tocElapsed time is 1.210464 seconds.
Using GPUs with Matlab
Matlab has no built in hilb() function that can run on the GPU – but we can write our own function(kernel) in cuda – and save it to hilbert.cu
And compile it with nvcc to generate a Parallel Thread eXecution file
You should create a pool that is the same size as the number of processors you requested in your job submission. Matlab also sells licenses for using a Distributed Computing Server which allows for matlabpools that use more than just the local node.
• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or
‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.
Parallel Computing Toolbox
• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or
‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.matlabpool(4)
• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or
‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.
matlabpool(4)spmd for n=drange(1:100) H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); endendmatlabpool close
matlabpool(4)spmd for n=labindex:numlabs:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); endendmatlabpool close
Parallel Computing Toolbox
• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or
‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.
• You can achieve parallelism in several ways:• parfor loops – execute for loops in parallel• smpd – execute instructions in parallel (using ‘labindex’ or
‘drange’)• pmode – interactive version of smpd• distributed arrays – very similar to gpuArrays.
matlabpool(4)spmd codist=codistributor1d(1,[250,250,250,250],[1000,1000]); [i_lo, i_hi]=codist.globalIndices(1); H_local=zeros(250,1000); for i=i_lo:i_hi for j=1:1000 H_local(i-i_lo+1,j)=1/(i+j-1); end end H_ = codistributed.build(H_local, codist);endZ_=fft(fft(H_,[],1),[],2);Z=gather(Z_);imagesc(log(abs(Z)));matlabpool close
What about building hilbert matrix in parallel?
Define partitionGet local indices in x-direction
Allocate space for local part
Initialize local array with Hilbert values.
Assemble codistributed arrayNow it's distributed like before!
Using pMatlab
pMatlab is an alternative method to get distributed matlab functionality without relying on Matlab’s Distributed Computing Server. It is built on top of MapMPI (an MPI implementation for matlab – written in matlab - that uses file I/O for communication) It supports various operations on distributed arrays (up to 4D)
Elementary math functions (trig, exponential, complex, remainder/rounding)
2D Convolutions, FFTs, Discrete Cosine TransformFFT's need to be properly mapped (cannot be distributed along transform dimension).
It does not have as much functionality as the parallel computing toolbox – but it does support ghosting and more flexible partitioning!
Using pMatlab
Since pMatlab works by launching other Matlab instances – we need them to startup with pMatlab functionality. To do so we need to add a few lines to our startup.m file in our matlab path.
function result = log_abs_fft_hilb(n) assert(isa(n,'uint32')); result=log(abs(fft2(hilb(n))));
Turning Matlab code into C
Now we can also export a static library that we can link to:
This will create a subdirectory codegen/lib/log_abs_fft_hilb that will have the source files '.c and .h' as well as a compiled object files '.o' and a library 'log_abs_fft_hilb.a' The source files are portable to any platform with a 'C' compiler (ie BlueStreak). We can rebuild the library on BlueStreak by running
To use the function, we still need to write a main subroutine that links to it. This requires working with matlab's variable types (which include dynamically resizable arrays)#include "stdio.h"#include "rtwtypes.h"#include "log_abs_fft_hilb_types.h"void main() { uint32_T n=64; emxArray_real_T *result; int32_T i,j; emxInit_real_T(&result, 2); log_abs_fft_hilb(n, result); for(i=0;i<result->size[0];i++) { for(j=0;j<result->size[1];j++) { printf("%f ",result->data[i+result->size[0]*j]); } printf("\n"); } emxFree_real_T(&result);}
Matlab type definitions
Argument to Matlab functionReturn value of Matlab function
Initialize Matlab array to have rank 2
Call matlab function
Free up memory associated with return array
Output result in column major order
Turning Matlab code into C
And here is another example of calling 2D fft's on real datavoid main() { int32_T q0; int32_T i; int32_T n=8; emxArray_creal_T *result; emxArray_real_T *input; emxInit_creal_T(&result, 2); emxInit_real_T(&input, 2); q0 = input->size[0] * input->size[1]; input->size[0]=n; input->size[1]=n; emxEnsureCapacity((emxArray__common *)input, q0, (int32_T)sizeof(real_T)); for(j=0;j<input->size[1];j++ { for(i=0;i<input->size[0];i++) { input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1); } }
Exported FFT's only work on vectors of length 2N Error checking is disabled in exported C code Mex code will have the same functionality as exported C code, but will also have error checking. It will warn about doing FFT's on arbitrary length vectors, etc... Always test your mex code!
Matlab code is not that different from C code
#include <stdio.h>#include <math.h>#include <complex.h>#include <fftw3.h>void main() { int n=4096; int i,j; double complex temp[n][n], input[n][n]; double result[n][n]; fftw_plan p; p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0], FFTW_FORWARD, FFTW_ESTIMATE); for (i=0;i<n;i++){ for(j=0;j<n;j++) { input[i][j]=(double complex)(1.0/(double)(i+j+1)); } } fftw_execute(p);
for (i=0;i<n;i++){ for(j=0;j<n;j++) { result[i][j]=log(cabs(temp[i][j])); } } for (i=0;i<n;i++){ for(j=0;j<n;j++) { printf("%f ",result[i][j]); } } fftw_destroy_plan(p);}
Or you can write your own 'C' code that uses open source mathematical libraries (ie fftw).