Unmanaged Parallelization via P/Invoke

Dmitri [email protected]

http://activemesa.comhttp://spbalt.net http://devtalk.net

“Premature optimization is the root of all evil.”

Donald KnuthStructured Programming with go to Statements, ACM

Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.

“In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.”

Wikipediahttp://en.wikipedia.org/wiki/Program_optimization

Brief introWhy unmanaged code?Why parallelize?

P/InvokeSIMDOpenMPIntel stack: TBB, MKL, IPPGPGPU: Cuda, AcceleratorMiscellanea

Today

Threads & ThreadPoolSync structures

Monitor.(Try)Enter/ExitReaderWriterLock(Slim)MutexSemaphore

Wait handlesManual/AutoResetEvent

Pulse & waitAsync delegatesAsync simplifications

F# async workflowAsyncEnumerator (PowerThreading)

Tomorrow

Tasks and TaskManagerTaskFuture

Data-level parallelismParallel.For/ForEach

Parallel LINQAsParallel()

PerformanceLow-level (fine-tuning) frameworkInstruction-level parallelismGPU SIMDGeneral vectorizationSimple cross-machine framework

Managed interfaces for SIMD/MPI-optimized librariesThreading tools

DebuggingProfilingInferencing

Cross-machine debuggingTask management UI

WTF?!? Isn’t C# 5% faster than C?It depends.

Why is there a difference?More safety (e.g., CLR array bound checking) JIT: No auto-parallelizationJIT: No SIMDLack of fine control

IL can be every bit as fast as C/C++But this is only true for simple problemsThe code is only as good as the JITter

Libraries (MKL, IPP)

OpenMP

Intel TBB, Microsoft PPL

SIMD (CPU & GPGPU)

Part I

A way of calling unmanaged C++ from .NetNot the same as C++/CLI

For interaction with ‘legacy’ systemsCan pass data between managed and unmanaged code

Literals (int, string)Pointers (e.g., pointer to array)Structures

Marshalling is taken care of by the runtime

Make a Win32 C++ DLLMYLIB_API int Add(int first, int second){return first + second;

}Specify a post-build step to copy DLL to .Net assembly

Important: default DLL location is solution root

Build the DLLMake a .Net application

[DllImport("MyLib.dll")]public static extern int Add(int first, int second);

Call the method

Basic C# ↔ C++ Interop

DLL not foundMake sure post-build step copies DLL to target folderOr that DLL is in PATH

An attempt was made to load DLL with incorrect format

DLL relies on other DLLs which are not found

Open Visual Studio command promptUse dumpbin /dependents mylib.dll to find outCopy files to target dirThis is common in Debug mode

32-bit/64-bit mismatch

Entry point not foundMake sure method names and signatures are equivalentMake sure calling convention matches

[DllImport(…, CallingConvention=))

On 64-bit systems, specify entry name explicitly

Use dumpbin /exports[DllImport(…,EntryPoint = "?Add@@YAHHH@Z"

No, extern "C " does not help

It all worksCongratulations!

Special casesString handling

Unicode vs. ANSILP(C)WSTR

Arraysfixed

Memory allocationCalling convention“Bitness” issues

… and lots more!

Handling themMarshalMarshalAsAttribute[In] and [Out]StructLayoutIntPtr… and lots more

Handle on a case-by-case basis

Make sure signatures match

Including return types!To debug

If your OS is 64-bit, make sure .Net assemblies compile in 32-bit modeMake sure unmanaged code debugging is turned on

In 64-bitLaunch target DLL with the .Net assembly as target

Good luck! :)

Visit the P/Invoke wiki @http://pinvoke.net

Part II

An API for multi-platform shared-memory parallel programming in C/C++ and Fortran.Uses #pragma statements to decorate codeEasy!!!

Syntax can be learned very quicklyCan be turned off and on in project settings

Enable it (disabled by default)

Use it!No further action necessary

To use configuration API#include <omp.h>Call methods, e.g., omp_get_num_procs()

void MultiplyMatricesDoubleOMP(int size, double* m1[], double* m2[], double* result[])

{int i, j, k;#pragma omp parallel for shared(size,m1,m2,result) private (i,j,k)for (i = 0; i < size; i++){for (j = 0; j < size; j++){result[i][j] = 0;for (k = 0; k < size; k++){result[i][j] += m1[i][k] * m2[k][j];

}}

}}

#pragma omp parallel forHints to the compiler that it’s worth parallelizing loop

shared(size,m1,m2,result)Variables shared between all threads

private(i,j,k)Variables which have differing values in different threads

Using OpenMP in your C++ app

Homepagehttp://openmp.orgcOMPunity (community of OMP users)http://www.compunity.org/OpenMP debug/optimization article (Russian)http://bit.ly/BJbPUVivaMP (static analyzer for OpenMP code)http://viva64.com/vivamp-tool

Part III

Libraries save you from reinventing the wheelTestedOptimized (e.g., for multi-core, SIMD)

These typically have C++ and Fortran interfaces

Some also have MPI supportOf course, there are .Net libraries too :)

The ‘trick’ is to use these libraries from C#Fortran-compatible API is tricky!Data structure passing can be quite arcane!

Intel makes multi-core processorsMulti-core know-how

Parallel ComposerC++ Compiler (autoparallelization, OpenMP 3.0)Libraries (Math Kernel Library, Integrated Performance Primitives, Threading Building Blocks)Parallel debugger extension

Parallel inspector (memory/threading errors)Parallel amplifier (hotspots, concurrency, locks and waits)Parallel Advisor Lite

Inte

l Par

alle

l Stu

dio

Low-level parallelization framework from IntelLets you fine-tune code for multi-coreIs a library

Uses a set of primitivesHas OSS license

#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"using namespace tbb;struct Average {

float* input;float* output;void operator()( const blocked_range<int>& range ) const {

for( int i=range.begin(); i!=range.end( ); ++i )output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);

}};// Note: The input must be padded such that input[-1] and input[n]// can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) {

Average avg;avg.input = input;avg.output = output;parallel_for(blocked_range<int>( 0, n, 1000 ), avg);

}

Functor

Library call

Integrated Performance Primitives

High-performance libraries forSignal processingImage processingComputer visionSpeech recognitionData compressionCryptographyString manipulationAudio processingVideo codingRealistic rendering

Also support codec construction

Math Kernel Library

Optimized, multithreaded library for mathSupport for

BLASLAPACKScaLAPACKSparse SolversFast Fourier TransformsVector Math… and lots more

Part IV

CPU support for performing operations on large registersNormal-size data is loaded into 128-bit registersOperation on multiple elements with a single instruction

E.g., add 4 numbers at onceRequires special CPU instructions

Less portableSupported in C++ via ‘intrinsics’

SSE is an instruction setInitially called MMX (n/a on 64-bit CPUs)Now SSE and SSE2

Compiler intrinsicsC++ functions that map to one or more SSE assembly instructions

Determining supportUse cpuidNon-issue if you are

A systems integratorRun your own servers (e.g., Asp.Net)

128-bit data types__m128__m128i (integer intrinsics)__m128d (double intrinsics)

Operations for load and set__m128 a = _mm_set_ps(1,2,3,4);To get at data, dereference and choose type

E.g., myValue.m128_f32[0] gets first float

Perform operations (add, multiply, etc.)E.g., _mm_mul_ps(first, second)multiplies two values yielding a third

} sse2

Make or get dataEither create with initialized valuesstatic __m128 factor = _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f);

Or load it into a SIMD-sized memory location__m128 pixel;pixel.m128_f32[0] = s->m128i_u8[(p<<2)];Or convert an existing pointer__m128* pixel = (__m128*)(&my_array + p);

Perform a SIMD operation and get datapixel = _mm_mul_ps(pixel, factor);

Get the dataconst BYTE sum = (BYTE)(pixel.m128_f32[0]);

Image processing with SIMD

Part IV

Graphics cards have GPUsThese are highly parallelized

PipeliningUseful for graphics

GPUs are programmableWe can do math ops on vectorsMainly float, with double support emerging

GPUs have programmable partsVertex shader (vertex position)Pixel shader (pixel colour)

Treat data as texture (render target)Load inputs as textureUse pixel shaderGet data from result texture

Special languages used to program themHLSL (DirectX)GLSL (OpenGL)

High-level wrappers (CUDA, Accelerator)

A Microsoft Research projectNot for commercial use

Uses a managed APIEmploys data-parallel arrays

IntFloatBoolBitmap-aware

Requires PS 2.0

Sorry! No demo.Accelerator does not work on 64-bit :(

If a library already exists, use itIf C# is fast enough, use itTo speed things up, try

TPL/PLINQManual Parallelizationunsafe (can be combined with TPL)

If you are unhappy, thenWrite in C++Speculatively add OpenMP directivesFine-tune with TBB as needed

System.Drawing.Bitmap is slowHas very slow GetPixel()/SetPixel() methods

Can fix bitmap in memory and manipulate it in unmanaged codeWhat we need to pass in

Pointer to bitmap image dataBitmap width and heightBitmap stride (horizontal memory space)

Image-rendered headings with subpixelpostprocessing (http://bit.ly/10x0G8)

WPF FlowDocument for initial generationC++/OpenMP for postprocessingAsp.Net for serving the result

Freeform rendered text with OpenTypefeatures (http://bit.ly/1cCP50)

Bitmap rendering in Direct2D (C++/lightweight COM API)OpenType markup language

a l b v o b q l l k u t m y w m w r e e r q q m q i q d n w g s s w d av p d v n u x j l s y t u b n b y c t h r r y u v a s t a d t n z f f xg q h b j j p y o w s i g i c i i g s o f n f r j f d c f g m k w u y jv b v e m i t i j x u v w s j u g u y l b o c m y k u b w s w n p x i ok a y c q o s u n k s c g x j x j e q p h j i a c m j z h c k v x k a kf e c r u u x q p p k o f w g x b v j m b e l e e w k s c v n n o g c zw w f w i n e h j q l h x u v j o m h g s x a j z b d n u a s c n a j ix w i n w z j d s p n w i p c n d s r m j h z q j g b w j m e z k j v az o u q w d c j c f o x w t h v s r h o m j y n a u p p u p h z n s j rm b z o w k i n t h l i k z w m z m f x c h o m w x b s m x u c j x o sh x u e t p u x e o v l h a y p f f v a x z x l z u l c l n q g e g m xy k k k q j n h p i j w i p d d a x z s z e m p c l i m s u g e i z o mq p r p d w m y q t o v m p T H E y E N D v z d c z x m g q q r h n b ji b q i p x n h w i d o h m a w c x m g h c y r i k n p n d m c x l z eh h s c l f s y l k j s p t d q e b k v u x k m k z p g k e n a f h h ro x v w k u j u t n e u q f a d n e d y y y f c z c a p x y f b r w e yo f a v f h z r y a n z u q r o g n f p x l j y l u a n r d o r v k m fj y n h p c c t k x y t b f j r n x g c z h s p c e i q g x k p f g r nl y i i f t i s b i f c k c h e s l w y s u p d v x b r l q l k i z d zw s a w r i i u m n i x r c j n d h n w g s f s i l h a b h l h x m v pt e g k n o i s g s x v b o k e c i j y b e d r t p e x v r c w u v d sd o a z t t m u i u v u b p l w c p x n k k v a a v b b s e e f d b f yi v c j k r g r y t j a m f v h b f s b z l i n a x c l r l z i v l c bn u d l l g u y r t t u q t l y j l q u h a o u o p t g v l q q r k r qy p l z d x n q n q v t f b u h r y n k f q i t h i u w i n m l o c c c

Unmanaged Parallelization via P/Invoke

Technology

Unmanaged Parallelization via P/Invoke