Dmitri [email protected]
http://activemesa.comhttp://spbalt.net http://devtalk.net
“Premature optimization is the root of all evil.”
Donald KnuthStructured Programming with go to Statements, ACM
Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.
“In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.”
Wikipediahttp://en.wikipedia.org/wiki/Program_optimization
Brief introWhy unmanaged code?Why parallelize?
P/InvokeSIMDOpenMPIntel stack: TBB, MKL, IPPGPGPU: Cuda, AcceleratorMiscellanea
Today
Threads & ThreadPoolSync structures
Monitor.(Try)Enter/ExitReaderWriterLock(Slim)MutexSemaphore
Wait handlesManual/AutoResetEvent
Pulse & waitAsync delegatesAsync simplifications
F# async workflowAsyncEnumerator (PowerThreading)
Tomorrow
Tasks and TaskManagerTaskFuture
Data-level parallelismParallel.For/ForEach
Parallel LINQAsParallel()
PerformanceLow-level (fine-tuning) frameworkInstruction-level parallelismGPU SIMDGeneral vectorizationSimple cross-machine framework
Managed interfaces for SIMD/MPI-optimized librariesThreading tools
DebuggingProfilingInferencing
Cross-machine debuggingTask management UI
WTF?!? Isn’t C# 5% faster than C?It depends.
Why is there a difference?More safety (e.g., CLR array bound checking) JIT: No auto-parallelizationJIT: No SIMDLack of fine control
IL can be every bit as fast as C/C++But this is only true for simple problemsThe code is only as good as the JITter
Libraries (MKL, IPP)
OpenMP
Intel TBB, Microsoft PPL
SIMD (CPU & GPGPU)
Part I
A way of calling unmanaged C++ from .NetNot the same as C++/CLI
For interaction with ‘legacy’ systemsCan pass data between managed and unmanaged code
Literals (int, string)Pointers (e.g., pointer to array)Structures
Marshalling is taken care of by the runtime
Make a Win32 C++ DLLMYLIB_API int Add(int first, int second){return first + second;
}Specify a post-build step to copy DLL to .Net assembly
Important: default DLL location is solution root
Build the DLLMake a .Net application
[DllImport("MyLib.dll")]public static extern int Add(int first, int second);
Call the method
Basic C# ↔ C++ Interop
DLL not foundMake sure post-build step copies DLL to target folderOr that DLL is in PATH
An attempt was made to load DLL with incorrect format
DLL relies on other DLLs which are not found
Open Visual Studio command promptUse dumpbin /dependents mylib.dll to find outCopy files to target dirThis is common in Debug mode
32-bit/64-bit mismatch
Entry point not foundMake sure method names and signatures are equivalentMake sure calling convention matches
[DllImport(…, CallingConvention=))
On 64-bit systems, specify entry name explicitly
Use dumpbin /exports[DllImport(…,EntryPoint = "?Add@@YAHHH@Z"
No, extern "C " does not help
It all worksCongratulations!
Special casesString handling
Unicode vs. ANSILP(C)WSTR
Arraysfixed
Memory allocationCalling convention“Bitness” issues
… and lots more!
Handling themMarshalMarshalAsAttribute[In] and [Out]StructLayoutIntPtr… and lots more
Handle on a case-by-case basis
Make sure signatures match
Including return types!To debug
If your OS is 64-bit, make sure .Net assemblies compile in 32-bit modeMake sure unmanaged code debugging is turned on
In 64-bitLaunch target DLL with the .Net assembly as target
Good luck! :)
Visit the P/Invoke wiki @http://pinvoke.net
Part II
An API for multi-platform shared-memory parallel programming in C/C++ and Fortran.Uses #pragma statements to decorate codeEasy!!!
Syntax can be learned very quicklyCan be turned off and on in project settings
Enable it (disabled by default)
Use it!No further action necessary
To use configuration API#include <omp.h>Call methods, e.g., omp_get_num_procs()
void MultiplyMatricesDoubleOMP(int size, double* m1[], double* m2[], double* result[])
{int i, j, k;#pragma omp parallel for shared(size,m1,m2,result) private (i,j,k)for (i = 0; i < size; i++){for (j = 0; j < size; j++){result[i][j] = 0;for (k = 0; k < size; k++){result[i][j] += m1[i][k] * m2[k][j];
}}
}}
#pragma omp parallel forHints to the compiler that it’s worth parallelizing loop
shared(size,m1,m2,result)Variables shared between all threads
private(i,j,k)Variables which have differing values in different threads
Using OpenMP in your C++ app
Homepagehttp://openmp.orgcOMPunity (community of OMP users)http://www.compunity.org/OpenMP debug/optimization article (Russian)http://bit.ly/BJbPUVivaMP (static analyzer for OpenMP code)http://viva64.com/vivamp-tool
Part III
Libraries save you from reinventing the wheelTestedOptimized (e.g., for multi-core, SIMD)
These typically have C++ and Fortran interfaces
Some also have MPI supportOf course, there are .Net libraries too :)
The ‘trick’ is to use these libraries from C#Fortran-compatible API is tricky!Data structure passing can be quite arcane!
Intel makes multi-core processorsMulti-core know-how
Parallel ComposerC++ Compiler (autoparallelization, OpenMP 3.0)Libraries (Math Kernel Library, Integrated Performance Primitives, Threading Building Blocks)Parallel debugger extension
Parallel inspector (memory/threading errors)Parallel amplifier (hotspots, concurrency, locks and waits)Parallel Advisor Lite
Inte
l Par
alle
l Stu
dio
Low-level parallelization framework from IntelLets you fine-tune code for multi-coreIs a library
Uses a set of primitivesHas OSS license
#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"using namespace tbb;struct Average {
float* input;float* output;void operator()( const blocked_range<int>& range ) const {
for( int i=range.begin(); i!=range.end( ); ++i )output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);
}};// Note: The input must be padded such that input[-1] and input[n]// can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) {
Average avg;avg.input = input;avg.output = output;parallel_for(blocked_range<int>( 0, n, 1000 ), avg);
}
Functor
Library call
Integrated Performance Primitives
High-performance libraries forSignal processingImage processingComputer visionSpeech recognitionData compressionCryptographyString manipulationAudio processingVideo codingRealistic rendering
Also support codec construction
Math Kernel Library
Optimized, multithreaded library for mathSupport for
BLASLAPACKScaLAPACKSparse SolversFast Fourier TransformsVector Math… and lots more
Part IV
CPU support for performing operations on large registersNormal-size data is loaded into 128-bit registersOperation on multiple elements with a single instruction
E.g., add 4 numbers at onceRequires special CPU instructions
Less portableSupported in C++ via ‘intrinsics’
SSE is an instruction setInitially called MMX (n/a on 64-bit CPUs)Now SSE and SSE2
Compiler intrinsicsC++ functions that map to one or more SSE assembly instructions
Determining supportUse cpuidNon-issue if you are
A systems integratorRun your own servers (e.g., Asp.Net)
128-bit data types__m128__m128i (integer intrinsics)__m128d (double intrinsics)
Operations for load and set__m128 a = _mm_set_ps(1,2,3,4);To get at data, dereference and choose type
E.g., myValue.m128_f32[0] gets first float
Perform operations (add, multiply, etc.)E.g., _mm_mul_ps(first, second)multiplies two values yielding a third
} sse2
Make or get dataEither create with initialized valuesstatic __m128 factor = _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f);
Or load it into a SIMD-sized memory location__m128 pixel;pixel.m128_f32[0] = s->m128i_u8[(p<<2)];Or convert an existing pointer__m128* pixel = (__m128*)(&my_array + p);
Perform a SIMD operation and get datapixel = _mm_mul_ps(pixel, factor);
Get the dataconst BYTE sum = (BYTE)(pixel.m128_f32[0]);
Image processing with SIMD
Part IV
Graphics cards have GPUsThese are highly parallelized
PipeliningUseful for graphics
GPUs are programmableWe can do math ops on vectorsMainly float, with double support emerging
GPUs have programmable partsVertex shader (vertex position)Pixel shader (pixel colour)
Treat data as texture (render target)Load inputs as textureUse pixel shaderGet data from result texture
Special languages used to program themHLSL (DirectX)GLSL (OpenGL)
High-level wrappers (CUDA, Accelerator)
A Microsoft Research projectNot for commercial use
Uses a managed APIEmploys data-parallel arrays
IntFloatBoolBitmap-aware
Requires PS 2.0
Sorry! No demo.Accelerator does not work on 64-bit :(
If a library already exists, use itIf C# is fast enough, use itTo speed things up, try
TPL/PLINQManual Parallelizationunsafe (can be combined with TPL)
If you are unhappy, thenWrite in C++Speculatively add OpenMP directivesFine-tune with TBB as needed
System.Drawing.Bitmap is slowHas very slow GetPixel()/SetPixel() methods
Can fix bitmap in memory and manipulate it in unmanaged codeWhat we need to pass in
Pointer to bitmap image dataBitmap width and heightBitmap stride (horizontal memory space)
Image-rendered headings with subpixelpostprocessing (http://bit.ly/10x0G8)
WPF FlowDocument for initial generationC++/OpenMP for postprocessingAsp.Net for serving the result
Freeform rendered text with OpenTypefeatures (http://bit.ly/1cCP50)
Bitmap rendering in Direct2D (C++/lightweight COM API)OpenType markup language
a l b v o b q l l k u t m y w m w r e e r q q m q i q d n w g s s w d av p d v n u x j l s y t u b n b y c t h r r y u v a s t a d t n z f f xg q h b j j p y o w s i g i c i i g s o f n f r j f d c f g m k w u y jv b v e m i t i j x u v w s j u g u y l b o c m y k u b w s w n p x i ok a y c q o s u n k s c g x j x j e q p h j i a c m j z h c k v x k a kf e c r u u x q p p k o f w g x b v j m b e l e e w k s c v n n o g c zw w f w i n e h j q l h x u v j o m h g s x a j z b d n u a s c n a j ix w i n w z j d s p n w i p c n d s r m j h z q j g b w j m e z k j v az o u q w d c j c f o x w t h v s r h o m j y n a u p p u p h z n s j rm b z o w k i n t h l i k z w m z m f x c h o m w x b s m x u c j x o sh x u e t p u x e o v l h a y p f f v a x z x l z u l c l n q g e g m xy k k k q j n h p i j w i p d d a x z s z e m p c l i m s u g e i z o mq p r p d w m y q t o v m p T H E y E N D v z d c z x m g q q r h n b ji b q i p x n h w i d o h m a w c x m g h c y r i k n p n d m c x l z eh h s c l f s y l k j s p t d q e b k v u x k m k z p g k e n a f h h ro x v w k u j u t n e u q f a d n e d y y y f c z c a p x y f b r w e yo f a v f h z r y a n z u q r o g n f p x l j y l u a n r d o r v k m fj y n h p c c t k x y t b f j r n x g c z h s p c e i q g x k p f g r nl y i i f t i s b i f c k c h e s l w y s u p d v x b r l q l k i z d zw s a w r i i u m n i x r c j n d h n w g s f s i l h a b h l h x m v pt e g k n o i s g s x v b o k e c i j y b e d r t p e x v r c w u v d sd o a z t t m u i u v u b p l w c p x n k k v a a v b b s e e f d b f yi v c j k r g r y t j a m f v h b f s b z l i n a x c l r l z i v l c bn u d l l g u y r t t u q t l y j l q u h a o u o p t g v l q q r k r qy p l z d x n q n q v t f b u h r y n k f q i t h i u w i n m l o c c c