High Performance Programming with C++ Hafiza Rabbia Ibrahim July 25, 2011 1 R.Ibrahim (CE Master)
Dec 19, 2015
R.Ibrahim (CE Master) 1
High Performance Programming with C++
Hafiza Rabbia Ibrahim July 25, 2011
R.Ibrahim (CE Master) 2
Outline
• Motivation• Return Value Optimization (RVO)• Inlining• Standard Template Library (STL)• Constructor and Destructors• Virtual Functions • Coding Optimization
R.Ibrahim (CE Master) 3
Motivation
performance
Space efficiency
Time
efficiency
R.Ibrahim (CE Master) 4
Return Value Optimization (RVO)
R.Ibrahim (CE Master) 5
Why ?
Methods must
return an objectCreate an object
to return
Constructing object is time
consuming
“The optimization often performed by the compilers to speed up your source code by transferring it and eliminating object
creation.”
R.Ibrahim (CE Master) 6
• For instance, let’s walk through a simple example of complex numbers:
Without optimization, the compiler generated code for Complex _ Add() is:
void Complex_Add ( const Complex& __ tempResult, const Complex& c1, const Complex& c2){ struct Complex retVal; retVal . Complex :: Complex( ); //construct retVal retVal . real = a . real + b . real; retVal . imag= a . imag+ b . imag; __tempResult .Complex :: Complex (retVal); // copy - construct // __tempResult retVal. Complex :: ~ Complex ( ); // Destroy retVal
return;}
R.Ibrahim (CE Master) 7
• The compiler can optimize the Complex _ Add( ) by eliminating the local object retVal and replacing it with __tempResult. This is RVO:
void Complex _Add ( const Complex& __tempResult, const Complex& c1, const Complex& c2){ __ tempResult . Complex :: Complex ( ); //construct__tempResult
__ tempResult . real = a . real + b . real ;
__ tempResult . imag = a . imag + b. imag ;
return ; }
R.Ibrahim (CE Master) 8
with RVO without RVO 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.3
1.89
Seco
nds
Execution time comparison
R.Ibrahim (CE Master) 9
Is it mandatory?
• NO!
• The application of RVO is up to the discretion of compiler implementation. You need to consult your compiler documentation or experiment to find if and when RVO is applied.
R.Ibrahim (CE Master) 10
INLINING
• Method Invocation Costs
What we are avoiding?
• Optimization tricksHow we are avoiding?
R.Ibrahim (CE Master) 11
What we are avoiding: Method Invocation Costs
REGISTERS
Register 0
Register 1
Register 2
-------------
Register X
Argument Pointer
Frame Pointer
Stack Pointer
Instruction Pointer
MEMORY
Variable passed in as argument to the method
Registers used by and therefore saved by the method
Memory allocated for the method’s automatic variables
Arguments pushed on the stack in preparation for a call
Unused memory
• 6 to 8 registers are saved• Consumption of at least 40 cycles (data movement to and from memory) Expansive in terms of machine cycles!
R.Ibrahim (CE Master) 12
Why Inline?• most significant performance enhancement technique available in C++.
Program’s Fast Path• the portion a program that supports the normal , error free, common usage
cases of he program’s execution.• typically less than 10% of the program’s code lies on this fast path.
Inlining and Fast Path
“ Inlining allows us to remove calls from the fast path.”
R.Ibrahim (CE Master) 13
Inlining Performance Story
Performance of avoiding expensive method
invocation
Cross Call Optimization Performance
R.Ibrahim (CE Master) 14
Performance gain of Avoiding method invocation
#include <iostream.h>//inlineint calc (int a, int b){ return a + b;}
int main (){ int x[1000] ; int y[1000] ; int z[1000] ;
for(int i=0; i<1000; ++i) {
for(int j=0; j<1000; ++j) {
for(int k=0; k<1000; ++k) {
z[i] = calc(y[j] , x[k] ); } } }}
when outlined: 62 seconds execution time
when inlined: 8 seconds execution time
Inlining provided here, 8x performance gain
R.Ibrahim (CE Master) 15
Performance gain of Cross Call OptimizationTake the form of doing things at compile time to avoid the necessity of doing at run time. For instance;
enum TrigFuns {SIN, COS, TAN}//inlinefloat calc_trig (TRIG_FUNS fun, float val){ switch (fun) { case SIN: return sin(val) ; case COS: return cos(val) ; case TAN: return tan(val) ; }}
//inlineTrigFuns get_trig_fun(){ return SIN;}
R.Ibrahim (CE Master) 16
Performance gain of Cross Call Optimization (cont.)
//inlinefloat get_float() { return 90; }
void calculator(){ --- TrigFuns tf = get_trig_fun() ;
float value = get_float() ;
reg0 = calc_trig ( tf, value) ;
---}
If inlined: simple optimization and calculations
If outlined: no one method optimization is possible, intra-method optimization is only possible
R.Ibrahim (CE Master) 17
Why not Inline? If Inlining is that good, why don’t you inline everything?
R.Ibrahim (CE Master) 18
Issues with Inlining
• Size of program source code increases
• Storage issues
multiple instances -> each has unique address each has storage in cache -> decrease in cache size capacity miss rate of cache
• Degenerative characteristics
exponential code growth
R.Ibrahim (CE Master) 19
int D( ){ . . . // 500 code bytes of functionality}
int C( ){ D( ) ; . . . // 500 code bytes of functionality D( ) ;}
int B( ){ C( ) ; . . . // 500 code bytes of functionality C( ) ;}
int A( ){ B( ) ; . . . // 500 code bytes of functionality B( ) ;}
int main ( ){ A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ;}
Inlining A,B,C,D will increase the code size by more than 70k bytes i.e.; 37x increase.
R.Ibrahim (CE Master) 20
When you should inline to be optimized?
Dynamic Frequency
Large (more than 20 lines of code)
Medium (between 5 and 20lines of code)
Small (less than 5 lines
of code)
Low (the bottom 80% of
call frequency)
Don't inline Don't inline Inline if you have the time and
patience
Medium (the top 5–20% of
call frequency)
Don't inline Consider rewriting the method to
expose its fast path and then inline
Always Inline
High (the top 5% of call
frequency)
Consider rewriting the method to
expose its fast path and then inline
Selectively inline the high frequency static invocation
points
Always Inline
R.Ibrahim (CE Master) 21
How we are avoiding: Inlining Optimization Tricks
Conditional Inlining
Outlined in .C file, Inlined in .inlFile: x . h:Class X{ ... int y (int a); };#if defined( INLINE)#include x.inl#endif
File: x .inl :#if !defined (INLINE)#define inline#end ifinline int X::y (int a){ .... }
File x.c:
#if !defined (INLINE)#include x.inl#endif
When INLINE is not defined , the .h file will not include the inlined methods , but rather these methods will be included in the .c file, and the inline directive will be stripped from the front of each method.
R.Ibrahim (CE Master) 22
Selective Inlining : Inlining specific parts in a method
File: x. h:Class x {public: int inline_y (int a) ; int y (int a) ; };#include "x. inl" File: x. inl:inline int x: :inline_y (int a){ .... } //original implementation of y
File: x . c:int x :: y (int a){ return inline_y(a); }
R.Ibrahim (CE Master) 23
concluding words about Inlining
• Inlining “might” improve the performance.
• Inlining may backfire i.e.; increase the size of the code
Be sure about the real cost of calls on your system before using Inlining!
R.Ibrahim (CE Master) 24
Standard Template Library(STL)
R.Ibrahim (CE Master) 25
Questions to be answered
Faced with a given computational task, what containers should I use? Are some better than others for a given scenario?
How good is the performance of the STL? Can I do better by rolling my own home-grown containers and algorithms?
R.Ibrahim (CE Master) 26
Execution time Comparisons
vector<int> list<int>0
100
200
300
400
500
600
700
800
900
800
10
Mill
isec
onds
INSERTING AT THE FRONT
R.Ibrahim (CE Master) 27
Execution time Comparisons (cont.)
vector<int> list<int>0
100
200
300
400
500
600
700
800
700
7
Mill
isec
onds
DELETING ELEMENTS AT THE FRONT
R.Ibrahim (CE Master) 28
Execution time Comparisons (cont.)
array vector list0
500
1000
1500
2000
2500
3000
110 110
2600
Mill
isec
onds
Container traversal speed
R.Ibrahim (CE Master) 29
Can I do better?
STL HOME GROWN
char *s = “abcde” ;reverse (&s[0] , &s[5] ) ;
// reverse sequence is required
char *s = "abcde";char temp;temp = s[4] ; // s[ 0] <-> s[4]s[ 4] = s[0] ;s[ 0] = temp;
temp = s[3] ; // s[ 1] <-> s[3]s[ 3] = s[1] ;s[ 1] = temp;
R.Ibrahim (CE Master) 30
Comparison STL speed to Home-grown code
STL HOME GROWN0
10
20
30
40
50
60
55
14
Mill
isec
onds
R.Ibrahim (CE Master) 31
Conclusions about STL performance
Outperforming the STL is possible.
Bend over backwards to concoct scenarios in which a home grown implementation outperforms the STL.
Outperforming STL ,home grown implementation should have something better that STL does NOT have!
R.Ibrahim (CE Master) 32
Constructors and Destructors
R.Ibrahim (CE Master) 33
Why this analysis?
• The performance of constructors and destructors is often poor due to the fact that an object's constructor (destructor) may call the constructors (destructors) of member objects and parent objects.
• This can result in constructors (destructors) that take a long time to execute, especially with objects in complex hierarchies or objects that contain several member objects.
• Hence a Performance Hit!
R.Ibrahim (CE Master) 34
Connection b/w cost of constructor/destructor and Inheritance based design
• Encounter: Implementation of thread synchronization constructors
• In multithreaded applications ,there should be thread synchronization to restrict concurrent access to shared resources
• Thread synchronization constructs can be any of :
Semaphore Mutex Critical Section
R.Ibrahim (CE Master) 35
Strategy:
• Encapsulate the lock in an object e.g. MutexLock object• Let the constructor obtain the lock• Destructor will release the lock automatically (as it does for regular
objects)• Compiler inserts a call to the lock destructor prior to each return statement• And the lock is always released!
R.Ibrahim (CE Master) 36
Performance Comparison constructors destructor behaviour with Mutex in case of
• Non-inherited object• inherited object
R.Ibrahim (CE Master) 37
Lock class implementation
Class Lock{public: Lock (pthread_mutex_t& key) : theKey(key) { pthread_mutex_lock(&theKey) ; } ~Lock() { pthread_mutex_unlock(&theKey) ; }
private: pthread_mutex_t &theKey;};
R.Ibrahim (CE Master) 38
BaseLock class implementation
class BaseLock{
public:
BaseLock ( pthread_mutex_t &key, LogSource &lsrc) {}; virtual ~BaseLock() {};
};
This class is intended as a root class for the various lock classes that are expected to be derived from it.
R.Ibrahim (CE Master) 39
Subclass of BaseLock: MutexLock class implementation
class MutexLock : public BaseLock { public: MutexLock (pthread_mutex_t &key, LogSource &lsrc) ; //constructor ~MutexLock() ; //destructor
private: pthread_mutex_t &theKey;
LogSource &src;
};
LogSource object is meant to capture filename and source code line where the object was constructed.
R.Ibrahim (CE Master) 40
MutexLock constructor
MutexLock: :MutexLock( pthread_mutex_t& aKey, const LogSource& source) : BaseLock(aKey, source) , theKey(aKey) , src(source)
{ pthread_mutex_lock (&theKey) ;
#if defined(DEBUG)
cout <<"MutexLock“<< &aKey<< "created at”<<src.file()<<"line"<<src.line()<<endl;
#endif
R.Ibrahim (CE Master) 41
MutexLock Destructor
MutexLock : : ~MutexLock ( ) { pthread_mutex_unlock(&theKey);
#if defined(DEBUG)
cout<<"MutexLock"<<&aKey<<“destroyed at"<<src.file()<<"line"<< src.line()<<endl;
#endif
}
R.Ibrahim (CE Master) 42
Non-inherited Mutex Object
int main() { . . . // Start timing here
for (i = 0; i < 1000000; ++i ) { SimpleMutex m( mutex); //using constructor to lock and destructor to unlock sharedCounter++; }
//stop timing here ....}
SimpleMutex object from a class containing acquire( ) and release( ) methods
R.Ibrahim (CE Master) 43
Inherited Mutex Object
int main( ) { . . . // Start timing here
for (i = 0; i < 1000000; i++) { DerivedMutex m(mutex); //using constructor to lock and destructor to unlock sharedCounter++; } // Stop timing here . . .}
replace SimpleMutex by DerivedMutex ( object of a derived class from BaseMutex)
R.Ibrahim (CE Master) 44
Non Inherited case
Inherited case0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1.01
1.62
inheritance costse
cond
s
Execution Time comparison
R.Ibrahim (CE Master) 45
Concluding Remarks
• Distinguish b/w over all computational cost, required cost, and computational penalty.
• Eliminate the one which is not important by some other mechanism• Over all cost increases with the size of derivation tree.
R.Ibrahim (CE Master) 46
VIRTUAL FUNCTIONS
R.Ibrahim (CE Master) 47
Inflict on performance
• Class with Virtual function -> virtual function table (vtbl) -> assigns each object a pointer -> vptr.
Virtual functions seem to inflict a performance cost in several ways:
The vptr must be initialized in the constructor
VFs are called using pointer indirection, resulting a few extra instructions per method invocation.
Inlining is compile time decision. The compiler cannot inline VFs whose resolution takes place at run time.
R.Ibrahim (CE Master) 48
Performance Comparison for virtual and Non-virtual methods
class Virtual{
private:
int mv;
public:
Virtual( ) { mv = 0; } virtual ~Virtual( ) {} virtual int foo( ) const { return (mv); }
};
•Creating virtual objects costs more than creating non-virtual objects, because the virtual function table must be initialized. •And it takes slightly longer to call virtual functions, because of the additional level of indirection.
R.Ibrahim (CE Master) 49
Performance Comparison for virtual and Non-virtual methods (cont.)
class NonVirtual {
private:
int mnv;
public:
NonVirtual( ) { mnv = 0; } ~NonVirtual( ) {} int foo( ) const { return (mv); }
};
R.Ibrahim (CE Master) 50
• Construction/destruction shows the performance penalty of initializing the virtual function table. • Virtual function invocation is slightly expensive than invoking a function through a function pointer : memory overhead.
ctor/dtor foo0
0.10.20.30.40.50.60.70.80.9
11 1
0.73
0.96
virtual non virtual
R.Ibrahim (CE Master) 51
If a specific virtual function creates a performance problem for you, what are your options?
To eliminate a virtual call, you must allow the compiler to resolve the function binding at compile time.
You bypass dynamic binding by
– hard-coding (derive distinct classes from string: CriticalSection)
– inheritance (derive a single ThreadSafeString class that contains a pointer to a Locker object. Use polymorphism to select the particular synchronization mechanism at runtime)
– templates (Create a template-based string class parameterized by the Locker type.)
R.Ibrahim (CE Master) 52
Hard-Coding (synchronization mechanism example)
• Standard string class serves as a base class
class CriticalSectionString : public string {public: . . . int length( ) ;private: CriticalSectionLock cs;};
int CriticalSectionString::length(){ cs . lock (); int len = string :: length () ; cs . unlock (); return len;}
+ Although lock() and unlock() are VFs, they can be resolved statically by compiler. The compiler can bypass the dynamic binding and choose correct lock() and unlock() to use.+ it allows the compiler to inline those calls.
- you need to write a separate string class for each synchronization flavour ->
poor code reuse!
R.Ibrahim (CE Master) 53
Inheritance
• Implementing a string class for each synchronization mechanism is a pain so you can factor out the synchronization choice into a constructor argument.
class ThreadSafeString : public string {
public:
ThreadSafeString (const char *s, Locker *lockPtr) : string(s) , pLock(lockPtr) { } . . . int length() ;
private:
Locker *pLock; //pointer to the Locker object
};
R.Ibrahim (CE Master) 54
//The length( ) method is now implemented as follows: int ThreadSafeString: : length(){
pLock->lock();
int len = string: : length() ;
pLock->unlock() ;
return len;
}
+ more compact than the previous one
- the lock( ) and unlock( ) virtual calls can only be resolved at execution time and hence cannot be inlined
R.Ibrahim (CE Master) 55
Templates• Templates combine best of the both worlds reuse and efficiency
template <class LOCKER>
class ThreadSafeString : public string {public: ThreadSafeString(const char *s) : string(s) {} . . . int length() ;private: LOCKER lock;};
//The length method implementation is similar to the previous ones:template <class LOCKER>Inlineint ThreadSafeString<LOCKER>: :length(){ lock.lock() ; int len = string: : length() ; lock.unlock() ; return len;}
+ provides a relief from the virtual function calls to lock() and unlock().
+ enables the compiler to resolve the virtual calls and inline them.
+ push the type resolution to compile time.
R.Ibrahim (CE Master) 56
Coding Optimizations
R.Ibrahim (CE Master) 57
Caching
• Remembering the results of frequent and costly computations• So, you will not have to perform those computations over and over again• For instance; evaluating the constant expression inside a loop is inefficient
for( ...; !done; ... ) { done = patternMatch (pat1, pat2 , isCaseSensitive ( ) ); }
// 2 string patterns -> compared to third argument (a function itself) // isCaseSensitive is independent of loop iterations
R.Ibrahim (CE Master) 58
int isSensitive = isCaseSensitive();
for(... ; !done; ... ) {
done = patternMatch (pat1, pat2, isSensitive);
}
Now you compute case sensitivity once , cache it in local variable and reuse it!
R.Ibrahim (CE Master) 59
Useless Computations• Pointless computations whose results are never used!• For instance: wasted initialization of a member object
class Student {public: Student(char *nm) ; . ..private: string name;};// the Student constructor turns the input character pointer into a string object representing the student's name:
Student: :Student(char *nm) { name = nm; . . .} //the constructor body follows with an invocation of:name = nm;
R.Ibrahim (CE Master) 60
The previous one wipes away the contents of compiler generated calls to the String default constructor , we can eliminate this pointless computation by using an explicit string constructor:
Student :: Student (char *nm) : name (nm)
//explicit string constructor
{ ....
}
R.Ibrahim (CE Master) 61
Lazy Evaluation• You should not perform costly computations “just in case.”• We ought to delay object definition to the scope where it is being used.• For instance, a code routed messages between downstream and upstream
communication adapters. One of the objects we used was very expensive to construct and destroy:
int route(Message *msg){ ExpensiveClass upstream(msg) ; if (goingUpstream) { . .. // do something with the expensive object } //upstream object not used here return SUCCESS;}
R.Ibrahim (CE Master) 62
• Upstream object expensive used only 50% of the time.• A better solution would define the expensive upstream object in the scope
where it is actually necessary:
int route(Message *msg){ if (goingUpstream) { ExpensiveClass upstream(msg) ; // do something with the expensive object }
//upstream object not used here
return SUCCESS;
}
R.Ibrahim (CE Master) 63
80-20 Rule: Speed up the common path• 80% of the execution scenarios will traverse only 20% of your source
code, and 80% of the elapsed time will be spent in 20% of the functions encountered on the execution path.
• For instance evaluation order of sub-expressions:
if (e1 || e2)
{ ... }
• If e1 and e2 are equally likely to evaluate TRUE sub-expression with smaller computational placed first!
• If e1 and e2 are of equal computational cost most likely to be TRUE placed first!
• p1 = conditional probability of e1 being TRUE • c1 = computational cost of e1
R.Ibrahim (CE Master) 64
Cost = c1+ (1- p1) *c2
• If e2 evaluates TRUE 100% of the time p2 = 1.0• If e1 evaluates TRUE 90% of the time p1 = 0.9• c1= 10 instructions; c2 = 100 instructions
Cost = 10 + 0.1*100 = 20
• If we flip e1 and e2 i.e. : if(e2 || e1)• Cost = c2 + (1-p2) *c1
Cost = 100 + 0*100 = 100
• So,
if ( e1|| e2) is better choice than if (e2 || e1)
R.Ibrahim (CE Master) 65
Concluding words about coding optimizations
• Are you ever going to use the result?
It sounds silly, but it happens. At times we perform computation and never use the results
• Do you need the results now?
Defer a computation to the point where it is actually needed. Premature computations may never be used on some execution flows.
• Do you know the result already?
We do costly computations even thought their results are available already two lines above. If you already computed it earlier in the execution flow, make the result available for reuse.
R.Ibrahim (CE Master) 66
Thank you for your attention.
Questions...?