High Performance Programming with C++ Hafiza Rabbia Ibrahim July 25, 2011 1R.Ibrahim (CE Master)

R.Ibrahim (CE Master) 1

High Performance Programming with C++

Hafiza Rabbia Ibrahim July 25, 2011


Outline

• Motivation• Return Value Optimization (RVO)• Inlining• Standard Template Library (STL)• Constructor and Destructors• Virtual Functions • Coding Optimization


Motivation

performance

Space efficiency

Time

efficiency


Return Value Optimization (RVO)


Why ?

Methods must

return an objectCreate an object

to return

Constructing object is time

consuming

“The optimization often performed by the compilers to speed up your source code by transferring it and eliminating object

creation.”


• For instance, let’s walk through a simple example of complex numbers:

Without optimization, the compiler generated code for Complex _ Add() is:

void Complex_Add ( const Complex& __ tempResult, const Complex& c1, const Complex& c2){ struct Complex retVal; retVal . Complex :: Complex( ); //construct retVal retVal . real = a . real + b . real; retVal . imag= a . imag+ b . imag; __tempResult .Complex :: Complex (retVal); // copy - construct // __tempResult retVal. Complex :: ~ Complex ( ); // Destroy retVal

return;}


• The compiler can optimize the Complex _ Add( ) by eliminating the local object retVal and replacing it with __tempResult. This is RVO:

void Complex _Add ( const Complex& __tempResult, const Complex& c1, const Complex& c2){ __ tempResult . Complex :: Complex ( ); //construct__tempResult

__ tempResult . real = a . real + b . real ;

__ tempResult . imag = a . imag + b. imag ;

return ; }


with RVO without RVO 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.3

1.89

Seco

nds

Execution time comparison


Is it mandatory?

• NO!

• The application of RVO is up to the discretion of compiler implementation. You need to consult your compiler documentation or experiment to find if and when RVO is applied.


INLINING

• Method Invocation Costs

What we are avoiding?

• Optimization tricksHow we are avoiding?


What we are avoiding: Method Invocation Costs

REGISTERS

Register 0

Register 1

Register 2

-------------

Register X

Argument Pointer

Frame Pointer

Stack Pointer

Instruction Pointer

MEMORY

Variable passed in as argument to the method

Registers used by and therefore saved by the method

Memory allocated for the method’s automatic variables

Arguments pushed on the stack in preparation for a call

Unused memory

• 6 to 8 registers are saved• Consumption of at least 40 cycles (data movement to and from memory) Expansive in terms of machine cycles!


Why Inline?• most significant performance enhancement technique available in C++.

Program’s Fast Path• the portion a program that supports the normal , error free, common usage

cases of he program’s execution.• typically less than 10% of the program’s code lies on this fast path.

Inlining and Fast Path

“ Inlining allows us to remove calls from the fast path.”


Inlining Performance Story

Performance of avoiding expensive method

invocation

Cross Call Optimization Performance


Performance gain of Avoiding method invocation

#include <iostream.h>//inlineint calc (int a, int b){ return a + b;}

int main (){ int x[1000] ; int y[1000] ; int z[1000] ;

for(int i=0; i<1000; ++i) {

for(int j=0; j<1000; ++j) {

for(int k=0; k<1000; ++k) {

z[i] = calc(y[j] , x[k] ); } } }}

when outlined: 62 seconds execution time

when inlined: 8 seconds execution time

Inlining provided here, 8x performance gain


Performance gain of Cross Call OptimizationTake the form of doing things at compile time to avoid the necessity of doing at run time. For instance;

enum TrigFuns {SIN, COS, TAN}//inlinefloat calc_trig (TRIG_FUNS fun, float val){ switch (fun) { case SIN: return sin(val) ; case COS: return cos(val) ; case TAN: return tan(val) ; }}

//inlineTrigFuns get_trig_fun(){ return SIN;}


Performance gain of Cross Call Optimization (cont.)

//inlinefloat get_float() { return 90; }

void calculator(){ --- TrigFuns tf = get_trig_fun() ;

float value = get_float() ;

reg0 = calc_trig ( tf, value) ;

---}

If inlined: simple optimization and calculations

If outlined: no one method optimization is possible, intra-method optimization is only possible


Why not Inline? If Inlining is that good, why don’t you inline everything?


Issues with Inlining

• Size of program source code increases

• Storage issues

multiple instances -> each has unique address each has storage in cache -> decrease in cache size capacity miss rate of cache

• Degenerative characteristics

exponential code growth


int D( ){ . . . // 500 code bytes of functionality}

int C( ){ D( ) ; . . . // 500 code bytes of functionality D( ) ;}

int B( ){ C( ) ; . . . // 500 code bytes of functionality C( ) ;}

int A( ){ B( ) ; . . . // 500 code bytes of functionality B( ) ;}

int main ( ){ A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ; A( ) ;}

Inlining A,B,C,D will increase the code size by more than 70k bytes i.e.; 37x increase.


When you should inline to be optimized?

Dynamic Frequency

Large (more than 20 lines of code)

Medium (between 5 and 20lines of code)

Small (less than 5 lines

of code)

Low (the bottom 80% of

call frequency)

Don't inline Don't inline Inline if you have the time and

patience

Medium (the top 5–20% of

call frequency)

Don't inline Consider rewriting the method to

expose its fast path and then inline

Always Inline

High (the top 5% of call

frequency)

Consider rewriting the method to

expose its fast path and then inline

Selectively inline the high frequency static invocation

points

Always Inline


How we are avoiding: Inlining Optimization Tricks

Conditional Inlining

Outlined in .C file, Inlined in .inlFile: x . h:Class X{ ... int y (int a); };#if defined( INLINE)#include x.inl#endif

File: x .inl :#if !defined (INLINE)#define inline#end ifinline int X::y (int a){ .... }

File x.c:

#if !defined (INLINE)#include x.inl#endif

When INLINE is not defined , the .h file will not include the inlined methods , but rather these methods will be included in the .c file, and the inline directive will be stripped from the front of each method.


Selective Inlining : Inlining specific parts in a method

File: x. h:Class x {public: int inline_y (int a) ; int y (int a) ; };#include "x. inl" File: x. inl:inline int x: :inline_y (int a){ .... } //original implementation of y

File: x . c:int x :: y (int a){ return inline_y(a); }


concluding words about Inlining

• Inlining “might” improve the performance.

• Inlining may backfire i.e.; increase the size of the code

Be sure about the real cost of calls on your system before using Inlining!


Standard Template Library(STL)


Questions to be answered

Faced with a given computational task, what containers should I use? Are some better than others for a given scenario?

How good is the performance of the STL? Can I do better by rolling my own home-grown containers and algorithms?


Execution time Comparisons

vector<int> list<int>0

100

200

300

400

500

600

700

800

900

800

10

Mill

isec

onds

INSERTING AT THE FRONT


Execution time Comparisons (cont.)

vector<int> list<int>0

100

200

300

400

500

600

700

800

700

7

Mill

isec

onds

DELETING ELEMENTS AT THE FRONT


Execution time Comparisons (cont.)

array vector list0

500

1000

1500

2000

2500

3000

110 110

2600

Mill

isec

onds

Container traversal speed


Can I do better?

STL HOME GROWN

char *s = “abcde” ;reverse (&s[0] , &s[5] ) ;

// reverse sequence is required

char *s = "abcde";char temp;temp = s[4] ; // s[ 0] <-> s[4]s[ 4] = s[0] ;s[ 0] = temp;

temp = s[3] ; // s[ 1] <-> s[3]s[ 3] = s[1] ;s[ 1] = temp;


Comparison STL speed to Home-grown code

STL HOME GROWN0

10

20

30

40

50

60

55

14

Mill

isec

onds


Conclusions about STL performance

Outperforming the STL is possible.

Bend over backwards to concoct scenarios in which a home grown implementation outperforms the STL.

Outperforming STL ,home grown implementation should have something better that STL does NOT have!


Constructors and Destructors


Why this analysis?

• The performance of constructors and destructors is often poor due to the fact that an object's constructor (destructor) may call the constructors (destructors) of member objects and parent objects.

• This can result in constructors (destructors) that take a long time to execute, especially with objects in complex hierarchies or objects that contain several member objects.

• Hence a Performance Hit!


Connection b/w cost of constructor/destructor and Inheritance based design

• Encounter: Implementation of thread synchronization constructors

• In multithreaded applications ,there should be thread synchronization to restrict concurrent access to shared resources

• Thread synchronization constructs can be any of :

Semaphore Mutex Critical Section


Strategy:

• Encapsulate the lock in an object e.g. MutexLock object• Let the constructor obtain the lock• Destructor will release the lock automatically (as it does for regular

objects)• Compiler inserts a call to the lock destructor prior to each return statement• And the lock is always released!


Performance Comparison constructors destructor behaviour with Mutex in case of

• Non-inherited object• inherited object


Lock class implementation

Class Lock{public: Lock (pthread_mutex_t& key) : theKey(key) { pthread_mutex_lock(&theKey) ; } ~Lock() { pthread_mutex_unlock(&theKey) ; }

private: pthread_mutex_t &theKey;};


BaseLock class implementation

class BaseLock{

public:

BaseLock ( pthread_mutex_t &key, LogSource &lsrc) {}; virtual ~BaseLock() {};

};

This class is intended as a root class for the various lock classes that are expected to be derived from it.


Subclass of BaseLock: MutexLock class implementation

class MutexLock : public BaseLock { public: MutexLock (pthread_mutex_t &key, LogSource &lsrc) ; //constructor ~MutexLock() ; //destructor

private: pthread_mutex_t &theKey;

LogSource &src;

};

LogSource object is meant to capture filename and source code line where the object was constructed.


MutexLock constructor

MutexLock: :MutexLock( pthread_mutex_t& aKey, const LogSource& source) : BaseLock(aKey, source) , theKey(aKey) , src(source)

{ pthread_mutex_lock (&theKey) ;

#if defined(DEBUG)

cout <<"MutexLock“<< &aKey<< "created at”<<src.file()<<"line"<<src.line()<<endl;

#endif


MutexLock Destructor

MutexLock : : ~MutexLock ( ) { pthread_mutex_unlock(&theKey);

#if defined(DEBUG)

cout<<"MutexLock"<<&aKey<<“destroyed at"<<src.file()<<"line"<< src.line()<<endl;

#endif

}


Non-inherited Mutex Object

int main() { . . . // Start timing here

for (i = 0; i < 1000000; ++i ) { SimpleMutex m( mutex); //using constructor to lock and destructor to unlock sharedCounter++; }

//stop timing here ....}

SimpleMutex object from a class containing acquire( ) and release( ) methods


Inherited Mutex Object

int main( ) { . . . // Start timing here

for (i = 0; i < 1000000; i++) { DerivedMutex m(mutex); //using constructor to lock and destructor to unlock sharedCounter++; } // Stop timing here . . .}

replace SimpleMutex by DerivedMutex ( object of a derived class from BaseMutex)


Non Inherited case

Inherited case0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1.01

1.62

inheritance costse

cond

s

Execution Time comparison


Concluding Remarks

• Distinguish b/w over all computational cost, required cost, and computational penalty.

• Eliminate the one which is not important by some other mechanism• Over all cost increases with the size of derivation tree.


VIRTUAL FUNCTIONS


Inflict on performance

• Class with Virtual function -> virtual function table (vtbl) -> assigns each object a pointer -> vptr.

Virtual functions seem to inflict a performance cost in several ways:

The vptr must be initialized in the constructor

VFs are called using pointer indirection, resulting a few extra instructions per method invocation.

Inlining is compile time decision. The compiler cannot inline VFs whose resolution takes place at run time.


Performance Comparison for virtual and Non-virtual methods

class Virtual{

private:

int mv;

public:

Virtual( ) { mv = 0; } virtual ~Virtual( ) {} virtual int foo( ) const { return (mv); }

};

•Creating virtual objects costs more than creating non-virtual objects, because the virtual function table must be initialized. •And it takes slightly longer to call virtual functions, because of the additional level of indirection.


Performance Comparison for virtual and Non-virtual methods (cont.)

class NonVirtual {

private:

int mnv;

public:

NonVirtual( ) { mnv = 0; } ~NonVirtual( ) {} int foo( ) const { return (mv); }

};


• Construction/destruction shows the performance penalty of initializing the virtual function table. • Virtual function invocation is slightly expensive than invoking a function through a function pointer : memory overhead.

ctor/dtor foo0

0.10.20.30.40.50.60.70.80.9

11 1

0.73

0.96

virtual non virtual


If a specific virtual function creates a performance problem for you, what are your options?

To eliminate a virtual call, you must allow the compiler to resolve the function binding at compile time.

You bypass dynamic binding by

– hard-coding (derive distinct classes from string: CriticalSection)

– inheritance (derive a single ThreadSafeString class that contains a pointer to a Locker object. Use polymorphism to select the particular synchronization mechanism at runtime)

– templates (Create a template-based string class parameterized by the Locker type.)


Hard-Coding (synchronization mechanism example)

• Standard string class serves as a base class

class CriticalSectionString : public string {public: . . . int length( ) ;private: CriticalSectionLock cs;};

int CriticalSectionString::length(){ cs . lock (); int len = string :: length () ; cs . unlock (); return len;}

+ Although lock() and unlock() are VFs, they can be resolved statically by compiler. The compiler can bypass the dynamic binding and choose correct lock() and unlock() to use.+ it allows the compiler to inline those calls.

- you need to write a separate string class for each synchronization flavour ->

poor code reuse!


Inheritance

• Implementing a string class for each synchronization mechanism is a pain so you can factor out the synchronization choice into a constructor argument.

class ThreadSafeString : public string {

public:

ThreadSafeString (const char *s, Locker *lockPtr) : string(s) , pLock(lockPtr) { } . . . int length() ;

private:

Locker *pLock; //pointer to the Locker object

};


//The length( ) method is now implemented as follows: int ThreadSafeString: : length(){

pLock->lock();

int len = string: : length() ;

pLock->unlock() ;

return len;

}

+ more compact than the previous one

- the lock( ) and unlock( ) virtual calls can only be resolved at execution time and hence cannot be inlined


Templates• Templates combine best of the both worlds reuse and efficiency

template <class LOCKER>

class ThreadSafeString : public string {public: ThreadSafeString(const char *s) : string(s) {} . . . int length() ;private: LOCKER lock;};

//The length method implementation is similar to the previous ones:template <class LOCKER>Inlineint ThreadSafeString<LOCKER>: :length(){ lock.lock() ; int len = string: : length() ; lock.unlock() ; return len;}

+ provides a relief from the virtual function calls to lock() and unlock().

+ enables the compiler to resolve the virtual calls and inline them.

+ push the type resolution to compile time.


Coding Optimizations


Caching

• Remembering the results of frequent and costly computations• So, you will not have to perform those computations over and over again• For instance; evaluating the constant expression inside a loop is inefficient

for( ...; !done; ... ) { done = patternMatch (pat1, pat2 , isCaseSensitive ( ) ); }

// 2 string patterns -> compared to third argument (a function itself) // isCaseSensitive is independent of loop iterations


int isSensitive = isCaseSensitive();

for(... ; !done; ... ) {

done = patternMatch (pat1, pat2, isSensitive);

}

Now you compute case sensitivity once , cache it in local variable and reuse it!


Useless Computations• Pointless computations whose results are never used!• For instance: wasted initialization of a member object

class Student {public: Student(char *nm) ; . ..private: string name;};// the Student constructor turns the input character pointer into a string object representing the student's name:

Student: :Student(char *nm) { name = nm; . . .} //the constructor body follows with an invocation of:name = nm;


The previous one wipes away the contents of compiler generated calls to the String default constructor , we can eliminate this pointless computation by using an explicit string constructor:

Student :: Student (char *nm) : name (nm)

//explicit string constructor

{ ....

}


Lazy Evaluation• You should not perform costly computations “just in case.”• We ought to delay object definition to the scope where it is being used.• For instance, a code routed messages between downstream and upstream

communication adapters. One of the objects we used was very expensive to construct and destroy:

int route(Message *msg){ ExpensiveClass upstream(msg) ; if (goingUpstream) { . .. // do something with the expensive object } //upstream object not used here return SUCCESS;}


• Upstream object expensive used only 50% of the time.• A better solution would define the expensive upstream object in the scope

where it is actually necessary:

int route(Message *msg){ if (goingUpstream) { ExpensiveClass upstream(msg) ; // do something with the expensive object }

//upstream object not used here

return SUCCESS;

}


80-20 Rule: Speed up the common path• 80% of the execution scenarios will traverse only 20% of your source

code, and 80% of the elapsed time will be spent in 20% of the functions encountered on the execution path.

• For instance evaluation order of sub-expressions:

if (e1 || e2)

{ ... }

• If e1 and e2 are equally likely to evaluate TRUE sub-expression with smaller computational placed first!

• If e1 and e2 are of equal computational cost most likely to be TRUE placed first!

• p1 = conditional probability of e1 being TRUE • c1 = computational cost of e1


Cost = c1+ (1- p1) *c2

• If e2 evaluates TRUE 100% of the time p2 = 1.0• If e1 evaluates TRUE 90% of the time p1 = 0.9• c1= 10 instructions; c2 = 100 instructions

Cost = 10 + 0.1*100 = 20

• If we flip e1 and e2 i.e. : if(e2 || e1)• Cost = c2 + (1-p2) *c1

Cost = 100 + 0*100 = 100

• So,

if ( e1|| e2) is better choice than if (e2 || e1)


Concluding words about coding optimizations

• Are you ever going to use the result?

It sounds silly, but it happens. At times we perform computation and never use the results

• Do you need the results now?

Defer a computation to the point where it is actually needed. Premature computations may never be used on some execution flows.

• Do you know the result already?

We do costly computations even thought their results are available already two lines above. If you already computed it earlier in the execution flow, make the result available for reuse.


Thank you for your attention.

Questions...?

High Performance Programming with C++ Hafiza Rabbia Ibrahim July 25, 2011 1R.Ibrahim (CE Master)

Documents

complex retval retval

ibrahim ce master slide

complex c2

complex c1

ibrahim ce master2 slide

ibrahim ce master methods

tempresult retval

retval return