Top Banner
ntel Threading Building Block TBB http://www.threadingbuildingblocks.org
86

Intel Threading Building Blocks TBB .

Jan 15, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Intel Threading Building BlocksTBBhttp://www.threadingbuildingblocks.org

  • Overview TBB enables you to specify tasks instead of threads

    TBB targets threading for performance

    TBB is compatible with other threading packages

    TBB emphasize scalable, data parallel programming

    TBB relies on generic programming

  • void SerialAverage(float *output,float *input, size_t){ for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);

    }}Example 1: Compute Average of 3 numbers

  • #include "tbb/parallel_for.h"#include "tbb/blocked_range.h"#include "tbb/task_scheduler_init.h"

    using namespace tbb;

    class Average {public: float* input; float* output; void operator()( const blocked_range& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); }};

    // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range( 0, n, 1000 ), avg );}

  • //! Problem sizeconst int N = 100000;

    int main( int argc, char* argv[] ) {

    float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1;

    task_scheduler_init ;

    ............ ............ ParallelAverage(output, padded_input, N);

    }

  • SerialParallel

  • Notes on Grain Size

  • Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)

  • Auto Partitioner

  • ReductionSerialParallel

  • Class for use by Parallel Reduce

  • Split-Join Sequence

  • Parallel_scan (parallel prefix)

  • Parallel Scan

  • Parallel Scan with Partitioner Parallel_scan, breaks a range into subranges and computes a partial result in each subrange in parallel.

    Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange.

    Finally, each subrange uses its updated information to compute its final result in parallel with all the other subranges.

  • Parallel Scan Requirements

  • Parallel Scan

  • Parallel While and Pipeline

  • Linked List Example Assume Foo takes at least a few thousand instructions to run, then it is possible to get speedup by parallelizing

  • Parallel_while Requires two user-defined objectsObject that defines the stream of objects - must have pop_if_present - pop_if_present need not be thread safe - nonscalable, since fetching is serialized - possible to get useful speedup 2. Object that defines the loop body i.e. the operator()

  • Stream of Objects

  • Loop Body (Operator)

  • Parallelized Foo Acting on Linked List Note: the body of parallel_while can add more work by calling w.add(item)

  • Notes on parallel_while scaling

  • PipeliningSingle pipeline

  • PipeliningParallel pipeline

  • Pipelining Example

  • Character Buffer

  • Top Level Code for for Building and Running the Pipeline

  • Non-Linear Pipes

  • Trick soln: Use Topologically Sorted PipelineNote that the latency is increased

  • parallel_sort

  • CONTAINERS

  • concurrent_queueThe template class concurrent_queue implements a concurrent queue with values of type T.

    Multiple threads may simultaneously push and pop elements from the queue.

    In a single-threaded program, a queue is a first-in first-out structure.

    But if multiple threads are pushing and popping concurrently, the definition of first is uncertain.

    The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.

  • Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:

    pop_if_present : This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.

    pop : This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and notwhile(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.concurrent_queue

  • concurrent_queueUnlike STL, TBB containers are not templated with an allocator argument. The library retains control over memory allocation.

    Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size( ) is defined as the number of push operations started minus the number of pop operations started.

    By default, a concurrent_queue is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue.

  • concurrent_queue Example

  • concurrent_vectorA concurrent_vector is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it.

    However, be careful not to let another task access an element that is under construction or is otherwise being modified.

    A concurrent_vector never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code.

  • concurrent_vector

  • concurrent_vector

  • concurrent_vector Example

  • concurrent_hash_map

  • concurrent_hash_mapA concurrent_hash_map acts as a container of elements of type std::pair.

    Typically, when accessing a container element, you are interested in either updating it or reading it.

    The template class concurrent_hash_map supports these two operations with the accessor and const_accessor classes

    An accessor represents update (write) access. As long as it points to an element, all other attempts to look up that key in the table block until the accessor is done.

    const_accessor is similar, except that it represents read-only access. Therefore, multiple const_accessors can point to the same element at the same time.

  • concurrent_hash_map find and insert methods take an accessor or const_accessor as an argument.

    The choice tells concurrent_hash_map whether you are asking for update or read-only access, respectively.

    Once the method returns, the access lasts until the accessor or const_accessor is destroyed.

    Because having access to an element can block other threads, try to shorten the lifetime of the accessor or const_accessor. To do so, declare it in the innermost block possible.

    To release access even sooner than the end of the block, use the release method.

    The method remove(key) can also operate concurrently. It implicitly requests write access. Therefore, before removing the key, it waits on any other accesses on the key.

  • Use of release method

  • MUTUAL EXCLUSION in TBBIn TBB, you program in terms of tasks, not threads, therefore, you will probably think of mutual exclusion of tasks.

    TBB offers two kinds of mutual exclusion:Mutexes : These will be familiar to anyone who has used locks in other environments, and they include common variants such as reader-writer locks. - Atomic operations These are based on atomic operations offered by hardware processors, and they provide a solution that is simpler and faster than mutexes in a limited set of situations.

  • MUTEXESIn TBB, mutual exclusion is implemented by classes known as mutexes and locks.

    A mutex is an object on which a task can acquire a lock. Only one task at a time can have a lock on a mutex; other tasks have to wait their turn.

  • MUTEX EXAMPLEWith the object-oriented interface, destruction of the scoped_lock object causes the lock to be released, no matter whether the protected region was exited by normal control flow or an exception.

  • MUTEX EXAMPLE

  • FLAVORS of MUTEXESThe simplest mutex is the spin_mutex. A task trying to acquire a lock on a busy spin_mutex waits until it can acquire the lock. A spin_mutex is appropriate when the lock is held for only a few instructions.

  • Some mutexes are called scalable. In a strict sense, this is not an accurate name because a mutex limits execution to one task at a time and is therefore necessarily a drag on scalability.

    A scalable mutex is rather one that does no worse than forcing single- threaded performance.

    A mutex actually can do worse than serialize execution if the waiting tasks consume excessive processor cycles and memory bandwidth, reducing the speed of tasks trying to do real work.

    Scalable mutexes are often slower than nonscalable mutexes under light contention, so a nonscalable mutex may be better. When in doubt, use a scalable mutex.SCALABLE MUTEX

  • Mutexes can be fair or unfair.

    A fair mutex lets tasks through in the order they arrive.

    Fair mutexes avoid starving tasks. Each task gets its turn.

    However, unfair mutexes can be faster because they let tasks that are running go through first, instead of the task that is next in line, which may be sleeping because of an interrupt.FAIR MUTEX

  • Mutexes can be reentrant or nonreentrant.

    A reentrant mutex allows a task that is already holding a lock on the mutex to acquire another lock on it.

    This is useful in some recursive algorithms, but it typically adds overhead to the lock implementation.REENTRANT MUTEX

  • SLEEP or SPIN MUTEXMutexes can cause a task to spin in user space or sleep while it is waiting.

    For short waits, spinning in user space is fastest because putting a task to sleep takes cycles.

    For long waits, sleeping is better because it causes the task to give up its processor to some task that needs it. Spinning is also undesirable in processors with multiple-task support in a single core, such as Intel processors with hyperthreading technology.

  • MUTEX TYPESA spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It would seem to be the worst of all possible worlds, except that it is very fast in lightly contended situations. If you can design your program so that contention is somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a more efficient lock.

    A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it when scalability and fairness are important.

    A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_mutex, but they additionally support reader locks.

    A mutex is a wrapper around the systems native mutual exclusion mechanism. On Windows systems, it is implemented on top of a CRITICAL_SECTION. On Linux systems, it is implemented on top of a pthread mutex.

  • Requests for a reader lock are distinguished from requests for a writer lock via an extra Boolean parameter in the constructor for scoped_lock. The parameter is false to request a reader lock and true to request a writer lock. It defaults to true when it is omitted

    It is also possible to upgrade a reader lock to a writer lock by using the method upgrade_to_writer.Reader/Writer, Upgrade/Downgrade

  • Reader Writer Mutex ExampleNote: upgrade_to_writer returns true if the upgrade happened withoutre-acquiring the lock and false if opposite

  • ATOMIC OPERATIONSAtomic operations are a fast and relatively easy alternative to mutexes.

    They do not suffer from the deadlock and convoying problems

    The main limitation of atomic operations is that they are limited in current computer systems to fairly small data sizes: the largest is usually the size of the largest scalar, often a double-precision floating-point number.

    Atomic operations are also limited to a small set of operations supported by the underlying hardware processor.

    The class atomic implements atomic operations with C++ style.

  • Fundamental Operations on an atomic variable

  • SCALABLE MEMORY ALLOCATIONThe scalable memory allocator is cleanly separate from the rest of TBB so that your choice of memory allocator for concurrent usage is independent of your choice of parallel algorithm and container templates.

    When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap. TBB scalable allocator is built for scalability and speed. In some situations, this comes at a cost of wasted virtual space. Specifically, it wastes a lot of space when allocating blocks in the 9K to 12K range. It is also not yet terribly sophisticated about paging issues.

  • FALSE SHARINGFalse sharing occurs when multiple threads use memory locations that are close together, even if they are not actually using the same memory locations.

    Because processor cores fetch and hold memory in chunks called cache lines, any memory accesses within the same cache line should be done only by the same thread.

    Otherwise, accesses to memory on the same cache line will cause unnecessary contention and swapping of cache lines back and forth, resulting in slowdowns which can easily be a hundred times worse for the affected memory accesses.

    Example: float A[1000] float B[1000]

  • Memory Allocators TBB scalable memory allocator utilizes a memory management algorithm divided on a per-thread basis to minimize contention associated with allocation from a single global heap.

    TBB offers two choices, both similar to the STL template class, std::allocator: - scalable_allocator This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.

    -cache_aligned_allocator This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.

    Note that protection against false sharing between two objects is guaranteed only if both are allocated with cache_aligned_allocator.

  • Memory AllocatorsThe functionality of cache_aligned_allocator comes at some cost in space because it allocates in multiples of cache-line-size memory chunks, even for a small object.

    The padding is typically 128 bytes. Hence, allocating many small objects with cache_ aligned_allocator may increase memory usage.

  • Library to LinkBoth the debug and release versions for Threading Building Blocks are divided into two dynamic shared libraries, one with general support and the other with scalable memory allocator.

    The latter is distinguished by malloc in its name (although it does not define a routine actually called malloc).

    For example, the release versions for Windows are tbb.dll and tbbmalloc.dll, respectively.

  • Using the Allocator Argument to C++ STL Template ClassesThe following code shows how to declare an STL vector that uses cache_aligned_allocator for allocation:

    std::vector< int, cache_aligned_allocator >;

  • MALLOC/FREE/REALLOC/CALLOC