Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing Science and Engineering/VIT University; http://sites.google.com/site/mrajasekharababu/mtech09/multi-corelab VIT U N I V E R S I T Y (Estd. u/s 3 of UGC Act 1956) Vellore - 632 014, Tamil Nadu, India School of Computing Sciences Multi-Core Programming Lab (CSE512) 1. Syllabus 2. Guide lines for a. Observation b. Soft Record 3. Cycle sheets 4. Literature a. OpenMP b. Introduction to Multi-core architectures c. Virtual & Cache Memory d. Fundamentals of parallel Computers e. Parallel Programming
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Objective To provide hands on experience on parallel programming for multi-core architectures.
Expected Outcome After completion of this course, student able to Parallelize code for an application Understand the issues and recent trends in the area of parallel programming
Instructions to write observation: There is no record writing for this Multi-core programming lab, so students should maintain this observation as a record.
1. Every student should have a 200 pages long note book 2. Leave empty first four pages for the index. 3. Maintain the index as per prescribed format. 4. Write the program as per given format in the right side of note book. 5. Results should be written in left side of book. 6. Start every new program in a fresh page. 7. specify the page numbers as per prescribed format
Students asked to submit the soft Record at the end of course: Guide lines to prepare Soft Record for Multi-Core Programming Lab 1. Front Page 2. Contents
Prepare index list as per the prescribed format given for the observation. 3. Programs
a. Prepare a separate file for every program, which includes aim, requirements, program and results.
b. Results should be placed as snapshots of your program outputs. Provide brief information on the each result.
c. Rename this file with <Cycle Number> _<Program Sequential Number> (Eg: C1_5 represents Cycle 1 and 5th Program)
d. Page numbers of every file should be continued from previous file ( Eg: file1 for program1 ends at page number 5, then the subsequent file should start at 6)
e. Subheadings <TimesNewroman> <12><bold><Upperacase> f. Information under subheadings is <Times New roman> <12>
4. Rename each file with program number 5. all .doc, and source code files burn into CD and submit the same to the faculty
member on or before 6th April 2008. 6. 10% of marks will be awarded for this Soft record. So, any student fails to submit/
poor submission will be discredited.
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
statements within the associated structured block are executed by one or more of the
threads. The barrier implied at the end of a work-sharing construct without a
nowait clause is executed by all threads in the team.
If a thread modifies a shared object, it affects not only its own execution
environment, but also those of the other threads in the program. The modification is
guaranteed to be complete, from the point of view of one of the other threads, at the
next sequence point (as defined in the base language) only if the object is declared to
be volatile. Otherwise, the modification is guaranteed to be complete after first the
modifying thread, and then (or concurrently) the other threads, encounter a flushdirective that specifies the object (either implicitly or explicitly). Note that when the
flush directives that are implied by other OpenMP directives are not sufficient to
ensure the desired ordering of side effects, it is the programmer's responsibility to
supply additional, explicit flush directives.
Upon completion of the parallel construct, the threads in the team synchronize at an
implicit barrier, and only the master thread continues execution. Any number of
parallel constructs can be specified in a single program. As a result, a program may
fork and join many times during execution.
The OpenMP C/C++ API allows programmers to use directives in functions called
from within parallel constructs. Directives that do not appear in the lexical extent of
a parallel construct but may lie in the dynamic extent are called orphaned directives.
Orphaned directives give programmers the ability to execute major portions of their
program in parallel with only minimal changes to the sequential program. With this
functionality, users can code parallel constructs at the top levels of the program call
tree and use directives to control execution in any of the called functions.
Unsynchronized calls to C and C++ output functions that write to the same file may
result in output in which data written by different threads appears in
nondeterministic order. Similarly, unsynchronized calls to input functions that read
from the same file may read data in nondeterministic order. Unsynchronized use of
I/O, such that each thread accesses a different file, produces the same results as
serial execution of the I/O functions.
1.4 ComplianceAn implementation of the OpenMP C/C++ API is OpenMP-compliant if it recognizes
and preserves the semantics of all the elements of this specification, as laid out in
Chapters 1, 2, 3, 4, and Appendix C. Appendices A, B, D, E, and F are for information
purposes only and are not part of the specification. Implementations that include
only a subset of the API are not OpenMP-compliant.
123
45678910111213
14151617
18192021222324
252627282930
31
3233343536
37
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
The for directive places restrictions on the structure of the corresponding for loop.
Specifically, the corresponding for loop must have canonical shape:
Note that the canonical form allows the number of loop iterations to be computed on
entry to the loop. This computation is performed with values in the type of var, after
integral promotions. In particular, if value of b - lb + incr cannot be represented in
that type, the result is indeterminate. Further, if logical-op is < or <= then incr-exprmust cause var to increase on each iteration of the loop. If logical-op is > or >= then
incr-expr must cause var to decrease on each iteration of the loop.
The schedule clause specifies how iterations of the for loop are divided among
threads of the team. The correctness of a program must not depend on which thread
executes a particular iteration. The value of chunk_size, if specified, must be a loop
invariant integer expression with a positive value. There is no synchronization
during the evaluation of this expression. Thus, any evaluated side effects produce
indeterminate results. The schedule kind can be one of the following:
for ( init-expr; var logical-op b; incr-expr)
init-expr One of the following:
var = lbinteger-type var = lb
incr-expr One of the following:
++varvar++-- varvar--var += incrvar -= incrvar = var + incrvar = incr + varvar = var - incr
var A signed integer variable. If this variable would otherwise be
shared, it is implicitly made private for the duration of the for .
This variable must not be modified within the body of the forstatement. Unless the variable is specified lastprivate , its
value after the loop is indeterminate.
logical-op One of the following:
<<=>>=
lb, b, and incr Loop invariant integer expressions. There is no synchronization
during the evaluation of these expressions. Thus, any evaluated side
effects produce indeterminate results.
12
3
456
78910111213141516
1718192021
2223
24
25
26
272829
303132333435
363738394041
42
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
There is an implicit barrier after the single construct unless a nowait clause is
specified.
Restrictions to the single directive are as follows:
■ Only a single nowait clause can appear on a single directive.
■ The copyprivate clause must not be used with the nowait clause.
Cross References:■ private , firstprivate , and copyprivate clauses, see Section 2.7.2 on
page 25.
2.5 Combined Parallel Work-sharingConstructsCombined parallel work–sharing constructs are shortcuts for specifying a parallel
region that contains only one work-sharing construct. The semantics of these
directives are identical to that of explicitly specifying a parallel directive
followed by a single work-sharing construct.
The following sections describe the combined parallel work-sharing constructs:
■ the parallel for directive.
■ the parallel sections directive.
2.5.1 parallel for ConstructThe parallel for directive is a shortcut for a parallel region that contains
only a single for directive. The syntax of the parallel for directive is as
follows:
This directive allows all the clauses of the parallel directive and the fordirective, except the nowait clause, with identical meanings and restrictions. The
semantics are identical to explicitly specifying a parallel directive immediately
followed by a for directive.
#pragma omp parallel for [clause[[, ] clause] ...] new-linefor-loop
12
3
4
5
6
78
9
10
11121314
15
16
17
18
192021
2223
24252627
28
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
■ binop is not an overloaded operator and is one of +, *, -, /, &, ^, |,<<, or >>.
Although it is implementation-defined whether an implementation replaces all
atomic directives with critical directives that have the same unique name, the
atomic directive permits better optimization. Often hardware instructions are
available that can perform the atomic update with the least overhead.
Only the load and store of the object designated by x are atomic; the evaluation of
expr is not atomic. To avoid race conditions, all updates of the location in parallel
should be protected with the atomic directive, except those that are known to be
free of race conditions.
Restrictions to the atomic directive are as follows:
■ All atomic references to the storage location x throughout the program are
required to have a compatible type.
Examples:
2.6.5 flush DirectiveThe flush directive, whether explicit or implied, specifies a “cross-thread”
sequence point at which the implementation is required to ensure that all threads in
a team have a consistent view of certain objects (specified below) in memory. This
means that previous evaluations of expressions that reference those objects are
complete and subsequent evaluations have not yet begun. For example, compilers
must restore the values of the objects from registers to memory, and hardware may
need to flush write buffers to memory and reload the values of the objects from
memory.
extern float a[], *p = a, b;/* Protect against races among multiple updates. */#pragma omp atomica[index[i]] += b;/* Protect against races with updates through a. */#pragma omp atomicp[i] -= 1.0f;
extern union {int n; float x;} u;/* ERROR - References through incompatible types. */#pragma omp atomicu.n++;#pragma omp atomicu.x -= 1.0f;
12
3456
78910
11
1213
14
15161718192021
222324252627
28
2930313233343536
37
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Note that because the flush directive does not have a C language statement as part
of its syntax, there are some restrictions on its placement within a program. See
Appendix C for the formal grammar. The example below illustrates these
restrictions.
Restrictions to the flush directive are as follows:
■ A variable specified in a flush directive must not have a reference type.
2.6.6 ordered ConstructThe structured block following an ordered directive is executed in the order in
which iterations would be executed in a sequential loop. The syntax of the ordereddirective is as follows:
An ordered directive must be within the dynamic extent of a for or parallelfor construct. The for or parallel for directive to which the orderedconstruct binds must have an ordered clause specified as described in Section 2.4.1
on page 11. In the execution of a for or parallel for construct with an
ordered clause, ordered constructs are executed strictly in the order in which
they would be executed in a sequential execution of the loop.
Restrictions to the ordered directive are as follows:
■ An iteration of a loop with a for construct must not execute the same ordered
directive more than once, and it must not execute more than one ordereddirective.
/* ERROR - The flush directive cannot be the immediate* substatement of an if statement.*/
if (x!=0) #pragma omp flush (x)...
/* OK - The flush directive is enclosed in a * compound statement
*/if (x!=0) { #pragma omp flush (x)}
#pragma omp ordered new-linestructured-block
1234
5678910
111213141516
17
18
19
202122
2324
252627282930
31
323334
35
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
2.7 Data EnvironmentThis section presents a directive and several clauses for controlling the data
environment during the execution of parallel regions, as follows:
■ A threadprivate directive (see the following section) is provided to make file-
scope, namespace-scope, or static block-scope variables local to a thread.
■ Clauses that may be specified on the directives to control the sharing attributes of
variables for the duration of the parallel or work-sharing constructs are described
in Section 2.7.2 on page 25.
2.7.1 threadprivate DirectiveThe threadprivate directive makes the named file-scope, namespace-scope, or
static block-scope variables specified in the variable-list private to a thread. variable-listis a comma-separated list of variables that do not have an incomplete type. The
syntax of the threadprivate directive is as follows:
Each copy of a threadprivate variable is initialized once, at an unspecified point
in the program prior to the first reference to that copy, and in the usual manner (i.e.,
as the master copy would be initialized in a serial execution of the program). Note
that if an object is referenced in an explicit initializer of a threadprivate variable,
and the value of the object is modified prior to the first reference to a copy of the
variable, then the behavior is unspecified.
As with any private variable, a thread must not reference another thread's copy of a
threadprivate object. During serial regions and master regions of the program,
references will be to the master thread's copy of the object.
After the first parallel region executes, the data in the threadprivate objects is
guaranteed to persist only if the dynamic threads mechanism has been disabled and
if the number of threads remains unchanged for all parallel regions.
The restrictions to the threadprivate directive are as follows:
■ A threadprivate directive for file-scope or namespace-scope variables must
appear outside any definition or declaration, and must lexically precede all
references to any of the variables in its list.
■ Each variable in the variable-list of a threadprivate directive at file or
namespace scope must refer to a variable declaration at file or namespace scope
■ A threadprivate directive for static block-scope variables must appear in the
scope of the variable and not in a nested scope. The directive must lexically
precede all references to any of the variables in its list.
■ Each variable in the variable-list of a threadprivate directive in block scope
must refer to a variable declaration in the same scope that lexically precedes the
directive. The variable declaration must use the static storage-class specifier.
■ If a variable is specified in a threadprivate directive in one translation unit, it
must be specified in a threadprivate directive in every translation unit in
which it is declared.
■ A threadprivate variable must not appear in any clause except the copyin ,
copyprivate , schedule , num_threads , or the if clause.
■ The address of a threadprivate variable is not an address constant.
■ A threadprivate variable must not have an incomplete type or a reference
type.
■ A threadprivate variable with non-POD class type must have an accessible,
unambiguous copy constructor if it is declared with an explicit initializer.
The following example illustrates how modifying a variable that appears in an
initializer can cause unspecified behavior, and also how to avoid this problem by
using an auxiliary object and a copy-constructor.
Cross References:■ Dynamic threads, see Section 3.1.7 on page 39.
■ OMP_DYNAMICenvironment variable, see Section 4.3 on page 49.
int x = 1;T a(x);const T b_aux(x); /* Capture value of x = 1 */T b(b_aux);#pragma omp threadprivate(a, b)
void f(int n) { x++; #pragma omp parallel for /* In each thread: * Object a is constructed from x (with value 1 or 2?) * Object b is copy-constructed from b_aux */ for (int i=0; i<n; i++) { g(a, b); /* Value of a is unspecified. */ }}
123
456
789
1011
12
1314
1516
171819
2021222324
2526272829303132333435
36
3738
39
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
2.7.2 Data-Sharing Attribute ClausesSeveral directives accept clauses that allow a user to control the sharing attributes of
variables for the duration of the region. Sharing attribute clauses apply only to
variables in the lexical extent of the directive on which the clause appears. Not all of
the following clauses are allowed on all directives. The list of clauses that are valid
on a particular directive are described with the directive.
If a variable is visible when a parallel or work-sharing construct is encountered, and
the variable is not specified in a sharing attribute clause or threadprivatedirective, then the variable is shared. Static variables declared within the dynamic
extent of a parallel region are shared. Heap allocated memory (for example, using
malloc() in C or C++ or the new operator in C++) is shared. (The pointer to this
memory, however, can be either private or shared.) Variables with automatic storage
duration declared within the dynamic extent of a parallel region are private.
Most of the clauses accept a variable-list argument, which is a comma-separated list of
variables that are visible. If a variable referenced in a data-sharing attribute clause
has a type derived from a template, and there are no other references to that variable
in the program, the behavior is undefined.
All variables that appear within directive clauses must be visible. Clauses may be
repeated as needed, but no variable may be specified in more than one clause, except
that a variable can be specified in both a firstprivate and a lastprivateclause.
The following sections describe the data-sharing attribute clauses:
■ private , Section 2.7.2.1 on page 25.
■ firstprivate , Section 2.7.2.2 on page 26.
■ lastprivate , Section 2.7.2.3 on page 27.
■ shared , Section 2.7.2.4 on page 27.
■ default , Section 2.7.2.5 on page 28.
■ reduction , Section 2.7.2.6 on page 28.
■ copyin , Section 2.7.2.7 on page 31.
■ copyprivate , Section 2.7.2.8 on page 32.
2.7.2.1 private
The private clause declares the variables in variable-list to be private to each thread
in a team. The syntax of the private clause is as follows:
private( variable-list)
1
23456
78910111213
14151617
18192021
22
23
24
25
26
27
28
29
30
31
3233
34
35
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
The default clause allows the user to affect the data-sharing attributes of
variables. The syntax of the default clause is as follows:
Specifying default(shared) is equivalent to explicitly listing each currently
visible variable in a shared clause, unless it is threadprivate or const -
qualified. In the absence of an explicit default clause, the default behavior is the
same as if default(shared) were specified.
Specifying default(none) requires that at least one of the following must be true
for every reference to a variable in the lexical extent of the parallel construct:
■ The variable is explicitly listed in a data-sharing attribute clause of a construct
that contains the reference.
■ The variable is declared within the parallel construct.
■ The variable is threadprivate .
■ The variable has a const -qualified type.
■ The variable is the loop control variable for a for loop that immediately
follows a for or parallel for directive, and the variable reference appears
inside the loop.
Specifying a variable on a firstprivate , lastprivate , or reduction clause
of an enclosed directive causes an implicit reference to the variable in the enclosing
context. Such implicit references are also subject to the requirements listed above.
Only a single default clause may be specified on a parallel directive.
A variable’s default data-sharing attribute can be overridden by using the private ,
firstprivate , lastprivate , reduction , and shared clauses, as
demonstrated by the following example:
2.7.2.6 reduction
This clause performs a reduction on the scalar variables that appear in variable-list,with the operator op. The syntax of the reduction clause is as follows:
default(shared | none)
#pragma omp parallel for default(shared) firstprivate(i)\private(x) private(r) lastprivate(i)
reduction( op: variable-list)
1
23
4
5678
910
1112131415161718
192021
22
232425
2627
28
2930
31
32
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
■ A variable that is specified in the reduction clause must not be const -
qualified.
■ Variables that are private within a parallel region or that appear in the
reduction clause of a parallel directive cannot be specified in a
reduction clause on a work-sharing directive that binds to the parallel
construct.
2.7.2.7 copyin
The copyin clause provides a mechanism to assign the same value to
threadprivate variables for each thread in the team executing the parallel
region. For each variable specified in a copyin clause, the value of the variable in
the master thread of the team is copied, as if by assignment, to the thread-private
copies at the beginning of the parallel region. The syntax of the copyin clause is as
follows:
The restrictions to the copyin clause are as follows:
■ A variable that is specified in the copyin clause must have an accessible,
unambiguous copy assignment operator.
■ A variable that is specified in the copyin clause must be a threadprivatevariable.
#pragma omp parallel private(y){ /* ERROR - private variable y cannot be specified in a reduction clause */ #pragma omp for reduction(+: y) for (i=0; i<n; i++) y += b[i];}
/* ERROR - variable x cannot be specified in both a shared and a reduction clause */#pragma omp parallel for shared(x) reduction(+: x)
copyin( variable-list)
12
3456
78910111213
141516
17
181920212223
24
25
2627
2829
30
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
The copyprivate clause provides a mechanism to use a private variable to
broadcast a value from one member of a team to the other members. It is an
alternative to using a shared variable for the value when providing such a shared
variable would be difficult (for example, in a recursion requiring a different variable
at each level). The copyprivate clause can only appear on the single directive.
The syntax of the copyprivate clause is as follows:
The effect of the copyprivate clause on the variables in its variable-list occurs after
the execution of the structured block associated with the single construct, and
before any of the threads in the team have left the barrier at the end of the construct.
Then, in all other threads in the team, for each variable in the variable-list, that
variable becomes defined (as if by assignment) with the value of the corresponding
variable in the thread that executed the construct's structured block.
Restrictions to the copyprivate clause are as follows:
■ A variable that is specified in the copyprivate clause must not appear in a
private or firstprivate clause for the same single directive.
■ If a single directive with a copyprivate clause is encountered in the
dynamic extent of a parallel region, all variables specified in the copyprivateclause must be private in the enclosing context.
■ A variable that is specified in the copyprivate clause must have an accessible
unambiguous copy assignment operator.
2.8 Directive BindingDynamic binding of directives must adhere to the following rules:
■ The for , sections , single , master , and barrier directives bind to the
dynamically enclosing parallel , if one exists, regardless of the value of any ifclause that may be present on that directive. If no parallel region is currently
being executed, the directives are executed by a team composed of only the
master thread.
■ The ordered directive binds to the dynamically enclosing for .
■ The atomic directive enforces exclusive access with respect to atomicdirectives in all threads, not just the current team.
■ The critical directive enforces exclusive access with respect to criticaldirectives in all threads, not just the current team.
copyprivate( variable-list)
1
23456
7
8
91011121314
15
1617
181920
2122
23
24
2526272829
30
3132
3334
35
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
This section describes the OpenMP C and C++ run-time library functions. The
header <omp.h> declares two types, several functions that can be used to control
and query the parallel execution environment, and lock functions that can be used to
synchronize access to data.
The type omp_lock_t is an object type capable of representing that a lock is
available, or that a thread owns a lock. These locks are referred to as simple locks.
The type omp_nest_lock_t is an object type capable of representing either that a
lock is available, or both the identity of the thread that owns the lock and a nestingcount (described below). These locks are referred to as nestable locks.
The library functions are external functions with “C” linkage.
The descriptions in this chapter are divided into the following topics:
■ Execution environment functions (see Section 3.1 on page 35).
■ Lock functions (see Section 3.2 on page 41).
3.1 Execution Environment FunctionsThe functions described in this section affect and monitor threads, processors, and
the parallel environment:
■ the omp_set_num_threads function.
■ the omp_get_num_threads function.
■ the omp_get_max_threads function.
■ the omp_get_thread_num function.
■ the omp_get_num_procs function.
■ the omp_in_parallel function.
1
2
3456
78
91011
12
13
14
15
16
1718
19
20
21
22
23
24
25
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
3.1.2 omp_get_num_threads FunctionThe omp_get_num_threads function returns the number of threads currently in
the team executing the parallel region from which it is called. The format is as
follows:
The num_threads clause, the omp_set_num_threads function, and the
OMP_NUM_THREADSenvironment variable control the number of threads in a team.
If the number of threads has not been explicitly set by the user, the default is
implementation-defined. This function binds to the closest enclosing paralleldirective. If called from a serial portion of a program, or from a nested parallel
region that is serialized, this function returns 1.
Cross References:■ OMP_NUM_THREADSenvironment variable, see Section 4.2 on page 48.
■ num_threads clause, see Section 2.3 on page 8.
■ parallel construct, see Section 2.3 on page 8.
3.1.3 omp_get_max_threads FunctionThe omp_get_max_threads function returns an integer that is guaranteed to be
at least as large as the number of threads that would be used to form a team if a
parallel region without a num_threads clause were to be encountered at that point
in the code. The format is as follows:
The following expresses a lower bound on the value of omp_get_max_threads :
threads-used-for-next-team <= omp_get_max_threads
Note that if a subsequent parallel region uses the num_threads clause to request a
specific number of threads, the guarantee on the lower bound of the result of
omp_get_max_threads no long holds.
The omp_get_max_threads function’s return value can be used to dynamically
allocate sufficient storage for all threads in the team formed at the subsequent
parallel region.
#include <omp.h>int omp_get_num_threads(void);
#include <omp.h>int omp_get_max_threads(void);
1
234
56
78
9101112
13
141516
17
18192021
2223
24
25
262728
293031
32
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
3.1.8 omp_get_dynamic FunctionThe omp_get_dynamic function returns a nonzero value if dynamic adjustment of
threads is enabled, and returns 0 otherwise. The format is as follows:
If the implementation does not implement dynamic adjustment of the number of
threads, this function always returns 0.
Cross References:■ For a description of dynamic thread adjustment, see Section 3.1.7 on page 39.
3.1.9 omp_set_nested FunctionThe omp_set_nested function enables or disables nested parallelism. The format
is as follows:
If nested evaluates to 0, nested parallelism is disabled, which is the default, and
nested parallel regions are serialized and executed by the current thread. If nestedevaluates to a nonzero value, nested parallelism is enabled, and parallel regions that
are nested may deploy additional threads to form nested teams.
This function has the effects described above when called from a portion of the
program where the omp_in_parallel function returns zero. If it is called from a
portion of the program where the omp_in_parallel function returns a nonzero
value, the behavior of this function is undefined.
This call has precedence over the OMP_NESTEDenvironment variable.
When nested parallelism is enabled, the number of threads used to execute nested
parallel regions is implementation-defined. As a result, OpenMP-compliant
implementations are allowed to serialize nested parallel regions even when nested
parallelism is enabled.
Cross References:■ OMP_NESTEDenvironment variable, see Section 4.4 on page 49.
■ omp_in_parallel function, see Section 3.1.6 on page 38.
#include <omp.h>int omp_get_dynamic(void);
#include <omp.h>void omp_set_nested(int nested);
1
23
45
67
8
9
10
1112
1314
15161718
19202122
23
24252627
28
2930
31
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Cross References:■ num_threads clause, see Section 2.3 on page 8.
■ omp_set_num_threads function, see Section 3.1.1 on page 36.
■ omp_set_dynamic function, see Section 3.1.7 on page 39.
4.3 OMP_DYNAMICThe OMP_DYNAMICenvironment variable enables or disables dynamic adjustment
of the number of threads available for execution of parallel regions unless dynamic
adjustment is explicitly enabled or disabled by calling the omp_set_dynamiclibrary routine. Its value must be TRUEor FALSE.
If set to TRUE, the number of threads that are used for executing parallel regions
may be adjusted by the runtime environment to best utilize system resources.
If set to FALSE, dynamic adjustment is disabled. The default condition is
implementation-defined.
Example:
Cross References:■ For more information on parallel regions, see Section 2.3 on page 8.
■ omp_set_dynamic function, see Section 3.1.7 on page 39.
4.4 OMP_NESTEDThe OMP_NESTEDenvironment variable enables or disables nested parallelism
unless nested parallelism is enabled or disabled by calling the omp_set_nestedlibrary routine. If set to TRUE, nested parallelism is enabled; if it is set to FALSE,
nested parallelism is disabled. The default value is FALSE.
setenv OMP_NUM_THREADS 16
setenv OMP_DYNAMIC TRUE
1
2
3
456
7
891011
1213
1415
16
17
18
1920
21
22232425
26
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
will skip the single section and stop at the barrier at the end of the singleconstruct. If other threads can proceed without waiting for the thread executing the
single section, a nowait clause can be specified on the single directive.
A.10 Specifying Sequential OrderingOrdered sections (Section 2.6.6 on page 22) are useful for sequentially ordering the
output from work that is done in parallel. The following program prints out the
indexes in sequential order:
A.11 Specifying a Fixed Number of ThreadsSome programs rely on a fixed, prespecified number of threads to execute correctly.
Because the default setting for the dynamic adjustment of the number of threads is
implementation-defined, such programs can choose to turn off the dynamic threads
#pragma omp parallel{ #pragma omp single printf("Beginning work1.\n"); work1(); #pragma omp single printf("Finishing work1.\n"); #pragma omp single nowait printf("Finished work1 and beginning work2.\n"); work2();}
#pragma omp for ordered schedule(dynamic) for (i=lb; i<ub; i+=st) work(i);
void f1(int *q){ *q = 1; #pragma omp flush // x, p, and *q are flushed // because they are shared and accessible
// q is not flushed because it is not shared.}
void f2(int *q){
#pragma omp barrier *q = 2; #pragma omp barrier // a barrier implies a flush // x, p, and *q are flushed // because they are shared and accessible
// q is not flushed because it is not shared.}
int g(int n){ int i = 1, j, sum = 0; *p = 1; #pragma omp parallel reduction(+: sum) num_threads(10) { f1(&j); // i, n and sum were not flushed // because they were not accessible in f1 // j was flushed because it was accessible sum += j; f2(&j); // i, n, and sum were not flushed // because they were not accessible in f2 // j was flushed because it was accessible sum += i + j + *p + n; } return sum;}
1
23456789
10111213141516171819
20212223242526272829303132333435363738
39
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
critical section, but to do other work while waiting for entry to the second.The
omp_set_lock function blocks, but the omp_test_lock function does not,
allowing the work in skip() to be done.
#include <omp.h>int main(){ omp_lock_t lck; int id;
omp_init_lock(&lck); #pragma omp parallel shared(lck) private(id) { id = omp_get_thread_num();
omp_set_lock(&lck); printf("My thread id is %d.\n", id);// only one thread at a time can execute this printf omp_unset_lock(&lck);
while (! omp_test_lock(&lck)) { skip(id); /* we do not yet have the lock, so we must do something else */ } work(id); /* we now have the lock and can do the work */ omp_unset_lock(&lck); }
omp_destroy_lock(&lck);}
123
45678
9101112
13141516
17181920212223242526
27
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
A.24 Example of the private ClauseThe private clause (Section 2.7.2.1 on page 25) of a parallel region is only in effect
for the lexical extent of the region, not for the dynamic extent of the region.
Therefore, in the example that follows, any uses of the variable a within the forloop in the routine f refers to a private copy of a, while a usage in routine g refers to
the global a.
int a;
void f(int n) {
a = 0;
#pragma omp parallel for private(a)for (int i=1; i<n; i++) {
a = i; g(i, n);
d(a); // Private copy of “a”...
}...}void g(int k, int n) {
h(k,a); //The global “a”, not the private “a” in f}
1
23456
7
8
9
1011
1213141516171819
2021
22
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Mind-boggling Trends in Chip Industry • Long history since 1971
- Introduction of Intel 4004 - http://www.intel4004.com
• Today we talk about more than one billion transistors on a chip - Intel Montecito (in market since July'06) has 1.7B transistors - Die size has increased steadily (what is a die?)
• Intel Prescott: 112mm², Intel Pentium 4EE: 237 mm2, Intel Montecito: 596 mm2 - Minimum feature size has shrunk from 10 micron in 1971 to 0.065 micron today
Agenda • Unpipelined microprocessors • Pipelining: simplest form of ILP • Out-of-order execution: more ILP • Multiple issue: drink more ILP • Scaling issues and Moore’s Law • Why multi-core - TLP and de-centralized design • Tiled CMP and shared cache • Implications on software Unpipelined Microprocessors • Typically an instruction enjoys five phases in its life
- Fetch from memory - Decode and register read - Execute - Data memory access–Register write
• Unpipelinedexecution would take a long single cycle or multiple short cycles
- Only one instruction inside processor at any point in time Pipelining: simplest form of ILP Pipelining • One simple observation
- Exactly one piece of hardware is active at any point in time • Why not fetch a new instruction every cycle?
- Five instructions in five different phases - Throughput increases five times (ideally)
• Bottom-line is - If consecutive instructions are independent, they can be processed in parallel - The first form of instruction-level parallelism (ILP)
• Control dependence - On average, every fifth instruction is a branch (coming from if-else, for, do-
while,…) - Branches execute in the third phaseIntroduces bubbles unless you are smart
Control Dependence
• What do you fetch in X and y slots? Options: nothing, fall-through, learn past history and predict (today best predictors achieve on average 97% accuracy for SPEC2000) Data Dependence
• Take three bubbles? - Back-to-back dependence is too frequent - Solution: hardware bypass paths - Allow the ALU to bypass the produced value in time: not always possible
Need a live bypass! (requires some negative time travel: not yet feasible in real world)
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Out-of-order Multiple Issue • Some hardware nightmares
- Complex issue logic to discover independent instructions - Increased pressure on cache
• Impact of a cache miss is much bigger now in terms of lost opportunity • Various speculative techniques are in place to “ignore”the slow and stupid memory
- Increased impact of control dependence • Must feed the processor with multiple correct instructions every cycle • One cycle of bubble means lost opportunity of multiple instructions
- Complex logic to verify Scaling issues and Moore's Law Moore's Law • Number of transistors on-chip doubles every 18 months
- So much of innovation was possible only because we had transistors - Phenomenal 58% performance growth every year
• Moore’s Law is facing a danger today - Power consumption is too high when clocked at multi-GHz frequency and it is proportional to the number of switching transistors
• Wire delay doesn’t decrease with transistor size Scaling Issues • Hardware for extracting ILP has reached the point of diminishing return
- Need a large number of in-flight instructions - Supporting such a large population inside the chip requires power-hungry delay sensitive logic and storage
- Verification complexity is getting out of control • How to exploit so many transistors? - Must be a de-centralized design which avoids long wires
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Why Multi-Core Multi-core • Put a few reasonably complex processors or many simple processors on the chip
- Each processor has its own primary cache and pipeline - Often a processor is called a core - Often called a chip-multiprocessor (CMP)
• Hey Mainak, you are missing the point - Did we use the transistors properly? - Depends on if you can keep the cores busy - Introduces the concept of thread-level parallelismF (TLP)
Thread-level Parallelism • Look for concurrency at a granularity coarser than instructions
- Put a chunk of consecutive instructions together and call it a thread (largely wrong!) - Each thread can be seen as a “dynamic”subgraphof the sequential control-flow graph: take a loop and unroll its graph - The edges spanning the subgraphsrepresent data dependence across threads
• The goal of parallelization is to minimize such edges • Threads should mostly compute independently on different cores; but need to talk once in a while to get things done! • Parallelizing sequential programs is fun, but often tedious for non-experts
- So look for parallelism at even coarser grain - Run multiple independent programs simultaneously
• Known as multi-programming • The biggest reason why quotidian Windows fans would buy small-scale multiprocessors and multi-core today • Can play AOE while running heavy-weight simulations and downloading movies • Have you seen the state of the poor machine when running anti-virus? Communication in Multi-core • Ideal for shared address space - Fast on-chip hardwired communication through cache (no OS intervention) - Two types of architectures • Tiled CMP: each core has its private cache hierarchy (no cache sharing); Intel Pentium D, Dual Core Opteron, Intel Montecito, Sun UltraSPARCIV, IBM Cell (more specialized) • Shared cache CMP: Outermost level of cache hierarchy is shared among cores; Intel Woodcrest, Intel Conroe, Sun Niagara, IBM Power4, IBM Power5 Tiled CMP and Shared cache Tiled CMP (Hypothetical Floor-plan)
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Implications on Software • A tall memory hierarchy - Each core could run multiple threads • Each core in Niagara runs four threads - Within core, threads communicate through private cache (fastest) - Across cores communication happens through shared L2 or coherence controller (if tiled) - Multiple such chips can be connected over a scalable network • Adds one more level of memory hierarchy • A very non-uniform access stack Research Directions • Hexagon of puzzles - Running single-threaded programs efficiently on this sea of cores - Managing energy envelope efficiently - Allocating shared cache efficiently - Allocating shared off-chip bandwidth efficiently - Making parallel programming easy • Transactional memory • Speculative parallelization - Verification of hardware and parallel softwareSingle References • A good reading is Parallel Computer Architecture by Culler,Singh with Gupta
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
- Caveat: does not talk about multi-core, but introduces the general area of shared memory multiprocessors • Papers - Check out the most recent issue of Intel Technology Journal • http://www.intel.com/technology/itj/ • http://www.intel.com/technology/itj/archive.htm - Conferences: ASPLOS, ISCA, HPCA, MICRO, PACT - Journals: IEEE Micro, IEEE TPDS, ACM TACO
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Agenda •Convergence of parallel architectures •Fundamental design issues •ILP vs. TLP Convergence of parallel architectures Communication architecture • Historically, parallel architectures are tied to programming models - Diverse designs made it impossible to write portable parallel software - But the driving force was the same: need for fast processing • Today parallel architecture is seen as an extension of microprocessor architecture with a communication architecture - Defines the basic communication and synchronization operations and provides hw/swimplementation of those Layered architecture • A parallel architecture can be divided into several layers
multiprogramming, data parallel, dataflow etc • - Compiler + libraries • - Operating systems support • - Communication hardware • - Physical communication medium
• Communication architecture = user/system interface + hw implementation (roughly defined by the last four layers)
• - Compiler and OS provide the user interface to communicate between and synchronize threads
Shared address • Communication takes place through a logically shared portion of memory
• - User interface is normal load/store instructions • - Load/store instructions generate virtual addresses • - The VAsare translated to PAsby TLB or page table • - The memory controller then decides where to find this PA • - Actual communication is hidden from the programmer
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• The general communication hw consists of multiple processors connected over some medium so that they can talk to memory banks and I/O devices
• - The architecture of the interconnect may vary depending on projected cost and target performance
Communication medium
• - Interconnect could be a crossbar switch so that any processor can talk to any memory bank in one “hop”(provides latency and bandwidth advantages)
• - Scaling a crossbar becomes a problem: cost is proportional to square of the size
• - Instead, could use a scalable switch* - based network; latency increases and bandwidth decreases because now multiple processors contend for switch ports
• - From mid 80s shared bus became popular leading to the design of SMPs
• - Pentium Pro Quad was the first commodity SMP • - Sun Enterprise server provided a highly pipelined wide shared
bus for scalability reasons; it also distributed the memory to each processor, but there was no local bus on the boards i.e. the memory was still “symmetric”(must use the shared bus)
• - NUMA or DSM architectures provide a better solution to the scalability problem; the symmetric view is replaced by local and remote memory and each node (containing processor(s) with caches, memory controller and router) gets connected via a scalable network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha GS320, Alpha/HP GS1280 etc.
Message passing
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• Very popular for large* - scale computing • The system architecture looks exactly same as DSM, but there is no shared memory • The user interface is via send/receive calls to the message layer • The message layer is integrated to the I/O system instead of the memory system • Send specifies a local data buffer that needs to be transmitted; send also specifies a tag • A matching receive at dest. node with the same tag reads in the data from kernel space buffer to user memory • Effectively, provides a memory* - to* - memory copy • Actual implementation of message layer
• - Initially it was very topology dependent • - A node could talk only to its neighbors through FIFO buffers • - These buffers were small in size and therefore while sending a
message send would occasionally block waiting for the receive to start reading the buffer (synchronous message passing)
• - Soon the FIFO buffers got replaced by DMA (direct memory access) transfers so that a send can initiate a transfer from memory to I/O buffers and finish immediately (DMA happens in background); same applies to the receiving end also
• - The parallel algorithms were designed specifically for certain topologies: a big problem
• To improve usability of machines, the message layer started providing support for arbitrary source and destination (not just nearest neighbors)
• - Essentially involved storing a message in intermediate “hops”and forwarding it to the next node on the route
• - Later this store* - and* - forwardrouting got moved to hardware where a switch could handle all the routing activities
• - Further improved to do pipelined wormholerouting so that the time taken to traverse the intermediate hops became small compared to the time it takes to push the message from processor to network (limited by node* - to* - network bandwidth)
• - Examples include IBM SP2, Intel Paragon • - Each node of Paragon had two i860 processors, one of which
was dedicated to servicing the network (send/recv. etc.)
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Convergence • Shared address and message passing are two distinct programming models, but the architectures look very similar
• - Both have a communication assist or network interface to initiate messages or transactions
• - In shared memory this assist is integrated with the memory controller
• - In message passing this assist normally used to be integrated with the I/O, but the trend is changing
• - There are message passing machines where the assist sits on the memory bus or machines where DMA over network is supported (direct transfer from source memory to destination memory)
• - Finally, it is possible to emulate send/recv. on shared memory through shared buffers, flags and locks
• - Possible to emulate a shared virtual mem. on message passing machines through modified page fault handlers
A generic architecture • In all the architectures we have discussed thus far a node essentially contains processor(s) + caches, memory and a communication assist (CA)
• - CA = network interface (NI) + communication controller • The nodes are connected over a scalable network • The main difference remains in the architecture of the CA
• - And even under a particular programming model (e.g., shared memory) there is a lot of choices in the design of the CA
• - Most innovations in parallel architecture takes place in the communication assist (also called communication controller or node controller)
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Fundamental design issues Design issues • Need to understand architectural components that affect software - Compiler, library, program - User/system interface and hw/swinterface - How programming models efficiently talk to the communication architecture? - How to implement efficient primitives in the communication layer? - In a nutshell, what issues of a parallel machine will affect the performance of the parallel applications? • Naming, Operations, Ordering, Replication, Communication cost Naming • How are the data in a program referenced?
• - In sequential programs a thread can access any variable in its virtual address space
• - In shared memory programs a thread can access any private or shared variable (same load/store model of sequential programs)
• - In message passing programs a thread can access local data directly
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• Clearly, naming requires some support from hw and OS • - Need to make sure that the accessed virtual address gets
translated to the correct physical address Operations • What operations are supported to access data?
• - For sequential and shared memory models load/store are sufficient
• - For message passing models send/receive are needed to access remote data
• - For shared memory, hw (essentially the CA) needs to make sure that a load/store operation gets correctly translated to a message if the address is remote
• - For message passing, CA or the message layer needs to copy data from local memory and initiate send, or copy data from receive buffer to local memory
Ordering • How are the accesses to the same data ordered?
• - For sequential model, it is the program order: true dependence order
• - For shared memory, within a thread it is the program order, across threads some “valid interleaving”of accesses as expected by the programmer and enforced by synchronization operations (locks, point* - to* - point synchronization through flags, global synchronization through barriers)
• - Ordering issues are very subtle and important in shared memory model (some microprocessor re* - ordering tricks may easily violate correctness when used in shared memory context)
• - For message passing, ordering across threads is implied through point* - to* - point send/receive pairs (producer* - consumer relationship) and mutual exclusion is inherent (no shared variable)
Replication • How is the shared data locally replicated?
• - This is very important for reducing communication traffic
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• - In microprocessors data is replicated in the cache to reduce memory accesses
• - In message passing, replication is explicit in the program and happens through receive (a private copy is created)
• - In shared memory a load brings in the data to the cache hierarchy so that subsequent accesses can be fast; this is totally hidden from the program and therefore the hardware must provide a layer that keeps track of the most rece nt copies of the data (this layer is central to the performance of shared memory multiprocessors and is called the cache coherence protocol)
Communication cost • Three major components of the communication architecture that affect performance - Latency: time to do an operation (e.g., load/store or send/recv.) - Bandwidth: rate of performing an operation - Overhead or occupancy: how long is the communication layer occupied doing an operation • Latency - Already a big problem for microprocessors - Even bigger problem for multiprocessors due to remote operations - Must optimize application or hardware to hide or lower latency (algorithmic optimizations or prefetchingor overlapping computation with communication) • Bandwidth - How many ops in unit time e.g. how many bytes transferred per second - Local BW is provided by heavily banked memory or faster and wider system bus - Communication BW has two components: 1. node-to-network BW (also called network link BW) measures how fast bytes can be pushed into the router from the CA, 2. within-network bandwidth: affected by scalability of the network and architecture of the switch or router
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• Linear cost model: Transfer time = T 0+ n/B where T 0is start-up overhead, n is number of bytes transferred and B is BW - Not sufficient since overlap of comp. and comm. is not considered; also does not count how the transfer is done (pipelined or not) • Better model: - Communication time for n bytes = Overhead + CA occupancy + Network latency + Size/BW + Contention - T(n) = O v+ O c+ L + n/B + Tc - Overhead and occupancy may be functions of n - Contention depends on the queuing delay at various components along the communication path e.g. waiting time at the communication assist or controller, waiting time at the router etc. - Overall communication cost = frequency of communication x (communication time - overlap with useful computation) - Frequency of communication depends on various factors such as how the program is written or the granularity of communication supported by the underlying hardware ILP vs. TLP • Microprocessors enhance performance of a sequential program by extracting parallelism from an instruction stream (called instruction-level parallelism) • Multiprocessors enhance performance of an explicitly parallel program by running multiple threads in parallel (called thread-level parallelism) • TLP provides parallelism at a much larger granularity compared to ILP • In multiprocessors ILP and TLP work together - Within a thread ILP provides performance boost - Across threads TLP provides speedup over a sequential version of the parallel program
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Prolog: Why bother? • As an architect why should you be concerned with parallel programming?
• - Understanding program behavior is very important in developing high* - performance computers
• - An architect designs machines that will be used by the software programmers: so need to understand the needs of a program
• - Helps in making design trade* - offs and cost/performance analysis i.e. what hardware feature is worth supporting and what is not
• - Normally an architect needs to have a fairly good knowledge in compilers and operating systems
Agenda •Steps in writing a parallel program •Example Writing a parallel • Start from a sequential description • Identify work that can be done in parallel • Partition work and/or data among threads or processes
• - Decompositionand assignment • Add necessary communication and synchronization
• - Orchestration • Map threads to processors (Mapping) • How good is the parallel program?
• - Measure speedup = sequential execution time/parallel execution time = number of processors ideally
Some definitions • Task
• - Arbitrary piece of sequential work • - Concurrency is only across tasks • - Fine* - grained task vs. coarse* - grained task: controls granularity of
parallelism (spectrum of grain: one instruction to the whole sequential program) • Process/thread
• - Logical entity that performs a task • - Communication and synchronization happen between threads
• Processors • - Physical entity on which one or more processes execute
Decomposition • Find concurrent tasks and divide the program into tasks
• - Level or grain of concurrency needs to be decided here • - Too many tasks: may lead to too much of overhead communicating and
synchronizing between tasks • - Too few tasks: may lead to idle processors • - Goal: Just enough tasks to keep the processors busy
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• Number of tasks may vary dynamically • - New tasks may get created as the computation proceeds: new rays in ray tracing • - Number of available tasks at any point in time is an upper bound on the
achievable speedup Static assignment • Given a decomposition it is possible to assign tasks statically
• - For example, some computation on an array of size N can be decomposed statically by assigning a range of indices to each process: for k processes P0operates on indices 0 to (N/k)* - 1, P1operates on N/k to (2N/k)* - 1,…, Pk* - 1operates on (k* - 1)N/k to N* - 1
• - For regular computations this works great: simple and low* - overhead • What if the nature computation depends on the index?
• - For certain index ranges you do some heavy* - weight computation while for others you do something simple
• - Is there a problem? Dynamic assignment • Static assignment may lead to load imbalance depending on how irregular the application is • Dynamic decomposition/assignment solves this issue by allowing a process to dynamically choose any available task whenever it is done with its previous task
• - Normally in this case you decompose the program in such a way that the number of available tasks is larger than the number of processes
• - Same example: divide the array into portions each with 10 indices; so you have N/10 tasks
• - An idle process grabs the next available task • - Provides better load balance since longer tasks can execute concurrently with the
smaller ones • Dynamic assignment comes with its own overhead
• - Now you need to maintain a shared count of the number of available tasks • - The update of this variable must be protected by a lock • - Need to be careful so t hat this lock contention does not outweigh the benefits of
dynamic decomposition • More complicated applications where a task may not just operate on an index range, but could manipulate a subtreeor a complex data structure
• - Normally a dynamic task queue is maintained where each task is probably a pointer to the data
• - The task queue gets populated as new tasks are discovered Decomposition types • Decomposition by data
• - The most commonly found decomposition technique • - The data set is partitioned into several subsets and each subset is assigned to a
process • - The type of computation may or may not be identical on each subset
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• - Very easy to program and manage • Computational decomposition
• - Not so popular: tricky to program and manage • - All processes operate on the same data, but probably carry out different kinds of
computation • - More common in systolic arrays, pipelined graphics processor units (GPUs) etc.
Orchestration • Involves structuring communication and synchronization among processes, organizing data structures to improve locality, and scheduling tasks
• - This step normally depends on the programming model and the underlying architecture
• Goal is to • - Reduce communication and synchronization costs • - Maximize locality of data reference • - Schedule tasks to maximize concurrency: do not schedule dependent tasks in
parallel • - Reduce overhead of parallelization and concurrency management (e.g.,
management of the task queue, overhead of initiating a task etc.) Mapping • At this point you have a parallel program
• - Just need to decide which and how many processes go to each processor of the parallel machine
• Could be specified by the program • - Pin particular processes to a particular processor for the whole life of the
program; the processes cannot migrate to other processors • Could be controlled entirely by the OS
• - Schedule processes on idle processors • - Various scheduling algorithms are possible e.g., round robin: process#kgoes to
processor#k • - NUMA* - aware OS normally takes into account multiprocessor* - specific
metrics in scheduling • How many processes per processor? Most common is one* - to* - one An example • Iterative equation solver
• - Main kernel in Ocean simulation • - Update each 2* - D grid point via Gauss* - Seidel iterations • - A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j* - 1]+A[i+1,j]+A[i* - 1,j] • - Pad the n by n grid to (n+2) by (n+2) to avoid corner problems • - Update only interior n by n grid • - One iteration consists of updating all n2points in* - place and accumulating the
difference from the previous value at each point • - If the difference is less than a threshold, the solver is said to have converged to a
stable grid equilibrium
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
Sequential program intn; float * - A, diff; begin main() read (n ); /* size of grid */ Allocate (A); Initialize (A); Solve (A); end main begin Solve (A) inti, j, done = 0; float temp; while (!done) diff = 0.0; for i = 0 to n* - 1 for j = 0 to n* - 1 temp = A[i,j]; A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j* - 1]+ A[i* - 1,j]+A[i+1,j]; diff += fabs(A[i,j] * - temp); endfor endfor if (diff/(n*n) < TOL) then done = 1; endwhile end Solve Decomposition • Look for concurrency in loop iterations
• - In this case iterations are really dependent • - Iteration (i, j) depends on iterations (i, j* - 1) and (i* - 1, j) • - Each anti* - diagonal can be computed in parallel
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing
• - Must synchronize after each anti* - diagonal (or pt* - to* - pt) • - Alternative: red* - black ordering (different update pattern)
• Can update all red points first, synchronize globally with a barrier and then update all black points
• - May converge faster or slower compared to sequential program • - Converged equilibrium may also be different if there are multiple solutions • - Ocean simulation uses this decomposition
• We will ignore the loop* - carried dependence and go ahead with a straight* - forward loop decomposition
• - Allow updates to all points in parallel • - This is yet another different update order and may affectconvergence • - Update to a point may or may not see the new updates to the nearest neighbors
(this parallel algorithm is non* - deterministic) while (!done) diff = 0.0; for_alli = 0 to n* - 1 for_allj = 0 to n* - 1 temp = A[i, j]; A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j* - 1]+A[i* - 1, j]+A[i+1, j]; diff += fabs(A[i, j] –temp); end for_all end for_all if (diff/(n*n) < TOL) then done = 1; end while • Offers concurrency across elements: degree of concurrency is n2 • Make the j loop sequential to have row* - wise decomposition: degree n concurrency Assignment • Possible static assignment: block row decomposition
• - Process 0 gets rows 0 to (n/p)* - 1, process 1 gets rows n/pto (2n/p)* - 1 etc. • Another static assignment: cyclic row decomposition
• - Process 0 gets rows 0, p, 2p,…; process 1 gets rows 1, p+1, 2p+1,…. • Dynamic assignment
• - Grab next available row, work on that, grab a new row,… • Static block row assignment minimizes nearest neighbor communication by assigning contiguous rows to the same process Shared memory version /* include files */ MAIN_ENV; intP, n; void Solve (); structgm_t{ LOCKDEC (diff_lock);
Compiled by: M.Rajasekhara Babu, M.Narayana Moorthy, Kaushik S., K.Manikandan Faculty / School of Computing