Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Parallel Programming on theSGI Origin2000

With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI

Taub Computer CenterTechnion

Mar 2005

Anne Weill-Zrahia

Parallel Programming on the SGI Origin2000

1) Parallelization Concepts

2) SGI Computer Design

3) Efficient Scalar Design

4) Parallel Programming -OpenMP

5) Parallel Programming- MPI

4) Parallel Programming-OpenMP

ReadIL500

ReadIL500

IL500IL500

IL500

IL 0

IL 0

IL100

IL350

TakeIL150(WriteIL350)

TakeIL400(WriteIL100)

Limorin Haifa

Shimonin Tel Aviv

Is this your joint bank account?

IL150

IL400

IL350

Initialamount

Finalamount

Introduction

- Parallelization instruction to the compiler: f77 –o prog –mp prog.f Or: f77 –o prog –pfa prog.f

- Now try to understand what a compiler has to determine when deciding how to parallelize

- Note that when loosely talk about parallelization, what is meant is: “Is the program as presented here parallelizable?”

-This is an important distinction, because sometimes rewriting can transform non-parallelizable code into a parallelizable form, as we will see…

Data dependency types1) Iteration i depends on values calculated in the previous iteration i-1 (loop carried dependence) do i=2,n a(i) = a(i-1) cannot be parallelized enddo

2) Data dependence within single iteration (non-loop carried dependence) do i=2,n c = . . . . a(I) = . . . c . . . parallelizable enddo

3) Reduction do i=1,n s = s + x parallelizable enddo

All data dependencies in programs are variations on thesefundamental types.

Data dependency analysis

Question: Are the following loops parallelizable?

do i=2,n a(i) = b(i-1)enddo

do i=2,n a(i) = a(i-1)enddo

YES!

NO!

Why?


do i=2,n a(i) = b(i-1)enddo

YES!

CPU1

CPU2

CPU3

A(2)=B(1)

A(3)=B(2)

A(4)=B(3)

A(5)=B(4)

A(6)=B(5)

A(7)=B(6)

cycle1 cycle2



CPU1 A(2)=A(1)

cycle1

A(3)=A(2)

cycle2

A(4)=A(3)

cycle3

Scalar (non-parallel) run:

A(5)=A(4)

cycle4

In each cycle NEW data from previous cycle is read



No!

CPU1

CPU2

CPU3

A(2)=A(1)

A(3)=A(2)

A(4)=A(3)

cycle1

Will probably readOLD data

Data dependency analysisData dependency analysis


No!

CPU1

CPU2

CPU3

A(2)=A(1)

A(3)=A(2)

A(4)=A(3)

A(5)=A(4)

A(6)=A(5)

A(7)=A(6)

cycle1 cycle2

May read NEW data

Will probably read

OLD data


Another question: Are the following loops parallelizable?

do i=3,n,2 a(i) = a(i-1)enddo

do i=1,n s = s + a(i)enddo

YES!

Depends!


do i=3,n,2 a(i) = a(i-1)enddo

YES!

CPU1

CPU2

CPU3

A(3)=A(2)

A(5)=A(4)

A(7)=A(6)

A(9)=A(8)

A(11)=A(10)

A(13)=A(12)

cycle1 cycle2

Data dependency analysisData dependency analysis

do i=1,n s = s + a(i)enddo

Depends!

CPU1

CPU2

CPU3

S=S+A(1)

S=S+A(2)

S=S+A(3)

S=S+A(4)

S=S+A(5)

S=S+A(6)

cycle1 cycle2

-The value of S will be undetermined and typically it will vary from one run to the next- This bug in parallel programming is called a “race condition”


What is the principle involved here?

The examples shown fall into two categories:

1) Data being read is independent of data that is written: a(i) = b(i-1) i=2,3,4. . . a(i) = a(i-1) i=3,5,7. . .

2) Data being read depends on data that is written: a(i) = a(I-1) i=2,3,4. . . s = s + a(i) i=1,2,3. . .


Here is a typical situation:

Is there a data dependency in the following loop?

do i = 1,n a(i) = sin(x(i)) result = a(i) + b(i) c(i) = result * c(i)enddo

Clearly, “result” is a temporary variable that isreassigned for every iteration.

Note: “result” must be a “private” variable (this will be discussed later).

No!


Here is a (slightly different) typical situation:

Is there a data dependency in the following loop?

do i = 1,n a(i) = sin(result) result = a(i) + b(i) c(i) = result * c(i)enddo

Yes!

The value of “result” is carried over from one iterationto the next.

This is the classical read/write situation but now it is somewhat hidden.


do i = 1,n a(i) = sin(result(i-1)) result(i) = a(i) + b(i) c(i) = result(i) * c(i)enddo

do i = 1,n a(i) = sin(result(i-1)) result(i) = sin(result(i-1)) + b(i) c(i) = result(i) * c(i)enddo

The loop could (symbolically) be rewritten:

Now substitute the expression for a(i):

This is really of the type “a(i)=a(i-1)” !


One more: Can the following loop be parallelized?

do i = 3,n a(i) = a(i-2)enddo

If this is parallelized, there will probably be differentanswers from one run to another.

Why?


CPU1

CPU2

A(3)=A(1)

A(4)=A(2)

A(5)=A(3)

A(6)=A(4)

cycle1 cycle2


This looks like it will be safe.


CPU1

CPU2

CPU3

A(3)=A(1)

A(4)=A(2)

A(5)=A(3)

cycle1


HOWEVER: what if there are 3 cpu’s and not 2?

In this case, a(3) isread and written intwo threads at once

RISC memory levels

CPU

Main memory

Cache

Single CPU

RISC memory levels

CPU

Main memory

Cache

Single CPU

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

RISC Memory Levels

Definition of OpenMP

- Application Program Interface (API) for Shared Memory Parallel Programming

- Directive based approach with library support

- Targets existing applications and widely used languages: Fortran API first released October 1997 C, C++ API first released October 1998

- Multi-vendor/platform support

Why was OpenMP developed?

- Parallel programming before OpenMP * Standards for distributed memory (MPI and PVM) * No standard for shared memory programming- Vendors had different directive-based API for SMP * SGI, Cray, Kuck&Assoc, DEC * Vendor proprietary, similar but not the same * Most were targeted at loop level parallelism- Commercial users, high end software vendors, have big investment in existing codes- End result: users wanting portability were forced to use MPI even for shared memory * This sacrifices built-in SMP hardware benefits * Requires major effort

The Spread of OpenMP

Organization: Architecture review board Web site: www.openmp.org

Hardware: HP/DEC IBM Intel SGI Sun

Software: Portland (PGI) NAG Intel Kuck & Assoc (KAI) Absoft

OpenMP Interface model

•Control structures•Work sharing•Data scope attributes * private,firstprivate, lastprivate * shared * reduction

-Control and query * number thread * nested parallel? * throughput mode

- Lock API

-Runtime environment * schedule type * max number threads * nested parallelism * throughput mode

DirectivesAnd

Pragmas

RuntimeLibraryroutines

Environmentvariables

OpenMP execution model

OpenMP programs starts in a single thread, sequential mode

To create additional threads, user opens a parallel region * additional slave threads launched * master thread is part of team * threads “disappear” at the end of parallel region run

This model is repeated as needed

Master thread

Parallel:4 threads

Parallel:2 threads

Parallel:3 threads

Creating parallel threadsFortran

C/C++

c$omp parallel [clause,clause] code to run in parallelc$omp end parallel

#pragma omp parallel [clause,clause]{ code to run in parallel}

Replicate execution

i=0C$omp parallel call foo(i,a,b)C$omp end parallel print*,i

foo foo foo foo

i=0

print*,i

Number of threads: set in library or environment call

OpenMP on the Origin 2000

Switches, formatsf77 -mp

c$omp parallel doc$omp+shared(a,b,c)ORc$omp parallel do shared(a,b,c)

c$ iam = omp_get_thread()+1

Conditional compilation

OpenMP on the Origin 2000 -C

Switches, formatscc -mp

#pragma omp parallel for\shared(a,b,c)OR#pragma omp parallel for shared(a,b,c)


Parallel Do Directive

c$omp parallel do private(I)

c$omp end parallel do --> optional

do I=1,na(I)= I+1enddo

Topics: Clauses, Detailed construct


Parallel Do Directive - Clauses

sharedprivatedefault(private|shared|none)firstprivatelastprivatereduction({operator|intrinsic}:var)schedule(type,[chunk])if(scalar_logical_expression)orderedcopyin(var)

S S

Single thread Parallel region Single thread

S = shared variableP = private variable

Allocating private and shared variables

Clauses in OpenMP - 1

Clauses for the “parallel” directive specify data association rulesand conditional computation

shared (list) - data accessible by all threads, which all refer to the same storageprivate (list) - data private to each thread - a new storage location is created with that name for each thread, and the of the storage are not available outside the parallel region

default (private | shared | none) - default association for variables not otherwise mentionedfirstprivate (list) - same as for private(list), but the contents are given an initial value from the variable with the same name, from outside the parallel regionlastprivate (list) - available only for work sharing constructs - a shared variable with that name is set to the last computed value of a thread private variable in the work sharing construct

Clauses in OpenMP - 2reduction ({op/intrinsic}:list) - variables in the list are named scalars of intrinsic type - a private copy of each variable will be made in each thread and initialized according to the intended operation - at the end of the parallel region or other synchronization point all private copies will be combined - the operation must be of one of the forms: x = x op expr x = intrinsic(x,expr) if (x.LT.expr) x = expr x++; x--; ++x; --x; where expr does not contain x

Op Init+ or - 0* 1& -0| 0^ 0&& 1|| 0

Op/intrinsic Init+ or - 0* 1.AND. .TRUE..OR. .FALSE..EQV. .TRUE..NEQV. .FALSE.MAX smallest numberMIN largest numberIAND all bits onIOR or IEOR 0

- example: c$omp parallel do reduction(+:a,y) reduction (.OR.:s)

Clauses in OpenMP - 3

copyin(list) - the list must contain common block (or global) names tahat have been declared threadprivate - data in the master thread in that common block will be copied to the thread private storage at the beginning of the parallel region - there is no “copyout” clause – data in private common block is not available outside of that threadif (scalar_logical_expression) - when an “if” clause is present, the enclosed code block is executed in parallel only if the scalar_logical_expression is .TRUE.ordered - only for do/for work sharing constructs – the code in the ORDERED block will be executed in the same sequence as sequential executionschedule (kind,[chunk]) - only for do/for work sharing constructs – specifies scheduling discipline for loop iterationsnowait - end of worksharing construct and SINGLE directive implies a synchronization\ point unless “nowait” is specified


Parallel Sections Directive

c$omp parallel sections private(I)

c$omp end parallel sections

c$omp section block1c$omp section block2

Topics: Clauses, Detailed construct


Parallel Sections Directive - Clauses

sharedprivatedefault(private|shared|none)firstprivatelastprivatereduction({operator|intrinsic}:var)if(scalar_logical_expression)copyin(var)


Defining a Parallel Region - Individual Do Loopsc$omp parallel shared(a,b)

do j=1,na(j)=jenddo

do k=1,nb(k)=kenddo

c$omp do private(j)

c$omp end do nowaitc$omp do private(k)

c$omp end doc$omp end parallel


Defining a Parallel Region - Explicit Sections

c$omp parallel shared(a,b)c$omp sectionblock1c$omp singleblock2c$omp sectionblock3c$omp end parallel


Synchronization Constructs

master/end mastercritical/end criticalbarrieratomicflushordered/end ordered


Run-Time Library Routines

Execution environment

omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_numomp_get_num_procsomp_in_parallelomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested


Run-Time Library Routines

Lock routines

omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lockomp_test_lock


Environment Variables

OMP_NUM_THREADSorMP_SET_NUMTHREADSOMP_DYNAMICOMP_NESTED

Exercise 5 – OpenMP to parallelize a loop

main loop

initial values

Enhancing Performance

• Ensuring sufficient work : running a loop in parallel adds runtime costs

• Scheduling loops for load - balancing

The SCHEDULE clause

SCHEDULE (TYPE[,CHUNK])

Static Each thread is assigned one chunk of iterations, according to variable or equally sized

Dynamic At runtime, chunks are assigned to threads dynamically

OpenMP summary

- Small number of compiler directives to set up parallel execution of code and runtime library system for locking function- Portable directives (supported by different vendors in the same way)- Parallelization is for SMP programming model – the machine should have a global address space- Number of execution threads is controlled outside the program- A correct OpenMP program should not depend on the exact number of execution threads nor on the scheduling mechanism for work distribution- In addition, a correct OpenMP program should be (weakly) serially equivalent – that is, the results of the computation should be within rounding accuracy when compared to sequential program- On SGI, OpenMP programming can be mixed with MPI library, so that it is possible to have “hierarchical parallelism” * OpenMP parallelism in a single node (Global Address Space) * MPI parallelism between nodes in a cluster (Network connection)

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Documents

s ai enddo

s a3 s

s a2 s

s a4 s

s a5 s

s a1 s

old data slide

data dependence