Parallel ProgrammingJacob Y. Kazakia © 20051 Parallel Programming Why Think Parallel? From supercomputers, to mainframes, to desktops, today’s computers.

Parallel Programming Jacob Y. Kazakia © 2005 1

Parallel Programming

Why Think Parallel?

From supercomputers, to mainframes, to desktops, today’s computers contain multiple processors. The authors of Numerical Recipes estimate that by 2010 an average desktop computer may have as many as 512 user controlled processors.

Is there parallel programming independent of computer architecture?

A particular program optimized for a given architecture may run hundreds of times slower on a different architecture. Yet, some general principles may be developed which help us write good , general purpose parallel code that works on a variety of architectures including purely serial machines.


some news

Parallel processing: Coming to a desktop near you Intel's Paul Otellini pronounced the megahertz era dead today News Story by Tom Krazit SEPTEMBER 07, 2004 (COMPUTERWORLD) - Intel Corp. and the PC industry are about to go through a major change in the way client computers are designed, built and marketed, Intel President and Chief Operating Officer Paul Otellini said during his introductory address today at the Fall Intel Developer Forum in San Francisco. Otellini officially pronounced the megahertz era dead during his talk. Intel has gradually shifted over the past two years away from a marketing strategy based on ever-increasing clock speeds to a plan that improves performance with new features and technologies. The company will focus on parallel processing with future products, Otellini said. This will include multicore processors, virtualization technology and a continuation of Intel's hyperthreading technology. Analysts had been hoping that Intel would provide more details about plans to introduce dual-core processors in 2005, which it abruptly announced in May. Otellini didn't take the bait, declining to even provide the code names of the upcoming processors. He did, however, reiterate that the company would introduce dual-core chips for desktops, servers and notebooks in 2005, with most of the growth coming in 2006.

http://www.computerworld.com/printthis/2004/0,4814,95738,00.html

http://www.computerworld.com/emailthis/2004/0,4831,95738,00.html

mailto:[email protected][email protected]&SUBJECT=Parallel%20processing:%20Coming%20to%20a%20desktop%20near%20you%20%2895738%29


some architecture

All in a MemoryOne of the challenges of designing parallel, multi-processor systems is to resolve the issue of memory and communication between individual processors. In multi-processor supercomputers, memory typically takes one of three forms. Symmetric Multiprocessors (SMP) These systems typically possess a relatively small number of processors (usually less than 16) which can share a common block of memory, much like office workers drawing files from a central filing cabinet. It's easy to develop fast, efficient programs for this design because every processor has direct access to all data. However, the disadvantages of this type of arrangement are that they're not scalable beyond dozens of processors. After all, only a limited number of office workers can be expected to share one filing cabinet; the same is true for processors. Also the technologies needed to connect the processors are rather expensive. Credits


more architecture

Message-passing Distributed Memory (MDM)

Distributed memory systems employing so-called message passing can accommodate thousands of processors. Each processor has a private block of memory from which to draw data. The drawbacks? Such systems tend to be slower because data often must be shuffled back and forth between the processors and they're more difficult to program. Also, there's less available software as compared with the number programs that can be run on SMP and DSM architectures. As a result, a lot of applications have either to be "ported" from other types of processors, or custom-built -- an expensive proposition. Credits


and still more

Distributed Shared Memory (DSM)

New, hybrid or distributed shared memory systems are emerging that maximize the strengths of each of the previous systems. Groups of processors (called nodes) share a local memory; the nodes are networked so that any processor can access any portion of memory. These systems should have the advantage of being scalable and quite easy to program. Though these systems are still somewhat experimental, more and more software for them is becoming available. Credits


Methods & ToolsHow much control of processors?

Generally speaking there are two different ways of writing parallel code.

a) Use a good language to write the code in a parallel processor friendly way and then let the compiler optimizer do the rest.

b) Use a general purpose interface, such as MPI, to completely control the use and participation of the various processors.

Languages for parallel code

FORTAN 90 is one such language. It is the only cross-platform standard language now available.

Other languages are Compositional C++ or CC++ and Fortran M or FM.


Compositional C++What is CC++ ?

CC++ is a general-purpose parallel programming language comprising all of C++ plus six new keywords. It is a strict superset of the C++ language in that any valid C or C++ program that does not use a CC++ keyword is also a valid CC++ program.

The CC++ extensions implement six basic abstractions: The processor object is a mechanism for controlling locality. A computation may comprise one or more

processor objects. Within a processor object, sequential C++ code can execute without modification. In particular, it can access local data structures. The keyword global identifies a processor object class, and the predefined class proc_t controls processor object placement.

The global pointer, identified by the type modifier global, is a mechanism for linking together processor objects. A global pointer must be used to access a data structure or to perform computation (using a remote procedure call, or RPC) in another processor object.

The thread is a mechanism for specifying concurrent execution. Threads are created independently from processor objects, and more than one thread can execute in a processor object. The par, parfor, and spawn statements create threads.

The sync variable, specified by the type modifier sync, is used to synchronize thread execution. The atomic function, specified by the keyword atomic, is a mechanism used to control the interleaving of

threads executing in the same processor object. Transfer functions, with predefined type CCVoid, allow arbitrary data structures to be transferred between

processor objects as arguments to remote procedure calls.

From the online book “Designing and Building Parallel Programs”, by Ian Foster http://www-unix.mcs.anl.gov/dbpp/

http://www-unix.mcs.anl.gov/dbpp/


FM languageWhat is FM ?

FM is a small set of extensions to Fortran. Thus, any valid Fortran program is also a valid FM program. (There is one

exception to this rule: the keyword COMMON must be renamed to PROCESS COMMON. However, FM compilers usually provide a flag that causes this renaming to be performed automatically.) The extensions are modeled whenever possible on existing Fortran concepts. Hence, tasks are defined in the same way as subroutines, communication operations have a syntax similar to Fortran I/O statements, and mapping is specified with respect to processor arrays.

The FM extensions are summarized in the following; detailed descriptions are provided in subsequent sections. In this chapter, FM extensions (and defined parameters) are typeset in UPPER CASE, and other program components in lower case.

A task is implemented as a process. A process definition has the same syntax as a subroutine, except that the keyword PROCESS is substituted for the keyword subroutine. Process common data are global to any subroutines called by that process but are not shared with other processes.

Single-producer, single-consumer channels and multiple-producer, single-consumer mergers are created with the executable statements CHANNEL and MERGER, respectively. These statements take new datatypes, called inports and outports, as arguments and define them to be references to the newly created communication structure.

Processes are created in process blocks and process do-loops, and can be passed inports and outports as arguments. Statements are provided to SEND messages on outports, to RECEIVE messages on inports, and to close an outport

( ENDCHANNEL). Messages can include port variables, thereby allowing a process to transfer to another process the right to send or receive messages on a channel or merger.

Mapping constructs can be used to specify that a program executes in a virtual processor array of specified size and shape, to locate a process within this processor array, or to specify that a process is to execute in a subset of this processor array.

For convenience, processes can be passed ordinary variables as arguments, as well as ports; these variables are copied on call and return, so as to ensure deterministic execution. Copying can be suppressed to improve performance.

From the online book “Designing and Building Parallel Programs”, by Ian Foster http://www-unix.mcs.anl.gov/dbpp/



MPI Resources

Are there resources for MPI ( Message Passing Interface) ?

Online book: “Designing and Building Parallel Programs”, by Ian Foster http://www-unix.mcs.anl.gov/dbpp/

The Message Passing Interface (MPI) standardhttp://www-unix.mcs.anl.gov/mpi/

Open MPI: Open Source High Performance Computinghttp://www.open-mpi.org/

Our own notes and examples among the lectures of ME 413

LAM/MPI Parallel Computinghttp://www.lam-mpi.org/

Boston University: Multiprocessing by Message Passing MPI http://scv.bu.edu/Tutorials/MPI/


http://www-unix.mcs.anl.gov/mpi/

http://www.open-mpi.org/

http://www.lam-mpi.org/

http://scv.bu.edu/Tutorials/MPI/


Fortran 90 Parallelism:Arrays and Intrinsic functions

Fortran 90 provides operations and intrinsic functions that act on arrays of data in parallel, in a manner optimized by the compiler for each particular hardware architecture.

For example, if we have : real, dimension(20) :: a, b, c

we can write: c = a + b

and the compiler causes it to be carried out on multiple pieces of data in as parallel a manner as the hardware allows.

(The usual loop construct would have caused the execution of the additions in a particular order. But this order is not at all necessary for the purpose of this operation. In other words, the serial code would over specify the desired task).

1) Array parallel Operations


F90 example

program MATADD implicit none integer i, j, k integer, parameter :: N =40 real, dimension(40,40):: A, B, C character (10) t_start, t_end

do j = 1, 40 do k = 1, 40 A(j,k)= 2.3*j - 0.9*k B(j,k)= j+k end do end do

call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 do j = 1, 40 do k = 1, 40 C(j,k)=A(j,k)+B(j,k) end do end do

end do

print *, "" print *, " RESULT : ", C(40,40) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1) call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 C=A+B

end do

print *, "" print *, " RESULT : ", C(40,40) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1)1 format(///)end program MATADD


Run times for F90 example

The run times for the example on the previous slide are summarized in the table below for three different machines

Compaq F on desktop

Rigel

Run #1

Rigel

Run #2

Fire cluster

Loops part

11.717 sec 15.961 sec 55.773 sec 1.092 sec

Parallel part

2.444 sec 7.620 sec 7.620 sec 1.812 sec


F 90 Par.-Broadcasts

Fortran 90 has the ability to do broadcasts from one compiler to the others.

So when we write something like:

real :: s real, dimension(20) :: x, y integer :: k

y = x + s

The additions are performed at different processors for different segments of x, while the same s is used in all processors.


F 90 Par.-Dimensional Expansion

The spread intrinsic function

Given an array x of rank m ( m <7) it creates another array of rank m+1 by copying the given array in the new dimension as many times as the lengths of the array indicate.

For example if we have the one dimensional array ( m=1) of length 20, the intrinsic function generates another two dimensional array ( m=2) of 20 x n.

202020

222

111

20

2

1

..

.....

.....

..

..

.

.

xxx

xxx

xxx

x

x

x

n copies


The spread intrinsic function

202020

222

111

20

2

1

..

.....

.....

..

..

.

.

xxx

xxx

xxx

x

x

x

n copies

This is the form of the statementusing the intrinsic spread:

a = spread ( x , dim=2 , ncopies = n )

The help facility of Compaq Fortran 90 software ( now available at our lab ) describes this function in detail. Part of this description is given in the next slide

2021

2021

2021

20

2

1

..

.....

.....

..

..

.

.

xxx

xxx

xxx

x

x

x

n copies a = spread ( x , dim=1 , ncopies = n )


spread functiondetails

SPREADTransformational Intrinsic Function (Generic): Creates a replicated array with an added dimension by making copies of existing elements along a specified dimension.

Syntax

result = SPREAD (source, dim, ncopies)

source (Input) Must be a scalar or array (of any data type). The rank must be less than 7.

dim (Input) Must be scalar and of type integer. It must have a value in the range 1 to n + 1 (inclusive), where n is the rank of source.

ncopies Must be scalar and of type integer. It becomes the extent of the additional dimension in the result.

Results: The result is an array of the same type as source and of rank that is one greater than source.If source is an array, each array element in dimension dim of the result is equal to the corresponding array element in source.If source is a scalar, the result is a rank-one array with ncopies elements, each with the value source.If ncopies <= zero, the result is an array of size zero.


example on the use of spread function

Suppose we need to calculate:

40,......3,2,140

1

jforxxwk

kjj

We could write the serial code:

real, dimension (40) :: w, xinteger :: j, k

do j = 1, 40 w( j ) = 0.0 do k = 1, 40 w( j ) = w( j ) + abs ( x ( j ) + x( k ) ) end doend do

or the parallel code:

real, dimension (40) :: w, xreal, dimension (40,40) :: a, b

a = spread( x, dim = 2, ncopies=40) b = spread( x, dim = 1, ncopies=40)a = a + b

w = sum( abs(a), dim = 1)


F90 example on dimensional expansionprogram DIMEXP implicit none integer i, j, k integer, parameter :: N =40 real, dimension(N,N):: a, b real, dimension(N) :: x, w character (10) t_start, t_end do j = 1, N x(j)= j * (1 - 0.5 * j) end do call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 do j = 1, N w(j)= 0.0 do k = 1, N w(j) = w(j) + abs( x(j) + x(k) ) end do end do end doprint *, "" print *, " RESULT : ", w(N) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end

print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1) call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 a = spread( x, dim =2, ncopies = N) b = spread( x, dim =1, ncopies = N) a = a + b w = sum ( abs(a), dim=1) end do print *, "" print *, " RESULT : ", w(N) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1)1 format(///)end program DIMEXP


Run times tableThe run times for the example on the previous slide are summarized in the table below for three different machines

N=4 Compaq F on desktop

Rigel

Run #1

Rigel

Run #2

Fire cluster

Loops part 0.110 sec 0.292 sec 0.163 sec 0.016 sec

Parallel part

0.171 sec 0.290 sec 0.291 sec 0.432 sec


Rigel

Run #1

Rigel

Run #2

Fire cluster


Parallel part

51.507 sec 25.299 sec 25.413 sec 8.390 sec


Example modified

program DIMEXP implicit none integer i, j, k integer, parameter :: N =40 real, dimension(N,N):: a real, dimension(N) :: x, w character (10) t_start, t_end do j = 1, N x(j)= j * (1 - 0.5 * j) end do call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 do j = 1, N w(j)= 0.0 do k = 1, N w(j) = w(j) + abs( x(j) + x(k) ) end do end do end do print *, "" print *, " RESULT : ", w(N) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, ""

print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1) call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 100000 a = spread( x, dim =2, ncopies = N) + spread( x, dim =1, ncopies = N)

w = sum ( abs(a), dim=1) end do print *, "" print *, " RESULT : ", w(N) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1)1 format(///)end program DIMEXP

get rid of b matrix


Run times for modified program



Rigel

Run #1

Rigel

Run #2

Fire cluster


Parallel part

0.091 sec 0.162 sec 0.162 sec 0.435 sec


Rigel

Run #1

Rigel

Run #2

Fire cluster


Parallel part

5.441 sec 54.326 sec 14.376 sec 46.722 sec


Outer product

The outer product of two arrays a and b is formed by the relation:

c jk = aj bk with j, k = 1,2,3,…..N

This can be programmed as:

do j = 1, N do k = 1, N c(j, k) = a(j) * b(k) end do end do

or alternatively as:

c = spread(a, dim = 2, ncopies = size(b) ) * spread(b, dim = 1, ncopies = size(a) )


Matmul intrinsicMATMULTransformational Intrinsic Function (Generic): Performs matrix multiplication of numeric or logical matrices. Syntax

result = MATMUL (matrix_a, matrix_b)

matrix_a (Input) Must be an array of rank one or two. It must be of numeric (integer, real, or complex) or logical type.

matrix_b (Input) Must be an array of rank one or two. It must be of numeric type if matrix_a is of numeric type or logical type if matrix_a is logical type. At least one argument must be of rank two. The size of the first (or only) dimension of matrix_b must equal the size of the last (or only) dimension of matrix_a.

Results: The result is an array whose type depends on the data type of the arguments, according to the rules shown in Conversion Rules for Numeric Assignment Statements. The rank and shape of the result depends on the rank and shapes of the arguments, as follows:

If matrix_a has shape (n, m) and matrix_b has shape (m, k), the result is a rank-two array with shape (n, k).

If matrix_a has shape (m) and matrix_b has shape (m, k), the result is a rank-one array with shape (k).

If matrix_a has shape (n, m) and matrix_b has shape (m), the result is a rank-one array with shape (n).

If the arguments are of numeric type, element (i, j) of the result has the value SUM ((row i of matrix_a) * (column j of matrix_b)). If the arguments are of logical type, element (i, j) of the result has the value ANY ((row i of matrix_a) .AND. (column j of matrix_b)).


Matmul intrinsic 2 Examples

A is matrix[ 2 3 4 ] [ 3 4 5 ],

B is matrix [ 2 3 ] [ 3 4 ] [ 4 5 ],

X is vector (1, 2), and Y is vector (1, 2, 3).

The result of MATMUL (A, B) is the matrix-matrix product AB with the value[ 29 38 ] [ 38 50 ].

The result of MATMUL (X, A) is the vector-matrix product XA with the value (8, 11, 14).

The result of MATMUL (A, Y) is the matrix-vector product AY with the value (20, 26).


example on matmulprogram MATmultp implicit none integer i, j, k, m integer, parameter :: N =40 real, dimension(40,40):: A, B, C character (10) t_start, t_end do j = 1, 40 do k = 1, 40 A(j,k)= 2.3*j - 0.9*k B(j,k)= j+k end do end do call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 10000 do j = 1, 40 do k = 1, 40 C(j,k) = 0.0 do m= 1, 40 C(j,k)= C(j,k) + A(j,m)*B(m,k) end do end do end do end do print *, "" print *, " RESULT : ", C(40,40) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end

print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1) call DATE_AND_TIME(TIME=t_start) print *,"" print *, "start time =", t_start do i = 1, 10000 C= matmul(A, B) end do print *, "" print *, " RESULT : ", C(40,40) print *, "" call DATE_AND_TIME(TIME=t_end) print *, "end time =", t_end print *, "" print *, "Note that the time characters printed above are in the format" print *, " hhmmss.sss " print *,"2 characters for hours, 2 for minutes, 2 for seconds.miliseconds" write(*,1)1 format(///)end program MATmultp


Run times for matmul example


Compaq F on desktop

Rigel

Run #1

Rigel

Run #2

Fire cluster


Parallel part

22.6 sec 2.5 sec 2.5 sec 6.23 sec


Array Sections 1DFortran 90 makes available array sections. Array sections are subsets of an already declared array. They are thus “windows into arrays” Numerical Recipes in Fortran 90.

Suppose we have the declarations:real, dimension (10) :: A = ( / 1.2, 2.3, 3, 4.7, 5.2, 6.3, 7.2, 8.1, 0.9,1.1 / )integer, dimension( 4 ) :: index = ( / 3, 4, 8, 9 / )We then have:

1.1

9.0

1.8

2.7

3.6

2.5

7.4

3

3.2

2.1

A

3.6

2.5

7.4

3

3.2

)6:2(

)10:1((:)

A

AAA

9.0

1.8

7.4

3

)(

1.1

3.6

3

)/)10,6,3(/(

indexA

A


Array Sections 2D

Suppose we have the declaration:real, dimension (6,6) :: A and that we define the array A as:

6.65.64.63.62.61.6

6.55.54.53.52.51.5

6.45.44.43.42.41.4

6.35.34.33.32.31.3

6.25.24.23.22.21.2

6.15.14.13.12.11.1

A

3.32.31.3

3.22.21.2

3.12.11.1

)3:1,3:1(

)6:1,6:1(:),(:

A

AAA

6.55.54.53.5

6.45.44.43.4

6.35.34.33.3

6.25.24.23.2

)6:3,5:2(

3.62.61.6

3.52.51.5

3.42.41.4

)3:1,6:4(

A

A


Matmul by sections

We have two arrays A and B as:

44434241

34333231

24232221

14131211

44434241

34333231

24232221

14131211

bbbb

bbbb

bbbb

bbbb

Band

aaaa

aaaa

aaaa

aaaa

A

We can write these arrays in terms of sections as:

2221

1211

2221

1211

BB

BBBand

AA

AAA

Where

)4:3,4:3(

)2:1,4:3(

)4:3,2:1(

)2:1,2:1(

22

21

12

11

AA

AA

AA

AA

)4:3,4:3(

)2:1,4:3(

)4:3,2:1(

)2:1,2:1(

22

21

12

11

BB

BB

BB

BB


Matmul by sections 2

2221

1211

2221

1211

BB

BBBand

AA

AAA

We can write:

),(),(),(),(

),(),(),(),(

2222122121221121

2212121121121111

BAmatmulBAmatmulBAmatmulBAmatmul

BAmatmulBAmatmulBAmatmulBAmatmulBA

Parallel ProgrammingJacob Y. Kazakia © 20051 Parallel Programming Why Think Parallel? From supercomputers, to mainframes, to desktops, today’s computers.

Documents

multiple processors

thousands of processors

individual processors

dozens of processors

upcoming processors

multicore processors

dualcore processors

small number of processors