BARC-1513 00 33 O _> w ISSUES IN DEVELOPING PARALLEL ITERATIVE ALGORITHMS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS ON A \ TRANSPUTER - BASED ) DISTRIBUTED PARALLEL COMPUTING SYSTEM by S. Kajagopalan. A. Jcthra. A. N. Kharc and M. D. Ghodgaonkar Electronics Division ami R. Srivenkateshan and S. V. G. Menon Theoretical Physics Division 1990
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BARC-151300
33O
_>w
ISSUES IN DEVELOPING PARALLEL ITERATIVE ALGORITHMSFOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS
ON A\ TRANSPUTER - BASED ) DISTRIBUTED PARALLEL COMPUTING SYSTEM
by
S. Kajagopalan. A. Jcthra. A. N. Kharc and M. D. GhodgaonkarElectronics Division
ami
R. Srivenkateshan and S. V. G. MenonTheoretical Physics Division
1990
B.A.R.C. - 1513
GOVERNMENT OF INDIAATOMIC ENERGY COMMISSION
oa:
ISSUES IN DEVELOPING PARALLEL ITERATIVE ALGORITHMS FOR SOLVING
PARTIAL DIFFERENTIAL EQUATIONS ON A (TRANSPUTER-BASED)
DISTRIBUTED PARALLEL COMPUTING SYSTEM
by
S. Rajagopalan*, A. Jethra*, A.N. Khare*, M.D. Ghodgaonkar*,R. Srivenkateshan and S.V.G. Menon
BIBLIOGRAPHIC DESCRIPTION SHEET FOR TECHNICAL REPORT
(as per IS : 9400 - 1980)
01
02
03
04
05
06
07
08
Security classification :
Distribution :
Report status :
Series :
Report type :
Report No. :
Part No. or Volume No. :
Contract No. :
Unclassified
External
New
B.A.R.C. External
Technical Report
B.A.R.C.-1513
Title and subtitle : Issues in developing parallel iterativealgorithms for solving partial differ-ential equations on a (transputer—based)distributed parallel Computing system
11 Collation :
13 Project No. :
20 Personal author(s) :
29 p., 4 tabs., 2 figs., 3 appendixes
(1) S. Rajagopalan; A. Jethra;A.N. Khare; M.D. Ghodgaonkar;(2) R. Srivenkateshan; S.V.G. Menon
21 Affiliation of author(s) (1) Electronics Division,Bhabha AtomicResearch Centre, Bombay(2) Theoretical Physics Division,Bhabha Atomic Research Centre, Bombay
22
23
24
30
31
Corporate author(s) :
Originating unit :
Sponsor(s) Name :
Type :
Date of submission :
Publication/Issue date :
Bhabha Atomic Research Centre,Bombay-400 085
Electronics Division, B.A.R.C.,Bombay
Department of Atomic Energy
Government
June 1990
July 1990
Contd... (ii)
4O
42
50
51
52
53
Publisher/Distributor :
Form of distribution :
Language of text :
Language of summary :
No. of references :
Gives data on :
(ii)
Head, Library and InformationDivision, Bhabha Atomic ResearchCentre, Bombay-400 O85
Hard copy
English
English
7 refs.
60 Abstract : Issues relating to implementing iterative procedures,
for numerical solution of elliptic partial differential
equations, on a distributed parallel computing system are
discussed. Preliminary investigations show that a speed—up of
about 3.85 is achievable on a four transputer pipeline network.
2. G.C. Fox and S.W. Otto, "Algorithms for Concurrent Processors", Physics
Today (May 1984).
3. D. Pountain and D. May, A Tutorial Introduction to Occam Programming, BSP
Professional Books, London (1987).
4. MJ. Quinn, Designing Efficient Algorithms for Parallel Computers, McGraw-
Hill Book Company, New York USA (1987).
5. Transputer Databook, Inmos Limited, UK (1989).
6. Transputer Development System, Prentice Hall Internationa], London (1988)
7. E.L. Wachspress, Iterative solution of Elliptic Systems and Applications to Neutron
Diffusion Equations of Reactor Physics, Prentice Hall International, London (1966).
17
APPENDIX A; Concepts in Parallelism
Since an iterative process is generally time-consuming, we would like to increase the power of the com-
puter. There are two options: either we increase the performance of the processor by developing newer
technologies and using the latest technology, or we apply more than one computer to solve the problem.
A.1 Types of parallelism: There are many classes of parallelism, SIMD, MIMD etc., and for details refer [1]
and [4]. For example, in the case of the Single Instruction Multiple Data (SIMD) machine, a master
control unit issues the same instruction simultaneously to many slave processors operating in lockstep,
each of which enacts it (with a limited degree of local autonomy) on its set of data. In the case of the
Multiple Instruction Multiple Data (MIMD) machine, each processor carries its own program and data
and works independently of the others until it becomes necessary to communicate with one another.
We have applied a version of the SIMD system on both problems and the MIMD for the generalized
Helmholz equation. For the latter problem, the SIMD was far superior (for atleast one obvious reason
which is mentioned in the appropriate section).
A.2 Topologies: There various topologies which may be used in networking the processors. In the
pipeline topology, each processor (except the first and last processors) are connected to a left and right
processor. They form a ring if the last processor is connected to the first. In a mesh architecture, the
processors are placed on the points of a rectangular grid with connections being mapped onto the grid lines.
A.3 Performance Evaluation: In devising parallel algorithm, we are concerned with maximizing the speed-up
and the efficiency. These parameters are defined as follows:
speed-up = tl / tp
efficiency = tl / (p * tp)
where tl is the time taken by a sequential program on one processor and tp is the time taken by a parallel
program on p processors.
Two considerations are important in discussing efficiency. First, the nodes spend some time
communicating with their neighbours. This term is minimized if the intcrnode communications
18
demanded by the algorithm always proceeds by a "hardwired" path. Load balancing is the second factor
affecting efficiency. One needs to ensure that all nodes have essentially identical computing loads.
A.4 Deadlock
A set of active concurrent processes is said to be deadlocked if each holds non-preemptible resources that
must be acquired by some other process in order for it to proceed [1]. As an example, consider two processes
AatvJB. Each has to send data and receive data from the other. If at any point during execution, both are
sending data to each other then a deadlock will occur. This is because each is waiting for the other to
receive its data but neither is receiving.
A.5 General strategies for exploiting a parallel computing system: There arc three such strategics:
1. Geometric parallelism: Each processor executes the same (or almost same) program on data correspond-
ing to a region of the system being simulated and communicate boundary data to neighbouring
processors handling neighbouring regions.
(We have applied this strategy on both problems. The domain of the system being simulated was (in each
case) mapped onto a 2- (or 3-) dimensional rectangular grid and we attempted to predict the behaviour of
the (evenly spaced) grid points at steady state. The domain was horizontally divided into p (= four) rectangu-
lar regions and each processor, in pipeline, worked on one with the first processor processing the top region
and the last processor the bottom strip.)
4
Note that this strategy is useful, when applicable, for problems requiring large data.
2. Algorithmic parallelism: Each processor is responsible for part of the algorithm.
(We applied this strategy on the secong problem. Since we could divide the algorithm into only two parts, we
could not exploit fully the parallel computing system.)
3. Event parallelism: Each processor executes the same program in isolation from all the other procesors.
We could not apply this strategy. This strategy is applicable for neutron transport calculations
employing the Monte Carlo method.
19
APPENDIX B: The Parallel Computing System
B.I Hardware
This section briefly describes the parallel system being developed by the Electronics Division of BARC.
In general terms, the system may be described as a distributed memory (that is, no shared memory) parallel
computing system.
The basic element in the system will be the Inmos T800 transputer chip (peak: 30 MIPS and 4.3 Mflops).
This is a reduced instruction set computer which integrates a 32-bit microprocessor, a 64-bit floating
poit unit, 4 Kbytes of on-chip RAM, interfaces to additional memory and I/O devices, and 4 bidirectional
serial links (2.4 Mbytes/s per link) which provide point-to-point connections to other tranputers. The links
are autonomous direct-memory-access devices that transfer data without any processor intervention.
Communication between processes is achieved by means of channels: a channel between two processes run-
ning on the same processor is implemented by a single word in the memory, while a channel between two
processes running on different transputers is implemented by means of the above links. Process
communication is synchronized and unbuffered. Therefore, communication can only begin when both
processes are ready.
The transputer supports concurrency by means of an in-built micro-seeduler. Active processes on a
processor, waiting to be executed, are held in queues. This removes the need for any software kernel.
The present system has the following features:
1. Personal computer IBM PC-AT
2. One root transputer Inmos T414
3.4 T800 transputers.
The PC acts as the host and the root transputer interfaces with the host and is networked to the othei
transputers. At present, the network is hardwired. The network is shown in Figure 2 below. For example, in
the case of the root transputer, link 0 is connected to the host PC, link 1 is connected to link 1 of the
transputer labelled P3 and Iink2 is connected to Iink3 of the tranputer labelled PO.
20
The system, when completed, will have, apart from additional T800 transputers, Inmos C004 link switches. Thenetwork may then be dynamically reconfigured into many architectures including those mentioned above.
P3 P2
PO
ROOTIN IBM PC-AT
I
FIGURE 2
For more details, refer [5].
21
B.2 Occam
As indicated earlier, the transputer was developed as a language-first type approach with the language being
Occam. Thus Occam has been specifically designed for parallel processing, and the hardware and software
architectures required for a specific program are defined by an Occam description. Although Fortran,
Pascal and C are supported, programming in Occam provides maximum performance and allows exploitation
of concurrency. It allows parallelism to be expressed explicilty. It has the PAR construct: statements following
PAR are executed concurrently.
Two Occam processes communicate with each other along channels. As mentioned earlier, the channels are
memory locations if the processes are on the same transputer and hardware links if otherwise. The
synchronization of two processes is realized by the hardware. Therefore the programmer need not worry
about this crucial issue of parallel programming.
An Occam program can be executed on one tranputcr where the validity and response times can be
checked, and if the "software network" defined by the Occam program can be realized on the actual system,
the program with, only minor modifications, can be mapped onto the hardware architecture.
For more details of the language, refer [3].
B.3 Support Software
The Transputer Development System is an Occam development system which provides a completely inte-
grated environment including the INMOS folding editor. This provides a secure development
environment with limited reliance on host services. This means that no operating system host support is
required.
B.4 Configuration Programs and Loading the Transputer Network
An application must be expressed as a number of parallel processes. Once this is done, the
programmer needs to add information describing the link topology and needs to describe the association of
each code to individual transputers. This is called configuration, and the additional code is called the
configuration program. There are two configuration files: one for the code running on the root transputer
and another for the different codes running on the network.
22
APPENDIX C: Codes for Selected Programs
1. LOGICAL NAMES OF LINKS
USED BY CONFIGURATION PROGRAMS (#USE links)
VALlinkOin IS4:
VALlinkOoutlSO:
VALlinklin IS 5:
VALlinkloutlSl:
VALlink2in IS 6:
VAL Iink2out IS 2:
VALHnk3in IS 7:
VALHnk3outIS3:
2. PREDEFINED VALUES USED BY NETWORK PROGRAMS (#USE puramS)
VALREAL32dl IS 1.34(REAL32):
VALREAL32d2 IS 0.9(REAL32):
VAL REAL32 sigal IS 0.00173(REAL32):
VAL REAL32 siga2 IS 0.00492(REAL32):
VAL REAL32 signl IS 0.000953(REAL32):
VAL REAL32 sign2 IS 0.00591 (REAL32):
VALREAL32sigl IS 0.008(REAL32):
VALREAL32zt IS 525.0(REAL32):
VALREAL32xt IS 525.0(REAL32):
VALREAL32yt IS 525.0(REAL32):
VAL REAL32 epsi IS 0.001 (REAL32):
VAL REAL32 epso IS 0.001(REAL32):
VAL INT nout IS 100: -- ITERATIONS CANNOT EXCEED nout
VALINTnil2IS126: - - N I - 2
VALINTnil lIS127: - -NI -1
VAL INT ni IS 128: - - # OF ROWS IN GRID
VAL INT nil IS 129: -- NI + 1
23
VALINTnjl2IS126: --NJ-2
VALINTnjllIS127: --NJ-1
VALINTnj IS 128: --# OF COLUMNS IN GRID
VALINTnjl IS 129: --NJ+ 1
VAL INT nt IS 4: -- # OF TRANSPUTERS IN PIPELINE
VAL INT ntl IS 3: - NT -1
VAL INT overlap IS 1: -- SHARED ROWS OF DIFFERENT PROCESSES
VAL INT overlap 1 IS 2: -- OVERLAP + 1
VAL INT blckl2 IS 30: -- BLCK - OVERLAP1
VAL INT blckll IS 31: -- BLCK - OVERLAP
VALINTblck IS 32: - (NI / NT) ROWS PROCESSED BY EACH TRANSPUTER
VAL INT blckl IS 33: - BLCK + OVERLAP
VAL INT bick2 IS 34: -- BLCK + OVERLAP1
3. CONFIGURATION FILE FOR ROOT TRANSPUTER
#USE "Iinks4.tsr"
... SCRGhlmhlz
CHAN OF ANY pipein, pipeout:
PLACE pipein AT Iink2in:
PLACE pipeout AT Iink2oui:
pghlmhlz(keyboard, screen, pipein, pipeout)
4. PROGRAM ON ROOT TRANSPUTER
#USE userio
PROC pghImhlz(CHAN OF INT keyboard, CHAN OF ANY screen, pipei, pipco)
SEQ
-- PROMPT TO SCREEN
newlinc(screen)
write.full.string(screen, "PLEASE WAIT. ")
newline(screen)
24
-- PROMPT TO 0-TII PROCESSOR
pipeo! 0
•- OUTPUT FROM 0-TH PROCESSOR
INT count:
REAL32 time, errf, errt, errs, ev:
SEQ
pipei ? time
write.full.siring(screen, "TIME:")
write.real32(screen, time, 5, S)
newline(screen)
pipei ? count
write.full.siring(screen, "# OF ITERATIONS:")
write.int(screen, count, 5)
newline(screen)
pipei ? errf
write.full.string(screen, "ERRF:")
write.reaO2(screen, errf, 5,5)
newline(screen)
pipei ? errt
write.full.string(screen, "ERRT:")
write.rea!32(screen, errt, 5,5)
newline(screen)
pipei ? errs
write.full.string(screen, "ERRS:")
write.real32(screen, errs, 5,5)
newline(screen)
pipei ? ev
write.full.string(screen, "EV:")
write.real32(screen, ev, 5,5)
newline(screen)
25
» PROMPT TO USER
wriie.full.string(screcn, "HIT ANY KEY TO RETURN TO TDS")
INTl:
keyboard ? t
5. CONFIGURATION FILE FOR NETWORK PROGRAM
#USE "Iinks8.tsr" -- LIBRARY OF LOGICAL NAMES FOR LINKS
... SCpOPRGhz2 --0-TH PROCESSOR
... SC pOPRGhz2 -- INTERMEDIATE PROCESSOR (ODD)
... SC pEPRGhz2 -- INTERMEDIATE PROCESSOR (EVEN)
... SCpNPRGhz2 -- LAST PROCESSOR IN PIPELINE
-- piperw: RTWARD PIPELINE LINK FOR UPDATING SHARED REGION
-- pipeiw: LTWARD PIPELINE LINK FOR UPDATING SHARED REGION
[4]CHAN OF ANY piperw, pipeiw:
-- piperw 1 : RIGHTWARDS PIPELINE LINK FOR IULTING CHECK
•- pipelwl: LEFrWARDS PIPELINE LINK FOR HALTING CHECK