The LOOP Approach, a new Method for the Evaluation of ... · Evaluation of Parallel Systems Abstract The increasing number of different parallel computers requires a method to compare

0

The LOOP Approach, a new Method for theEvaluation of Parallel Systems

Jürgen Brehm

Address until July 20th, 1995:

Department of Computer ScienceVanderbilt UniversityBox 1679, Station BNashville, TN, 37235

USA

Address after July 20th, 1995:

Institut für Rechnerstrukturen und BetriebssystemeUniversität Hannover

Lange Laube 330159 Hannover

Germany

e-mail: [email protected]

Abstract

The increasing number of different parallel computers requires a method to compare the perfor-mance of such systems. Values like MIPS and MFLOPS often used by computer vendors are nor-mally of secondary value since such information says little about the behavior of real applicationsrunning on a certain system. This problem, well known from single processor systems, is even com-licated further in multiprocessor systems. Architectural features such as the arrangement of pro-cessors and the performance of interconnection networks significantly influence the overall systemperformance and cannot be described by these values.

This paper describes a new “LOOP“ approach to benchmark message passing multiprocessor sys-tems. The approach uses a description language for parallel workloads, a program generator fordeadlock free message passing programs, and an interface to a visualization tool. The package us-ing the approach has been was implemented on three different machines, the MEIKO Transputersystem, the nCUBE/2 Hypercube, and the Intel Paragon. After a brief discussion of existing mul-tiprocessor benchmarks the paper describes the new approach in detail and presents results forthe MEIKO, nCUBE, and Paragon systems.

Keywords: Benchmarking, Workload Characterization, Performance Evaluation

1

1. Introduction

Benchmarking computer systems is an impor-tant issue for both, computer architects and us-ers. The purpose of benchmarking reachesfrom identifying architectural bottlenecks todetermine purchase decisions. The number ofmultiprocessor architectures has substantiallyincreased in the last years, so did the efforts toevaluate these machines. Unfortunately, noneof the existing approaches is feasible for awide range of existing multiprocessors or isadaptable to user defined workloads. This pa-per describes a portable high level workloaddescription language (LOOP language) forparallel systems. To automatically produceprogram code for the different systems, a pro-gram generator was developed that translatesthe LOOP computation and the LOOP com-

The LOOP Approach, a new Method for theEvaluation of Parallel Systems

Abstract

The increasing number of different parallel computers requires a method to compare the performance of such systems.Values like MIPS and MFLOPS often used by computer vendors are normally of secondary value since such informa-tion says little about the behavior of real applications running on a certain system. This problem, well known fromsingle processor systems, is even comlicated further in multiprocessor systems. Architectural features such as the ar-rangement of processors and the performance of interconnection networks significantly influence the overall systemperformance and cannot be described by these values.

This paper describes a new “LOOP“ approach to benchmark message passing multiprocessor systems. The approach

uses a description language for parallel workloads, a program generator for deadlock free message passing pro-

grams, and an interface to a visualization tool. The package using the approach has been was implemented on three

different machines, the MEIKO Transputer system, the nCUBE/2 Hypercube, and the Intel Paragon. After a brief dis-

cussion of existing multiprocessor benchmarks the paper describes the new approach in detail and presents results for

the MEIKO, nCUBE, and Paragon systems.

Keywords: Benchmarking, Workload Characterization, Performance Evaluation

munication instructions into instrumented par-allel C code. The instrumentation results intrace data that are visualized by appropriatetools. Thus, a user can evaluate a system forthe specific workloads of his applications. Ad-ditionally, LOOP descriptions of a set of pa-rameterized workloads are part of the LOOPbenchmark package. These workloads areused to compare different systems.

The rest of the paper is organized as follows.In section 2, a brief survey of standard bench-mark approaches is given. Section 3 describesthe approach of benchmarking parallel sys-tems in general and section 4 explains theLOOP approach in detail. A set of parameter-ized parallel workloads that are used as stan-dard benchmarks is described in section 5. Re-sults for these workloads on the MEIKO,

Jürgen Brehm

Institut für Rechnerstrukturen und BetriebssystemeUniversität Hannover,Lange Laube 3, 30159 Hannover

Germanye-mail: [email protected]

2

nCUBE, and Paragon systems are provided insection 6. The final section with an outlook tofuture work concludes this paper.

2. Standard Approaches

This section describes several well knownbenchmark tests. An overview of existingbenchmarks for single processor computersprovides a useful perspective. Most bench-mark programs consist of synthetic programsor real applications (uni- and multiprocessorapplications).

Often, small benchmark programs are used toget a first impression of a system’s perfor-mance. In general these programs are easy toport to another machine, but they typicallymeasure only a single aspect of the machine,for example, the integer performance.

Dhrystone:A well known benchmark program is Dhry-stone. This program was originally written inADA by R. Weicker and was later implement-ed in C by R. Richardson. Dhrystone evaluatesthe performance of the CPU and the compiler.This synthetic benchmark program /Wei91/generates a representative workload which istypical for single processor machines. By ex-amining a large amount of program code andanalyzing the type of operations, Weicker triesto recreate typical program behavior. The re-sulting benchmark is one where the amount ofeach operation type mimics that of the exam-ined programs.

The performance of the CPU and the optimiza-tion features of the compiler are tested withthis benchmark. It does not test floating pointarithmetic nor does it stress the operating sys-tem. Due to its small size, it fits in almost ev-ery cache and may exaggerate the cache’s ef-fectiveness. Much work has been done to keepcompilers from doing special optimizationsspecifically for Dhrystone.

Whetstone:In order to test the performance of floatingpoint operations, the Whetstone benchmarktest was designed. Like the Dhrystone it is asynthetic benchmark where the procedures aredesigned to generate typical workload behav-ior rather than to execute a specific task.

SSBA:SSBA is a benchmark suite assembled by theFrench UNIX user group (AFUU). It has espe-cially been designed to test the performance ofUNIX based systems. A recent version testsmultiprocessor systems.

Linpack:Another approach to benchmarking is to ex-amine the performance of real applications.Compilers, databases and other programs areoften used to simulate the overall system per-formance of a general purpose machine. Twofamous suites of such programs are Linpackand SPEC.

The Linpack benchmark is widely used in sci-entific environments. It consists of severalprocedures which calculate problems such asthe solution of large systems of linear equa-tions, matrix multiplications, and dotproducts.

SPEC:The SPEC benchmark suite consists of com-pilers, databases and other application pro-grams that are typically found on general pur-pose machines /Spec91/. They are a typicalmixture of floating point intensive, integerarithmetic intensive and memory bound appli-cations. Large production code packages suchas SPICE are used to create similar effects. Inthe newer versions of the SPEC benchmarksuite some synthetical benchmarks can also befound.

SLALOM:SLALOM is used to test parallel systems. Themost unique feature in the SLALOM bench-mark test is that instead of having a fixed prob-lem-size and measuring its execution time, theexecution time is fixed and the problem size is

3

chosen such that the benchmark completes inthe allotted time. In /Sla90/ the algorithm,which calculates how a coupled set of diffusesurfaces emits and absorbs radiation, is intro-duced and it is shown how the problem sizecan be scaled.

SPLASH:SPLASH is a set of several typical applica-tions which are often used on parallel systems.These applications are well documented andthus can be used for benchmarking a system.This set of applications is described in /Sin91/.

3. Benchmarking Parallel Systems

3.1. The System Under Test

There are several approaches to the evaluationof systems, depending on the desired level ofabstraction. Although there is a continuum ofpossible views, two examples of different ab-straction levels are illustrated in Figure 1.

For the programmer of a high level applicationthe system under test (SUT) includes severalcomponents such as the compiler, the operat-ing system, and the underlying hardware. In amultiprocessor environment the interconnec-tion network is also part of the SUT. The pro-grammer might be interested in several perfor-mance features including:

• response time,

• elapsed time,

• resource utilization,

• communication patterns,

• concurrency profile, and the

• space-time-diagram.

A different view of the same computer systemis shown in Figure 1b. A hardware developeris normally less interested in the performanceof the compiler or the operating system.Thus,the benchmarks of most use are specifically

designed for the evaluation of certain compo-nents of interest. The features of most interestto the hardware developer include:

• native MIPS,

• cache hit rate,

• bus utilization, and

• memory access times.

The system under test considered in this paperis the one of the application programmer, Fig-ure 1a. The compiler, the operating system,and the underlying hardware will be regardedas a black box. The instrumentation results indata for the single node performance, the com-munication behavior and the overall perfor-mance. Looking at the amount and speed ofcommunication at the nodes, potential and ex-isting bottlenecks in the interconnection net-work may be found.

In the field of parallel computing a broad blackbox approach with fixed workloads (e.g., asused by SPEC) is no longer adequate. Some

Application

Compiler

OS

System under TestHardware

a. SUT for application programmer

Figure 1: SUT depending on the level of ab-straction

Application

Compiler

OS

System under TestHardware

b. SUT for hardware developer

4

knowledge of the machine architecture alwaysinfluences the design of the application pro-gram. The LOOP method introduced in thenext section allows certain machine dependentoptimizations.

The term parallel systems used here refers tomassively parallel computer systems and notto architectures such as multiprocessor work-stations. Benchmarking the latter is similar tothe evaluation of single processor architec-tures. These environments normally consider anumber of processes per processor with littlecommunication between them. They are pro-grammed in a code-parallel manner. Besidesevaluating pure processor performance,benchmark tests must determine processor ca-pabilities, e.g., how many processes can behandled at the same time and how long are thecontext switches. Under this scenario work-load mixes consisting of conventional bench-marks can be used, so long as it is guaranteedthat all available processors have some com-putational work to do. Workstation networksrunning applications in an SPMD (single pro-gram multiple data) mode can also be consid-ered as massively parallel systems and canthus be evaluated using the LOOP approach.

The situation in evaluating massively parallelcomputers is more difficult than for single pro-cessor architectures. Standard benchmark testscannot be used for such systems since theyhave not been specially designed for these ar-chitectures. Special algorithms are requiredsince applications normally are tuned to cer-tain processor or cache topologies. Contrary tosingle processor architectures, massively par-allel machines are typically not stressed undernormal workload conditions. This creates theneed for a new kind of benchmark. The com-munication features of the system should beevaluated, and special workload characteris-tics should be described. The remainder of thissection summarizes the approach of a newbenchmark that is able to evaluate the overallsystem performance of massively parallel

computer systems. The advantages of this pa-rameterized benchmark over previous bench-marks are given, and the methods by which thenew approach can be applied are explained.

3.2. Using Parameters

All of the existing benchmark tests describedin section 2 do not allow the use of parametersby which the user load can be calibrated. Thisrestricts the influence that a user has on the ex-ecution of a benchmark program. This restric-tion is useful to make sure that results are uni-form and comparable. On the other hand, theuser is bound to a program which probablydoes not represent the same workload as thespecific application of interest. Giving the userthe ability to adapt the behavior of a certain ap-plication enables the benchmark to mimic thebehavior of the described program. Two ap-proaches are possible.

A first and rather static approach is to create asingle program that is able to change its behav-ior according to input parameters from theuser. Such a program can change its behaviorin a restricted manner.

Another, more flexible approach is to developa program that not only makes use of these pa-rameters, but also generates different pro-grams. These synthetic programs are then tocomprise the benchmark workload. This im-plies the creation of a Benchmark Generatorrather than developing a benchmark programin isolation.

In both cases, the use of parameters has the im-portant advantage that only a single programhas to be ported to different machines in orderto obtain a wide variety of synthetic work-loads. This implies that a user does not have toport an application program to the new archi-tecture to investigate its behavior. By incorpo-rating several scaling parameters it is possibleto simulate the application’s (communicationand computation) behavior under differentconditions.

5

A second important advantage of this ap-proach is the fact that special features of a sys-tem’s performance can be tested individually.For example, different kinds of message pat-terns can be generated by manipulating mes-sage size and frequency parameters.

The benchmark generator also has another ad-vantage over simply porting special applica-tions and using them as benchmark programs.Evaluation facilities and tracing capabilitiescan automatically be included. Using a set ofparameters to describe an application implies atrade-off between the conflicting goals of easyusability and model representativeness. Alarge number of parameters makes it easier tocreate a workload with behavior close to theapplication from which these parameters arederived. However, the extra parameters add tothe complexity of the benchmark. Ideally, thebenchmark should be characterized by a smallset of parameters while not sacrificing repre-sentativeness.

4. The LOOP Method

In the LOOP approach /BBS94a,Schl93/,workloads are not defined using one specificworkload described in detail. Instead, an envi-ronment for a user specified evaluation of par-allel computer systems is provided. TheLOOP method has been developed assumingthat the user has structural knowledge of theintended workload. The benchmark generatorthen constructs a workload with the samestructural characteristics.

It is often useful to obtain a first impression ofa new algorithm’s behavior on a known ma-chine. The exact amount of code related tocommunication handling does not have to bespecified. Instead, one can concentrate on thealgorithm itself. Some predefined standardworkloads described in Section 5 can be usedto get a first impression of the system withoutthe need to fully implement a user specific ap-plication.

In Figure 2, the LOOP approach is illustrated.A structural load description (LOOP program)is fed into the generator. The generator pro-duces the corresponding parallel instrumentedprogram. The program can be run on the targetarchitecture with different input parametersand the behavior can be analyzed using col-lected trace information. The central part ofthe LOOP method is the Workload Generator.This generator is the only program that has tobe ported to a new machine to test the new ma-chine with a wide variety of workloads.

Although the problems evaluated using theLOOP method can be defined at a high level ofabstraction, the use of a powerful visualization

Structural parameters(LOOP program)

- Program structure- Data structure- Communication structure- Computation structure

Runtime parameters(used by the generated workload)

- Problem size- No. of Processors

Visualization of

Workload Generator

Parallel Instrumented Program

Parallel Machine

Trace-data

Trace-data

Figure 2: Generating Workloads using theLOOP-method

6

tool allows the user to examine such things asthe communication structure in detail. Com-munication bottlenecks in the hardware or inthe chosen algorithm can be detected. It can bedetermined if a certain network topology issuitable for a problem with specific character-istics. Examples are described in sections 5and 6.

Given a description of the workload and thearchitecture to be tested, the workload genera-tor constructs a program in standard C whichis executed on the target system. To have awidely accepted communication model, thegenerator uses the PICL library /PICL90/(Portable Instrumented Communication Li-brary). Besides the ability to write portablecode for massively parallel systems, this li-brary is capable of tracing basic communica-tion instructions. PICL trace information canbe analyzed with a visualization tool, Para-Graph /Para92/.

The main goal in the design of the LOOP mod-el is to ensure ease of use. This is accom-plished by making the description of the work-load significantly more concise than the user’sactual application. In situations where a newarchitecture is to be evaluated, this is of espe-cially important. Another important goal in theLOOP design is to make the programming ofparallel program communication for messagepassing systems as easy as possible.

In the LOOP method, the user can describeprograms in a pseudo-code like manner simi-lar to that often found in literature, for exam-ple, /Gol83,Pre88/. To accomplish this, theLOOP language is an extension to standard Cand PICL /Schl93/. LOOP constructs and datastructures can easily be manipulated. The im-plementation of the abstract constructs for thedescription of communication workloadsguarantees deadlock-free workloads, becausesender and corresponding receiver are auto-matically addressed as pairs.

4.1. How to write LOOP programs

The first step in the evaluation of a parallelsystem with the LOOP method is to give struc-tural information of the desired workload tothe generator. Based on this information, thegenerator produces executable code for thetarget system. By using the LOOP language,all structural information is given.

In a typical experiment, it is often useful to ex-ecute the same workloads with different prob-lem sizes and with varying number of proces-sors. Such sensitivity analysis finds limitationsin the hardware regarding memory or cachebehavior. Therefore, at runtime certain param-eters can be specified to the generated work-load. This implies that even without rebuildingthe program, problem size limitations of algo-rithms or hardware can be tested.

4.1.1. Structural Parameters

The structural parameters are specified viaLOOP language constructs. In this section, themost important constructs are explained andtheir usage is demonstrated via examples.

4.1.1.1. Programs and Data Structures

Considering problems normally solved onmassively parallel systems, numerical applica-tions are arguable the most important. Numer-ical programming problems mainly deal withoperations on matrices and vectors. Programsfor these kinds of problems, therefore, consistof iterations over matrices and vectors. Forthis reason, the design of the LOOP languagefocuses on offering convenient ways to de-scribe such operations.

Along with the \LOOP construct which deter-mines loop nesting, several instructions hand-ling high-level data structures are available. InFigure 3, a LOOP program abstraction of asimple parallel matrix multiplication is shown.The use of the construct \LOOP and the decla-

7

ration of high-level data structures can beseen. The communication is specified via the\COMMUNICATE statement which is ex-plained later. The declared matrices are allo-cated dynamically by the \MAT construct andinitialized as specified by the \INIT_ARRAYconstruct. This Initialization can be omitted ifspecific array values are not required.

All LOOP language instructions are prefixedby the sign \. The generator recognizes thesetagged backslash instructions and convertsthem to normal C. This implies that additionalC code can be incorporated directly into theLOOP program.

int main(void){int nodes, me, host, prob_size, amount;\OPEN0(&nodes, &me, &host, &prob_size,

TRACE_BUF_SIZE);/* Declarations: */\Mat(double) Mat1, Mat2, Mat3;/*random init */\INIT_ARRAY(Mat1);\INIT_ARRAY(Mat2);\INIT_ARRAY(Mat3, 0);amount=sizeof(double)*prob_size*

prob_size/nodes;\LOOP i3 {

\LOOP i2 {\LOOP i1 {Mat3[i3][i2] += Mat1[i3][i1]*Ma-

t2[i1][i2];}

\COMMUNICATE(amount,1,1);}}\CLOSE0();

}

Figure 3: Complete program for a parallelMatrix-Matrix multiplication

The construct

\LOOP [iterator]

in Figure 3 is used to describe iterations overthe complete problem size. The use of iterationvariables (i1 - i3 in our example) is optional.The variables can be omitted if they are notneeded. The default number of LOOP itera-

tions is the problem size. No definition or ini-tialization for the iterators and the matrices isnecessary. Iteration variables are automatical-ly declared by their use in the LOOP construct.High-level data structures like matrices are de-clared and initialized by using special instruc-tions which, in our example, are the constructs

\MAT and \INIT_ARRAY.

The code given in Figure 3 is the complete in-put for the generator. The result of the genera-tor is a parallel instrumented program, PIP(see Figure 2). The benchmark package in-cludes a user friendly interface that aids in allphases of the machine evaluation. After thePIP is generated, the code is compiled and ex-ecuted. A tracefile is collected and written todisk.

4.1.1.2. Computational Load

In the above example the computational loadresults from the statement

Mat3[i3][i2] += Mat1[i3][i1]*Mat2[i1][i2]

in the inner loop. Often the user may not beable to give the exact statements generatingthe desired computational load. In such casesit is important to have a set of statements withwhich the computational behavior can be de-scribed. Since the resulting computationalload depends upon the position in the loop hi-erarchy, these statements have to be deter-mined for each loop level of the programstructure. Three examples of such statementsillustrate various options.

\MATPROD

indicates the calculation of a matrix multipli-cation. The type of operations can be deter-mined by the declaration of the matrices whichare to be multiplied. The sizes of the matrices(submatrices) can be given as arguments.

\MATVECPROD

8

is similar to the matrix multiplication instruc-tion and generates code for a matrix-vectorproduct.

\SCALPROD

calculates a scalar product. Two vectors andthe name of the resulting scalar are given as ar-guments.

These and other operations often used in mas-sively parallel programming are provided tomake the description of the computationalload easy. Since various data structures can bespecified it is easy to generate different classesof workloads. Using these constructs, it is pos-sible, for example, to describe a diverse set ofabstract tests for integer and floating point per-formance.

4.1.1.3. Modeling Communication

The design of a workload for multiprocessormachines should include a description ofworkload placed on the interconnection net-work. Several goals are discussed in the fol-lowing.

First, it is important to have a simple, easy touse description of several communication re-lated parameters. Such parameters include:

• the frequency and type of messages,

• the size and pattern of messages, and

• the locality and communication distance,including the sending and receivingnodes.

Another goal, related to the programming ofmessage-passing architectures, is to make thecommunication code deadlock-free. The de-velopment of such code is problematic sincestatements for receiving messages are normal-ly blocking. In order to make the code dead-lock-free, a mechanism is needed which as-sures that an adequate number of messages aresent to nodes which are waiting to receive.

Describing communication in an abstract wayimplies that neither setup routines nor specialpoint-to-point programming should be neces-sary. The LOOP language provides high levelinstructions from which the generator is ableto produce correct communication code foreach node. These are motivated through thefollowing example.

4.1.1.4. The COMMUNICATE Statement

One common communication function is thetransfer of messages between several proces-sor nodes of a given distance. In parallel com-puting, situations can be found in which sever-al processors of a certain distance communi-cate regularly. Nodes typically send results toanother node and receive new data from a thirdnode.

As an example, s simple parallel matrix multi-plication is considered. First, the basic algo-rithm is described in Figure 4.

In this figure, it is shown that both matrices Aand B are distributed in blocks of rows andcolumns, respectively. Each node is able tocompute a certain sub-matrix of the resultingmatrix C. Having calculated the sub-matrix, it

A B C

x =

Each node performs the following step number ofnodes times:

• gets part of A and B

• computes a submatrix of C

• sends own part of B to another node

• receives new part of B from a third node

Figure 4: Structure of a parallel matrix-ma-trix multiplication

9

is necessary that each node sends its columnblock of matrix B to another node and receivesa new block of columns from a third node.Typically, the blocks of columns are ex-changed cyclically.

Reconsidering the LOOP program shown inFigure 3, the communication can be modelledwith the

\COMMUNICATE (size, distance, partners)

instruction. The arguments specify the amountof data which has to be communicated, the dis-tance between the communicating processors,and the number of processors to which the datashall be transferred, respectively. Thus, the

\COMMUNICATE(amount,1,1)

statement in Figure 3 indicates that a certainnumber of matrix entries (amount) has to betransferred to one partner of distance one. Op-tionally, the \COMMUNICATE statement canbe given a compound statement. Such com-pound statements indicate cases in which thelow level send and receive operations generat-ed by the high level \COMMUNICATE in-struction are positioned at different places inthe parallel code. The sending operations aredone before the execution of the compoundstatement and the receiving of the messages isdone afterwards. Inside the compound state-ment a certain amount of computation may beperformed. Because of the non blocking sendoperation, overlapping between communica-tion and computation can be achieved. Gener-ally speaking, with the \COMMUNICATE in-struction it is possible to check the perfor-mance of the interconnection network withrespect to

• the communication distance and

• the message length.

4.1.1.5. Information Exchange

Similar to the \COMMUNICATE construct,the \EXCAHANGE statement generates com-

munication between two processors of a cer-tain distance.Although it might initially seempossible, this construct cannot be replaced bythe use of two \COMMUNICATE instructionswhich generate only very few bidirectionalcommunications. Therefore, a construct whichgenerates only bidirectional communicationbetween pairs of processors is provided. Theconstruct

\EXCHANGE (size, distance, partners)

generates bidirectional communication bysending and receiving messages of length‘size’ between ‘partners’ (i.e., processors) of acertain communication ‘distance’.

In parallel linear algebra and image processingthere are several algorithms which make use ofdata exchange between pairs of processors. Asan example we consider the principles of aparallel red-black relaxation algorithm.

All elements of the matrix are marked in achess-board like manner using the colors redand black. Each processor gets a part of thematrix as shown in Figure 5. In each iterationstep all elements of a processor’s submatrixare recalculated. New values are calculated byusing a function which takes a certain neigh-borhood of the point into account. This meansfor each point, a statement

Shared points

Shared points

Proc 1

Proc 2

Proc 3

Figure 5: Parallel red-black relaxation

10

P’[i, j] := f( P[i][j], P[i-1][j],P[i+1][j], P[i][j-1], P[i][j+1] )

has to be calculated.

Red elements are recalculated first in which el-ements shared between two processors have tobe exchanged. After that, recalculation and ex-change of black elements is done. This processcontinues until changes in the matrix valuesare below a certain error threshold where thealgorithm is assumed to have found the solu-tion to the problem.

int main(void){int amount, host, me,

psize /* problemsize */, nodes /* nodes */;\MAT (TYPE: MY_TYPE, S1: psize_node) Mat ;\OPEN0(&nodes, &me, &host, &psize, T_BUF);int psize_node = (int) ( psize / nodes ) ;\INIT_ARRAY(Mat);amount = psize * sizeof(MY_TYPE) ;\LOOP (ITERATIONS) {

\EXCHANGE ( amount, 1, 2 ) ;relax (0, psize, psize_node) ;\EXCHANGE ( amount, 1, 2 ) ;relax (1, psize, psize_node) ;

}\CLOSE0();return EXIT_SUCCESS;}

Figure 6: LOOP program for parallelred_black relaxation

A LOOP program which is capable to modelthis algorithm is shown in Figure 6. The sub-routine relax performs the iteration of one col-or (red or black) on the matrix. The functionfor the computation of new elements is thearithmetic mean of the four neighboring ele-ments. In contrast to a real algorithm whichwould terminate if the error bound betweentwo iteration steps is smaller than a given lim-it, this example iterates a distinct number oftimes. At runtime the user can specify theamount of iterations by defining the value ofiterations.

Besides placing computational load on eachprocessor, the program as it is described in

Figure 6 stresses the interconnection networkwith a large amount of bidirectional messagesexpressed by the \EXCHANGE construct.Here it can be seen how some normal C con-structs are integrated in the LOOP program.The distribution of the matrix itself and othersetup overhead is not regarded in this example.To handle the distribution of data some high-level communication routines as shown beloware provided.

4.1.1.6. High-level Communication Routines

Two major types of high-level communicationroutines are provided: multi-broadcast andvector communication kernels. The first onehas been implemented to place a massive loadon the interconnection network. All of thesemassively communicating routines are imple-mented by sending a certain amount of datafrom all nodes to all others. This functionalityis called a multi-broadcast operation. TheLOOP package has three slightly different im-plementations of this multi-broadcast state-ment.

\MULTIBCAST0(amount)

All nodes start by performing all send opera-tions first; after that all receives are executed.The amount of data sent by each node is givenas an argument. All nodes begin by sending tonode 0 and proceed by sending to the remain-ing nodes in numerical order. (Nodes do notsend messages to themselves). Message re-ceiving is done in the same order starting atnode 0.

\MULTIBCAST_ME(amount)

This procedure is similar to the above de-scribed operation. The only difference is thatnodes do not start sending to node 0. Insteadeach starts with the node which is numberedone greater than itself (modulo the highest pro-cessor number). This is to make sure that node0 and the communication paths near node 0 arenot overwhelmingly loaded.

11

\MULTIBCAST_ALTER(amount)

In contrast to the two above mentioned func-tions this one does not separate all send and re-ceive operations. Instead, as the name indi-cates, it places a receive operation after eachsend. The order of these operations is such thatsend operations are done with increasing andreceive operations are done with decreasingprocessor numbers. This strategy avoids hotspots and puts an evenly distributed communi-cation load on the interconnection network.

The other class of high-level communicationroutines are vector communication kernels.Such routines are often used in parallel com-puting. Routines for distributing, collecting,and broadcasting vectors are provided. Thecommunication pattern used by these proce-dures is based on a virtual (binary) tree topol-ogy. Although a tree topology might not mapwell on the target architecture, it does offer theadvantage that collection and distribution canbe done in logarithmic time. Three differentroutines are provided by the LOOP language.

\TREE_BCAST(start_data,amount)

Node 0 sends (broadcasts) ‘amount’ bytes toall other nodes. The data sent is located at theposition indicated by ‘start_data’.

\TREE_COLLECT_VEC(start_vec)

Processor 0 collects the parts of a distributedvector from all other processors and rebuildsthe original vector at ‘start_vec’.

\TREE_DISTRIB_VEC(start_vec)

Processor 0 starts distributing a vector at‘start_vec’ to all other processors. Each nodegets its own part of the vector. These tree com-munications are carried out in log2(proces-sors) steps. In each step the data to be collect-ed/distributed is passed to the next upper/low-er level of the assumed virtual tree topology. Adefault logical tree topology is implementedwith the LOOP package. The mapping of thenodes on the target architecture can be tunedby the user.

4.1.2. Runtime Parameters

In the preceding, the structural parametersneeded to generate a certain type of workloadare described. For different executions, thisgeneration step need not be repeated, only dif-ferent runtime parameters are needed.

One important runtime parameter is the prob-lem size. The overall execution time and com-munication behavior depend directly on thisparameter. For example, in the SLALOMbenchmark, by varying the problem size it ispossible to analyze the cache influence. Work-loads with a large problem size do not fit insmall data caches. A second important runtimeparameter is the number of processors allocat-ed to the generated workload. Both, the prob-lem size and the number of allocated proces-sors, are important in determining the granu-larity at which the problem is solved mostefficient on the tested machine. To be able tomodel iterative algorithms, a third runtime pa-rameter, the number of iterations is provided.

Problem size:The size of the data structures over which theprogram iterates.

Processors:The number of processors allocated to theworkload.

Iterations:In case of iterative algorithms, the number ofpasses the algorithm makes over the specifieddata structures.

4.2. PICL and ParaGraph

An important feature of any successful bench-mark is to design it to be portable across asmany machines as possible. This is difficulttask in the case of multiprocessor architec-tures, because there is no standard program-ming language. There are also various pro-gramming models (e.g., host node model,node model, synchronous communication,

12

asynchronous communication). A group of re-searchers at Oak Ridge National Laboratory(ORNL) addressed this task by constructing acommunication library. The idea is simple:

1) Identify the communication needs of a mes-sage passing program (e.g., send, receive,barrier, broadcast, etc.).

2) Provide the user with routines for thoseneeds.

3) Put the routines in a software library that iseasy to install for a wide variety of multi-processors.

4) Make it publicly available.

The result is PICL (Portable InstrumentedCommunication Library) which has been im-plemented on several multiprocessor systems.PICL programs are portable between ma-chines where PICL is implemented. PICL in-cludes all communication routines that areneeded for parallel message passing programs.The generator of the benchmarking packagetransforms the LOOP description of a parallelworkload into a parallel program with C andPICL statements. A detailed description ofPICL can be found in /PICL90/, which is alsopart of the LOOP benchmark documentationpackage1.

PICL automatically instruments the code fortracing purposes. The resulting traces can beinterpreted with ParaGraph, a graphical dis-play system for visualizing the behavior andperformance of parallel programs on messagepassing multicomputer architectures. Visualanimation is provided based on executiontrace information monitored during an actualrun of a parallel program. The resulting tracedata is replayed pictorially and provides a dy-namic depiction of the behavior of the parallelprogram. Graphical summaries of overall per-formance behavior is also provided. Differentvisual perspectives provide different insightsof the same performance data. A description of

1. available via ftp ([email protected])

ParaGraph can be found in /PARA92/, whichis also part of the benchmark documentationpackage.

The output of the generator was chosen to bePICL programs for three reasons: instrumenta-tion, availability, and portability. PICL pro-vides instrumentation, it is public domain soft-ware and it is implemented on several systems.The generator output is not inherently restrict-ed to PICL. Whenever a new message passingparadigm becomes available that meets thethree requirements above, the generator outputcan easily be changed. This implies that thebasic LOOP approach is independent of theunderlying message passing hardware. In thispaper it is not possible to describe all LOOPstatements, a complete description can befound in /BBS94a/.

5. Predefined Benchmarks

Once the basic LOOP structure has been spec-ified it is possible to write generic LOOP pro-grams to analyze a wide range of system fea-tures. The LOOP package includes some pro-grams that can be used as predefinedbenchmarks. The predefined benchmarks con-sist of LOOP programs for typical parallelworkloads (e.g., matrix multiplication, conju-gate gradient, relaxation, fast fourier transfor-mation) and of one special synthetic test pro-gram that provides an overall impression ofthe computation and communication perfor-mance of the machine. This special test pro-gram is termed the Fingerprint LOOP pro-gram. The predefined benchmarks are de-signed for users who want to evaluate andcompare different machines.

5.1. The Fingerprint

To assess the communication capabilities of amachine in comparison to the computationalpower, the Fingerprint benchmark has beendeveloped. The goal is to provide quick, visual

13

reference information for a first glance com-parison between different machines. As shownin the annotated version of the space time dia-gram in Figure 8, the Fingerprint was de-signed to illustrate

(1) the time needed for a certain computa-tion intensive phase,

(2a-c)the time needed for communications ofdifferent message length, and

(3) the effect of heavy communication loadswhich partially saturate the communica-tion network.

This latter effect is provoked by concentratingon node 0, then on node 1, and so on. Thus,communication delays tend to be compoundedfor higher processor numbers. Therefore, a se-vere “V-type“ profile indicates a high numberof conflicts in the communication network. Amore rectangular (i.e., vertical) profile is typi-cal for a non-saturated network as evidencedby the ends of phases 2a and 2b. In Figure 7the LOOP source code for the Fingerprintworkload is given.

#include “LOOP.h“#define TRACE_BUF_SIZE 250000#define MY_TYPE doubleint main(void){

int myself,allnodes,host,problemsize ;\OPEN0(&allnodes, &myself, &host,

&problemsize, TRACE_BUF_SIZE) ;\VEC(TYPE: MY_TYPE) sc, vec1, vec2;\VISIBLE_SYNC();\LOOP {

\SCALPROD(sc, vec1, vec2);}\VISIBLE_SYNC();/* multi-broadcast, 1 bytes */\MULTIBCAST0(1);\VISIBLE_SYNC();/* multi-broadcast,500 bytes */\MULTIBCAST0(500);\VISIBLE_SYNC();/* multi-broadcast, 1000 bytes */\MULTIBCAST0(1000);\VISIBLE_SYNC();\CLOSE0();return EXIT_SUCCESS;

}

Figure 7: LOOP source code for Fingerprint

Figure 8: Space time diagram for a typical 16 processor fingerprint execution

Computation Phase shortcommu.phase

mediumcommu.phase

largecommu.phase

over-loadphase

visible sync

1 2a 2b 2c 3

14

The execution of the Fingerprint workloadfalls into two parts. In the first part, a scalarproduct of two vectors of size problemsize iscalculated on each node. The parameter prob-lemsize is specified at runtime making severaldifferent executions possible. The second partconsists of three multibroadcast instructionswith different message lengths. In each multi-broadcast the selected amount of informationis transmitted from each node to all othernodes. This produces a heavy load for the in-terconnection network. The different amountof data sent increases the network load andproduces space time diagrams which can becompared across various machines.

To provide a boundary between the computa-tion and communication phases easier a \VISI-BLE_SYNC() construct is used. Since thePICL sync0 instruction does not produce anytrace data to be visualized by ParaGraph, theLOOP system provides this special form ofsynchronization. All nodes execute a sync0operation. Next, every node sends a short mes-sage to its right hand neighbor1. This producesa vertical line in the space-time diagram. Tosynchronize the execution of all nodes aftersending a message to and receiving a messagefrom the neighbors, a second sync0 is per-formed by all nodes. The second sync0 mini-mizes the time difference for the processors tostart the next phase of the program.

5.2. Parameterized Applications

For the comparison of different computer sys-tems the benchmark package provides five dif-ferent LOOP workload programs:

- fingerprint (fp),- conjugate gradient method (cg),- matrix multiplication sync (mmm_s),- matrix multiplication async (mmm_a), and- red-black relaxation (red_black).

1. Using a virtual ring topology

The fingerprint workload is described in theprevious section. The second workload (cg) isa LOOP workload for a parallel conjugate gra-dient method. The two different matrix multi-plication versions are with asynchronous com-munication (mmm_a) and with synchronouscommunication (mmm_s). In the first case,sending of messages is overlapped with com-putation. This can be important for architec-tures being capable of doing computation andcommunication in parallel. The synchronousversion is favorable for architectures with asynchronous message passing hardware. Thelast workload simulates a parallel red-black re-laxation algorithms. The LOOP workload pro-gram is described in Figure 6.

6. Results

For the comparison of a MEIKO (a 64 nodeT800 based multiprocessor), an nCUBE/2 (a128 node hypercube connected multiproces-sor), and an Intel Paragon (a 512 node i860based mesh connected multiprocessor), fivedifferent LOOP workloads that are describedare executed on each system. The workloadsare executed with 16 and 32 processors on allthree machines /BBS94b/. The timings aregiven in Table 1. On the nCUBE/2 and theParagon the workloads are also executed with64 and 128 nodes. The results are shown in Ta-ble 3. The results tables are organized as fol-lows. First, the name for the LOOP workloadprogram is given. The first parameter is thenumbers of allocated processors, the secondparameter is the problem size, and the third pa-rameter is the number of iterations (if applica-ble).

6.1. Execution Times

Table 1 shows the results of the LOOP bench-marks executed on the different systems.

15

An interesting result is that none of the threetarget architectures profits from the overlap-ping communication in the second matrix mul-tiplication algorithm. On the Paragon and thenCUBE, asynchronous communication resultsin a lower bandwidth. The asynchronous com-munication can also slow down computationbecause the processor and the communicationunit try to access main memory simultaneous-ly. On the MEIKO, asynchronous communi-cations are converted to synchronous commu-nications at the hardware level resulting in ad-ditional overhead. [Note: In a separateexperiment, the message passing paradigm“send as soon as possible, receive as late aspossible“ does not necessarily improve perfor-mance.]

To see the results from a relative viewpoint,the times are converted to “paragon seconds“(see Table 2). The workloads in the tables areordered from communication bound loads tocomputation bound loads. That is, the finger-print workload has the highest communica-tion/computation ratio, while the relaxationworkload has the lowest communication/com-putation ratio. From the published single nodepeak performances, one could expect that theperformance of the nCUBE/2 and the MEIKOare similar and that the Paragon is an order ofmagnitude faster.

Runtime in secondsMEIKO nCUBE Paragon

fp 16/100 0.282 0.073 0.024fp 32/100 0.559 0.112 0.036cg 16/256/8 0.601 0.299 0.062cg 32/256/8 0.624 0.237 0.055mmm_s 16/256 13.461 8.586 1.692mmm_s 32/256 7.788 4.643 0.855mmm_a 16/256 13.680 8.437 1.692mmm_a 32/256 7.922 4.470 0.886red_black 16/1024/5 6.885 4.439 0.567red_black 32/1024/5 3.621 2.210 0.284

Table 1: Execution times for the LOOPbenchmarks

The first experiment (Fingerprint) shows thatthis expectation is not necessarily true. For acommunication bound synthetic workload(i.e., fp16/100) a slowdown of only 3 is ob-served for the nCUBE/2 and a slowdown of 11is observed for the MEIKO. However, thelower the communication/computation ratiois, the more the MEIKO and the nCUBE/2 areoutperformed by the Paragon. For the mostcomputation bound workload, red_black32/1024/5, the closer the MEIKO and nCUBE/2are to each other and are approximately an or-der of magnitude slower than the Paragon. TheFingerprint results (execution time, space timediagram) show that the nCUBE/2 scores betterwith respect to communication bound work-loads.

SlowdownMEIKO nCUBE Paragon

fp 16/100 11.729 3.033 1.0fp 32/100 15.517 3.111 1.0cg 16/256/8 9.931 4.823 1.0cg 32/256/8 11.345 4.309 1.0mmm_s 16/256 7.956 5.074 1.0mmm_s 32/256 8.800 5.246 1.0mmm_a 16/256 8.085 4.986 1.0mmm_a 32/256 8.941 5.045 1.0red_black 16/1024/5 12.121 7.743 1.0red_black 32/1024/5 12.750 7.782 1.0

Table 2: Slowdown against Paragon

Runtime in secs. SlowdownnCUBE Paragon nCUBE

fp 64/100 0.181 0.072 2.613fp 128/100 0.380 0.145 2.621cg 64/1024/8 1.129 0.189 5.974cg 128/1024/8 0.829 0.145 5.717mmm_s 64/256 2.968 0.484 6.132mmm_s 128/256 2.598 0.327 7.945mmm_a 64/256 2.816 0.484 5.818mmm_a 128/256 2.468 0.328 7.524red_bl. 64/1024/5 1.078 0.143 7.531red_bl. 128/1024/5 0.544 0.074 7.351

Table 3: Runtimes for the LOOP bench-marks on 64 and 128 nodes

16

Some of the results are expected (e.g., overallperformance). However, some tests provideinteresting insight into machine behavior(through trace visualization tools). These re-sults can be used by both parallel programmersand system developers to improve perfor-mance. Examples include balancing the com-putation and communication performance(e.g., Fingerprint) and the improving of asyn-chronous communication (e.g., matrix multi-plication). It is noted that the parallelizationfor message passing systems is still rathercoarse (i.e., a certain amount of computationbetween communication is needed) otherwiseslowdowns can easily result from adding pro-cessors (e.g., the result for the conjugate gradi-ent workload on the MEIKO for 16 and 32processors).

6.2. Standard Result Sheets

For a more complete overview on the test re-sults, three standard result sheets for each ex-periment are developed. The first page con-tains information on the workload, includingthe structural and runtime parameters, infor-mation on the system hardware (e.g., numberof processors, type of interconnection net-work), the measured performance metrics(e.g., execution time, percentages for busy,idle and overhead times), a profile of the par-allel workload, and a utilization summary foreach processor. The second page gives anoverview of various statistical information ofan experiment. It contains information such asthe number of messages sent and received, theaverage, maximum and minimum times forthe send and receive operations, and messagequeue lengths1. An example for the standardresult sheets for a conjugate gradient LOOPworkload is provided as Appendix.

1. A report explaining in more detail the standardevaluation sheets for all tests is available via ftp [email protected] (directory /pub/bench)

The PICL tracefiles contain all the informationon communication and computation events.Paragraph offers a wide variety of displays tovisualize these events. Thus, the user can gointo as much detail as desired.

7. Conclusions and Future Work

A primary goal of the LOOP approach is toprovide the user with a set of parallel work-loads which can be used to compare variousaspects of different systems quickly. A secondgoal is to offer a convenient way to implementuser defined workloads on parallel systems.There are a variety of LOOP statements (notall of which are described in this paper) fortypical communication and computationloads. LOOP programs can be complementedwith C statements. Once the description of aworkload is complete, the remaining steps areautomatic. The user only has to specify runt-ime parameters (problemsize, number of pro-cessors, and number of iterations). The pro-gram is then executed and a tracefile is gener-ated. ParaGraph can be used to visualize thetrace data.

Regarding future work on the LOOP ap-proach, two extensions are planned:

1) The use of LOOP programs for automaticworkload characterization, and

2) the use of the LOOP language for fast pro-totyping of message passing programs.

17

8. References

/BBS94a/ J. Brehm et.al.:A Multiprocessor Benchmark, User’s guide and refern-ce manual, ESPRIT III technical report, available via ft-p.irb.uni-hannover.de

/BBS94b/ J. Brehm et.al.:A Multiprocessor Benchmark, Appendix D, machineevaluations, ESPRIT III technical report, available viaftp.irb.uni-hannover.de

/Gol83/ Gene H. Golub et al.:Matrix Computation, North Oxford Academic, Oxford1983

/Jain91/ R. Jain:The art of computer systems performance analysis:techniques for experimental design, measurement, si-mulation and modeling,John Wiley&Sons, New York 1991

/Ker88/ B. W. Kernighan; D. M. Ritchie:The C Programming LanguagePrentice-Hall, 1988

/Kil94/ U. Killermann:Implementierung der parallelen Programmierumge-bung PPRC auf nCubeDiplomarbeit IRB, Hannover 1994

/PARA91/ M. T. Heath; J. E. Finger:ParaGraph: A Tool for Visualizing Performance of Pa-rallel Programs, IEEE Software, 8(5), September 1991,pp. 29-39

/PARA92/ M. T. Heath; J. E. Finger:ParaGraph: A Tool for Visualizing Performance of Pa-rallel Programs, UserGuide, Oak Ridge National Laboratory, Oak Ridge,October 1992

/PICL90/ G.A. Geist; M.T. Heath; B.W. Peyton;P.H. Worley:PICL - A Portable Instrumented Communication Libr-ary, Technical Report, Oak Ridge National Laboratory,Oak Ridge, July 1990

/PICL90a/G.A. Geist; M.T. Heath; B.W. Peyton; P.H.Worley:PICL - A Portable Instrumented Communication Libr-ary, C Reference Manual, Oak Ridge National Labora-tory, Oak Ridge, July 1990

/Pre88/ H. Press et. al.:Numerical Recipes in C - The Art of Scientific Comp-tuing,Cambridge University Press, New York 1988

/Schl93/ T. Schlemeier:Entwicklung eines Generators für parallele Benchmark-programmeDiplomarbeit IRB, Hannover 1993

/Sin91/ Singh; Weger; Gupta:SPLASH: Stanford Parallel Applications for Shared-MemoryStanford University, CA 94305, Technical Report CSL-TR-91-469

/Sla90/ J. Gustafson et. al.:SLALOM: The First Scalable Supercomputer Bench-markSupercomputing Review, November 1990, pp.56-61

/Spec91/ SunTech Journal:SPECulations (Defining the SPEC Benchmark)January 1991

/Wei91/ Reinhold Weicker:Benchmarking: Status, Kritik, Aussichten, Proceedingszur 6. GI/ITG Fachtagung Messung Modellierung undBewertung von Rechensystemen,p. 259-277, Springer-Verlag, Berlin 1991

18

Appendix: Example for the standard result sheets for a conjugate gradient workload

Evaluated Problem

Fingerprint(1,500,1000)

No. nodes: 16Problemsize: 100Iterations: 100

Main Results

Execution time: 0,0728 secPercent Processors Busy: 51.32 %Percent Processors Overhead: 43.37 %Percent Processors Idle: 5.32 %Avg Time Send (usec): 4378Avg Time Rcvd (usec: 4378

Hardware

nCUBE/2

Nodes: 128 at 20MHzNetwork: Hypercube

19


Statistics evaluated by ParaGraph (Scalingfactor for times: 100):

20

nCub

e 2

Fing

erpr

int (

100,

1,50

0,10

00)

(Sca

ling

120)


The LOOP Approach, a new Method for the Evaluation of ... · Evaluation of Parallel Systems Abstract The increasing number of different parallel computers requires a method to compare

Documents