PRELOADING SCHEMES FOR THE PASM PARALLEL MEMORY ...

PRELOADING SCHEMES FOR THE PASMPARALLEL MEMORY SYSTEM

David Lee Tuomenoksa!Howard Jay Siegel

Purdue UniversitySchool of Electrical EngineeringWest Lafayette, Indiana 47907

Abstract — Parallel processing systems, such asPASM, employ a large number of primary memorymodules. A memory system organization using parallelsecondary storage devices and double-buffered primarymemories has been devised for PASM in order to preventprimary/secondary memory transfers from becoming abottleneck. To efficiently use the memory system, it isdesirable to overlap the operation of the parallel secon-dary storage devices with computations being performedby the processors. Due to the dynamically reconfigurablearchitecture of PASM, the processors which will executea new task will not be selected until they are ready toexecute the task. That is, to make effective use ofdouble-buffering, a task must be preloaded prior to thefinal selection of the processors on which it will execute.Two schemes which allow for the parallel secondarystorage devices to preload input data and programs intothe primary memories so that system performance canbe improved are presented and compared. Results showthat both methods are effective techniques.

I. Introduction 'In large-scale reconfigurable parallel processing sys-

tems the transfer of data and programs between the pri-mary memories of the processors and the secondarystorage can become a bottleneck. There are severaltypes of reconfigurable parallel processing systems. Apartitionable SIMD/MIMD system can be dynamicallyreconfigured to operate as one or more independentSIMD (single instruction stream-multiple data stream) [4]and/or MIMD (multiple instruction stream-multiple datastream) [4] machines (e.g., PASM [18], TRAC [8,16]). Amultiple-SIMD system is a parallel processing systemwhich can be dynamically reconfigured to form one ormore independent SIMD machines of varying sizes (e.g.,MAP [11,12]). When a partitionable SIMD/MIMD ormultiple-SIMD system is forming an SIMD machine, datamust be loaded into the processors' primary memories.When a partitionable SIMD/MIMD system is forming anMIMD machine, in addition to data, a program must beloaded into the primary memory of each processor whichis executing the task.

PASM is a partitionable SIMD/MIMD multimicro-computer system being designed at Purdue Universityfor image processing and pattern recognition applications[18]. In order to prevent the primary/secondary memorytransfers from becoming a bottleneck in PASM, amemory system employing parallel secondary storagedevices and double-buffered primary memories has beendevised [18]. To improve processor utilization by takingadvantage of the double-buffering, it is necessary to over-

This research was supported by the Air Force Office ofScientific Research, Air Force Systems Command, USAF, undergrant number AFOSR-78-3581, and by a Purdue University Gradu-ate Fellowship. The United States Government is authorized toreproduce and distribute reprints for Governmental purposes not-withstanding any copyright notation hereon.

t D. L. Tuomenoksa is now with American Bell, Holmdel, NewJersey 07733.

lap the operation of the parallel secondary storage dev-ices with computations being performed by the proces-sors. This overlap can be obtained by preloading theprograms and data for the next task, while the previoustask is being executed, and then by overlapping theunloading of output data with execution of the nexttask.

Since PASM is to be used in a research environmentfor parallel algorithm development (in some casesinteractively), it is undesirable to require the user tospecify the maximum allowable execution time of a taskbefore it can be executed. The first-fit multiple-queue(FFMQ) scheduling algorithm, which has been describedin [20], does not put this requirement on the user. If theFFMQ scheduling algorithm is used, due to the dynami-cally reconfigurable architecture of PASM, it is notknown a priori which task a given group of processorswill execute. Since tasks must be preloaded prior to thefinal selection of processors, it appears that the systemwould be unable to preload tasks when using the FFMQscheduling algorithm The problem considered is how todetermine where the data and programs for tasks can bepreloaded while using the FFMQ scheduling algorithm sothat the performance of the memory system can beimproved. Without such a preloading scheme, the fullpotential of the double-buffered memory modules willnot be realized.

This paper presents preloading schemes which canbe used in conjunction with the FFMQ scheduling algo-rithm. Two schemes which solve the preloading problemby determining which task's or tasks' programs andinput data should be preloaded into a given set of pro-cessor memories are presented. The first scheme uses thescheduling algorithm to preschedule the task(s) whichwill follow the current task. The second scheme uses thescheduling algorithm to predict which task(s) may followa given task. The performance of these preloadingschemes as applied to PASM is demonstrated and con-trasted through simulation studies. The preloadingschemes described can be adapted to other multiple-SIMD and partitionable SIMD/MIMD systems.

II. PASM BackgroundPASM, a partitionable SIMD/MIMD machine, is a

large-scale dynamically reconfigurable parallel processingsystem [18] (see Fig. 1). The System Control Unit is aconventional machine, such as a PDP-11, and is responsi-ble for the overall coordination of the activities of theother components of PASM. The Parallel ComputationUnit (PCU) contains N = 2n processors, N memorymodules, and an interconnection network (see Fig. 2).The PCU processors are microprocessors that performthe actual SIMD and MIMD computations. The PCUmemory modules are used by the PCU processors fordata storage in SIMD mode and both data and instruc-tion storage in MIMD mode. A pair of memory units isused for each PCU memory module so that data can bemoved between one memory unit and secondary storage

0190-3918/83/0000/0407$01.00 © 1983 IEEE 407

Fig. 1. Block diagram overview of PASM.

"U

Fig. 2. PASM Parallel Computation Unit (PCU).

while the PCU processor operates on data in the othermemory unit (double-buffering). A processor and itsassociated memory module form a PCU processing ele-ment (PE). The PEs are physically addressed from 0 toN-l . The pair of memory units forming the ith PCUmemory module are labeled iA and LB (see Fig. 2). Theinterconnection network provides a means of communica-tion among the PEs. PASM will use either a Cube type[l] or ADM type [10] of multistage network.

The Micro Controllers (MCs) are a set of micropro-cessors which act as the control units for the PEs inSIMD mode and orchestrate the activities of the PEs inMIMD mode. There are Q = 2q MCs, addressed from 0to Q—1. Like the PEs, the MC memory modules aredouble-buffered. Each MC controls N/Q PEs, wherepossible values of N and Q are 1024 and 16, respectively.An MC-group is composed of an MC processor, itsmemory module, and the N/Q PEs which are controlledby the MC. The N/Q PEs connected to MC i are thosewhose addresses have the value i in their low-order q bitpositions (see Fig. 3). Control Storage contains the pro-grams for the MCs.

A virtual machine of size RN/Q, where R = 2r and1 r q> is obtained by combining the efforts of RMC-groups. According to the partitioning rule for

Fig. 3. Organization of the Memory Storage System forN = 16 and Q = 4.

PASM [18], the physical addresses of these MCs musthave the same low-order q-r bits so that all of the PEsin the partition have the same low-order q-r physicaladdress bits. For example, for Q = 16, allowable MCpartitions include: (6), (14), (2,10), (0,4,8,12), and(1,3,5,7,9,11,13,15). Q is therefore the maximum numberof partitions allowable, and N/Q is the size of the smal-lest partition. The reason for using this particular parti-tioning rule is because it allows multistage networks likethe multistage Cube and the ADM, which are being con-sidered for PASM, to be partitioned into independentsubnetworks [17]. This rule is also valid for multistageOmega [9], shuffle-ex change [13], and indirect binary n-cube [15] networks, as well as other data manipulator [3]type networks such as the Gamma [14] network [17].

The designator of a virtual machine composed of anallowable partition is the smallest physical address of theMCs in the virtual machine. This designatorcorresponds to the low-order q-r bits of the physicaladdress of each MC in the virtual machine. For the par-titions in the above example, the designators are: 6, 14,2, 0, and 1, respectively.

The approach of permanently assigning a fixednumber of PCU processors to each MC has the advan-tages that the operating system need only schedule QMCs, rather than N PCU processors, and that itsimplifies the MC/PE interaction, from both a hardwareand software point of view, when a virtual machine isbeing formed. In addition, this fixed assignment schemeis exploited in the design of the Memory Storage Systemin order to allow the effective use of parallel secondarystorage devices [18].

The FFMQ scheduling algorithm, which is beingconsidered for use with PASM_[20], makes use of q + 1first-in first-out task queues, TQ0,TQ1,...,TQq. A taskwhich requires 2k MC-groups is put into TQk. wheneverthere are free MC-groups, the FFMQ algorithm selectsthe first job in TQk, where k is the largest integer suchthat a virtual machine of size 2k MCs is available forexecution. If TQk is empty, then the first task fromTQk-i is selected. This process is continued until all

408

available MCs have been assigned or until k — 0. TheFFMQ algorithm assigns the task to the free virtualmachine with the lowest designator. This is anonpreemptive scheduling policy since all tasks run untilcompletion and a multiple-queue scheduling. TheFFMQ algorithm is a centralized scheduling algorithm [7]since the System Control Unit, which is executing theFFMQ algorithm, has complete accurate informationregarding the states of all tasks in the system.

When PASM is forming a virtual machine which isto execute an SIMD task, data must be loaded into thePCU memory units and a program must be loaded intothe MC memory units. When forming a virtual machinewhich is to execute an MIMD task, both data and pro-grams must be loaded into each of the PCU memoryunits. In this paper the loading/unloading of data forSIMD tasks from the PCU memory units is considered.The loading of the SIMD program into the MC memoryunits is not considered since it can be overlapped withthe loading of the data, following the same preloadingscheme. The analysis in this paper can easily beextended to MIMD tasks; instead of loading just data,both programs and data would be loaded.

The Memory Storage System, which provides secon-dary storage space for the PCU memory modules, con-sists of N/Q independent Memory Storage Units (MSUs).It is controlled by the Memory Management System.The MSUs are numbered from 0 to (N Q)-l. Each isconnected to Q PCU memory modules, as shown in Fig.3. For 0 < i < N/Q, MSU i is connected to those PCUmemory modules whose physical addresses are of theform (Q * i) + k, 0 < k < Q. For 0 < k < Q, MC-group k contains those PCU processors whose physicaladdresses are of the form (Q * i) + k, 0 < i < N/Q.Thus, MSU i is connected to the ith PE of each MC-group.

The two main advantages of this approach for apartition of size N/Q (i.e., one MC-group) are that (1) allof the PCU memory modules can be loaded in paralleland (2) the data is directly available no matter whichpartition (MC-group) is chosen. This is done by storingthe data for a task which is to be loaded into the ith logi-cal PE of the virtual machine in MSU i, 0 < i < N/Q.Thus, no matter which MC-group is chosen, the datafrom the ith MSU can be loaded into the ith PCUmemory module of the virtual machine, for all i,0 < i < N/Q, simultaneously.

Thus, for virtual machines of size N/Q, this secon-dary storage scheme allows all N/Q PCU memorymodules to be loaded in one parallel block transfer.Consider the situation where a virtual machine of sizeRN/Q is desired, 1 < R < Q. Only R parallel blockloads are required if the data for the PCU memorymodule whose high-order n-q logical address bits equal iis loaded into MSU i. This is true no matter which parti-tion of R MCs (which agree in the low-order q-r addressbits) is chosen [18].

A memory frame is the amount of space used in thePCU memory units for storage of data from secondarystorage for a particular task. It is possible that a taskmay need to process more than one memory frame.Besides being used for preloading, the double-bufferedPCU memory modules can also be used to overlap taskexecution on one memory frame with the loading orunloading of another memory frame.

III. Memory System ModelIn this section the model for the Memory System

used for the analysis in this paper is described. A

memory unit set is the set of PCU memory units withina single MC-group which have the same label (e.g., A,see Fig. 2). A data block consists of all the data to beloaded for one memory unit set. For a particular taskwhich requires R MC-groups, there are R data blocks ina memory frame.

When a task is assigned to an MC-group, one of thememory units of each PCU memory module within theMC-group is used by the task. Without loss of general-ity for SIMD tasks, it is assumed that all of the memoryunits within an MC-group which are used by a giventask are in the same memory unit set. Hence, all of thememory units within a memory unit set will always beassigned to the same task and will have the same status.Since all memory units within a memory unit set alwayshave the same status, in this model it is also assumedthat the loading/unloading of data for a memory unit setis done simultaneously and is considered as one action.In general, this is also true for MIMD tasks. (However,it is possible that for MIMD tasks in which the PEs havediffering secondary memory system requirements thatthese assumption may not hold.)

All requests which are made to the MemoryManagement System and serviced by the MemoryStorage System are for one data block. This results fromthe fact that the Memory Storage System can onlyload/unload one memory unit set at a time. There arethree types of requests: load, preload, and unload. A loadrequest is a request for input data for a task which hasbeen assigned to its MC-group(s) (i.e., the MC-groups areready to execute the task). A preload request is arequest for input data for a task which has not beenassigned to its MC-group(s) (i.e., the MC-groups are notready to execute the task). An unload request is arequest to unload output data for a completed task.

Load requests have the highest priority since theMC-group which is associated with the request hasalready been assigned to the task and is idle waiting forits input data. Unload requests have the second highestpriority since they are for tasks which have already com-pleted execution and the user is waiting for the outputdata. Preload requests have the lowest priority since theMC-group which is associated with the request is not idlewaiting for the input data to be loaded. The MemoryStorage System services requests from three requestqueues, one for each type of request.

IV. Preloading SchemesPreloading enables the Memory Storage System to

preload the data into the PCU memory units of a givenvirtual machine while the PCU processors of that virtualmachine are still executing the previous task. In thissection two task preloading schemes for use with theFFMQ scheduling algorithm are presented. Theseschemes make use of the double-buffered PCU memorymodules. While a task is being executed using one ofthe memory unit sets of a given MC-group (e.g., the Amemory units), the next task to be executed can bepreloaded into the other memory unit set (e.g., the Bmemory units). Since there are only two memory unitsassociated with each PCU processor, each processor canhave at most two tasks associated with it. Hence, onlysingle task look-ahead preloading is considered. In gen-eral, there will be more than one task preloaded into thesystem since different MC-groups can have differenttasks preloaded.

Preloading is driven by the size of a currently exe-cuting task. When a task of size 2k MCs, 0 < k < q,begins to execute, the preloading of tasks of size 2 (or

409

Fig. 4. An example of the use of the prescheduling scheme for determining where input datashould be preloaded. The status of a PASM with four MC-groups (Q = 4) is shown.Status of the eight (2Q) memory unit sets, the three (q + 1) task queues (for schedul-ing), and the Memory Storage System are given. Shaded area indicates when a memoryunit set is being accessed (either by loading, unloading, or preloading) by the MemoryStorage System. "L," "P," and "U" indicate that the Memory Storage System is load-

*: ing input data, preloading input data, and unloading output data, respectively.

410

smaller) is considered for that set of MCs. Thus, a taskof size 21 MCs can be preloaded only if there are tasks ofsize 21 or greater currently executing. The two preload-ing schemes to be presented are prescheduling and pred-iction.

Prescheduling. When prescheduling is used, the taskmanager attempts to schedule tasks in advance of whenthey would normally be scheduled. Whenever a taskstarts executing on a virtual machine, the preschedulerdetermines which tasks (if any) will follow the executionof the given task. If there are any tasks, the tasks arepreassigned to the appropriate MCs to be their next taskexecuted. The prescheduling algorithm uses the FFMQscheduling algorithm [20]; but, instead of attempting toschedule tasks for the entire machine, the preschedulingalgorithm attempts only to schedule the virtual machine(or MC-groups) which is executing the given task. Whena task completes execution, if no tasks have beenprescheduled to follow the completing task, the regularFFMQ scheduling algorithm is called. It is noted thattask prescheduling supplements task scheduling, it doesnot replace it.

The following example will illustrate the use ofprescheduling on a PASM with four MC-groups (i.e.,Q = 4). The status of the system is given as a functionof time in Fig. 4. The status of the task queues are givenwhenever there is a change. At time 10.0 the systemcompletes execution of tasks a, which required fourMC-groups. Task fi has already been preloaded, so itbegins executing immediately. The prescheduling algo-rithm is called by the task manager to preschedule thetask or tasks which will follow task fi. The FFMQ algo-rithm determines that tasks 7 and 8 (which each requiretwo MC-groups) will follow fi. Tasks 7 and 8 areremoved from TQi and are preassigned to the appropri-ate MCs. The task manager then requests that the datafor tasks 7 and 8 be preloaded. The Memory StorageSystem unloads the output data from task a andpreloads the data for 7 and 8. Recall that the MemoryStorage System is only able to transfer the data for oneMC-group at a time. Hence it takes four transfers tounload task a, two to preload 7, and two to preload 8.At time 10.8 task fi is executing and tasks 7 and 8 arepreloaded and ready to be executed. At time 12.0 task ficompletes executing and tasks 7 and 8 start executing.

As indicated in Fig. 4, task 1, which requires twoMC-groups, was prescheduled to follow task 7. So nomatter when task 7 completes execution, task e will fol-low it. As a result, even though task e arrived to thesystem before task c, the MC-groups did not start exe-cuting it until 5.5 seconds after task f. As a result, theresponse time for task 1 is much greater than that oftask f. This is an example of how the preschedulingscheme alters the order which task are executed.

Prediction. As with prescheduling, the predictionpreloading algorithm is invoked each time a task starts 'executing. When prediction is used, the task managerpredicts which task or tasks may follow the task whichstarted executing. Unlike the prescheduling scheme, thetask(s) are not removed from the task queue. The pred-iction scheme uses the FFMQ scheduling algorithm topredict which tasks may follow a given task. Thepredicted task or tasks are then preloaded by theMemory Storage System into their predicted memoryunits. The same enqueued task may be predicted andpreloaded to follow each currently executing task whosesize is equal to or greater than that of the enqueuedtask. The FFMQ scheduling algorithm is executedwhenever a task completes execution (i.e., when MCsbecome free) and whenever a new task arrives to be.scheduled [20]. When a task is scheduled for executionby the FFMQ scheduling algorithm, the assignment oftasks to MC-groups is made as if no preloading hadtaken place. When a task is assigned, the task managersends requests to the Memory Management System indi-cating that the Memory Storage System should load thedata. Recall that one data block is sent to each MC-group and that the task manager sends a separaterequest for each data block. After the MemoryManagement System receives the data block requests, itvoids any requests for data blocks which have beenpreloaded. Doing data requests on a block by blockbasis allows the system to take advantage of and toaccount for partial preloading. Having the task managerrequests all data blocks (regardless of whether they havebeen preloaded) removes the burden of keeping track ofpreloaded data from the task manager (which executeson the System Control Unit).

The following example will illustrate the use ofprediction on a PASM with four MC-groups (i.e.,

411

Fig. 6. Status of a PASM with four MC-groups, withnotation as defined in Fig. 4. Illustratesunnecessary preloading which can occur fromusing prediction scheme.

Q — 4). The status of the system is given as a functionof time in Fig. 5. The status of the task queues aregiven whenever there is a change. At time 10.0 the sys-tem completes execution of tasks a, which required fourMC-groups. The scheduler determines that task j3 willbe executed next by the system. Since task /? has beenpreloaded, execution begins immediately. The predictionalgorithm is called by the task manager to predict whichtask or tasks may follow task fi. The FFMQ algorithmdetermines that tasks 7 and 6 (which each require twoMC-groups) may follow 0. The task manager thenrequests that the data for tasks 7 and 6 be preloaded.Note that tasks 7 and 6 are not removed from thescheduler task queues as they were for the preschedulingscheme. The Memory Storage System unloads the out-put data from task a and preloads the input data for 7and 5. Recall that it takes four transfers to unload taska, two to preload 7, and two to preload 6. At time 10.8task j} is executing and tasks 7 and S are preloaded andready to be executed. At time 12.0 task /? completesexecution. Since there are free MC-groups, the scheduleris called by the task manager. The scheduler thenselects tasks 7 and 6 to be assigned to the free MC-groups. The tasks are then assigned and begin executionimmediately since they have both been preloaded. Up tothis point, the results of prescheduling and prediction arethe same.

As indicated in Fig. 5, task t was predicted to followeither task 7 or 6. Therefore, it was preloaded into theMC-groups forming the virtual machines for both tasks.In this way, task t can be executed by the virtualmachine which becomes available first, preserving theFFMQ ordering. Thus, the normal scheduling policy ismaintained with prediction and task e does not experi-ence the excessive delays that it did with prescheduling.Also note that in this particular example the structure ofthe Memory Storage System allows € to be loaded intoboth MC-groups simultaneously. Since task S was com-pleted first, task e was scheduled to follow it, and a newtask was predicted to follow tasks 7 and e.

Summary. With the prediction scheme, the taskmanager predicts where the enqueued task might executeand preloads the data into the appropriate PCU memoryunits. The enqueued task may be loaded to follow morethan one currently executing task. Independently of thepreloading which has occurred, the scheduler selectswhich task will be executed next and to which MC-group(s) the task will be assigned. Thus, prediction doesnot alter the natural order in which tasks would havebeen scheduled without preloading. In contrast, thepreschedufing scheme has the disadvantage that it alters

Fig. 5. An example of the use of the prediction scheme for determining where input datashould be preloaded. Notation is the same as used in Fig. 4.

the order which the tasks are executed from the naturalorder resulting from the use of the FFMQ schedulingalgorithm. For example, when prescheduling was used,task f was executed before task e even though task e wasfirst in the task queue. As a result, prescheduling greatlyincreased the response time for task e.

The prescheduling scheme has the advantage that itdoes not do any unnecessary loading of tasks which maynot be used, i.e., with prescheduling a task is preloaded(or loaded) only one time. However, with prediction, atask may be preloaded one or more times. For examplein Fig. 6, the two MC-group task S was predicted andpreloaded to follow task 7. Since both tasks a and j3completed execution before 7, task 8 did not follow task7. Hence, unnecessary loading of task 6 occurred.

To demonstrate how it is not clear which preloadingalgorithm will yield better performance, consider thesimple examples given in Figs. 7 and 8. In the examplein Fig. 7, the system variation using the preschedulingscheme completes all of the tasks first. On the otherhand, in the example in Fig. 8, the system variationusing the prediction scheme completes all of the tasksfirst. Both preloading schemes have advantages anddisadvantages. In order to evaluate and quantify theirrelative performance, simulation studies were conducted.These studies are described in Section V.

The preloading schemes can use any schedulingalgorithm, they are not limited to the FFMQ schedulingalgorithm. Since the preloading schemes use the proces-

412

Fig. 8. Status of a PASM with two MC-groups, withnotation as defined in Fig. 4. Example of casewhere the (b) prediction scheme yields betterperformance than the (a) prescheduling scheme.

sors currently assigned to a given task, they do not haveto have a fixed MC-group structure. Hence, thesehardware/software schemes can be adapted for use inother multiple-SIMD and partitionable SIMD/MIMD sys-tems.

V. Performance AnalysisA PASM with 16 MCs (Q = 16) and 1024 PEs

(N = 1024) was simulated using the PASM OperatingSystem simulator, a discrete event simulator [5], underfour variations in the control strategy used by thememory system (details of the simulations are given in[211).1. A PASM without double-buffered PCU memory

modules was considered, i.e., only one memory unitper PE. This allowed for no overlapped loading orunloading of data, and is examined to demonstratethe need for the double-buffered PCU memorymodules.

Fig. 9. Time-line which illustrates the definitions of loaddelay time and response time.

2. A PASM with double-buffered PCU memory moduleswas considered, i.e., two memory units per PE. Withthis variation the second memory unit was used fordoing overlapped unloading of the output data fromthe previous task, but no preloading of the input datafor the next task, i.e., there is no preloading schemeemployed.

3. A PASM with double-buffered PCU memory moduleswas considered, using the prescheduling scheme fordetermining where to preload input data.

4. A PASM with double-buffered PCU memory moduleswas considered, using the prediction scheme fordetermining where to preload input data.

Performance measures to be considered are MC util-ization, average load delay time, and average responsetime. The MC utilization is the fraction of time that theMCs are active during the simulation, specifically, thetotal MC active time, divided by Q and by the totalsimulation time. MC utilization has been selected sincethe utilization of the MCs reflects the utilization of thePEs.

The average load delay time is the average delaytime to load the memory frame for a task (see Fig. 9).The load delay time for a given task is the delaybetween the time when the MC-group(s) are ready toexecute the task and the time when the task starts exe-cuting. This is of interest since it directly shows thedecrease in the time the processors are idle waiting fordata to be loaded.

The response time for a task is the delay betweenthe time when the task arrives at the system and thetime when the task completes execution on the system(see Fig. 9). The average response time is calculated byaccumulating the response time for each task executedand dividing it by the number of task completions. Theresponse time is being considered since a decrease inresponse time has the greatest direct effect on the user.Response time is a significant measure since it isexpected that PASM will often be used interactively.Interactive users might be experimenting with differentsequences of image processing algorithms on largeimages. It is desirable to be able to very parameters forthe algorithms and see the results in a reasonableamount of time (i.e., short response times).

The system throughput is the number of tasks com-pleted per second by the system. It is not considered indetail since it is not an accurate performance measurefor this type of analysis. The system throughput doesnot take into account the number of MC-groups thetasks required. For example, for two system variation,the throughput could be the same, but for one variation,the system could be completing all 16 MC-group tasksand for the other the system could be completing all oneMC-group tasks. Hence, for the system throughput tobe of use, it is necessary to weigh it with the task size,which is equivalent to looking at the MC utilization.

Fig. 7. Status of a PASM with two MC-groups, withnotation as defined in Fig. 4. Example of casewhere the (a) prescheduling scheme yields betterperformance than the (b) prediction scheme.

413

Since each memory unit set is preloaded indepen-dently, it is possible for a task to be partially preloadedwhen the previous task completes execution. Therefore,when the new task is assigned, it is only necessary toload the memory units sets which were not preloaded.The simulator is able to account for partial preloading.

The value of N (the number of PEs) is not variedsince it would not effect the performance of the system.For example, if N were doubled, there would still be 16MC-groups, but each MC-group would have twice asmany PEs. Since there would also be twice as manyMSUs (since there are N/Q MSUs), all of the memoryunits within one memory unit set could still be loaded inone parallel block transfer. Hence, the results of thesimulations would not be effected.

units within one memory unit set could still be loaded inone parallel block transfer. Hence, the results of thesimulations would not be effected.

On the other hand, if the value of Q was doubled,there would 32 MC-groups and only 32 MSUs. Thischange enables the system to execute tasks which require32 PEs, in addition to the other possible task sizes.Since there are half as many MSUs, it would take twiceas many parallel block transfers to load the PCUmemory units for a task. Hence, doubling the value of Qwould have a similar effect to doubling the time toload/unload a data block for an MC-group, as done inExperiment 2.

"In computer systems, the arrival of individuals at acard reader, the failure of circuits in a central processor,and requests from terminals in a time-sharing system areprocesses that are essentially Poisson in nature." [5]Since PASM serves requests from terminals (as does atime-sharing system), task arrivals are generated with aPoisson process. The mean task interarrival time wasselected to be 20 simulation seconds. A uniform distri-bution is used for determining the number of MC-groupsa task requires. Each simulation run was for 20,000"PASM seconds" and had approximately 1000 tasks exe-cuted. The performance analysis has been divided intotwo experiments.

Experiment 1. In this experiment the distributionfor the task execution time was chosen to be exponen-tial. The mean execution time was varied from five to50 simulation seconds. The time to load/unload a datablock for an MC-group has been selected to be 0.090simulation seconds. This load time is based on the timeto load 64 kilobytes of data into a memory unit assum-ing that each MSU was a CDC BK7XX Storage DriveModule (disk) [2]. This time accounts for the seek andlatency times of the disk which can be overlapped withthe time to set the Memory Storage System busses.However, it does not account for any overhead from filesystem actions which should be insignificant when com-pared to the seek and latency times.

The average response time is given for the four vari-ations in control strategy as a function of the averagetask execution time in Tab. 1. In [19] it has been deter-mined both analytically and by simulation that the aver-age number of tasks being executed by the system for auniform distribution of task sizes is 2:58 times the MCutilization. Hence, if the system is to be 100 percentutilized, the mean task execution time must be at least51.6 seconds (if the mean interarrival time is 20 seconds).Therefore, when the average execution time is small (lessthan 20 seconds), the system (or MC) utilization is low(less than 40 percent, see Tab. 2). With the utilizationso low, there are usually free MC-groups and tasks cannormally be scheduled immediately upon arrival. Iftasks are scheduled immediately upon arrival, there is notime period in which tasks can be preloaded and as aresult no preloading of input data occurs. Hence, for

small average execution times (less than 20 seconds), theresponse time for all variations in control strategy isabout the same.

The prescheduling scheme alters the order in whichtasks are scheduled from the normal scheme since thetasks are scheduled in advance. The use of theprescheduling sometimes results in bad scheduling deci-sions (or miss-scheduling). For example, consider aPASM with two MC-groups (Q = 2) which is executingtasks a and /?, each requiring one MC-group (see Fig.8a). Task 7 has been prescheduled to follow a. Task j3completes execution before task a. Now MC-group 1 isidle and task 7 is waiting to be executed on MC-group 0.Hence, task 7 has been miss-scheduled resulting inincreased response time for task 7. These bad decisionswill have little effect when the average task executiontime is small. However, when the average task executiontime becomes large (i.e., greater than 25 seconds), thebad decisions have greater effect. This effect is illus-trated by the average response time for the preschedul-ing scheme becoming greater than the average responsetime for the prediction scheme for large execution times(see Tab. 1).

The MC utilization for the four system variations incontrol strategy is given in Tab. 2. For execution timesof less than 40 seconds, the system is able to service allof the arriving tasks (task arrival rate equalsthroughput) under each variation in control strategy. Asa result, for small execution times the MC utilization isthe same for all control strategies since the same set oftasks is being executed. As the average execution timeincreases, the throughput of tasks requiring 16 MC-groups decreases for the single and double variations.Since the system is executing fewer 16 MC-group tasks,the MC utilization is lower for the variations withoutpreloading. When the average execution time is 50seconds, the 16 MC-group task throughput is less forprescheduling than prediction, resulting in the difference

Tab. 2. MC utilization is given for the four variationsin control strategy as a function of the averagetask execution time (in simulation seconds).

Tab. 1. Average response time (in simulation seconds) isgiven for the four variations in control strategyas a function of the average task execution time(in simulation seconds).

in MC utilization. This occurs since the preschedulingscheme can only preschedule tasks which require thesame number or fewer MC-groups than a currently run-ning task. Therefore, a 16 MC-group task can only beprescheduled, if there is a 16 MC-group task running.Hence, the prescheduling scheme tends to favor taskswhich do not require 16 MC-groups. As the demand onthe system increases (e.g., longer average task executiontimes), the MC utilization becomes limited with the sin-gle and double variations since the processors cannot beutilized while they are waiting for data and programs tobe loaded and unloaded. Hence, the maximum allowablesystem load (utilization) is higher when the preloadingschemes are used.

In Tab. 3 the average task load delay time is givenas a function of the average task execution time for thefour variations in control strategy. This is given to show

Tab. 3. Average load delay time (in simulation seconds)is given for the four variations in control stra-tegy as a function of the average task executiontime (in simulation seconds).

how the preloading schemes reduce the load delay timesfor tasks. Consider the single-buffered variation. As theaverage task execution time increases the system utiliza-tion approaches one (see Tab. 2). When a task isscheduled it is usually being executed by MC-groupswhich have just completed executing another task. As aresult, the new task must wait for the output data fromthe previous task to be unloaded and its input data to beloaded. Hence, the load delay time increases withincreased utilization for the single-buffered variation.However, with the prediction and prescheduling schemes,longer task execution times allow more time for taskpreloading. Therefore, for large task execution times,the average load delay time approaches zero for bothpreloading schemes (see Tab. 3). The average load delaytime will never reach zero since there are constraints onwhen preloading is possible (e.g., cannot preload taskswhich require more MCs than any given currently exe-cuting task) which will always prevent some tasks frombeing preloaded. The average load delay time for theprediction scheme does not approach zero as rapidly as itdoes for the prescheduling scheme since some tasks arenot executed by the virtual machine in which they werepreloaded (e.g., task S of example in Fig. 6).

In summary, for small execution times (less than 20seconds) the system performs the same for all variationsin control strategy. For large execution times (greaterthen 20 seconds) the prediction scheme performs best.For a given task, the load delay time is a component ofthe response time (see Fig. 0). As a result, load delaytime does not indicate the direct effect on the user, asdoes the average response time. Hence, the lower aver-

414

Tab. 4. Average response time (in simulation seconds) isgiven for the four variations in control strategyas a function of the time to load/unload onedata block (in simulation seconds).

age response times (for larger executing times) providedby the prediction scheme are more significant than thelower average load delay times provided by theprescheduling scheme. Therefore, this experiment indi-cates that the prediction scheme is the method of choice.

Experiment 2. In this experiment the distributionfor the task execution time is exponential with meantask execution time of 25 simulation seconds. The timeto load/unload a data block for an MC-group is variedfrom 0 to 0.315 simulation seconds. Varying the time toload/unload a data block could result from varying thesize of the data block (which would result from varyingthe size of the PCU memory units) or from changing thetype or speed of the secondary storage device used bythe MSUs. For example, the time to load 64 kilobytes ofdata from a disk which employs "Winchester" technol-ogy can range from 0.2 to 0.4 seconds, depending on theparticular manufacturer (e.g., for the Hewlett-PackardModel 7910, the average load time is 0.236 seconds [6]).

The average response time is given for the four vari-ations in control strategy as a function of the time toload/unload a data block in Tab. 4. Note that in Tab. 1the average response time was given as a function of theaverage task execution time, while in Tab. 4 it is givenas a function of the time to load/unload a data block. Ifthe load/unload time is zero, the average response timefor the single, double, and prediction variations is thesame since loading and unloading a task requires notime. The response time for the prescheduling variationis greater than the other variations when theload/unload time is zero since with the preschedulingvariation, the system is still prescheduling tasks, result-ing in some miss-scheduling. Hence, the zeroload/unload time case directly illustrates the increase inresponse time resulting from the use of prescheduling.

As the load/unload time increases, the averageresponse times for the single-buffered variation increaseat a greater rate since the load and unload time for atask must be added to the execution time of every task.For all load/unload times, the prediction scheme yieldsthe lowest response times. For load/unload times ofgreater than 0.045 simulation seconds, the preschedulingscheme yields lower average response times than thesingle-buffered variation, and for load/unload times ofgreater than 0.180 the prescheduling scheme yields loweraverage response times than the double-bufferedvariation without preloading. These cross-overs in theaverage response time occur since the benefit of thepreloading (from the use of prescheduling) becomes moresignificant with greater load/unload times (and over-comes the increase resulting from miss-scheduling).

VI. ConclusionTwo schemes which can be used with the FFMQ

scheduling algorithm for preloading input data into thePCU memory modules have been presented. The twoschemes (prescheduling and prediction) make use of thedouble-buffered PCU memory modules. Since bothschemes have advantages and disadvantages, in order toevaluate and quantify their relative performance, it wasnecessary to conduct simulation studies. The perfor-mance of the system has been evaluated with four varia-tions in control strategy. It has been shown that the useof the double-buffered memory modules for overlappingthe unloading of the output data from the previous taskwith the execution of the next task results in asignificant decrease in average response time. Further-more, it has been shown that the average response timecan be decreased more significantly by using the double-buffered memories for input data preloading (along withoverlapped unloading). When the system becomesheavily loaded, the system performs better with theprediction scheme than with the prescheduling schemesince the prescheduling scheme alters the natural order-ing of the tasks which results from using the schedulingalgorithm. However, the prescheduling scheme has theadvantage that it does not do any unnecessary loading ofinput data which may not be used. The predictionscheme also has the advantage that in the worst case theresulting system performance will never be worse thanthat of the overlapped unloading case since the samescheduling order is maintained and all preloading is donewith lower priority. This claim cannot be made for theprescheduling scheme since it alters the scheduling order.

In summary, the "prediction" preloading schememakes good use of the Memory Storage System architec-ture and the double-buffered PCU memory modules. Itovercomes the problem of how to determine where thesystem can preload tasks prior to final processor selec-tion. Thus, the double-buffered primary memory - paral-lel secondary storage device organization can beexploited for overlapped loading of tasks as well as over-lapped unloading. The preloading schemes can use anyscheduling algorithm and can be adapted for use in othermultiple-SIMD and partitionable SIMD/MIMD systems.

References

[1] G. B. Adams III and H. J. Siegel, "The extra stagecube: a fault-tolerant interconnection network forSupersystems," IEEE Trans. Comput., vol. C-31, pp.443-454, May 1982.

[2] Control Data Corporation, CDC Storage ModuleDrive BK7XX Hardware Reference Manual, ControlData Corporation, Minneapolis, MN, 1979.

[3] T. Feng, "Data manipulating functions in parallelprocessors and their implementations," IEEE Trans.Comput., vol. C-23, pp. 309-318, Mar. 1974.

[4] M. J. Flynn, "Very high-speed computer systems,"Proc. IEEE, vol. 54, pp. 1901-1909, Dec. 1966.

[5] S. H. Fuller, "Performance evaluation," in Introduc-tion to Computer Architecture, 2nd edition, editedby H. S. Stone, SRA, Inc., Chicago, 1980, pp. 527-590.

[6] Hewlett-Packard, Electronic Instruments and Sys-tems 1982, Hewlett-Packard, Palo Alto, CA, 1982.

[7] R. Y. Kain, A. A. Raie, and M. G. Gouda, "Multi-ple processor scheduling policies," 1st Int'l. Conf.Distributed Computing Systems, Oct. 1979, pp. 660-668.

[8] R. N. Kapur, U. V. Premkumar, and G. J. Lipovski,"Organization of the TRAC processor-memory sub-system," AFIPS 1980 Nat. Comput. Conf., May1980, pp. 623-629.

[9] D. H. Lawrie, "Access and alignment of data in anarray processor," IEEE Trans. Comput., vol. C-24,pp. 1145-1155, Dec. 1975.

[10] R. J. McMillen and H. J. Siegel, "Routing schemesfor the augmented data manipulator network in anMIMD system," IEEE Trans. Comput., vol. C-31,pp. 1202-1214, Dec. 1982.

[11] G. J. Nutt, "Microprocessor implementation of aparallel processor," 4th Symp. Comput. Architecture,Mar. 1977, pp. 147-152.

[12] G. J. Nutt, "A parallel processor operating systemcomparison," IEEE Trans. Software Engr., vol. SE-3, pp. 467-475, Nov. 1977.

[13] D. S. Parker, "Notes on shuffle/exchange-typeswitching networks," IEEE Trans. Comput., vol.C-29, pp. 213-222, Mar. 1980.

[14] D. S. Parker and C. S. Raghavendra, "The gammanetwork: a multiprocessor interconnection networkwith redundant paths," 9th Symp. Comput. Archi-tecture, Apr. 1982, pp. 73-80.

[15] M. Pease, "The indirect binary n-cube microproces-sor array," IEEE Trans. Comput., vol. C-26, pp.458-473, May 1977.

[16] M. C. Sejnowski, E. T. Upchurch, R. N. Kapur, D.P. S. Charlu, and G. J. Lipovski, "An overview ofthe Texas reconfigurable array computer," AFIPS1980 Nat. Comput. Conf, May 1980, pp. 631-641.

[17] H. J. Siegel, "The theory underlying the partitioningof permutation networks," IEEE Trans. Comput.,vol. C-29, pp. 791-801, Sept. 1980.

[18] H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T.Mueller, Jr., H. E. Smalley, Jr., and S. D. Smith,"PASM: a partitionable SIMD/MIMD system forimage processing and pattern recognition," IEEETrans. Comput., vol. C-30, pp. 934-947, Dec. 1981.

[19] D. L. Tuomenoksa and H. J. Siegel, "Analysis of thePASM control system memory hierarchy," 1982Int'l. Conf. Parallel Processing, Aug. 1982, pp. 363-370.

[20] D. L. Tuomenoksa and H. J. Siegel, "Analysis ofmultiple-queue task scheduling algorithms formultiple-SIMD machines," 3rd Intl. Conf. Distri-buted Computing Systems, Oct. 1982, pp. 114-121.

[21] D. L. Tuomenoksa and H. J. Siegel, Design of theOperating System for the PASM Parallel ProcessingSystem, TR-EE 83-14, School of Electrical Engineer-ing, Purdue Univ., May 1983.

415

PRELOADING SCHEMES FOR THE PASM PARALLEL MEMORY ...

Documents