Top Banner
Thread migration and its applications in distributed shared memory systems 1 Ayal Itzkovitz, Assaf Schuster * , Lea Shalev Computer Science Department Technion - IIT, 32000 Haifa, Israel Received 23 July 1996; received in revised form 30 October 1996; accepted 27 January 1997 Abstract In this paper we describe the way thread migration can be carried in distributed shared memory (DSM) systems. We discuss the advantages of multi-threading in DSM systems and the importance of preempted dynamic thread migration. The proposed solution is implemented in MILLIPEDE: an environment for parallel programming over a network of (personal) computers. MILLIPEDE implements transparent computation migration mechanism: a mobile computation thread in a MILLIPEDE application can be sus- pended almost at every point during its lifetime and be resumed on another host. This mechanism can be used to better utilize system resources and improve performance by balancing the load and solving ping-pong situations of memory objects, and to provide user ownership on his workstation. We describe how some of these are implemented in the MILLIPEDE system. MILLIPEDE, includ- ing its thread migration module, is fully implemented in user-mode (currently on Windows-NT) using the standard operating system APIs. Ó 1998 Elsevier Science Inc. All rights reserved. Keywords: Tread migration; Distributed shared memory; Load sharing; Virtual parallel machine 1. Introduction Many attempts are made to integrate the resources and services of distributed computational environments into virtual parallel machines, or: metacomputing envi- ronments. While being very cheap and available to ev- eryone, such metacomputing environments will exhibit very high computational power, large virtually shared memory, and high bandwidth of I/O and communica- tion. Applications using these environments will have to be dynamically adaptive to the varying network con- figurations, utilizing idle resources and instantly evicting those resources reclaimed by their native users. In order to integrate the resources of a distributed en- vironment some form of cooperation among the nodes (or, computers) is necessary (Casavant and Kuhl, 1988; Chase et al., 1989; Krueger and Livny, 1988; Ku- mar et al., 1987; Willebek-Le-Mair and Reeves, 1993). Dynamic load sharing is the form of load distribution that has a potential of being ecient in a distributed sys- tem (Eager and Lazowska, 1986; Kremien, 1993). Load- sharing algorithms attempt to assure that there are no idle hosts when there are tasks waiting for execution on other hosts. This is achieved by dynamic initial place- ment and by migration after startup. Multithreading fur- ther helps to achieve better load distribution between the nodes in the system by splitting the application into smaller chunks of work. However, distributing an application over the net- work has its drawbacks. Components of an application need to communicate and synchronize, imposing over- head that, due to the relatively inecient communica- tion, is typically very high (Kumar et al., 1993). In fact, the optimal speedup in a distributed environment is commonly obtained by using fewer processors than the total number of those that are idle and available. The exact distribution of the machines that take part in the computation must be determined dynamically, ac- cording to both the system varying capabilities and the application varying needs. The solution to all these can be found by using multithreading and thread migra- tion. Multithreading can hide the latency by overlapping communication and computation. Thread migration can significantly reduce the amount of communication in DSM systems, by migrating threads in order to improve locality of shared data accesses. The Journal of Systems and Software 42 (1998) 71–87 * Corresponding author. Tel.: +972 4 829 4330; fax: +972 4 822 1128; e-mail: [email protected]. 1 Technion CS/LPCR Technical Report #9603, July 1996. 0164-1212/98/$19.00 Ó 1998 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 9 8 ) 0 0 0 0 8 - 9
17

Thread migration and its applications in distributed shared memory systems

May 14, 2023

Download

Documents

Michael Fry
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Thread migration and its applications in distributed shared memory systems

Thread migration and its applications in distributed shared memorysystems 1

Ayal Itzkovitz, Assaf Schuster *, Lea Shalev

Computer Science Department Technion - IIT, 32000 Haifa, Israel

Received 23 July 1996; received in revised form 30 October 1996; accepted 27 January 1997

Abstract

In this paper we describe the way thread migration can be carried in distributed shared memory (DSM) systems. We discuss the

advantages of multi-threading in DSM systems and the importance of preempted dynamic thread migration. The proposed solution

is implemented in MILLIPEDE: an environment for parallel programming over a network of (personal) computers. MILLIPEDE

implements transparent computation migration mechanism: a mobile computation thread in a MILLIPEDE application can be sus-

pended almost at every point during its lifetime and be resumed on another host. This mechanism can be used to better utilize system

resources and improve performance by balancing the load and solving ping-pong situations of memory objects, and to provide user

ownership on his workstation. We describe how some of these are implemented in the MILLIPEDE system. MILLIPEDE, includ-

ing its thread migration module, is fully implemented in user-mode (currently on Windows-NT) using the standard operating system

APIs. Ó 1998 Elsevier Science Inc. All rights reserved.

Keywords: Tread migration; Distributed shared memory; Load sharing; Virtual parallel machine

1. Introduction

Many attempts are made to integrate the resourcesand services of distributed computational environmentsinto virtual parallel machines, or: metacomputing envi-ronments. While being very cheap and available to ev-eryone, such metacomputing environments will exhibitvery high computational power, large virtually sharedmemory, and high bandwidth of I/O and communica-tion. Applications using these environments will haveto be dynamically adaptive to the varying network con-®gurations, utilizing idle resources and instantly evictingthose resources reclaimed by their native users.

In order to integrate the resources of a distributed en-vironment some form of cooperation among the nodes(or, computers) is necessary (Casavant and Kuhl,1988; Chase et al., 1989; Krueger and Livny, 1988; Ku-mar et al., 1987; Willebek-Le-Mair and Reeves, 1993).Dynamic load sharing is the form of load distributionthat has a potential of being e�cient in a distributed sys-tem (Eager and Lazowska, 1986; Kremien, 1993). Load-

sharing algorithms attempt to assure that there are noidle hosts when there are tasks waiting for executionon other hosts. This is achieved by dynamic initial place-ment and by migration after startup. Multithreading fur-ther helps to achieve better load distribution between thenodes in the system by splitting the application intosmaller chunks of work.

However, distributing an application over the net-work has its drawbacks. Components of an applicationneed to communicate and synchronize, imposing over-head that, due to the relatively ine�cient communica-tion, is typically very high (Kumar et al., 1993). Infact, the optimal speedup in a distributed environmentis commonly obtained by using fewer processors thanthe total number of those that are idle and available.The exact distribution of the machines that take partin the computation must be determined dynamically, ac-cording to both the system varying capabilities and theapplication varying needs. The solution to all thesecan be found by using multithreading and thread migra-tion. Multithreading can hide the latency by overlappingcommunication and computation. Thread migration cansigni®cantly reduce the amount of communication inDSM systems, by migrating threads in order to improvelocality of shared data accesses.

The Journal of Systems and Software 42 (1998) 71±87

* Corresponding author. Tel.: +972 4 829 4330; fax: +972 4 822 1128;

e-mail: [email protected] Technion CS/LPCR Technical Report #9603, July 1996.

0164-1212/98/$19.00 Ó 1998 Elsevier Science Inc. All rights reserved.

PII: S 0 1 6 4 - 1 2 1 2 ( 9 8 ) 0 0 0 0 8 - 9

Page 2: Thread migration and its applications in distributed shared memory systems

Although most of the power of metacomputing envi-ronments will typically consist of personal machines, de-grading of interactive response must be avoided. If theowner of a machine or a resource is not guaranteed toreceive it at the moment he attempts to use it he willnot allow ``invasion'' of remote execution in the future(Douglis and Ousterhout, 1991). To this end, once againthread migration is the answer which may be used toprovide user ownership in an e�cient way.

Part of the machines in a metacomputing environ-ment may be symmetric multiprocessors (SMP), whichare tightly coupled ``shared-all'' multiprocessor ma-chines. In an SMP system all the components such asprocessors, physical memory, buses, disks and control-lers are shared. A single copy of an operating systemcontrols all components, manages the shared memory,and balances the load among the processors by dynam-ically reassigning processors to threads. Here, usingmultiple threads make it possible to utilize the proces-sors in a transparent and e�cient way.

SMP systems are becoming widely available and it isexpected that this process will promote the developmentof parallel applications that use multithreading andshared memory. From the application point of view,non-scalable parallel computing on SMP machines withshared memory, and scalable parallel computing onmetacomputing environments with virtually sharedmemory, are at the same level of abstraction. Thus, giv-en an e�cient run-time support for metacomputing en-vironments, the transition from parallel computing onSMPs to parallel computing on distributed environ-ments is just a small, natural step.

As argued above, an e�cient support for metacom-puting environments must include migration of threadsbetween machines. Unfortunately, implementing threadmigration is not an easy task. In this paper we discussthe problems and complications of such implementa-tions, with a special emphasis on the relation to a possi-ble neighbor DSM mechanism. We describe some wrongsolutions that appear in the literature and present agood solution that is implemented in the MILLIPEDEvirtual parallel machine.

We then proceed to describe the way thread migrationis utilized in the MILLIPEDE system. MILLIPEDE is athread-based system for the development and executionof parallel applications in distributed environments. Itpresents a strong application interface, including a ¯exi-ble DSM mechanism along with a dynamic thread sched-uling algorithm. The thread scheduling algorithm strivesto reach the optimal speedup by dynamically solving thetrade-o� between minimal load and minimal communi-cation. It also tries to minimize communication by mi-grating both threads and pages between machines,until maximal data locality is achieved. To this end,MILLIPEDE implements a transparent thread migra-tion mechanism that is used by the thread scheduler.

MILLIPEDE is currently implemented on the Win-dows-NT operating system, using its support for multi-threading and SMP thread scheduling. Detaileddescription of the MILLIPEDE system can be foundin (Itzkovitz et al., 1997b).

1.1. Related systems

We now discuss the main di�erences between MIL-LIPEDE and several other systems that support threadmigration.· UPVM is a package that supports multi-threading

and transparent migration for PVM applications (Ca-sas et al., 1994). UPVM de®nes an abstraction havingsome of the characteristics of a thread and some of aprocess called a user level process (ULP). ULPs di�erfrom threads in that they de®ne a private data andheap space. ULPs communicate with each other viamessage passing. The ULP state that is transferredwhen a ULP migrates includes the context, the stack,the data, and the heap. As in MILLIPEDE mappingof a ULP to a set of virtual addresses is unique acrossall the processes of the application. The di�erence isthat MILLIPEDE threads keep their non-local datain shared memory that need not be transferred explic-itly at migration time. The memory usage of a threadtriggers the migration of pages to its new locationmeaning that only the data that is actually used bythe migrated thread is transferred on demand, thusdecreasing the cost of migration in MILLIPEDE.

· Ariadne is a user-space threads system that runs onshared- and distributed-memory multiprocessors. Incontrast to Ariadne, MILLIPEDE uses operating-system supported threads (also called kernel threadsin UNIX-like environments). The advantage of us-er-space threads is their relative portability, since theymay be implemented on an operating system thatdoes not support threads. In addition, context switch-ing between user threads is faster than context switch-ing between kernel threads. However, this maychange in the future, since next generations of proces-sors may support thread context switching in hard-ware, thus making switching of kernel threads lessexpensive than that of user threads. There are twomain disadvantages of user threads. First, a userthread that blocks on an internal page fault or a sys-tem call causes blocking of its process. If kernelthreads are used, a thread that blocks does not pre-vent other threads of the same process from running.Another advantage of kernel threads is that on anSMP they are scheduled by the operating system au-tomatically on available processors. With userthreads, an application has to be modi®ed explicitlyin order to use multiple processors. In Ariadne, addi-tional processes are created for this purpose imposinghigh overhead. Similar to MILLIPEDE, thread

72 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 3: Thread migration and its applications in distributed shared memory systems

migration in Ariadne is supported at user level in ho-mogeneous environments. However, the mechanismof migration in Ariadne is di�erent from MIL-LIPEDE. We further discuss Ariadne's thread migra-tion and the problems that are associated with it inSection 3.

· Amber (Chase et al., 1989) is an object-oriented DSMsystem that permits a single application to use a ho-mogeneous network of computers. Each node maybe a shared-memory multiprocessor. Amber supportsdata and thread migration; the location of objects ismanaged explicitly by an application. The mechanismof thread migration is essentially the same as in MIL-LIPEDE. The di�erence is that with MILLIPEDE aprogrammer does not have to deal with the dataand threads location issues, since MILLIPEDE pro-vides a location-independent interface and automati-cally improves locality of data accesses at run-time.The rest of this paper is organized as follows. In Sec-

tion 2 we discuss our motivation for using preemptivemultithreaded DSM systems. Section 3 discusses someglobal aspects of thread migration, explains the variousapproaches introduced so far for its implementation,and proposes a new approach for implementing threadmigration in user-space which is applicable on most ex-isting operating systems. Section 4 gives an overview ofthe MILLIPEDE system and discusses its implementa-tion of thread migration. Section 5 describes the wayMILLIPEDE utilizes thread migration in order to sharethe load and improve the locality of memory references.Section 6 presents some measures taken with the MIL-LIPEDE system on a non-homogeneous environment,essentially giving examples to possible improvementsof performance that are enabled by thread migration.Finally, Section 7 gives some concluding remarks.

2. Motivation and discussion

In this section we discuss the advantages of the DSMmodel combined with multithreading. We also explainthe bene®ts of dynamic load distribution schemes andthread migration in multithreaded DSM systems.

2.1. Why DSM systems

Distributed Shared Memory (DSM) is an implemen-tation of a shared memory paradigm on a physically dis-tributed system (Keleher et al., 1994; Li and Hudak,1989). Parallel programming in this model is easy, sincethe DSM is a natural generalization of sequential pro-gramming. Furthermore, with a DSM it is relativelyeasy to parallelize sequential programs. In this model,components of an application communicate using a vir-tually shared memory. Local and remote data accessesare carried in a way transparent to the programmer, ser-

viced by the underlying DSM mechanism. This makesDSM applications both easier to develop and more por-table (across DSM architectures) than programs thatuse explicit message passing. In particular, metacomput-ing environments which exhibit virtually shared memory(and may consist of the cooperation of large suits of var-ious machines and resources), are at the same level ofabstraction as that of multiprocessor machines withphysically shared memory. In fact, the programmingparadigms are at the same level of abstraction as thatof a multithreaded uniprocessor machine.

With the rapid growth of popularity and availabilityof SMPs, it is expected that more users will attempt toutilize the power of their machines by parallelizing them.This will lead to a growing set of available parallel appli-cations. These applications will assume the convenientprogramming paradigm provided by their native multi-processor machines; namely, multithreaded parallelcomputing with shared memory, that does not assumea dedicated machine. Given this expected large volumeof applications, it is just a natural step to provide this in-terface (including in particular the DSM) also on top ofphysically distributed, metacomputing environments.Such metacomputing environments have the additionaladvantage over SMPs of being scalable to higher levelsof parallelism.

2.2. Why dynamic load sharing

Load distribution is necessary in a distributed systemfor better utilizing its computational power. Variousload balancing and load sharing algorithms appear inthe literature. In general, the purpose of the load balanc-ing operation is to split the work evenly among the pro-cessors, whereas the approach of load sharingalgorithms is to ensure that no processor stays idle orslightly loaded when there are heavily loaded processorsin the system.

Static load distribution strategies are e�ective whenapplied to problems that can be partitioned into taskswith uniform computation and communication require-ments. An additional requirement of static algorithms isthat the environment is homogeneous, i.e., all machinesin the system should have identical hardware parameters(such as processor speed) and similar load resulting fromother activities. There exist, however, a large number ofproblems with non-uniform and unpredictable compu-tation and communication requirements. Also, ma-chines in a non-dedicated network of computers (suchas a metacomputing environment) will commonly di�erin their speed and load state; part of them may be evenunavailable at certain times. Therefore, dynamic loaddistribution is essential both for e�cient solving ofnon-uniform problems and for solving uniform prob-lems in a non-uniform environment. Thus, it seems thatin a metacomputing environment applying either

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 73

Page 4: Thread migration and its applications in distributed shared memory systems

dynamic load balancing or dynamic load sharing is un-avoidable.

The overhead imposed by dynamic load balancing ina large distributed system may outweigh its potentialbene®ts for the following reasons. First, equalizing theload among all nodes in the system requires largeamounts of precise, global information concerning thestate of all the machines. For fairly large systems thismay violate the scalability requirement. Furthermore,when the overall system load is high, load balancingstrategies will cause transfer of work from highly over-loaded hosts to other hosts that are overloaded as well.This may improve performance in some cases, e.g., wheniterations of a loop are scheduled on a uniform system.However, with environment such as a network of work-stations, this strategy will only impose additional over-head, and may even cause an unstable behavior.

In contrast to load balancing algorithms, dynamicload sharing strategies have a potential of achieving re-source utilization that is almost as good at a much lowercost. Due to their relaxed requirements, load sharing al-gorithms may avoid the need for global information,using restricted local information only. The algorithmmay do very well even if machines know the status ofonly part of the other machines in the system. More-over, the information may be less precise than that need-ed for load balancing, and thus may be exchanged lessfrequently. Another advantage is that load sharing algo-rithms can be designed so that no overhead is imposedwhen all nodes in the system have enough work to do.This makes load sharing strategies potentially more e�-cient, especially in dynamically changing environments.

2.3. Why multithreaded DSM systems

Multiple threads within a process share its virtualspace. Threads are the basic entity to which the operat-ing system allocates the CPU time. On a multiprocessorsystem executable threads are distributed among theavailable processors. Therefore multithreading allowsan application to take advantage of an SMP architectureby using all the processors on a node in a way that istransparent to the programmer, and is natural to ashared memory application. As long as the level of par-allelism in the application exceeds that of the actual ma-chine, it need not be changed in order to utilize multipleprocessors. Modern operating systems balance the loadamong the processors of the machine when enoughthreads are available, and this load distribution neednot be programmed in advance. In addition to betterutilization of multiprocessor machines, using multiplethreads allows also better load distribution over the net-work, when the level of parallelism provided by the ap-plication is su�ciently high.

In an environment that does not support threads, anapplication should be divided into multiple processes in

order to be parallelized. However, the cost of communi-cation, synchronization, and context switching betweenprocesses, is a lot higher than that of multiple threadsthat share the resources in the same process. The reasonsare that the threads can exchange data e�ciently usingthe shared virtual address space, that their context issmall relative to that of a process, and that their workingsets may overlap, so that in many cases context switch-ing between threads does not cause swapping, while pro-cess switching would do.

Some additional overhead may be imposed by multi-threading due to the need to switch contexts. This switch-ing commonly occurs in a remote access. The associatedoverhead is thus justi®ed, as it implies that the time onethread is waiting for the remote access to complete (calledthe latency of the system) is overlapped by a computationthat is carried by a di�erent thread. In this way we avoidstalling the processor during remote accesses that may befrequent in a large metacomputing environment. Whenkernel threads are used this overlap of communicationand computation is easy, natural and e�cient, by the au-tomatic scheduling of the operating system.

Another advantage of using multiple threads is the re-duced cost of migration. Migrating threads is less expen-sive than migrating a process since process migrationrequires transferring all virtual space of a process(Zayas, 1987), while migration of threads in a DSM sys-tem requires only the transfer of memory occupied bythe threads' stacks.

2.4. Why thread migration

Dynamic initial placement of threads solves part ofthe problems arising from non-uniform problem or envi-ronment. However additional performance improve-ment can be achieved by thread migration for thefollowing reasons:1. Load may change quickly, causing poor utilization of

processors. Therefore redistribution of the load isnecessary.

2. Poor initial placement of threads may cause largecommunication overhead. In a DSM system this hap-pens when threads that are executing on di�erenthosts are using the same data. In such a case migrat-ing these threads to one host turns the remote dataaccesses into local ones, thus reducing communica-tion overhead.For a network of personal computers, there is one

more reason for migration being important. A user ex-pects to receive the full resources from the machine heis using. Therefore, threads executing remotely shouldnot degrade interactive response. To achieve this, threadsshould be executed remotely only on idle machines, andif a user returns before they ®nish, they should bestopped. Thread migration mechanism makes it possibleto continue the execution of such threads on other hosts.

74 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 5: Thread migration and its applications in distributed shared memory systems

3. Designing thread migration in a DSM system

This section discusses problems that need be solvedwhen designing a user-level thread migration in a non-distributed operating system. We make several assump-tions on the underlying system. We consider operatingsystems that provide support to multithreading at thekernel level. Migration is only supported across ma-chines with processors of the same architecture, runningthe same operating system. We also assume that a con-ventional compiler is used, so that no extra informationabout threads' state is available.

We assume that thread scheduling is transparent toan application, i.e., a thread is not required to performany additional code in order to be ``migratable''. Threadmigration may occur at (almost) any moment during thethread lifetime, and not only in prede®ned points wherethe thread is checking if it should migrate.

However, we do not assume that any thread can betransferred at any moment regardless of what it is doing.A programmer should be aware of the possibility of mi-gration and follow certain guidelines when writing orporting his application. We strive to minimize the in-volvement of the programmer, but in some cases it is un-avoidable. A programmer is responsible for bothde®ning blocks of computation that can be performedin parallel, and for assisting in determining whether athread can migrate safely.

3.1. Requirements from the operating system

The following services by the operating system are vi-tal in order to support a user-level implementation of acombined DSM and thread migration:· Protection of pages in virtual memory and exception

handling on a protection fault.· Interface for the creation and management of

threads, including a mechanism for obtaining and up-dating a thread's state.

· Virtual address space that is arranged identically foreach instance of an application. In particular, thecode section resides at the same virtual addresses ineach copy of a program.

· Some mechanism for resetting the location ofthreads' stacks. It should be possible to reserve arange of virtual addresses for the stack of a thread.

The reasons behind the above requirements are de-scribed below.

3.2. Restrictions on thread migration

Here we describe the thread state and the problemsthat arise when a migration of a thread occurs, i.e. whena thread is stopped on one host and is resumed on an-other one in the same state. We identify the restrictions

on the state of the migrated thread that are necessary tomake the migration possible.

Thread state consists of global data and thread-spe-ci®c information: stack contents, register values and op-erating system internal control information. In the DSMmodel all global data are assumed to be allocated in theshared memory, so if a thread migrates the data shouldbe transferred by the DSM when needed; depending onmemory consistency protocols this might be done at mi-gration time or only after an attempt to access this databy the migrated thread. On the other hand, the stackcontents and the register values must always be trans-ferred at migration time.

The stack and the registers may contain pointers tocode, global data or data in the stack. A potential prob-lem is that these pointers may not have the same mean-ing on di�erent hosts. Thus, it is necessary either toensure that the pointers will retain their meaning, orto provide some translation mechanism. We are assum-ing that program code is automatically placed by the op-erating system at the same virtual addresses in each copyof the program (see Section 4.3 for discussion on someOS-speci®c aspects of this assumption and arising limi-tations on the applications). DSM addresses also havethe same meaning in each instance; the consistency ofthe DSM data for a mobile thread is maintained bythe DSM mechanism as discussed in Section 4.3. Thelast problem that should be treated is the pointers to da-ta in the stack. We discuss this problem in detail in Sec-tion 3.3.

Another important issue is the usage of the systemcalls. A user cannot access the internal control informa-tion of the operating system, so it cannot be updated ortransferred when a thread migrates. Therefore, a threadthat owns system resources cannot migrate. For exam-ple, a thread that entered a critical section (using thecorresponding system call) and did not leave it yet ownsthe critical section object; its migration in this state willprevent other threads from entering the critical section.Releasing the critical section on the destination host willnot make much sense, because in a non-distributed op-erating system object handles are meaningful only onthe host they were created on. It might be possible to re-direct such calls to relevant machines, but this requiresrede®nition of all the system calls, and in addition in-creases the cost of remote execution.

Many system calls (especially those used for synchro-nization) cannot be used explicitly in user level in a dis-tributed system that supports thread migration, becausethe location of a job may change at any time. For exam-ple, jobs cannot communicate via pipes, since they haveno information about each other's location. Even if theydo have such information, the location of a job maychange after a message was sent to it and before it ar-rives. Thus, some other mechanism of synchronizationis necessary in such systems. Using DSM for this pur-

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 75

Page 6: Thread migration and its applications in distributed shared memory systems

pose might be extremely ine�ective; for example imple-menting critical section using shared variables inevitablyinvolves busy-waiting and in addition imposes highcommunication overhead associated with synchroniza-tion of these variables.

3.3. Implementing thread migration

We now describe and discuss several approaches tothe problem of transferring the stack contents of a mi-grating thread.

3.3.1. A simple approach that failsThe approach described here is used, for instance, in

the Ariadne system (Mascarenhas and Rego, 1996). Wewill describe the method itself and the problems that itmay cause, and try to explain why and when it works.The method is as follows. When a thread migrates thecontents of its stack on the destination machine is cop-ied to addresses that might di�er from addresses onthe source machine. Let us call stack self-referencespointers that reside in the stack and reference some datathat also reside in the stack. These self-references, as wellas the stack pointer and the frame pointer, have to betranslated when the stack is moved to di�erent address-es. The o�set that is used for this purpose consists of thedi�erence between the stack bottom address on the ori-gin machine and the stack bottom address on the desti-nation machines.

The stack contains two types of self-references: savedframe pointers and addresses of stack data. The lattermay reside in the stack in several ways: as parametersto functions, as values of local variables, as values ofsaved registers, or as intermediate values used by a com-piler. The method suggested in (Mascarenhas and Rego,1996) is to identify such references and update them (de-tails are not provided). Saved frame pointers are easilyidenti®able, so they can be updated correctly. The prob-lem is that local data addresses in the stack cannot beidenti®ed in the general case. They may be everywherein the stack; the data in the stack may be even mis-aligned (if compiler alignment must be disabled for somereason). The only way such addresses might be updatedwithout some additional information is to prohibit theuse of data types such as char that may cause misalign-ment in the stack, to examine the value of each alignedentry in the stack and update it if it may be a stack self-reference, and to hope that this updated value was not anon-pointer that accidentally looks like pointer to thestack data.

Consider the following example. Let us suppose thatthe nodes in the system use perfectly synchronizedclocks; an application orders events of some type usingtimestamps. A thread performs an operation get_timethat returns the number of milliseconds that passedfrom some prede®ned moment. The thread stores the

obtained value in a local variable t that resides in itsstack. At this point the thread is preempted, and laterit migrates to another host. If the value of t is in therange of stack addresses of the thread, it will be updatedas if it were a stack reference. If now the thread will storethe variable t as a timestamp of an event, the event or-dering may become incorrect.

Another problem with this approach is that generalpurpose registers may also contain pointers to stack da-ta. It is claimed in (Mascarenhas and Rego, 1996) thatthis occurs only when compiler optimizations are used,but this claim is clearly incorrect. Thus, values of regis-ters must be updated too, causing the same problem asthat of identifying references to stack data.

We believe that translating the state correctly in thegeneral case when this method is used is impossible with-out compiler support. A natural question to ask is howthis method works in systems that use it. The answer isthat the probability of correct operation is high, giventhat only aligned data is used, that migration is initiatedby the migrating thread itself (thus eliminating the prob-lem of temporary addresses in registers), and that thereis a little amount of non-pointer values on the stack.However, these limitations do not guarantee correctnessof the state translation in the general case.

3.3.2. A popular approachSince translating the pointers is impossible without

extensive compiler support, it is necessary to ensure thatthe pointers will retain their meaning after migration. Toachieve this, the segment of virtual memory occupied bythe stack on one host is reserved for it on all other hosts,so that the stack contents can be copied to the same ad-dresses when a thread migrates.

The popular method (incorporated, e.g., in (Chase etal., 1989; Casas et al., 1994; Dubrovski et al., 1997)) forreserving memory for stacks is as follows. A region ofvirtual memory starting at a prede®ned location is re-served for the threads' stacks on every host; each threadis assigned a unique identi®cation number that is used to®nd the thread's slot in the stack region. A newly createdthread is forced to use the proper slot as its stack. More-over, this slot can be allocated from the DSM, so thestack need not to be explicitly transferred: it will betransparently brought by the DSM mechanism whenneeded. This method is very easy to implement when us-er-level threads are used, so that a programmer has con-trol over the locations of thread stacks. With kernel-level threads, the situation is more di�cult since in thiscase the stacks are usually allocated by the operatingsystem. This problem may be solved in the followingway. The register context of a newly created thread ischanged so that the thread will use the proper slot in-stead of the original stack; this is performed before thethread starts executing and before any values are writteninto the stack.

76 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 7: Thread migration and its applications in distributed shared memory systems

This method can be used in many systems; however,it has a serious disadvantage. Namely, it is based onthe assumption that the operating system behavior doesnot depend on initial location of thread stacks. This isnot the case for some existing systems; for example,the Windows-NT operating system checks validity of athreads' stack pointer in certain cases, and if it decidesthat the stack pointer is illegal it just terminates the pro-gram. It de®nitely will decide so if a thread will use astack at location other than the original one (registeredby the operating system). Moreover, even if the assump-tion above is true in some operating system, it may beviolated in its future versions. Thus, this approach lacksportability.

3.3.3. Our approachWe solve the problem described above by using

stacks allocated by the operating system while ensuringthat these stacks will occupy the same addresses on allhosts.

A user application de®nes blocks of code that can beexecuted in parallel. These blocks are called jobs. Thejobs are executed by separate threads. Instead of creat-ing a thread each time a new job is spawned in a userprogram, a prede®ned number of threads called workersare used to receive jobs and execute them. The workersare created on each host at initialization time and rununtil the application completes. Since the virtual spaceof all copies is initially arranged identically and all in-stances perform their initialization in the same way,the copies of the same worker running on di�erent hostsget their stacks at the same addresses. In this way the ad-dresses are reserved for the stacks. A job that was al-ready started by worker i can be executed on any copyof this worker, i.e. on worker i at any other host. Tomake sure that migration is always possible, at mostone copy of each worker is executing a job at any giventime. All idle workers are suspended.

As with the previous approach, the number ofthreads that can be created simultaneously on all nodesin the system is limited since the threads share a singleaddress space among all hosts. If this limit is too low,a single application would not bene®t from a massivelyparallel architecture. However, this problem may beeliminated when 64-bit architectures will be used, so thatthe limit on the number of threads will be high enough.

4. Architecture of the MILLIPEDE system

In this section we give a brief overview of the MIL-LIPEDE DSM system and its relation to the MIL-LIPEDE thread migration mechanism. MILLIPEDEis a user-level implementation of a multi-threadedDSM system with transparent page- and job- migration.The current implementation of MILLIPEDE is on Win-

dows-NT operating system, and employs our proposeddesign for thread migration.

4.1. Assumptions

MILLIPEDE was designed for the following type ofenvironment and applications:· Homogeneous environment. A network consisting of

machines with processors of the same architecture isassumed, so that the representation of the program'sstate is the same on all machines. Processors may dif-fer in their speed. The network may include SMP ma-chines.

· Coarse granularity. The overhead associated with cre-ating a thread and with initiating remote execution isrelatively high. Therefore a thread should have su�-cient amount of computation to do in order to justifythe cost of its creation or migration. Thus, we assumethat the expected lifetime of a thread is relativelyhigh.

· Unpredictable computation and communication re-quirements. Requests for the execution of one or morethreads arrive in an arbitrary timing. No assumptionis made about the memory access patterns of thethreads. No a-priori knowledge is assumed aboutthe relative amounts of communication and compu-tation used by the applications.

4.2. System overview

Each machine in the system runs a MILLIPEDEDaemon: a process that is in charge of collecting anddisseminating of load information, of managing MIL-LIPEDE applications, and of the dynamic load sharing(Fig. 1).

MILLIPEDE package includes two libraries: theDSM and the MGS. The DSM library provides inter-face for allocating DSM and keeping it consistent; it

Fig. 1. MILLIPEDE structure.

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 77

Page 8: Thread migration and its applications in distributed shared memory systems

supports various memory consistency protocols (see(Itzkovitz and Schuster, 1997)). The Migration Server(MGS) library provides interface for creating multipleparallel activities and for managing them; it controlstheir locations and performs the migration.

MILLIPEDE applications are written in a parallellanguage independent of the underlying operating sys-tem, of the number of available processors, and of thedata and threads location. Currently ParC (Ben-Asheret al., 1996) and ParC++ (Beery et al., 1997) (naturalparallel extensions of C and C++) are supported, en-hanced with ¯exible mechanisms for weakly sharing ofvariables (porting is underway for ParFortran90, andJava). A ParC++ program is precompiled; the resultingC++ code is compiled using a conventional C++ com-piler and is linked with the DSM and MGS libraries.The libraries are independent of the ParC++ languageconstructs; they provide an interface that allows imple-mentation of any similar precompiler for any other lan-guage. It is also possible to write an application in aconventional language and use the libraries directly; this(less convenient) way may be used if the application re-quires some exotic synchronization method that is notsupported by existing precompilers. The interface is fur-ther explained in (Itzkovitz et al., 1997a).

A MILLIPEDE application consists of instances(copies) of a user program running on di�erent nodesin the system (see Fig. 2). If a node is an SMP, all avail-able processors are used in a transparent way by a singleinstance of an application. Instances of an applicationshare single virtual space. They communicate in a loca-tion-independent way via the DSM mechanism and syn-chronize using the MGS primitives.

An instance of an application consists of the follow-ing parts (Fig. 2):· A pool of workers: system threads that receive user

jobs and execute them.· Memory manager: threads needed to keep the DSM

consistent.

· Migration Server (MGS): threads that take part ofthe decisions whether to migrate, and handle the mi-gration of jobs to and from the host.

4.3. The way migration work in MILLIPEDE

In Section 3.2 we put some requirements from the op-erating system and the applications which should be sat-is®ed in order to make thread migration possible. Herewe check that Millipede complies with the requirementsand discuss their impact on Millipede applications.

4.4. Application code addresses

Our basic assumption is the identical virtual memorylayout in all instances of an application. In particular,we assumed that the code of an application is automat-ically placed by the operating system at the same virtualaddresses in each copy of the program. However, in theWindows-NT operating system this is not guaranteedfor the general case. The problem description follows.

``Conventional'' application code will always be load-ed to the same addresses by the same version of the op-erating system. However, a problem might appear dueto dynamic loading of dynamic-link libraries (DLLs).If a DLL is loaded at run-time, it might not be loadedto the same addresses in each copy of the program. SinceDLLs are widely used in NT, and most importantly, allsystem services are accessed by an application via systemDLLs, this might potentially make thread migration im-possible. Fortunately, DLLs are typically loaded stati-cally. Static loading is the standard mechanism,invoked when a program is built with import libraries re-ferring to the DLL. In this case application's executableimage ®le contains the import library; the correspondingDLL is linked implicitly. The operating system loader,upon starting the process loads all DLLs that are foundin the executable image in the order they appear there.Thus, mapping to the same addresses is guaranteed (giv-en that DLL versions are the same).

We assume that Millipede applications do not use ex-plicit linking of DLLs (i.e., dynamic loading). Note thatif for some reason dynamic loading is desirable, a usermight still use it (on his own risk). He may specify theDLL base address (at link time) explicitly, thus forcingthe operating system to load the DLL to prede®ned ad-dresses. It is the user responsibility to assure that the re-quired address space is available.

Thus, if an application does not load DLLs dynami-cally or uses the DLL base addresses as described above,it is guaranteed that the virtual memory layout is identi-cal for each instance of the application.

4.4.1. Thread stateWe made some assumptions to ensure that the thread

state can be transferred between di�erent machines. WeFig. 2. of a MILLIPEDE application.

78 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 9: Thread migration and its applications in distributed shared memory systems

assumed that all global data is allocated in the DSMarea and ``private'' data of a thread is allocated in theDSM area or the thread stack. Static data or data allo-cated in the non-DSM area cannot be transferred whenmigrating a thread, therefore it is not allowed for a ``mo-bile'' thread. This limitation is relatively easy to satisfyin the application code. However, it is impossible toknow whether it is satis®ed in the system or ``third-par-ty'' libraries. They might allocate memory in the appli-cation heap (that is not managed by the DSM) or usestatic data, thus making it impossible to transfer threadstate correctly.

A similar problem arises from our restriction on usingsystem calls. The solution for millipede applications isdescribed brie¯y in Section 4.6.2. Once again, a third-party library evidently does not use this solution andmight still use system calls thus violating our require-ments: it might allocate internally some operating sys-tem resources associated with a thread. For example itmight store information on a per-thread basis usingthread local storage (widely-used win32 interface).

Hence one just cannot know whether using an exter-nal library is safe: it might maintain some non-migrat-able thread state (either within the scope of onefunction or even between the calls). It is the program-mer's responsibility to decide which libraries can be usedsafely by a mobile thread.

4.4.2. DSM consistencyWe assumed that when a thread migrates the DSM

data is transferred by the DSM mechanism when need-ed. If strong memory consistency model is used, the dataneed not be transferred at migration time because this isdone transparently when an attempt to access this datais done on the target host; if the data is already presentthere then it is always synchronized with the data on theoriginal host. However in other consistency models thismight not hold: the data on the target host might be old-er than the data on the original host, so that for thisthread the logical order of changes on this data mightnot be preserved. If the model does not allow it, theDSM mechanism should perform necessary memorysynchronization before the migrating thread is resumed;to make it possible the MGS should notify the DSMthat migration is taking place and wait until all synchro-nization is done.

Currently the above mechanism is not implementedin MILLIPEDE, so that thread migration cannot beused together with relaxed memory consistency (strongconsistency is mandatory for mobile threads). We areplanning to implement it in future; we investigate theways to optimize the memory synchronization for di�er-ent consistency protocols. For example, if weak consis-tency (Itzkovitz and Schuster, 1997) is used, the pageshould be synchronized only in the case that the targethost has a read-only copy of the page while there exist

a writable (possibly more updated) copy in the system.In this case the DSM mechanism on the target hostcan simply ``forget'' about his copy; when some threadon this host will try to access the page, the DSM willbring the most updated copy, so that it is guaranteedthat the migrated thread will not ``see'' old values afternew ones.

4.5. Relation between the DSM and the MGS libraries

Thread migration mechanism in millipede is based onthe assumption that all non-local data that is used by athread resides in the DSM. Therefore when a thread mi-grates, only its stack and context should be transferredby the MGS; the issues of the DSM synchronizationshould be left to the DSM, so the MGS should invoke it.

The MGS collects information provided by the DSMto make decisions on migration, and in some cases alsoa�ects decisions of the DSM mechanism as described be-low. The DSM mechanism passes to the MGS informa-tion about remote page accesses. The MGS uses thisinformation to determine if threads should be redistrib-uted to decrease communication. In some cases theMGS may a�ect behavior of the DSM by advising itto lock a page on the local host for a short time. In thisway it is possible to stabilize the system when remote da-ta accesses are causing high communication overhead,but thread migration is not possible.

The MGS also uses the DSM mechanism to storepart of the necessary information. The MGS of each in-stance should keep track of the location of each runningthread. Since a thread may migrate several times, keep-ing this information consistent on each host may be ex-pensive. We solve this problem simply by using theDSM to store the threads' locations.

4.6. Thread migration in MILLIPEDE

4.6.1. Migration policyThread scheduling in MILLIPEDE is transparent to

an application. A computation thread (that is de®nedby the programmer as mobile) may be suspended atany moment and resumed on another host.

Thread migration occurs in the following cases:· An overloaded node sends work to an underloaded

one to decrease load imbalance.· Threads that are causing high communication over-

head are brought together.· Remote threads are evicted by the machine when a

native user starts working on it.MILLIPEDE uses history of remote page accesses formaking decisions on migration, where the objective isto minimize the amount of communication. The MGS``learns'' about the communication pattern of thethreads by recording remote page accesses. What makethings interesting is that ± for performance reasons ±

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 79

Page 10: Thread migration and its applications in distributed shared memory systems

the information about local accesses is not recorded.Thus, the knowledge about the communication patternis incomplete and incorrect decisions may be taken.For example, in the case that all threads which frequent-ly access the same page are running on the same host,the MGS is not (initially) informed about it, so it maychoose one of these threads for migration to anotherhost. This will be a poor decision, since it will cause re-peated transferring of this page between the hosts (apage ping-pong). However, information about this pagewill become available, making it possible to correct thisdecision and to keep the obtained information in orderto improve future decisions. The detailed descriptionof the algorithms that are used to decide that migrationwill take place and to select threads that will migrate canbe found in (Schuster and Shalev, 1997).

4.6.2. Migration implementationThe thread migration is implemented in user-level in

Windows-NT, using standard Win32 API. The same im-plementation may be used in the Windows-95 environ-ment as well. As we explained in Section 3.3.3, a poolof workers is used, where workers are threads that re-ceive user jobs and execute them (Fig. 2). The workersare created in each instance of an application when theMGS library performs its initialization; they run untilthe application completes. The copies of the same work-er running on di�erent hosts get their stacks at the sameaddresses (Fig. 3); therefore a job that was already start-ed by worker i can be executed on any copy of thisworker, i.e. on worker i on any other host. To make sure

that migration is always possible, at most one copy ofeach worker is executing a job at any given time. All idleworkers are suspended.

The problem of using system calls is solved by provid-ing a location-independent interface and by migratingonly the jobs that do not own operating system resourc-es and are executing user-level code, so that their statecan be simulated on another host. The details follow.

Jobs are not allowed to use system calls explicitly, un-less they notify the MGS. Suppose a job wants to dis-play some data on a graphic window. Then it cannotmigrate from the moment it starts to create the windowand until it ®nishes closing it. The MGS should be in-formed about it; otherwise it may choose this job as acandidate for migration. Therefore, before it performslocation-dependent activities, a job must notify theMGS. The MGS library provides functions to avoid/en-able migration. These functions may be used at lan-guage-implementation level to prevent migration whenexecuting location-sensitive code. Note that a typicalcomputation-intensive application (that is the most nat-ural candidate for porting to MILLIPEDE) will rarelyneed to use these functions explicitly.

As was shown in Section 3.2, jobs cannot synchronizeusing the operating system interface. Therefore theMGS provides a general mechanism for inter-mobile-job communication, or for MILLIPEDE Job Event Con-trol (MJEC), which is described in detail in (Itzkovitz etal., 1997a), MJEC solves the problem of obtaining joblocations by storing them in a shared array (that residesin the DSM). MJEC can be used to easily implement all

Fig. 3. Example of the virtual memory of a MILLIPEDE application running on 3 hosts. Job 1 is executed on instance 0; Job 2 is executed on in-

stance 2; worker N is free.

80 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 11: Thread migration and its applications in distributed shared memory systems

the known synchronization protocols (semaphores, bar-riers, condition-variables, monitors, etc.) in a location-independent way. Together with some basic interfacefunctions that are used for creating and managing jobs,the interface supplied by MJEC is very ¯exible and pow-erful. It is designed to support the convenient implemen-tation of various parallel languages, where theimplementation is independent of the operating systemand of location issues (see (Itzkovitz et al., 1997a)).

The global design of MILLIPEDE makes it possiblefor the MGS to transfer a job in an extremely simpleway. The MGS of the sender instance suspends a joband if migration is enabled for this job, sends its workerid, context and contents of its stack to the MGS of thereceiver instance; otherwise it resumes the job locally.The design ensures that the proper worker on the receiv-er instance is idle, and its stack resides in the same ad-dresses as on the sender instance. Thus, the MGS ofthe receiver instance simply copies the stack of the joband its context to the proper worker and resumes it.

4.7. MILLIPEDE daemons

MILLIPEDE daemons are in charge of the dynamicload sharing. They collect and disseminate load infor-mation, identify idle workstations and distribute theMILLIPEDE applications over these machines.

A MILLIPEDE daemon consists of the followingmodules.

Idle Detector. This module checks if the host is idle,i.e., it is not used by its owner interactively, and the loadcaused by non-MILLIPEDE background processes islow. Each time the host becomes idle or becomes non-idle the Idle Detector noti®es the Eviction Server.

Eviction Server. This module initiates and stops theeviction of foreign applications from the host. It uses in-formation maintained by the Application Info Managerand noti®es each local instance of a foreign applicationwhen it should start or stop the eviction of its local jobs.

Application Info Manager. The Application InfoManager collects administrative information about eachMILLIPEDE application running on the host, such asits unique identi®er, instance identi®er, master host id,system process handle, and so on.

Local Load Info Manager. It collects load informa-tion about each application running on the host, suchas number of local jobs and number of mobile localjobs. It also determines the global load state of the localhost.

Masters. A separate manager called Master is createdfor each new application that was started locally. Themaster is in charge of distributing its application overthe network.

Master Info Manager. This module collects master in-formation about each application that was started local-ly, e.g., the status of the application on each host. It also

collects global load information about all hosts in thesystem. The information maintained by this module isused only by masters, therefore the module is activatedonly if there exist applications that were started locally,so that there are masters running locally.

Communication. This module is used for communica-tion with other Daemons and local MGS of MIL-LIPEDE applications.

4.8. Parallelism vs. communication in MILLIPEDE

There are several aspects of the parallelism-communi-cation trade-o� in the MILLIPEDE system. MIL-LIPEDE supports running multiple applicationssimultaneously. Our objective is to compromise betweenmaximum parallelism and minimum communication.Since di�erent applications are not communicating, theyshould run on di�erent hosts whenever possible. On theother hand, the communication between the compo-nents of the application should be minimized withoutcausing load imbalance.

The control over the parallelism and the communica-tion is done cooperatively by the Daemons and theMGSs in the following way. The Daemons distributethe applications over the network and determine the ini-tial number of jobs of each application on each host.They strive to ®nd an optimal assignment of hosts to ap-plications, that is, to achieve su�cient load sharingusing minimal number of application instances.

The mission of the MGSs is to optimize the commu-nication within the application with respect to the deci-sions of the Daemons. That is, the MGSs try tominimize the amount of communication caused by theDSM mechanism without breaking the load balanceachieved by the Daemons.

The algorithms used by the MGSs in order to mini-mize the communication are described in (Schusterand Shalev, 1997). In Section 5 we describe in detailthe algorithms used by the Daemons to distribute an ap-plication over the network.

5. Distributing an application

The daemons are in charge of determining the set ofhosts for executing an application. The objective is toexecute di�erent applications on di�erent hosts whenev-er possible, since di�erent applications do not communi-cate, while the components of the same applications do.Therefore, if an underloaded host already runs a certainapplication, then the algorithm tries to send it additionaljobs of the same application. Only if this is not possibleor the host is not executing any application, a new appli-cation instance may be started on this host.

Each application is executed on a subset of the avail-able hosts. Initially only the main copy of the application

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 81

Page 12: Thread migration and its applications in distributed shared memory systems

is created; if, as a result, the local host becomes over-loaded, and the overall load of the system is su�cientlylow, additional copies of the application are eventuallycreated on underloaded hosts. The reverse process is ini-tiated when the load decreases so that there exist morethan a single underloaded host running the application.In this case two such hosts are chosen; the applicationcopy of one of them is forced to migrate the jobs tothe other copy and is disabled in order to make it possi-ble for another application to use this machine.

5.1. Information policy

In order to achieve speedups, only idle or slightlyloaded hosts should be used for remote execution.Therefore, certain amount of global information aboutthe hosts' load state is needed to decrease the numberof incorrect decisions. However, maintaining exact in-formation about all hosts in the system is extremely ex-pensive. Therefore each host reports to peers onlysigni®cant changes in its load state. In addition, sincethe decisions on distributing an application are madeby its master, only master hosts need the global load in-formation. Thus, each host that becomes a master orstops being a master reports this change to all the hostsin the system; each host maintains a list of all masterhosts and reports to them about all signi®cant changesin its load state. Since coarse-grain applications are as-sumed, the state of the hosts is expected to change sel-dom, so that the overhead associated with this policyis relatively low.

5.1.1. Load indicatorLoad state of a host is determined by two factors. The

®rst is the CPU utilization by non-MILLIPEDE pro-cesses and the presence of interactive work on the ma-chine. In order to provide the user-ownership feature,MILLIPEDE avoids using a host for remote executionif its native load (caused by non-MILLIPEDE applica-tions) is high or if the host is used for interactive work.The second factor is the load caused by MILLIPEDEapplications. Since we assume CPU-intensive applica-tions, the load state of a host is determined by the totalnumber of MILLIPEDE jobs running on the host. Oth-er possible sources of information are:· The number of jobs waiting for synchronization (e.g.,

when using ParC++ statements such as sync).· The CPU time consumed by the jobs and by paging

or migration of threads.· Memory utilization.· Network utilization.Using these factors to determine the load would possiblyprovide better estimation. However, it would imposehigher overhead, so that it would not necessarily im-prove the speedups. Since we concentrate on other issues

in this research, the problem of load indexes is still openin MILLIPEDE.

5.1.2. Host statesThe state of a host is determined according to the two

load factors described above. If the background load istoo high or the machine's owner is working interactivelyon it, the machine will become evicting. In this case, re-gardless of the second state component, namely, theload of the host, the machine is not used for remote ex-ecution (this may change in future implementations).However, if a MILLIPEDE application was started lo-cally, it may still run on both the remote and the localmachines; e.g., if the local machine is overloaded (i.e.,there are too many local MILLIPEDE jobs), the mas-ters of the local applications will try to migrate part ofthese jobs to other machines.

The host load state is determined according to theload that is caused by MILLIPEDE applications. Twothresholds ± low and high ± are used to evaluate the hostload state. We call a host underloaded if its load is belowthe low threshold, and overloaded if the load is above thehigh threshold, and normal otherwise. The thresholdsare constant for each host and depend only on its hard-ware parameters, such as number of processors andtheir speed.

Since the algorithm strives to execute di�erent appli-cations on di�erent hosts, the master hosts should alsohave some additional information. They should be ableto determine which applications are running on whichhost, and whether they are enabled or disabled (as ex-plained below). Thus, the state of a host is characterizedalso by the number and the type of applications that areusing it.

5.1.3. Information disseminationA new host receives the list of all the master hosts

from a host that is chosen dynamically when the systemcomes up. It then sends the ®rst state message to all themaster hosts. Additional state updates are sent eachtime the host state (as de®ned above) changes. The stateupdate message contains load of the host and the list ofapplications using it. For each such application the mes-sage contains its type (enabled or disabled) and the cor-responding number of jobs. Note that a change in thenumber of jobs is not reported immediately. Rather, itis reported with the regular state updates to all masters.In addition, when a particular application crosses a loadthreshold, an update is sent to the master host of this ap-plication.

The number and the type of applications running ona host are not expected to change frequently; only thesechanges and signi®cant load changes are reported to themaster hosts. In addition we assumed that the expectedjob lifetime is high and the number of workstations isnot too large. Therefore the overhead imposed by the

82 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 13: Thread migration and its applications in distributed shared memory systems

policy described above is relatively low. This overheadmight be further reduced by using a ®ltering algorithmthat prevents a daemon from sending unnecessary loadupdates when the host load state oscillates near one ofthe threshold values. This optimization is not yet imple-mented in MILLIPEDE.

5.2. Master protocol

We now describe the scheduling algorithm that themasters use for distributing their applications. This algo-rithm decides which, and how many, threads will be as-signed to each host. The master common data structures(on a host) are updated each time the daemon receives astate update message. Then each master running on thishost makes migration decisions regarding its application(if there are underloaded hosts in the system).

Let us describe the response of a master of applica-tion a which resides at host H0 to a state update messagefrom a host H1. We denote the number of jobs belongingto the application a that run on the host Hj by nj. Wedenote the low and the high load thresholds of the hostHj by lj and hj, respectively.

5.2.1. Treating an overloaded hostWhen the master receives a message from an over-

loaded host H1, it checks if that host is running the ap-plication a. If not, it takes no further action. Otherwisethe master tries to initiate a transfer of all or part of a'sjobs out of H1, depending on the number of these jobs(denoted n1).

The master makes the decision in the following way.It determines the number of jobs to transfer (denoted n),which depends on n1 and possibly on other load param-eters. If n1 < l1, the master tries to evict a from H1, i.e., itattempts to transfer all jobs of a from H1. If the transfersucceeds, the master disables a's application copy in H1.In this case n � n1. Otherwise it tries to transfer excessjobs from H1. In this case the exact number of jobs tobe sent depends on both n2 and on load thresholds ofthe target host H2 (that is chosen as described below):n � minfn1=2; �l2�h2�

2g.

The master then looks for an underloaded host thatcan receive the jobs. It may create an additional copyof a on some underloaded host if this is necessary. How-ever, its objective is to avoid creating redundant instanc-es and to avoid executing several di�erent applicationson the same host. The master therefore looks for a pre-ferred underloaded host using the following precedenceorder.1. The hosts that have only an enabled copy of a, sorted

by the number of the jobs of a in increasing order.2. The hosts having an enabled copy of a and disabled

copies of some other applications, sorted in the sameway.

3. The hosts having an enabled copy of a and some oth-er applications, sorted in the same way.

4. The hosts having a disabled copy of a and nothingelse.

5. The hosts having a disabled copy of a and disabledcopies of some other applications.

6. The hosts that do not have any copies.7. The hosts having only disabled copies of other appli-

cations.8. The hosts having enabled copies of other applica-

tions.The master then asks the chosen host H2 whether it

can receive n jobs. H2 may refuse if someone else al-ready o�ered it enough jobs, or if it has become under-loaded recently, and it has no copy of a, but it hassome copies of other applications. In the latter case itexpects job o�ers from the masters of existing applica-tions. If it remains underloaded for long enough with-out receiving such o�ers, it assumes that it will notreceive jobs from the existing applications, and conse-quently it may agree to create a copy of an additionalapplication.

If the underloaded host refuses, the master asks thenext underloaded host. For a certain period of timethe master keeps the information that the host refusedto receive work, assuming the reasons for this do not fre-quently change. This allows the master to avoid sendinguseless work o�ers to such hosts at the next time it looksfor an underloaded host.

If the selected underloaded host agrees to receivework, the master checks whether H1 agrees to send thejobs out. The reason for this extra check is to avoid use-less migrations in the case that several applications useH1, and all the corresponding masters decide to sendtheir jobs out of H1. In this case, if there is no negotia-tion with H1, these masters' decisions may cause H1 tobecome underloaded, and so immediately followingthe transfer of the jobs out of H1 the master will startthe process of collecting jobs back to H1.

Consequently, the master asks H1 if it can send n jobs.The daemon on H1 checks whether it still has su�cientnumber of jobs. If sending n jobs will make it underload-ed, it refuses; the master then cancels its o�er to H2. Oth-erwise it agrees; in this case the master sends H2 a copyof a (or enables an existing one if needed) and sends H1 a®nal request to send work. Upon receipt of this requestthe daemon on H1 asks the MGS of a's local copy tosend n jobs to H2. The MGS selects the jobs with respectto their remote access history, striving to achieve maxi-mal locality of DSM accesses. The general method forthe selection process, as well as its implementation inMILLIPEDE, is described in (Schuster and Shalev,1997). Then the MGS transfers the selected jobs to H2

and noti®es the local daemon, which noti®es the master.If the application is evicted from H1, the local daemonalso disables its copy.

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 83

Page 14: Thread migration and its applications in distributed shared memory systems

5.2.2. Treating an underloaded hostWhen the master on H0 receives a message from an

underloaded host H1, it looks for an overloaded hostH2 that has an enabled copy of a. If there exists such ahost, the master tries to transfer the excess jobs fromH2 to H1 in the same way as in the Section 5.2.1

Otherwise, all the hosts that are executing a are ineither underloaded or normal load state. The masterchecks if H1 is running the application a. If it does,it tries to ®nd another underloaded host H2 that is ex-ecuting a, attempting to merge the two copies into asingle host (thus eliminating unnecessary communica-tion between them). It decides which of the two hostswill receive the work from the other one, by using thesame precedence order that was described in Sec-tion 5.2.1. The other host will evict a to the chosenhost. The rejection mechanism is used here too, mean-ing that both the sender and the receiver may rejectthe master request due to recent changes in their loadstate.

6. Performance evaluation

In this section we present the results of our experi-ments with the MILLIPEDE system. We show thatthread migration can be used to improve load balancingand to reduce the amount of communication. We exe-cuted the tests on six X86 Pentium workstations runningthe Windows-NT operating system, and connected by100-Mb/s Ethernet. The workstations have di�erentamount of physical memory and di�erent processorspeed. The average latency of thread migration in thisenvironment is 70 ms, while the latency of a messageof ``zero'' length (for example, an MJEC message or apage lookup message) is 2 ms.

6.1. The TSP problem

The Travelling Salesman Problem (TSP) is an exam-ple of an NP-hard optimization problem in graph theo-ry. Given a connected graph with weighted edges, theshortest Hamiltonian path should be found, i.e., a pathtravelling through all the nodes of the graph, so that thesum of the edge weights along the path is minimal. Wegive here a brief description of the parallel algorithmused to ®nd an exact solution for this problem. Basical-ly, the solution to the TSP problem is to scan a searchtree having a node for each partial path of any Hamilto-nian path which starts in a given node of the inputgraph, see Fig. 4. More precisely, for a node represent-ing the path i0 ! i1 ! � � � ! ik, its children are thenodes representing all paths of the formi0 ! i1 ! � � � ! ik ! s, where s is di�erent fromi1; � � � ; ik. Thus, each leaf of the tree represents a Hamil-tonian path in the input graph, and the objective is to®nd the one that represents the shortest path.

In the parallel algorithm work is divided amongthreads in the following way. For each node 0! i itssubtree is searched by k threads; the sons of the nodeare evenly divided between the threads, so that eachthread receives a set of initial paths of the form0! i! j, where its mission is to search for the minimalpath in sub-trees of these paths. Each thread performs aDFS-style search in each of its subtrees; the search is ex-haustive in the worst case. In order to optimize thesearch, all threads use a shared variable to store theweight of the shortest path; a thread cuts o� the searchin a certain subtree if the weight of the partial path atthe root of that subtree is greater than the weight ofthe shortest path that was found so far.

Each thread uses a certain amount of dynamically al-located memory. This memory should be allocated inthe DSM to make thread migration possible. Depending

Fig. 4. search tree and parallelization.

84 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 15: Thread migration and its applications in distributed shared memory systems

on the DSM allocations size, some of the threads mightget their memory on the same pages, which is a typicalexample of false sharing. We compared three di�erentvariants. In the ®rst variation, denoted NO-FS, falsesharing is avoided by allocating more memory than nec-essary (the allocations are padded to precisely ®t into apage). In the other two variants, k threads that searchthe paths starting with 0! i store their private dataon the same page. The variant called FS uses no optimi-zations, whereas the other one, called OPTIMIZED-FS,uses optimizations for data locality by enabling the his-tories mechanism.

6.1.1. Improving load balancingUniform input. We show now that migrating threads

can improve performance even in cases that are triviallyparallelizable. We compare execution time of the NO-FSvariant of the TSP for two di�erent scheduling strate-gies: static and dynamic. The TSP application receivesa uniform input, i.e., all paths have almost the samelength. Therefore all the threads have about the sameamount of work, and also there is almost no communi-cation between them. The static policy is a round-robinstrategy: when n threads are to be created in a system ofm machines, n=m sequential threads are created on eachhost. The dynamic policy is our load sharing strategy.Since the system is not uniform, thread migration im-proves performance by about 30% as one can see inFig. 5.

Evidently, one can suggest to improve the round-rob-in strategy by dividing the threads between the machinesaccording to their respective performance. This mighthelp, assuming that the expected execution time on eachhost can be accurately predicted. However, such predic-tion is only approximate, even if all threads have exactlythe same amount of work. The reason is that the predic-tion depends, in addition to the number of processorsand their speed, on the amount of available physicalmemory, and on the behavior of other processes that

are using the machines. Certainly, no improvements tothe prediction help if the amount of work in the threadsis not known in advance. As an example, we examine be-low an extreme case that cannot be treated by the staticpolicy.

Unpredictable computation amount. Here we comparethe static and dynamic policies applied to the TSP withextremely non-uniform input: the paths that aresearched by the ®rst six threads (among a total of 36 cre-ated threads) all have the same length, while the otherpaths begin with heavy edges, so the threads that searchthem terminate almost immediately due to pruning.Thus, when the static policy is applied in the system withat most six machines all jobs that do not terminate im-mediately will be scheduled to run on the same machine,while all other machines will become idle shortly afterinitiation. In contrast, the dynamic load sharing keepsall the machines utilized. Fig. 6 shows that with the stat-ic policy there is no speedup at all, while the dynamicpolicy provides speedup close to linear.

6.1.2. Optimizing localityImproper placement of communicating threads can

impose huge communication overhead and signi®cantlyincrease execution time. Table 1 summarizes the resultsof running the TSP algorithm on six machines with dif-ferent numbers of threads contending for the same page.The table shows a dramatic reduction in the networktra�c when the optimizations for locality are applied:the number of DSM-related messages, which re¯ectson the miss-ratio (ratio of remote to total number of ac-cesses), drops by a factor of 30 ± 40! Note that the num-ber of extra messages that are added by the localityoptimization mechanism itself is negligible comparedto the improvement in the number of DSM-related mes-sages.

Fig. 5. NO-FS TSP with uniform threads in non-uniform environment

for static and dynamic scheduling.

Fig. 6. NO-FS TSP with static and dynamic scheduling for extremely

non-uniform input.

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 85

Page 16: Thread migration and its applications in distributed shared memory systems

7. Discussion

In this work we have shown that thread migration isone of the major capabilities of DSM systems. Systemsthat do not support migration of threads (or processes)may su�er bad performance due to load imbalance anddue to improper placement of threads in the distributedenvironment.

Implementing preempted thread migration is not asimple task. Several approaches for implementingthread migration were introduced in the literaturebut generally they are inappropriate for most of theoperating systems, and in some cases they are even in-correct. Because the portability issue in DSM systemsis very important, it is vital to employ a solution thatis guaranteed to be portable to various operating sys-tems.

In this paper we proposed a correct system design forthread migration. We described the common approach-es and discussed their advantages and ¯aws. Our pro-posed solution is the less demanding from theunderlying operating system and thus is the most appro-priate to be implemented on large varieties of operatingsystems, thus making the DSM system portable to manyplatforms.

In order to prove our design, we implemented it onthe MILLIPEDE DSM system, under Windows-NT.The MILLIPEDE system allows better utilization of anetwork of single-processor and multiprocessor ma-chines. It provides a simple interface for multithreadedconcurrent programming on such a network, so thatthe network is viewed by the application as a single mul-tiprocessor machine with shared memory.

Transparent thread migration is used in MIL-LIPEDE to provide dynamic load sharing while de-creasing communication overhead by improving thelocality of data accesses in an application-transparentway. In addition, migration is used to capture idle ma-chines, and to provide user ownership on his personalmachine by evicting remote threads and data from a ma-chine when its user starts using it.

References

Beery, M., Fleisher, A., Itzkovich, A., Schuster, A., Tzur, S., 1997.

ParC++ ± A natural parallel extention of C++. Technion Parallel

and Distributed Systems Lab Internal Document.

Ben-Asher, Y., Feitelson, D.G., Rudolph, L., 1996. ParC: An

extension of C for shared memory parallel processing. Software:

Practice and Experience 26 (5), 581±612.

Casas, J., Konuru, R., Otto, S.W., Prouty, R., Walpole, J., 1994.

Adaptive Load Migration Systems for PVM.

Casavant, T.L., Kuhl, J.G., 1988. A taxonomy of scheduling in

general-purpose distributed computing systems. IEEE Trans.

Software Engrg. SE (2), 141±154.

Chase, J.S., Amador, F.G., Lazowska, E.D., Levy, H.M., Little®eld,

R.J., 1989. The Amber system: Parallel programming on a network

of multiprocessors. In: Proceedings of the 12th ACM Symposium

on Operating Systems Principles (SOSP'89), pp. 147±158.

Douglis, F., Ousterhout, J., 1991. Transparent process migration:

Design alternatives and sprite implementation. Software: Practice

and Experience 21, 757±785.

Dubrovski, A., Friedman, R., Schuster, A., 1997. Load balancing in

distributed shared memory systems. International Journal of

Applied Software Technology, 1997, to appear in Technion

LPCR/TR, #9602, July 1996.

Eager, D.L., Lazowska, E.D., 1986. Adaptive load sharing in

homogeneous distributed systems. IEEE Trans. Software Engrg.

SE (5), 662±675.

Itzkovitz, A., Schuster, A., 1997. Algorithms for tracking mobile

objects, their copies, and their access capabilities, in DSM systems.

In preparation.

Itzkovitz, A., Schuster, A., Shalev, L., 1997. MILLIPEDE: Supporting

multiple programming paradigms on top of a single virtual parallel

machine. In: Proceedings of the HIPS Workshop, Geneve.

Itzkovitz, A., Schuster, A., Wolfovich, L., 1997. Millipede: A strong

virtual parallel machine. In preparation, see also http://

www.cs.technion.ac.il/Labs/Millipede.

Keleher, P., Dwarkadas, S., Cox, A.L., Zwaenepoel, W., 1994.

Treadmarks: Distributed shared memory on standard workstations

and operating systems. In: Proceedings of the Winter 1994

USENIX Conference, pp. 115±131.

Kremien, O., 1993. The design and evaluation of adaptive load-sharing

algorithms for distributed systems. Ph.D. thesis, Bar Ilan Univer-

sity, Tel Aviv.

Krueger, P., Livny, M., 1988. A comparison of preemptive and non-

preemptive load distributing. In: Proceedings of the 7th Interna-

tional Conference on Distributed Computing Systems (ICDCS-7),

pp. 242±249.

Kumar, A., Singhal, M., Ming, T.L., 1987. A model for distributed

decision making: An expert system for load balancing in distrib-

Table 1

This table contains statistics regarding applying locality optimization to the TSP application running on six hosts with false sharing for di�erent k(number of jobs contending for a page). Applying locality optimizations decreases dramatically the number of DSM messages (page lookup and

transfer). The added overhead imposed by ping-pong treatment mechanism and increased number of thread migrations is negligible

k optimized? Number of DSM-related

messages

Number of ping-pong treatment

messages

Number of thread migrations Execution time (s)

2 Yes 5100 290 68 645

2 No 176120 0 23 1020

3 Yes 4080 279 87 620

3 No 160460 0 32 1514

4 Yes 5060 343 99 690

4 No 155540 0 44 1515

5 Yes 6160 443 139 700

5 No 162505 0 55 1442

86 A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87

Page 17: Thread migration and its applications in distributed shared memory systems

uted systems. In: Proceedings of the 11th Symposium on Operating

Systems, pp. 507±513.

Kumar, A., Singhal, M., Ming, T.L., 1993. Locality-based scheduling

for shared-memory multiprocessors. In: Proceedings of the Fourth

ICS.

Li, K., Hudak, P., 1989. Memory coherence in shared virtual memory

systems. ACM Trans. Computer Systems 7 (4), 321±359.

Mascarenhas, E., Rego, V., 1996. Ariadne: Architecture of a portable

threads system supporting thread migration. Software: Practice

and Experience 26 (3), 327±356.

Schuster, A., Shalev, L., 1997. Access histories: How to use the

principle of locality in distributed shared memory systems. Tech-

nical Report #9701, Technion/LPCR.

Willebek-Le-Mair, M.H., Reeves, A.P., 1993. Strategies for dynamic

load balancing on highly parallel computers. IEEE Trans. Parallel

and Distributed Systems 4 (9), 979±993.

Zayas, E.R., 1987. Attacking the process migration bottleneck. In:

Proceedings of the 11th Symposium on Operating Systems Prin-

ciples (SOSP'87) pp. 13±24.

Ayal Itzkovitz is a doctoral student at the Department of Computerscience at the Technion ± Israel Institute of Technology. He receivedhis B.A. in Computer Science from the Technion. His research interestsinclude operating systems, distributed systems and communication.

Assaf Schuster received his B.A., M.A. and Ph.D. degrees in Mathe-matics and Computer Science from the Hebrew University of Jerusa-lem, the latter one in 1991. He is currently a senior lecturer at theTechnion (Israel Institute of Technology). His main interests includenetworks and routing algorithms, parallel and distributed computa-tion, optical computation and communication, and dynamically re-con®guring networks. During the last four years he has been leadingthe MILLIPEDE project, a part of which is described in this work.

Lea Shalev (Wolfovich) is a master student at the Department ofComputer Science at the Technion ± Israel Institute of Technology.She received her B.Sc. in Computer Science from the Technion. Herresearch interests include communication, parallel and distributedprogramming and software engineering. She is currently a softwareengineer in Network Products Department of Intel Israel.

A. Itzkovitz et al. / The Journal of Systems and Software 42 (1998) 71±87 87