Abstract of thesis entitled A Distributed Object Model for Solving Irregularly Structured Problems on Distributed Systems submitted by Sun Yudong for the degree of Doctor of Philosophy at The University of Hong Kong in March 2001 This thesis presents a distributed object model, MOIDE (Multithreading Object-oriented Infrastructure on Distributed Environment), for solving irregularly structured problems. The primary appeal of MOIDE is its flexible, collaborative infrastructure that is adaptive to various system architecture and application patterns. The model integrates object-oriented and multi-threading methodologies to set up a unified computing environment on heterogeneous system. The kernel of the MOIDE model is the hierarchical collaborative system (HiCS) constructed by the objects, i.e., the compute coordinator and compute engines, on the hosts to execute an application. HiCS integrates the object-oriented and multithreading methodologies to enable its structure adaptive to hybrid hosts. Lightweight threads are generated in the compute engines residing on SMP nodes, which is more efficient and stable than a sole distributed object scheme. The structure and work mode of HiCS are adaptive to the computation and communication patterns of applications as well as the architecture of underlying hosts. The adaptability is particularly beneficial to support the high-performance computing of irregularly structured problems. A unified communication interface is built in MOIDE based on the two-layer communication mechanism that integrates shared-data access and remote messaging for inter- object communication. It is a flexible and efficient communication mechanism to resolve the high communication cost that appears in irregularly structured problems. Autonomous load scheduling is proposed as a new approach to dynamic load balancing in the irregular computation based on the MOIDE model. A runtime support system is developed in Java and
156
Embed
A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving Irregularly Structured Problems on Distributed Systems ... for solving irregularly
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Abstract of thesis entitled
A Distributed Object Model for Solving Irregularly Structured Problems on Distributed Systems
submitted by
Sun Yudong
for the degree of Doctor of Philosophy at The University of Hong Kong
in March 2001
This thesis presents a distributed object model, MOIDE (Multithreading Object-oriented
Infrastructure on Distributed Environment), for solving irregularly structured problems. The
primary appeal of MOIDE is its flexible, collaborative infrastructure that is adaptive to
various system architecture and application patterns. The model integrates object-oriented
and multi-threading methodologies to set up a unified computing environment on
heterogeneous system. The kernel of the MOIDE model is the hierarchical collaborative
system (HiCS) constructed by the objects, i.e., the compute coordinator and compute engines,
on the hosts to execute an application. HiCS integrates the object-oriented and multithreading
methodologies to enable its structure adaptive to hybrid hosts. Lightweight threads are
generated in the compute engines residing on SMP nodes, which is more efficient and stable
than a sole distributed object scheme. The structure and work mode of HiCS are adaptive to
the computation and communication patterns of applications as well as the architecture of
underlying hosts. The adaptability is particularly beneficial to support the high-performance
computing of irregularly structured problems.
A unified communication interface is built in MOIDE based on the two-layer
communication mechanism that integrates shared-data access and remote messaging for inter-
object communication. It is a flexible and efficient communication mechanism to resolve the
high communication cost that appears in irregularly structured problems. Autonomous load
scheduling is proposed as a new approach to dynamic load balancing in the irregular
computation based on the MOIDE model. A runtime support system is developed in Java and
RMI that implements MOIDE as a platform-independent infrastructure to support parallel and
distributed computation on varied systems.
Four irregularly structured applications are developed to manifest the advantages of
MOIDE model. The N-body method demonstrates the capability of the object-based
methodologies of the MOIDE model in the implementation of adaptive task decomposition
and complex data structure. A distributed tree structure with partial subtree scheme is devised
in the N-body method as the communication-efficient data structure to support the high data-
dependent computation. The autonomous load scheduling approach in the ray tracing can
realize the high parallelism in the MOIDE-based asynchronous computation. The MOIDE
model provides the adaptability to the CG method for solving sparse linear systems. The CG
method can be dynamically mapped onto heterogeneous hosts and utilize the unified
communication interface to enhance the communication efficiency. The radix sort verifies the
flexibility of the MOIDE-based computation in which the grouped communication can
outperform MPICH on SMP node and cluster of SMP nodes.
A Distributed Object Model for Solving Irregularly Structured Problems on
Distributed Systems
by
Sun Yudong
孫 昱 東
A thesis submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy at The University of Hong Kong
March 2001
i
Declaration
I declare that this thesis represents my own work, except where due acknowledgement is
made, and that it has not been previously included in a thesis, dissertation or report submitted
to this University or to any other institution for a degree, diploma or other qualification.
Signed ………………………………………….
Sun Yudong
ii
Acknowledgements
First, I would like to express best gratitude to my supervisor, Dr. Cho-Li Wang, for all his
guidance to my study in the past years. He has always given me invaluable encouragement
and advice to complete the thesis.
I am deeply thankful to Dr. Francis C. M. Lau for his constructive advice and help to my
study. I am highly grateful to Dr. P.F. Liu for his important comments and suggestions for my
thesis revision.
Special thanks to the members of the System Research Group, Anthony Tam, Benny
Cheung, Matchy Ma, David Lee, and all other colleagues, whose help and cooperation have
been the strong support to my research.
Finally I show my thanks to all people who have helped me in all aspects to finish the
thesis, especially the technical and office staff in our department.
iii
Contents
Declaration ………………………………………………………………………………….. i
Acknowledgements ..……………………………………………………………………….... ii
Table of Contents …………………………………………………………………………… iii
Lists of Figures ……………………………………………………………………………... vii
Lists of Tables ……………………………………………………………………………….. x
6.15 All-to-all scattering of elements in parallel radix sort on four processors …………... 102
6.16 Execution time breakdowns of the MOIDE-based radix sort ……………………….. 104
6.17 Execution time breakdowns of the single-threading radix sort ……………………… 105
6.18 Execution time breakdowns of the C & MPI radix sort program ……………………. 107
6.19 Execution time breakdowns of three radix sort programs: Java MOIDE-based (Java-M),
Java single-threading (Java-S) and C & MPI (C-MPI) in sorting 10M elements …… 107
6.20 Execution time breakdowns of two radix sort programs on four quad-processor SMP
nodes: Java MOIDE-Based program and C& MPI (C-MPI) in sorting 10M elements 109
6.21 Communication costs of two radix sort programs on four quad-processor SMP nodes:
Java MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M
elements ……………………………………………………………………………… 110
x
List of Tables
1.1 Characteristics of four irregularly structured problems ..………….…...…………….... 7
3.1 Major classes and methods implemented in MOIDE runtime support system …… 34-35
4.1 The times of the N-body method with/without load balancing (seconds) …………… 71
6.1 Execution time of the MOIDE-based radix sort (seconds) ………………………….. 103
6.2 Execution time of the single-threading radix sort (seconds) ………………………… 104
6.3 Execution time of the C & MPI radix sort (seconds) ………………………………... 106
7.1 Comparison of the related work in supporting heterogeneous computing …………... 116
7.2 Comparison of the programming models on cluster of SMPs ……………………….. 118
1
Chapter 1
Introduction
Irregularly structured problems are the applications which have unstructured and/or
dynamically changing patterns of computation and communication. These problems widely
exist in various scientific and engineering areas from astrophysics, fluid dynamics, sparse
matrix computations, system modeling and simulations to computer graphics and image
processing. Irregularly structured problems are usually computation-intensive applications
with high potential parallelism. However, the parallelism is difficult to be fully exploited due
to the irregularity in computation and communication. The irregularities of these problems
will be aggravated when solving on distributed systems. It is hard to evenly partition the
irregular computation among the processors. Moreover the complicated and unstructured
inter-process communication will emerge in distributed computing and the communication
will restrain the parallelism in computation. Therefore the irregularly structured problems are
also communication-intensive in distributed computing.
It is a challenging work to develop flexible and efficient methodologies for solving
irregularly structured problems on distributed systems. The methodologies should cover data
structures, task decomposition and allocation schemes, load balancing strategies, and
communication mechanisms to support high-performance distributed computing of
irregularly structured applications. Meanwhile the methodologies should also take into
account the architectural features of the platforms on which an application is running in order
to create an efficient mapping of the computation onto the underlying platforms.
My research concentrates on the distributed object-oriented methods for the high-
performance solutions of irregularly structured problems on distributed systems. A distributed
object-oriented model MOIDE has been built. The model sets up a flexible and efficient
software infrastructure for developing and executing irregularly structured applications on
2
varied distributed systems. The MOIDE model supports the techniques effective for solving
irregularly structured problems with different irregular characteristics. A runtime support
system is developed to implement the computations based on the MOIDE model.
1.1 Irregularly Structured Problems
1.1.1 Specification
A lot of scientific and engineering applications in various fields including scientific
computing, computer graphics, physics and chemistry can be classified as irregularly
structured problems. For instance, sparse matrix computations in solving sparse linear
equation system; finite element methods for solving partial differential equations; ray tracing
and radiosity in computer graphics; N-body methods in particle simulations; system modeling
and simulations in many science, engineering and social disciplines. Despite the distinct
features of the problems, they have the common characteristics of irregular data distribution
in the computation that generates irregular computation patterns and thus incurs irregular
communication patterns. The irregularly structured problems can be described from different
aspects such as data structure, computation and communication patterns, unpredictable data
distribution and workload [1-6]. Generally irregularly structured problems can be described
as the following definition.
Definition
An Irregularly Structured Problem is an application whose computation and communication
patterns are input-dependent, unstructured, and evolving with the computation procedure.
The irregularity of the irregularly structured problems causes difficulties in designing
efficient parallel and distributed algorithms for them. The distribution of data and computing
workload cannot be exactly determined a priori, and they are dynamically changing during
the computation. When solving an irregularly structured problem in parallel, one should deal
with the following issues:
(1) Irregular Data Representation
The irregular data distribution requires irregular data structures, e.g., special forms of
trees and graphs, to represent data and their relations [54,55,60]. Usually it is uneasy to
exploit the parallelism in the computations on the irregular data structures.
3
(2) Non-predetermined Load Scheduling
The workload of irregularly structured problems depends on the input data and the
dynamic evolution of the data in the computation. As the irregular and dynamic data
distribution, the irregularly structured problems cannot be evenly partitioned and allocated
onto multiprocessors in advance before execution. It is impractical to accurately measure the
computation workload of an irregularly structured problem. High data-dependency may exist
in the problem. That further burdens the load scheduling operations [69,70].
(3) Complicated Communication Requirements
Due to the irregular data structures, computation patterns, and data dependencies,
irregularly structured problems also present irregular and complicated communication
requirements. The high data-dependencies in some irregularly structured problems generate
complicated inter-process communication patterns and request high communication
bandwidth [67,68]. The unstructured communication may severely restrict the performance of
irregularly structured applications.
(4) Adaptive Algorithmic Requirement
The non-predicable computation and communication patterns request adaptive algorithms
for solving irregularly structured problems. The algorithms should generate a task
decomposition in accordance with the specific patterns emerged in the problems in order to
attain high-performance in distributed computing. The algorithms should also map the
irregular computations onto the underlying platforms in an adaptive way that matches the
hardware architecture [71].
Irregularly structured problems are mostly large-scaled computation-intensive and
communication-intensive applications. The unstructured, dynamically evolving patterns
aggravate the computation and communication costs. In addition to the general strategies to
design efficient parallel or distributed algorithm as evenly scheduling computing tasks in
order to balance the workload and minimizing the communication, special techniques should
be devised for irregularly structured problems. The fundamental techniques for solving
irregularly structured problems include: (1) Flexible Data Structures
The data structures should be able to facilitate the effective representation and efficient
computation of irregular problems. The data structures should be flexible to be partitioned in
task decomposition and to be reconstructed to reflect the evolution of the computation
4
patterns. The data structures should efficiently satisfy the data sharing required in the parallel
and distributed computing. Of course, different irregular applications require special data
structures to represent the data and the related computations. For example, a distributed tree
structure is designed for the distributed N-body method in chapter 4.
(2) Dynamic Load Scheduling
The unpredictable and evolving data distribution and therefore the computation workload
of irregularly structured problems request dynamic load scheduling to allocate the workload
at run-time in distributed computing. Computing tasks should be allocated to or data should
be redistributed to multiple processes to ensure dynamic load balance. The load scheduling
approaches are relevant to the characteristics of the applications. Global load redistribution is
probably required for the applications with high data-dependency. Runtime task allocation is
proper for the applications with light data-dependency. The space re-decomposition scheme
in the distributed N-body method in chapter 4 and the autonomous load scheduling scheme in
chapter 5 are two examples of dynamic load scheduling.
(3) Efficient Communication Methodologies
The irregular communication patterns in irregular applications usually produce high
communication overhead that will deteriorate the overall performance. The communication
overhead is more critical in distributed computing environment where the communication
goes through long-latency message passing. It is essential to reduce the inter-process
communication by maintaining data locality in the computation and build efficient
communication mechanism for distributed computing. In MOIDE model described in chapter
2, a two-layer communication mechanism is created by integrating shared-data access and
message passing on heterogeneous systems.
(4) Adaptive Computing Infrastructure
The task decomposition depends on the computation and communication patterns of an
application. It should evenly decompose the computation and allocate workload fairly to each
processor meanwhile reducing the inter-processor communication. An ideal task
decomposition scheme should also take into account the architecture of the underlying system.
The algorithms of the applications should be developed on an adaptive computing
infrastructure. Therefore adaptive task decomposition and allocation strategies can be
implemented for varied hardware architecture to generate a task distribution which can make
full use of the architectural features to attain high performance on the hardware platforms.
5
1.1.2 Sample Applications
As previously indicated, a lot of applications in various fields can be viewed as
irregularly structured problems. These applications possess different irregularities that appeal
to specific strategies to cope with. Four sample irregularly structured problems are studies in
the thesis.
(1) N-body Problem
N-body problem simulates the evolution of a system containing a great number of bodies
(particles) [7, 16]. The bodies distributed in a space have force interactions between each
other. System evolution is the consequence of the cumulative force influences from all bodies.
The force influences are determined by the interactions of the bodies and impel the bodies
moving to new positions. The force influences keep on changing because of the continuous
body motion. A lot of physical systems exhibit this behavior such as in the fields of
astrophysics, plasma physics, molecular dynamics, fluid dynamics, radiosity calculations in
computer graphics, and etc. The common feature of these systems is the large range of
precision in the data requirements for bodies to compute their force influences on each other.
A body needs gradually rough data in less frequency from the bodies that are farther away.
The system evolution is a dynamic procedure. N-body problem has high irregularity in data
distribution and the computation of force influences, whereas heavy irregular data
communication occurs in distributed N-body method.
(2) Ray Tracing
Ray tracing is a graphic rendering algorithm in computer graphics to synthesize an image
from the mathematical description of the objects that constitute the image [8,20]. It generates
a 2D rendering image for a 3D scene by calculating the color contributions of the objects to
each pixel on a view plane (screen). In ray tracing, primary rays are emitted from a viewpoint,
passing through the pixels on the view plane and entering the space which encloses the
objects. When encountering an object, a ray is reflected to each light source to check if it is
shielded from that light source. If not, the light contribution from that light source onto the
view plane should be computed. The ray is also reflected from and refracted through the
object to spawn new rays. The ray tracing procedure is performed recursively on the new rays.
Thus each primary ray may generate a tree of new rays. The rays are terminated when they
6
leave the space or by some pre-defined criterion (e.g., the maximum number of levels
allowed in a ray tree). If a ray hits nothing, no further computation will be taken. The rays
hitting sophisticated objects will generate a bundle of dispersed rays that require more
rendering computations. In ray tracing, the generation of the rays is non-deterministic which
depends on the objects and light sources in the scene. The workload of the rendering
operations is totally irregular. The rendering of each pixel on the view plane has highly
diverse workload. It is difficult to evenly partition the view plane in priori for parallel ray
tracing.
(3) Sparse Matrix Computations
Sparse matrix computations appear broadly in scientific and engineering computing from
basic computations as sparse matrix-vector multiplication to complex computations as
iterative methods for solving sparse linear systems. Sparse matrix is an unstructured data
structure. Parallel sparse matrix computations are irregular operations because of the
unbalanced matrix computation as the consequence of the unstructured data density [62,63].
Unstructured communications arise in the parallel sparse matrix computations. For example,
the iterative methods for solving linear equation systems in the form bAx = generate a
sequence of approximation to the solution vector x by iterating on the matrix-vector
multiplication operations on the coefficient matrix A. The computation and communication
costs are dependent with the data density of the sparse matrix and the algorithmic operations
on it. The conjugate gradient (CG) is one of the most powerful iterative methods for large
sparse linear systems [22]. Parallel CG method is based on the mesh topology of
multiprocessors. Large vectors are exchanged among the processors in iteration. The heavy
communication may restrict the performance of parallel CG method. Efficient
communication mechanism should be built to raise the communication efficiency and
enhance the overall performance.
(4) Sorting
Sorting is a procedure to transform a random sequence of elements into an ordered one. It
is one of the common operations in computing algorithms. Sorting procedure is accomplished
by repeating comparison or non-comparison manipulations on the element sequence. Parallel
sorting algorithm involves the redistribution of the elements among multiprocessors in each
7
sorting round. The data exchange has irregular communication pattern and produces heavy
communication among all processors. Hence sorting algorithms can also be recognized as
irregular problems [48,66]. For example, radix sort is a non-comparison sorting algorithm
[24]. It reorders a sequence of elements based on the integer value of bit sets. Radix sort
examines the elements r bits at a time. It sorts the elements according to the ith least
significant block of r bits during iteration i. All elements are redistributed on the
multiprocessors based on their new positions in the global sequence in each loop. The data
redistribution is an irregular all-to-all communication. Since sorting algorithm only conducts
simple computation, the performance is mainly determined by the communication operations.
Efficient communication mechanism is also needed to speed up the sorting procedure.
The four irregularly structured applications discussed above have unique irregular
characteristics that demand specific techniques to achieve high-performance computing. The
characteristics are summarized as four attributes: computation complexity, communication
complexity, data dependency, and synchronization requirement in Table 1.1.
Applications Computation Complexity
Communication Complexity
Data Dependency
Synchronization Requirement
Key Techniques
N-body High High (all-to-all)
High Yes Distributed tree structure
Ray Tracing High Low None No Autonomous load scheduling
CG High High (point-to-point)
Medium Yes Two-layer communication
Radix Low High (all-to-all)
Low Yes Two-layer communication
Table 1.1 Characteristics of four irregularly structured problems
In Table 1.1, computation complexity is the computation workload of the application.
Communication complexity is the communication requirement of the application appeared on
distributed computing. Data dependency is the level of correlation of the data in computation.
Synchronization requirement indicates whether synchronization should be enforced on the
parallel computing procedure. No synchronization requirement implies a total asynchronous
computation procedure that can attain the highest parallelism. Otherwise, the synchronization
should be imposed to coordinate the parallel computations on multiple processes. The key
techniques in Table 1.1 are used to handle the irregularities of the individual irregularly
8
structured problems and achieve high-performance in distributed computing. These
techniques are implemented in the distributed object model MOIDE (see chapter 2). The
effects of the key techniques will be demonstrated by the sample applications in chapter 4, 5
and 6.
1.2 Distributed System and Distributed Object Computing
1.2.1 Distributed System
A distributed system is constructed with computer nodes linked across networks. The
computer nodes may be geographically distributed in wide area. Running on a distributed
system, an application should be decomposed into a group of concurrent computing tasks that
are dispatched to the distributed computer nodes meanwhile the tasks can still cooperate
during the computation. This is the basic model of distributed computing. Distributed
computing technologies are under rapidly development with the proliferation of smaller
computers, such as PCs and workstations, and the wide spread of networks. Networked
computers support large applications by distributed computing without high-end standalone
computers.
A distributed system can be a homogeneous system consisting of same type of machines.
More generally, it is a heterogeneous system composed of hybrid computers that have
different architecture and computing power. It may accommodate PCs, workstations and
multiprocessors in the same system. With fast advancement of high-speed networks and
powerful microcomputers and workstations, networked computers have been providing a
cost-effective environment to support high performance parallel and distributed computing.
The scope of the networked systems is expanding quickly. Nowadays the networked
computers are beyond the traditional LAN-linked systems. The local systems in different sites
can form a wide-area distributed system, from campus-wide to area or nation-wide system,
which provides strong computing powers. The innovative system architecture is called
Cluster of Cluster or Computational Grid [64,65].
A distributed system can be concurrently accessed by a lot of users at different sites for
different applications. Software infrastructure is required to integrate the distributed resources
and provide uniform interface for developing and running applications on the distributed
system. The infrastructure should present sufficient flexibility on heterogeneous platforms. It
9
should hide the architecture of the platforms and create a uniform environment for application
developers.
In response to the requirements for integrated computing infrastructure, object-oriented
methodology is recognized as the appropriate technique to construct distributed computing
infrastructure. The object-oriented techniques are flexible enough to create objects on the
varied platforms and organize the objects into a computing infrastructure for executing the
applications. The distributed object infrastructure is highly adaptive to heterogeneous
platforms.
1.2.2 Distributed Object Computing
Object-oriented technology is appropriate for the computation on distributed systems [9,
10]. An object is a software unit that encapsulates data and behavior. Object-oriented
computing can be considered as providing an interface that specifies the functions and
arguments of an object and encapsulates the details of the internal implementation. The
interface hides the hardware characteristics from the applications. The applications can be
developed in a uniform model regardless what platforms they will run on. The applications
based on object-oriented model can be properly mapped onto the platforms and high
performance can be achieved.
Distributed object computing is the integration of object-oriented computing and
networking. It provides high flexibility to the computations on distributed system. Objects are
created on distributed computer nodes when an application is submitted to run. The object on
one host can interact with other objects on remote hosts. An object can also be transferred to
another host for the sake of load balancing or fault tolerance. The object on one host can
create remote objects on other hosts. Distributed object computing also supports
multithreading methodology that implements light-weight computation on SMP nodes.
Remote method invocation is a more powerful communication mechanism in distributed
object system than ordinary message passing. By remote method invocation, an object can
transfer not only data but also control to remote objects. It has the feature of one-sided
communication in which the communication operation can be started on the side of sender or
receiver only. One-sided communication will contribute to the high asynchrony in distributed
object computing. The polymorphism of distributed object system guarantees the flexibility
of object-oriented computing. A distributed object system can expand by incorporating more
10
objects created on new hosts at runtime to raise its computing power. Summarily, a
distributed object system should possess the following capabilities:
(1) Runtime Host Selection
The first step to create a distributed object system is the selection of the available hosts in
a distributed system. The computers in a distributed system are simultaneously accessible by
many users and applications. A distributed system is also a loosely-coupled and non-
permanent system. Local systems or hosts can join or leave the distributed system. When
building up a distributed object system, it needs to select the appropriate hosts based on their
current states. Then objects are created on the selected hosts to perform distributed computing.
(2) Adaptive Task Mapping
In distributed computing, an application should be decomposed into a group of tasks and
allocated to the distributed objects. The tasks should be properly mapped onto the distributed
objects in accordance with the computation pattern, data locality, and the architecture of the
target hosts. The task mapping should exploit the computing power of the hosts and minimize
the data communication between the objects.
(3) Multithreading Computation
Multithreading computation is a light-weight method for parallel computing within an
object. An object residing on SMP node can spawn a group of threads inside to execute
parallel computation on the multiprocessors. The group of threads in object consume less
system resources than multiple objects. They can maintain higher data sharing and tighter
cooperation in parallel computing.
(4) Efficient Communication Mechanism
Distributed object computing is usually accompanied with high inter-object
communication. Efficient communication mechanism is demanded to support flexible and
fast communication. The communication mechanism should integrate the inter-object
communication methodologies based on the physical communication paths to achieve high
communication efficiency.
(5) Dynamic Object Creation and Load Scheduling
A distributed object system should be adaptive to the computing environment. It should
be able to dynamically create objects on new hosts to utilize the available computing
resources and improve the performance. It needs to support dynamic load scheduling to
balance the workload on the objects.
11
These capabilities are especially useful in solving irregularly structured problems. The
unstructured computation and communication patterns of these problems present strong
requests for adaptive task mapping and dynamic load scheduling to produce balanced
distribution of computation workload. Dynamic object creation is also helpful to develop
adaptive algorithms on distributed systems. The efficient communication mechanism can
mitigate the overwhelming overhead of unstructured communication. Multithreading
computation is advantageous to smooth the unstructured computation and increase the
performance on SMP nodes.
1.2.3 Object-Oriented Programming Language
A distributed system may contain different types of platforms. The object-oriented
computing on such a system should be executable on the heterogeneous platforms. Java is an
architecture-neutral object-oriented language that offers plenty of services for distributed
computing on heterogeneous platforms [11]. The services support multithreading and
distributed object computing as well as remote method invocation mechanism. It also enables
dynamic object creation and redistribution. Java provides a homogeneous, language centric
view over a heterogeneous environment. The platform-independent Java bytecodes is
executable on any hosts provided Java Virtual Machine is installed. Therefore Java classes
and objects are portable from one host to any other hosts without recompilation. RMI
(Remote Method Invocation) is a Java-based interface for distributed object-oriented
computing [12,77]. It provides the registry to register the object references and allows an
object running on one host to make remote method invocation to the object on another host.
Thus distributed objects can transfer data through both the arguments and the return value of
the method. The distributed object system implemented in Java can be flexibly built on any
platforms at runtime. In this thesis, Java and RMI are used to implement MOIDE model and
the irregularly structured applications.
1.3 Motivation
The motivation of my research comes out from the recognition of all the previously
discussed requirements. The goal of this research is to develop a generic computing model
that can support efficient and flexible computing on various distributed systems. With regard
to the main target applications, the model should provide the powerful support to solve
12
irregularly structured problems. Runtime support software is needed to integrate the
distributed and heterogeneous hosts and subsystems to provide a uniform computing
environment to the applications. The thesis presents a distributed object model MOIDE for
solving irregularly structured problems on distributed systems. MOIDE stands for
Multithreading Object-oriented Infrastructure on Distributed Environment. It integrates all
the capabilities discussed in 1.2.2 and sets up a flexible infrastructure for developing and
executing irregularly structured applications. The applications implemented in MOIDE model
can achieve high performance on heterogeneous systems.
1.4 Contributions
The research work in this thesis focuses on the development of MOIDE model and
related mechanisms to support the solutions of irregularly structured problems on distributed
systems. The model provides the foundation to develop the solutions. Proprietary techniques
are designed based on MOIDE model for specific irregularly structured applications. The
major contributions in the thesis are as follows:
A distributed object model MOIDE is developed to establish a flexible and
architecture-independent computing infrastructure on distributed systems. The model
integrates the object-oriented and multithreading methodologies to provide a unified
infrastructure for developing and executing varied applications, especially irregularly
structured problems, on heterogeneous systems. MOIDE model utilizes the
polymorphism and location-transparency properties of the object-oriented and
multithreading technologies. The properties facilitate the dynamic, adaptive creation
and reconfiguration of the distributed computing infrastructure on varied system
architecture and resources. Hierarchical collaborative system is proposed as the
fundamental software architecture of the model. The hierarchical collaborative
system has the two-level structure that is adaptive to the architecture of underlying
hosts and the pattern of irregular applications. It supports the dynamic system
creation and reconfiguration to cope with the uncertainty of the resource availability
and the irregular computation. The MOIDE-based computation can attain high
performance on the available resources.
A unified communication interface is constructed by integrating local shared-data
access and remote messaging to support architecture-transparent and efficient inter-
13
object communication. The integrated two-layer communication based on the object-
oriented technique provides a simple, flexible and extensible communication
mechanism for transmitting complex data structures and control information between
distributed objects. It can effectively improve the communication efficiency on HiCS
in solving irregularly structured problems. The MOIDE-based applications can be
developed in the architecture-independent mode by calling the unified
communication interface. The applications will be adaptively mapped onto the
underlying hosts at runtime by forming a hierarchical collaborative system and
creating the communication mechanism that match the underlying architecture.
Generic task allocation strategies are proposed to the task decomposition and
allocation in different irregularly structured problems. For example, the strategy of
initial task decomposition with runtime repartition can be applied to the applications
with high data-dependency. Dynamic task allocation can be used to the applications
with low data-dependency. The polymorphism and encapsulation of object-oriented
methodologies enable the adaptive task allocation that matches both the application
pattern and the system architecture.
A runtime support system called MOIDE-runtime is developed to implement the
distributed computing based on MOIDE model. It implements the functions and
mechanisms required in the MOIDE-based computing, including host selection,
hierarchical collaborative system creation and reconfiguration, unified
communication interface, object synchronization, and autonomous load scheduling.
Implemented in Java, MOIDE-runtime builds a platform-independent, extensible,
unified computing environment on wide-range distributed systems, e.g., cluster of
SMP nodes, cluster of single-processor PCs, and cluster of hybrid hosts.
A distributed tree structure is designed for the N-body problem. It is a distributed
variation of the Barnes-Hut tree. Partial subtree scheme is proposed as the
communication-efficient solution to the data sharing of the distributed tree structure.
It is different from the tree structures in other parallel N-body methods on shared-
memory or distributed-memory systems. The distributed tree structure is supported
by the object-oriented approach based on MOIDE model. The object-oriented
techniques facilitate the construction and transmission of the tree structures in
distributed environment.
14
Autonomous load scheduling is proposed as the dynamic task allocation approach for
high-asynchronous computation. The autonomous load scheduling method is
supported by the flexible, one-sided remote method invocation in object-based
communication. It can exploit the high parallelism in some irregular applications,
make full use of the computing power of the resources, and automatically achieve
dynamic load balancing under low overhead. The autonomous load scheduling is
used in the ray tracing method. One of the autonomous load scheduling scheme—
individual scheduling—is recognized the better scheme to the high-asynchronous ray
tracing procedure.
Grouped communication is adopted on the unified communication interface to reduce
the heavy and irregular communication cost in the irregularly structured application.
It is developed with the integration of object-oriented and multithreading
methodologies in MOIDE model. The grouped communication approach is used to
fulfill the large-volume all-to-all scattering operation in the radix sort. The grouped
scatter outperforms the similar operation in MPICH in the radix sort.
1.5 Thesis Organization
In the following text, chapter 2 addresses the distributed object model MOIDE with the
focus on the infrastructure of hierarchical collaborative system. Chapter 3 describes the
runtime support system MOIDE-runtime. Chapter 4 presents distributed N-body method
based on MOIDE model, with the emphasis on the distributed tree structure. Chapter 5
discusses the distributed ray tracing methods based on autonomous load scheduling. Chapter
6 presents the MOIDE-based CG and radix sort methods to illustrate the architecture-
independent feature of MOIDE-based computation and the efficiency of two-layer
communication mechanism. Chapter 7 covers the related work and the comparison with my
work. Chapter 8 concludes the thesis.
15
Chapter 2
MOIDE: A Distributed Object Model
2.1 Introduction
As discussed in 1.2.2, the object-oriented technology is suitable to implement the
computations on distributed systems. A distributed system can be composed of
geographically scattered hosts. The computing resources in the system are accessible to a
large number of users. Any application can be submitted to run on any of the hosts in the
system simultaneously. To attain fair utilization of the resources and effectively organize the
computations on them, software facility is demanded to support the various computing
requirements on distributed systems.
MOIDE (Multithreading Object-oriented Infrastructure on Distributed Environment) is a
distributed object model that implements high-performance computing, especially to support
irregularly structured problems, on distributed systems. It establishes a flexible infrastructure
that combines distributed object with multithreading methodologies to support parallel and
distributed computing on varied platforms in distributed system.
The basic components of MOIDE model are a group of objects dynamically created at
runtime on the hosts that are selected to run an application. The hosts may be scattered in
wide range of a distributed system. The objects on the hosts are organized into a cooperative
working system, called collaborative system, and execute the application together. The
collaborative system is a runtime infrastructure. It is built when an application is submitted to
run. The collaborative system provides the mechanisms that allow the distributed objects to
interact with one another, and it is responsible to coordinate the computing procedures on the
distributed objects.
The construction of a collaborative system is associated with the resources in the
underlying distributed system. It is always built on the most appropriate hosts, i.e., the hosts
16
with higher computing power and lower workload. For a distributed system containing SMP
nodes, multiple threads will be generated inside the object on SMP node. In this case, the
collaborative system will possess hierarchical structure. The objects, one per SMP node, form
the upper level of the collaborative system. The threads in the objects form the lower level of
the system. The communication takes place on the two levels in different mode. The
combination of distributed objects and threads gives the collaborative system high
adaptability to heterogeneous system architecture.
The distributed object-oriented and multithreading features of MOIDE model have the
following advantages.
(1) Architectural Transparency
MOIDE model erects a uniform computing infrastructure for developing and executing
applications. The architecture of the underlying distributed system is transparent to the
applications. An application does not need to learn on what hosts it runs. The application just
requests a certain number of processors. The processors may be supplied from uniprocessor
hosts, SMP nodes, or hybrid hosts. The application is developed on an identical infrastructure
no matter what hosts it will be executed on. The runtime support system will create
distributed objects and generate threads inside the objects when running the application
depending on the specific architecture of the underlying hosts.
(2) Combined Programming Methodology
MOIDE combines distributed object computing and multithreading methodologies. The
model is adaptive to heterogeneous architecture. A group of threads are generated within an
object to support parallel computing on SMP node. These threads can have tight cooperation
in the group. The threads can share computation workload and data in an object. The
combined programming methodology is more efficient and resource-saving than a complete
distributed object approach where multiple objects will be created on SMP nodes to perform
parallel computation on multiple processors. The threads can work in different modes
depending on the computation pattern of an application. An application is decomposed into
tasks. The threads in an object can work in cooperative mode to cooperatively process one
task. The threads can also work in independent mode in which each thread independently
processes its own task.
(3) Integrated Communication Mechanism
17
The MOIDE model supports two communication methods. The distributed objects
interact with each other by remote messaging–a data transmission approach between
distributed objects based on remote method invocation. Remote messaging is more powerful
than message passing. The threads inside an object can access shared data through local
memory. Shared-data access has low communication expense. MOIDE model integrates the
two communication ways and builds a two-layer communication mechanism. Meanwhile it
provides a unified communication interface on the mechanism to hide the two-layer
communication paths to the application. At the application level, the communication between
objects or the threads call the same interface regardless of the physical path. A thread can
communicate with a local thread in the same object or a remote thread in remote objects
through the same communication interface. The runtime support system will carry out the
communication through the proper communication path, either shared-data access or remote
messaging.
(4) High Asynchrony
MOIDE supports high asynchronous computation of the objects. The distributed objects
can execute the computing tasks asynchronously unless interaction is required between them.
It can also implement the asynchronous inter-object communication. Ordinary message-
passing communication is fulfilled by the cooperation between a pair of sender and receiver.
Communication operations should be explicitly performed on both sides. The sender issues a
send operation and the receiver issues a receive operation. MOIDE model allows one-sided
communication in which only the sender or receiver starts the communication. An object can
send data to another object by writing the data directly to a variable in the destination object
via remote method invocation. Similarly the receiver can fetch data from the source object by
directly reading the data value there. The communication can be conducted at any time
without the explicit participation of the other side. The one-sided communication contributes
to the high asynchrony of the computation on distributed objects.
This chapter describes the fundamental structure and functionality of MOIDE model. The
runtime support system that implements the computation in MOIDE model will be addressed
in chapter 3.
18
2.2 Basic Collaborative System
The kernel of MOIDE model is collaborative system. Collaborative system is a runtime
software infrastructure. Its fundamental components are a group of objects distributed on the
hosts. The collaborative system is formed when an application is submitted to run, and it is
terminated when the execution has finished. The collaborative system can also be
reconfigured during the computation to match the runtime requirement of the computation
and the states of the underlying hosts.
2.2.1 System Structure
A collaborative system consists of a group of distributed objects. Fig. 2.1 shows the basic
structure of a collaborative system built on P hosts. The object on host 0 is called compute
coordinator. It is the first object created in the system and it acts as the manager of the system.
Fig 2.1 The collaborative system built on P hosts
The compute coordinator is the initiator of the collaborative system. It starts to work on
the host where an application is submitted. It creates all remote objects on other hosts and
allocates the computing tasks to them. It also coordinates the computing procedures on all
objects and conducts the system-wide synchronization. Other objects accept and execute the
assigned computing tasks. Hence they are called compute engine. The collaborative system
has a registration mechanism that contains the references to the distributed objects in the
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
19
system. The distributed objects can locate each other by referring to the registration
mechanism. An object can find the reference of another object when it wants to communicate
with that object. The interaction mechanism provides the communication interface to the
distributed objects. It implements all inter-object communication. The host selection
mechanism is for the detection and selection of the hosts when a collaborative system is to be
created or reconfigured.
2.2.2 System Creation
A collaborative system is constructed under the cooperation of all objects. The compute
coordinator is the first object started on one host. Then it starts the remote objects on other
hosts. The creation of a collaborative system is accomplished in four steps.
(1) Compute coordinator start
Compute coordinator is started on the host where an application is submitted.
(2) Host selection
If the application requests more than one processors, the compute coordinator will search
for the available hosts in the underlying system to supply the required number of processors.
There may be a lot of available hosts. The compute coordinator follows the host selection
policy to choose the most appropriate hosts that can provide high performance. It examines
the computing power and current states of the hosts by referring to the information provided
from the host selection mechanism. The host selection policy can be expressed by the
following priority:
i
ii workload
eperformancpriority =
where priorityi is the priority of host i to be selected in the host selection, performancei is the
computing power of host i, and workloadi is the current workload on host i. Precedence is
given the hosts with higher prioprity.
(3) Compute engine creation
Compute coordinator starts the creation of the objects on the select hosts, one object per
host.
(4) Object registration
Each object registers itself to the registration mechanism. The registration mechanism
maintains the registration table to store the reference of every object. An object can get the
20
reference to other object by looking up the registration table as Fig 2.2 shows. Thus an object
can locate other objects and communicate with them.
As Fig 2.2 shows, the registration table is duplicated on each object. The figure only
displays the registration tables on the compute coordinator and the compute engine 1. The
table contains the items: the logical name of compute engine CE, its residing host name
HOST, and the reference to the compute engine REF. All of the registration tables constitute
the registration mechanism. When an object wants to make a communication to another
object, it looks up the table and gets the reference to that object. Then the object can perform
remote method invocation to the other object through the reference.
Fig 2.2 Registration tables
The collaborative system has been built up when all objects have registered themselves to
the registration mechanism. Then the compute coordinator and the compute engines will
execute the application together.
2.2.3 System Work
To execute the computation on collaborative system, the compute coordinator firstly
decomposes the application into computing tasks. The computing tasks will be allocated to all
compute engines. The compute coordinator itself also works as a compute engine. At the
Registration Table 1
ComputeEngine
ComputeEngine
ComputeEngine
Compute
ComputeCoordinator
host1 host2 host3
host0
ComputeEngine (ce2)
ComputeEngine (ce3)
ComputeEnginece(p-1)
ComputeCoordinator (ce0)
host (P-1)
ComputeEngine (ce1)
ce0ce1ce2ce3
ce(p-1)
host 0host 1host 2host 3
host(p-1)
CE HOST REFRegistration Table 0
null
ce0ce1ce2ce3
ce(p-1)
host 0host 1host 2host 3
host(p-1)
CE HOST REF
null
21
same time, it should perform the required coordination to the computing procedures on all
compute engines.
For a collaborative having m compute engines, let poweri be the computing power of
compute engine i. The computing power of a compute engine is determined by the computing
power of the underlying host. Assume the overall workload of an application is W, the task
allocated to compute engine i should have workload wi where
∑⋅=k
kii powerpowerWw
The compute coordinator starts the computing procedure on the compute engines by
assigning the tasks to them. The compute engines process the tasks asynchronously unless the
necessary data communication and synchronization are required. Compute coordinator is
responsible to make the global synchronization on all compute engines at the synchronization
point. When the computation has finished, the compute coordinator ceases all compute
engines. Thus the collaborative system is terminated.
The computing procedure on collaborative system is application-dependent. MOIDE is
an infrastructure to support the implementation of the applications with different computation
and communication requirements.
2.2.4 System Reconfiguration
The collaborative system is also flexible to conduct runtime system reconfiguration to
improve the system performance. The computing power of the collaborative system can be
enhanced by adding more compute engines on new hosts to it. The computing task on an
overloaded host can be moved to other available host.
MOIDE model has the flexibility to create new compute engine at any time on any host.
The distributed compute engines are linked together via the registration mechanism. The
registration table contains the references to all compute engines. The table can be updated by
inserting new references or removing old references. Therefore new compute engine can join
the collaborative system and old compute engine can be removed from the system.
2.2.4.1 System Expansion
22
Generally there are a lot of hosts available in a distributed system. A collaborative system
is established on the selected hosts. Moreover irregularly structured problems have non-
predetermined computation patterns. For instance, an application initially runs on P compute
engines. It may generate extremely high computation workload during the execution. As the
distributed system has extra hosts available, the collaborative system can select additional
hosts at runtime to share the computation workload. The collaborative system can be
expanded by incorporating new compute engines on new hosts. The new compute engines
work in the same way as the old compute engines in the collaborative system. This is
horizontal system expansion shown in Fig 2.3.
Fig 2.3 Horizontal system expansion
On the execution of an irregular application, the runtime computation workload may be
highly imbalanced among the compute engines. A heavily-loaded compute engine will
become the bottleneck. To alleviate the bottleneck, an extra compute engine can be attached
to the overloaded compute engine. The attached compute engine works under the overloaded
compute engine to share its workload. The attached compute engine is the assist-engine of the
overload one. It is only visible to its parent compute engine. This is vertical system expansion.
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngine
ComputeEngine
New Host
23
Fig 2.4 shows the vertical system expansion with an assist-engine attached to compute engine
2.
Fig 2.4 Vertical system expansion
2.2.4.2 Host Replacement The hosts in a distributed system are shared resources accessible by many users and
applications. Thus the states of the hosts, such as the workload, keep on changing from time
to time. Only idle or lightly-loaded hosts are suitable to accommodate a collaborative system.
At the system creation stage, the compute coordinator selects the lightly-loaded hosts.
However the workload on the hosts may uprise to a high level because of not only the
computation of this collaborative system but also other computing jobs from other users. The
overloaded hosts will slow down the collaborative computation. If there are idle or lightly-
loaded hosts available in the distributed system, they can be used to replace the overloaded
hosts. In the case of host replacement, new compute engine is created on the newly-selected
host. The new compute engine takes over the computing task from the compute engine on the
overloaded host. It replaces the role of the old compute engine in the collaborative system,
and the old compute engine will be ceased. The registration table is accordingly updated to
reflect the change of the compute engines. The reference to the new compute engine replaces
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngineAssistEngine
New Host
24
the entry of the old compute engine in the table. The logical structure of the collaborative
system remains unchanged after the replacement. The computing procedure continues on the
collaborative system as before. This is the host replacement in collaborative system. The
system size is the same after the replacement. Fig 2.5 shows an example of host replacement
in which the compute engine on Host 2 is replaced by the new compute engine on the New
Host 2.
Fig 2.5 Host replacement in a collaborative system
The system reconfiguration is managed by the compute coordinator (except the vertical
system expansion which is conducted by the parent compute engine only). The host selection
mechanism provides the real-time information of the available hosts. The compute
coordinator reads the state information and performs the host replacement operations when
necessary.
2.3 Hierarchical Collaborative System
The basic collaborative system is built with the objects of compute engine only on single-
processor hosts. All inter-object communication is accomplished via remote messaging. For
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 (replaced) Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngine
ComputeEngine
New Host 2
25
the collaborative system on heterogeneous system, multithreading methodology can be
incorporated into compute engines to suit the hierarchical architecture.
Nowadays symmetric multiprocessors (SMP) made from off-the-shelf microprocessors
are widely used as the cost-effective multiprocessor machines. Networked SMP machines
(cluster of SMP nodes) can provide high performance to large-scale computing. Cluster of
SMP nodes is considered as a low-end substitution of the high-end supercomputers [25]. A
distributed system may consist of both SMP and single-processor nodes. The SMP nodes may
contain different number of processors and varied computing power. Therefore a
heterogeneous system like this has hierarchical architecture. In such a system, there exist the
system-wide loosely-coupled nodes and the tightly-coupled processors inside SMP node. The
heterogeneous feature requires MOIDE model to be adaptive to the architecture of underlying
hosts. Therefore the basic collaborative system in 2.2 is expanded to the hierarchical
collaborative system (HiCS).
2.3.1 Heterogeneous System
Consider the situation of building a collaborative system on heterogeneous hosts. The
basic collaborative system structure is also applicable in this case. A P-processor SMP node
can be treated as P individual hosts and thus P compute engines will be created on it. All
compute engines on the same or different SMP nodes simply form a basic collaborative
system. This is certainly not a proper approach. It does not take the advantages of SMP node.
For example, the objects residing on an SMP node can communicate with each another via
shared-data access, a more efficient communication method than remote messaging. The
communication mechanism on heterogeneous system should integrate the swift shared-data
access and the widely-applicable message passing. Multithreading programming techniques
can be used on SMP node. A thread is a light-weight object that occupies less system
resource than an object. Multithreading is a cost-effective methodology to perform parallel
computation on SMP node.
Fig 2.6 shows the architecture of a heterogeneous system. It is a two-level hierarchical
structure. SMP node is composed of tightly-coupled multiprocessors connected by shared
memory modules. All nodes are linked together to be a loosely-coupled cluster across the
network. The relations between the objects residing on the nodes can also be treated at two
levels. The objects on same SMP node are tightly related to each other. These sibling objects
26
can make tight cooperation in the computation than the objects on different nodes. The
sibling objects can be created as multiple threads and take the advantages of multithreading
technique such as high data-sharing and low resource-consuming.
Fig 2.6 Cluster of SMPs
Though message passing is the ordinary communication method on distributed systems,
the threads on same SMP node can interact with each other by shared-data access because
they are instantiated from the same object and able to access the public data in the object.
Shared-data access is a quick communication way on SMP node. A two-layer communication
mechanism can be built on heterogeneous hosts, which integrates local shared-data access and
remote messaging on the two levels. With the two-layer communication mechanism, the
performance of the communication-intensive applications can be improved.
To realize the adaptability of MOIDE model, the basic collaborative system in 2.2 is
modified to be a hierarchical structure that incorporates distributed object and multithreading
methodologies. This is the hierarchical collaborative system structure.
2.3.2 Hierarchical Collaborative System
Hierarchical collaborative system (HiCS) is an infrastructure expanded from the basic
collaborative system. In HiCS, the compute engine on an SMP node spawns a group of
threads inside. The number of the threads is equal to the number of processors on the SMP
node. The threads will run on the multiprocessors in parallel. Multithreading is the
supplement to the distributed object model to allow the collaborative system adaptive to the
hierarchical architecture of hybrid hosts. The computation is more efficient when performed
P0
P1
Pi0
SMP
M
NI
M
P0
P1
Pi1
SMP
M
NI
M
Network
P0
P1
Pm
SMP
MM
NI
MM
P0
P1
Pn
M
NI
M
P
MM
NINI
PC
P
MM
NINI
PC
27
by the threads on SMP node. Two-layer communication mechanism implements the efficient
communication on HiCS.
Fig 2.7 shows the structure of hierarchical collaborative system. The compute engine on
SMP node can generate multiple threads inside. For example, assume Host 1 in Fig 2.7 is an
SMP with k processors, the compute engine on it spawns k threads as shown in the attached
box. The creation of hierarchical collaborative system includes the following operations.
Fig 2.7 Hierarchical collaborative system built on heterogeneous hosts
(1) Host selection and compute engine creation
This is similar to the creation of basic collaborative system described in 2.2. At first the
compute coordinator is started on the node where an application is initiated to run. The
compute coordinator selects the hosts to supply the processors required by the application.
The host selection policy on heterogeneous system should include the index about the size of
SMP node. It gives higher priority to the SMP node with more processors.
i
iii workload
eperformancPpriority ⋅=
where priorityi is the priority of host i to be selected, Pi is the number of processors on host i
and performancei is the computing power of each processor in host i, worloadi is the
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
LL
Thread1 Thread2 Threadk
Main thread
Shared Data
Compute
LL
Thread1 Thread2 Thread
Main thread
Shared Data
Engine
SMP
28
computation workload on host i. Here the number of processors in a host is one of the indices
in the host selection. An SMP node with more processors has higher priority to be selected.
The compute coordinator starts the compute engines on the selected hosts, one compute
engine per host, as compute engines on Host 1 to Host (P-1) shown in Fig 2.7, regardless how
many processors it has. The compute engines register themselves to the registration
mechanism. Each compute engine has a registration table that records the references to
remote compute engines.
(2) Thread generation
Each compute engine on SMP node generates a group of threads inside. The threads will
run on local multiprocessors.
The hierarchical collaborative system is a two-level infrastructure. All compute engines
form the upper level which is directly managed by the compute coordinator. The lower level
contains the threads in each compute engine. The main thread is the original thread in each
compute engine. All other threads are instantiated from it. The group of threads can jointly
process a computing task or independently process own computing task. The main thread acts
as the local coordinator. It makes the necessary synchronization on the group of the threads.
In HiCS, the threads in same compute engine can access shared data. Inter-thread
communication can be accomplished by data access via shared memory. But the distributed
compute engines are shared-nothing objects. They communicate with one another by remote
messaging.
The hierarchical collaborative system is an adaptive infrastructure on heterogeneous
system. Owing to the conditional creation of the threads in compute engine, HiCS sets up a
uniform infrastructure for developing and running applications on varied architecture.
Multithreading and single-threading compute engines can exist in a HiCS at the same time.
For instance, an application requests to run on P processors. The processors may be supplied
by different hosts each time to run the application. The application may run on a group of
SMP nodes with varied size. It is also possible to run on hybrid hosts of SMP nodes and
single-processor hosts. The underlying hosts are transparent to the application developed
upon HiCS infrastructure. A HiCS will be created at runtime with a structure that matches the
architecture of the hosts to achieve the best performance in executing the application.
The group of threads generated in the compute coordinator or compute engine can be
organized to work in two modes in the applications according to the computation patterns:
29
Cooperative mode: the main thread acts as the local coordinator inside a compute
engine, which accepts the computing task from the compute coordinator. The group of
threads shares the computing task, i.e., each thread executes a part of the computing
task. Thus a thread can be called a sub-engine in the compute engine. The main thread
coordinates the computing procedure on the group of threads. The group of threads
only conduct local communication between each other via shared data access. The
main thread is responsible for the communication to other compute engines.
Independent mode: each thread works as an independent compute engine. Each thread
processes a computing task independently. Even so, the threads in same compute
engine can access shared data as well. Moreover any thread is allowed to communicate
with the threads in other compute engine through remote messaging. There is no local
coordinator against the main thread in cooperative mode. Each thread performs
individual computing task. Thus each thread can be called a pseudo-engine.
The work mode of threads is determined by the computation pattern of an application.
The algorithm of an application can be designed based on one of the modes. The work mode
can also be switched from one to another in an application, depending on the computational
requirement in different phases.
2.3.3 Task Allocation
After a HiCS has been built, the compute coordinator should allocate the computing tasks
to all compute engines. Due to the highly diverse characteristics of irregularly structured
problems, different task allocation strategies should be used for varied applications. Good
task decomposition cannot be discovered by inspection before execution because of the
irregularity in computation and communication [73]. The following strategies can be used as
the general approaches for the task allocation in different irregularly structured problems.
(1) Initial task decomposition with runtime repartition
The workload of irregular computation is not pre-determined and it evolves in the
execution. It is not possible to evenly decompose the computation a priori. The task
allocation can adopt the strategy of initial task decomposition and runtime repartition. An
application can be initially divided into tasks based on the estimation of the workload. If the
workload is found unbalanced on the processes during the execution, runtime task repartition
30
should be made to re-decompose the tasks based on the real workload. This strategy is
suitable to the applications with high data-dependency.
(2) Dynamic task allocation
To the applications with light data-dependency, it is not required to allocate all tasks to
the processes before execution. The tasks can be progressively allocated to the processes one
at a time in accordance with the computing procedure on each process. By the dynamic task
allocation, the workload can be automatically balanced on the multiple processes without
specific load balancing operation.
(3) Balance in both computation and communication
Generally load balancing refers to the balance of computation workload. However the
load balancing in irregularly structured problems should pay special attention to the
communication workload, because the irregular communication patterns usually incur high
and diverse communication overhead that severely restrict the performance. The task
allocation strategy for the communication-intensive applications must cover the factor of
communication. The task decomposition should keep data locality in the tasks and generate a
data distribution among the tasks that can reduce the inter-process communication. The task
allocation should also alleviate the diverse communication requirements on multiple
processes so that the communication can be balanced among the processes and alleviate the
communication bottleneck.
The computation and communication patterns are extremely complicated in the
irregularly structured problems. Task allocation strategies should be closely related to the
applications. The task allocation strategies above are the generic approaches for the task
allocation. Specific task allocation approaches should be derived from these strategies for
different applications.
2.3.4 Unified Communication Interface
The communication in hierarchical collaborative system takes place on two levels. The
threads inside a compute engine can share public data and variables. The communication
between the threads can be realized through shared memory. This is the efficient way for
local data communication. The communication between compute engines is delivered by
remote messaging across the network, which requires high communication latency. Fig 2.8
shows the structure of the two-layer communication mechanism.
31
The communications between the tasks are delivered either by local shared memory
access or by message passing through the network, depending on the locations of the
communication partners. A unified communication interface is provided to all tasks at the
application level. The tasks call the same interface for communication no matter where the
destination is. The interface will implicitly decide the proper communication paths.
Fig 2.8 Two-layer communication mechanism
The two-layer communication is appropriate to the hierarchical collaborative system built
on heterogeneous hosts. It helps to reduce the heavy and unpredictable communication
overhead in irregularly structured problems. With the flexible computing mode (cooperative
or individual mode), the tasks (threads) can be flexibly organized into different mode to
exploit the efficiency of the two-layer communication. In chapter 6, the CG method will
demonstrate the transparent mapping of the communication paths. The radix sort will
illustrate the flexible use of the computing mode to implement efficient global
communication.
The pseudo-engines have two communication paths. The pseudo-engines on the same
SMP node can take the way of share-data access but the pseudo-engines residing on different
nodes should communicate via remote messaging. As the architecture-transparent feature of
MOIDE model, the locations of the pseudo-engines are transparent to the applications. At
application level, any communication between the pseudo-engines calls the unified
communication interface. The MOIDE runtime support system (see chapter 3) will decide the
exact communication path according to the locations of the pseudo-engines.
Task
Uniform Interface
Memory
Task
Task
Task
buffer
Task
Uniform Interface
Memory
Task
Task
Task
buffer
..................
Network
Host Host
32
2.4 Implementation
MOIDE model is implemented in Java and RMI (Remote Method Invocation) interface
on distributed systems. The motivation to use Java and RMI has been discussed in 1.2.3.
Java’s object-oriented, multithreading and platform-independent features are suitable to
implement the hierarchical collaborative system infrastructure on heterogeneous systems. The
remote method invocation mechanism of RMI facilitates the implementation of the flexible
interaction and communication between distributed objects. Here gives a brief introduction
on the implementation. The details will be covered in chapter 3.
(1) Compute Coordinator and Compute Engine
Computer coordinator and compute engines are the major components in hierarchical
collaborative system. Two classes of compute coordinator and compute engine are defined as
the kernel of the implementation. Other components are built around them.
(2) Object Registration and Interaction Interface
The registration mechanism of collaborative system is implemented on RMI registry—
rmiregistry. The registry runs on each selected host and provides naming service to the
distributed objects. A compute engine registers itself to the registry when created. The
compute coordinator assigns computing tasks as well as other arguments to remote compute
engines and triggers the computation on them through remote method invocation. The
compute coordinator generates a name list of the compute engines and broadcasts it to the
remote compute engines. Each compute engine can independently read the reference to other
remote compute engine by looking up the registry on remote host once. The references are
stored in the registration table on each compute engine for later use as Fig 2.2 shows.
The interaction and communication between compute coordinator and compute engines
goes through remote method invocation via RMI interface. The class of the interface is
defined. The implementation of the interface is defined in the class of compute engine.
(3) Multithreading
A compute engine will instantiate a group of threads on SMP node before it starts to run
the computing task. Recent JDK versions, e.g. IBM JDK 1.1.8, Blackdown JDK 1.2 and up,
support the kernel-based threads − native threads. JVM can schedule native threads to the
multiprocessors in SMP node.
(4) Two-layer Communication
33
The data sharing among a group of threads can be accomplished by accessing public data
objects. In cooperative mode, the threads in a compute engine may work on the same data set
and share the computing task. Compute engines transmit data to each other by remote
messaging, i.e., passing data through remote method invocation. RMI supports object
serialization that enables direct transfer of complex data objects back and forth between the
compute engines. Application calls the unified communication interface to realize the two-
layer communication. The uniform interface provides the communication methods to the
applications and seamlessly integrates the shared-data access and remote messaging
communication on the two levels. The unified communication interface is extremely useful
for the computation in independent mode, where each thread works independently as a
compute engine. The location of each thread is registered at the creation stage of HiCS. When
a thread wants to communicate to another thread, it just specifies the ID of the target when
calling a communication method. The two-layer communication mechanism will choose the
proper path and deliver the data depending on the locations of the two threads.
2.5 Summary
This chapter describes the distributed object model MOIDE that supports flexible and
efficient computing on distributed system. MOIDE model presents a computing infrastructure
that is adaptive to heterogeneous system architecture. It creates a collaborative system based
on the available hosts to execute an application. The collaborative system can be reconfigured
to adapt the change of the states in the underlying hosts. With the combination of distributed
object and multithreading methodologies, the hierarchical collaborative system can realize
high-performance computing on heterogeneous system. The compute engines in a
hierarchical collaborative system can work in two modes in accordance with the computation
patterns of the applications. The two-layer communication mechanism is built on hierarchical
collaborative system to support the efficient communication.
MOIDE model is implemented by the runtime support system MOIDE-runtime. Chapter
3 will give the implementation details of the runtime support system. MOIDE model is
suitable for various applications, especially for solving irregularly structured problems on
distributed systems. It is used to implement four irregularly structured applications: N-body
problem, ray tracing, CG, and radix sort. The utilization and advantages of MOIDE model
will be demonstrated by the four applications in chapter 4, 5 and 6.
34
Chapter 3
Runtime Support System
A runtime support system is developed to support the distributed computing on MOIDE
model. It provides the fundamental classes and APIs for implementing MOIDE-based
computation. The runtime support system is programmed in Java and RMI.
3.1 Overview
The runtime support system provides the components and functions for MOIDE model.
Table 3.1 lists the major classes and methods implemented in the runtime support system.
The runtime support system is called MOIDE-runtime in the following text for abbreviation.
Class Method Description StartEngine build a collaborative system
getHost select hosts createEngine create compute engines on the hosts invokeEngine assign tasks to remote compute engines
Codr compute coordinator main create collaborative system run run application
Engine object interface of compute engine EngineImpl implementation of compute engine
run start compute engine ceaseEngine terminate compute engine
ExpandEngine expand collaborative system addEngine add new compute engine
RecfgEngine reconfigure collaborative system checkEngine check the states of the hosts replaceEngine replace a compute engine with a new one
CommLib unified communication interface exchDouble exchange a double exchDoubleArray exchange an array of double exchIntArray exchange an integer array allReduce global reduction scan global scan
35
Util miscellaneous utilities barrier synchronize a group of threads remoteBarrier synchronize compute engines getTask get a task from global task pool getSubtask get a subtask from local subtask queue
Table 3.1 Major classes and methods in MOIDE runtime support system
Compute engine is specified as the interface Engine and the implementation of the
interface EngineImpl. The class of compute coordinator Codr is an application-dependent
class that calls the main method of the application.
Fig 3.1 Organization of MOIDE runtime support system
Fig 3.1 depicts the organization of MOIDE-runtime. The runtime supports system
implements the creation and reconfiguration of hierarchical collaborative system. It defines
class StartEngine for system creation, class ExpandEngine for system expansion, and
class RecfgEngine for host replacement. These three classes call the same methods to
search the available hosts in a distributed system and detect their states. The state detection is
implemented with the aid of ClusterProbe [13], a Java-based tool for reporting the states of
clustered computers. ClusterProbe is developed by our research group. It runs on a server and
monitors the hosts in a distributed system. It periodically reports the states and other
information of the hosts, including processor type, performance, memory size, workload, the
number of processors in each host, and etc. The host selection is based on the state
information of the hosts.
The runtime support system provides the primitives for dynamic load scheduling and
object synchronization. It also defines a unified communication interface for the two-layer
communication.
Unified Communication InterfaceCommLib
Two-layer Communication Mechanism
System CreationStartEngine
Synchronizationbarrier(), remoteBarrier()
Autonomous Load SchedulinggetTash(), getSubtask()
System ReconfigurationExpandEngine, RecfgEngine
ComputeCoordinator
Codr
Compute EngineInterfaceEngine
Compute EngineImplementationEngineImpl
36
3.2 Principal Objects
The principal objects in collaborative system are compute coordinator and compute
engines. They are defined as two classes. All other classes are built around them to complete
the system functions.
The compute coordinator is the first object started on the host where an application
begins to run. The coordinator is responsible for host selection and remote compute engine
creation. It also coordinates the whole computing procedure on collaborative system. The
compute engine is the object created on remote host. It accepts and processes computing tasks.
After establishing the collaborative system, the compute coordinator also works as a compute
engine to execute computing task.
Fig 3.2 Class description and relation of compute coordinator and compute engine
The interactions between the objects (compute coordinator and compute engines) are
implemented based on RMI interface. As the convention of RMI-based distributed object
scheme, an interface and the implementation of the interface should be specified for remote
objects. Therefore the compute engine is defined as an interface Engine and its
implementation EngineImpl. The interface contains the declaration of all methods for
remote method invocation. The implementation of the interface specifies the details of the
compute engine class. It includes the bodies of the methods declared in the interface.
class Codr
Codr codr; Appl appl;
main { StartEngine.run(); codr = new Appl(); codr.run(); ceaseEngine(); }
run { appl = new Appl(); appl.run() }
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
37
From the description above, we can outline the structures of the compute coordinator and
compute engine classes as well as their relationship. As Fig 3.2 shows, the class of compute
coordinator Codr is a comprehensive class. It encapsulates the system creation and
coordination work. EngineImpl is the interface implementation of the compute engine
class. It includes the methods for remote object interaction such as remote messaging and
global synchronization (see 3.6.2). The construction method of compute engine is invoked by
the compute coordinator through remote method invocation. When a compute engine is
created on an SMP node, it will spawn a group of threads inside with regard to the number of
local processors. Appl is the class of the application program to be run. Both compute
coordinator and compute engine instantiate the application class and run its main method as
appl.run().
3.3 System Creation
The compute coordinator is the initiator of a collaborative system. At first it runs class
StartEngine to create the collaborative system. The creation of compute engine on other
host is trigged by the remote method invocation from the compute coordinator.
3.3.1 class StartEngine
The class StartEngine aims to build a collaborative system based on the required
number of processors and the available system resources. It includes the methods to select the
hosts, create and invoke compute engines on the hosts, and start the computation on the
system. The methods complete the three stages in system creation: host selection method
getHosts (), compute engine creation method createEngine(), and remote compute
engine invocation method invokeEngine(). These methods are called in the run()
method of StartEngine as Fig 3.3 shows which is executed by the compute coordinator.
In the group scheduling, the group-based task allocation and delayed strip send-back aim
to decrease the frequency of remote method invocation. These approaches are proposed based
on the prediction that frequent remote method invocation will produce high communication
overhead so that it should be reduced during the computation. However the runtime tests give
an opposite conclusion. The row-based individual task allocation produces low
communication overhead. It has little influence on the overall performance. Now inspect the communication overhead in the individual scheduling. Fig 5.9 shows the
detailed time breakdowns of the individual scheduling. The communication time is the cost of
row fetching and send-back. In the individual scheduling, a rendered row is directly stored
into the correspondent entry of the image buffer in the computer coordinator via the remote
method invocation. There is no strip reordering operation. As the first case in Fig 5.9 (P=4,
i.e., four processors), the rendering work is evenly performed on the four threads residing on
one SMP node that has no communication cost.
Fig 5.9 Execution time breakdowns of individual scheduling
The second case is the execution time of eight threads on two SMP nodes (P=8). The
four threads on the left exist in the compute coordinator. Their execution time is mostly spent