A Framework for Fully Decentralised Cycle Stealing · 2010. 6. 9. · A dissertation submitted in partial fulfilment Of ... stealing the limits of centralised models can be overcome.

A Framework for Fully Decentralised Cycle Stealing

Richard Samuel Mason

April 2007

A dissertation submitted in partial fulfilment

Of the requirements for the degree of

DOCTOR OF PHILOSOPHY

School of Software Engineering and Data Communications

Faculty of Information Technology

Queensland University of Technology

Brisbane, Australia

i

Keywords

Cycle Stealing, Cycle scavenging, Volunteer computing, Peer‐to‐peer, Fully de‐

centralised networking, Pure P2P, Distributed computing

ii

Abstract

Ordinary desktop computers continue to obtain ever more resources – in‐

creased processing power, memory, network speed and bandwidth – yet these

resources spend much of their time underutilised. Cycle stealing frameworks

harness these resources so they can be used for high‐performance computing.

Traditionally cycle stealing systems have used client‐server based architectures

which place significant limits on their ability to scale and the range of applica‐

tions they can support. By applying a fully decentralised network model to cycle

stealing the limits of centralised models can be overcome.

Using decentralised networks in this manner presents some difficulties which

have not been encountered in their previous uses. Generally decentralised ap‐

plications do not require any significant fault tolerance guarantees. High‐

performance computing on the other hand requires very stringent guarantees

to ensure correct results are obtained. Unfortunately mechanisms developed for

traditional high‐performance computing cannot be simply translated because of

their reliance on a reliable storage mechanism. In the highly dynamic world of

P2P computing this reliable storage is not available. As part of this research a

fault tolerance system has been created which provides considerable reliability

without the need for a persistent storage.

As well as increased scalability, fully decentralised networks offer the ability for

volunteers to communicate directly. This ability provides the possibility of sup‐

porting applications whose tasks require direct, message passing style commu‐

nication. Previous cycle stealing systems have only supported embarrassingly

parallel applications and applications with limited forms of communication so a

new programming model has been developed which can support this style of

communication within a cycle stealing context.

In this thesis I present a fully decentralised cycle stealing framework. The

framework addresses the problems of providing a reliable fault tolerance sys‐

tem and supporting direct communication between parallel tasks. The thesis

includes a programming model for developing cycle stealing applications with

iii

direct inter‐process communication and methods for optimising object locality

on decentralised networks.

iv

Table of Contents

KEYWORDS .................................................................................................................................................................................... I

ABSTRACT ..................................................................................................................................................................................... II

TABLE OF CONTENTS .............................................................................................................................................................. IV

TABLE OF FIGURES ................................................................................................................................................................. VII

TABLE OF CODE LISTINGS ................................................................................................................................................... VIII

STATEMENT OF ORIGINAL AUTHORSHIP ......................................................................................................................... IX

ACKNOWLEDGEMENTS ............................................................................................................................................................ X

1 INTRODUCTION ................................................................................................................................................................ 1

1.1 DECENTRALISED P2P .............................................................................................................................................. 2

1.2 CYCLE‐STEALING ...................................................................................................................................................... 3

1.3 DECENTRALISED CYCLE‐STEALING .................................................................................................................... 5

1.4 CONTRIBUTIONS ....................................................................................................................................................... 6

2 RELATED WORK ............................................................................................................................................................... 8

2.1 DECENTRALISED NETWORKING .......................................................................................................................... 8

2.1.1 CHORD .............................................................................................................................................................. 10

2.1.2 CONTENT‐ADDRESSABLE NETWORK .................................................................................................... 11

2.1.3 PASTRY ............................................................................................................................................................. 12

2.2 CYCLE STEALING ..................................................................................................................................................... 17

2.2.1 DREAM .............................................................................................................................................................. 21

2.2.2 BUTT ET AL ..................................................................................................................................................... 23

2.2.3 AWAN ET AL ................................................................................................................................................... 24

2.2.4 G2 CLASSIC ...................................................................................................................................................... 26

2.2.5 LOAD BALANCING ......................................................................................................................................... 26

3 DECENTRALISED CYCLE‐STEALING ........................................................................................................................ 28

3.1 G2:P2P DESIGN ........................................................................................................................................................ 29

3.1.1 JOB ASSIGNMENT .......................................................................................................................................... 30

3.2 PROGRAMMING MODEL ........................................................................................................................................ 33

3.2.1 DISTRIBUTED OBJECT MODEL ................................................................................................................. 35

3.2.2 INTER‐OBJECT COMMUNICATION ........................................................................................................... 36

3.2.3 WELL‐KNOWN OBJECTS ............................................................................................................................. 37

3.2.4 OBJECT LIFETIME ......................................................................................................................................... 40

v

3.3 VOLUNTEER ARRIVAL & DEPARTURE ............................................................................................................. 41

3.4 IMPLEMENTATION ................................................................................................................................................. 43

3.4.1 PROTOTYPE ARCHITECTURE ................................................................................................................... 44

3.4.2 .NET REMOTING BACKGROUND ............................................................................................................... 48

3.4.3 INTEGRATING G2:P2P INTO REMOTING ............................................................................................... 50

3.4.4 ACTIVATING OBJECTS ................................................................................................................................. 52

3.5 CONCLUSION ............................................................................................................................................................ 55

4 FAULT TOLERANCE ...................................................................................................................................................... 57

4.1 BACKGROUND .......................................................................................................................................................... 58

4.1.1 CHECKPOINT BASED PROTOCOLS........................................................................................................... 59

4.1.2 LOG‐BASED PROTOCOLS ............................................................................................................................ 63

4.2 FAULT TOLERANCE IN G2:P2P ........................................................................................................................... 66

4.2.1 LOGGING PROCEDURE ................................................................................................................................ 68

4.3 CHECKPOINTING ..................................................................................................................................................... 74

4.3.1 SUPPORT FOR BLOCKING METHODS ..................................................................................................... 76

4.3.2 SUPPORT FOR LONG RUNNING METHODS ........................................................................................... 80

4.4 CONCLUSION ............................................................................................................................................................ 85

5 IMPROVING LOCALITY ................................................................................................................................................. 86

5.1 RELATED WORK ...................................................................................................................................................... 87

5.2 OPTIMISATIONS ...................................................................................................................................................... 88

5.2.1 OPTIMISATION 1 – OBJECTID ORDERING ............................................................................................. 89

5.2.2 OPTIMISATION 2 – OBJECT COLLOCATION .......................................................................................... 94

5.2.3 OPTIMISATION 3 – VOLUNTEER BALANCING ..................................................................................... 95

5.2.4 OPTIMISATION 4 – NODE ORDERING ..................................................................................................102

5.3 PROGRAMMING MODEL EXTENSIONS ...........................................................................................................105

5.4 CONCLUSION ..........................................................................................................................................................108

6 EVALUATION .................................................................................................................................................................110

6.1 TEST APPLICATIONS ............................................................................................................................................110

6.1.1 MANDELBROT – EMBARRASSINGLY PARALLEL ...............................................................................111

6.1.2 LATTICE GAS SIMULATION – CELLULAR AUTOMATON .................................................................112

6.2 SPEEDUP TESTS .....................................................................................................................................................114

6.2.1 MULTI‐CORE SPEEDUP .............................................................................................................................119

6.3 FAULT TOLERANCE OVERHEAD ......................................................................................................................120

7 CONCLUSIONS ...............................................................................................................................................................123

vi

7.1 FUTURE WORK ...................................................................................................................................................... 124

BIBLIOGRAPHY ...................................................................................................................................................................... 127

vii

Table of Figures

Figure 2‐1 – Pastry Routing Table (8‐bit NodeID, b=2) ................................................. 13

Figure 2‐2 – Pastry Routing from ID:2000 ‐> ID:0301 ................................................... 14

Figure 2‐3 – DREAM Architecture ........................................................................................... 22

Figure 3‐1 – G2:P2P Overview .................................................................................................. 30

Figure 3‐2 – Assigning jobs to Volunteers ........................................................................... 32

Figure 3‐3 – Sending Messages to Well‐Known Objects ................................................ 38

Figure 3‐4 – Standard G2:P2P Object Creation Sequence ............................................. 39

Figure 3‐5 – Well‐Known G2:P2P Object Creation Sequence ...................................... 39

Figure 3‐6 ‐ G2:P2P Prototype Architecture ....................................................................... 44

Figure 3‐7 ‐ External Client Message Redirection ............................................................ 46

Figure 3‐8 ‐ .NET Remoting Structure ................................................................................... 49

Figure 3‐9 – Activation via CustomActivatorSink ............................................................ 54

Figure 3‐10 ‐ G2:P2P Remoting Structure ........................................................................... 54

Figure 4‐1 – Simple Rollback Example .................................................................................. 60

Figure 4‐2 – Domino Rollback ................................................................................................... 61

Figure 4‐3 – Overview of G2:P2P Message Logging ........................................................ 72

Figure 5‐1 – Unoptimised Ring Communication ............................................................... 90

Figure 5‐2 – Optimised Ring Communication .................................................................... 91

Figure 6‐1 – Mandelbrot Visualisation ............................................................................... 112

Figure 6‐2 – Lattice Gas Simulation of Immiscible Fluids .......................................... 113

Figure 6‐3 – Speedup of Object Ordering Optimised Cellular Automata ............ 115

Figure 6‐4 – Speedup of Mandelbrot with Volunteer Balancing ............................. 117

Figure 6‐5 – Speedup of Cellular Automata with Volunteer Balancing ................ 118

Figure 6‐6 – Speedup of Mandelbrot on Dual‐Core Machine ................................... 120

Figure 6‐7 – Speedup of Cellular Automata on Dual‐Core Machine ...................... 120

Figure 6‐8 ‐ Fault Tolerance Overhead for Cellular Automaton ............................. 121

viii

Table of Code Listings

Listing 3‐1 – Creating G2:P2P Jobs .......................................................................................... 35

Listing 3‐2 – Inter‐object Communication ........................................................................... 36

Listing 3‐3 – Connecting to Well Known Objects using Type Registration ........... 53

Listing 3‐4 – Connecting to Well Known Objects using ‘Connect’ API ..................... 53

Listing 4‐1 – Non‐G2:P2P Style Blocking .............................................................................. 78

Listing 4‐2 – G2:P2P Style Blocking ........................................................................................ 79

Listing 4‐3 – Long Running G2:P2P Method ....................................................................... 81

Listing 4‐4 – Interruptable G2:P2P Loop .............................................................................. 81

Listing 4‐5 – Long running Interruptable Task with Return Value ........................... 82

Listing 4‐6 – Method with Multiple Blocking Points ....................................................... 84

Listing 5‐1 – Using the Object Spacing Optimisation ................................................... 107

Listing 5‐2 – Using the Object Collocation Optimisation ............................................ 107

ix

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet

requirements for an award at this or any other higher education institution. To

the best of my knowledge and belief, the thesis contains no material previously

published or written by another person except where due reference is made.

Signature: ___________________

Date: _________________________

x

Acknowledgements

First and foremost I would like to acknowledge my Lord and saviour, Jesus Chr‐

ist, without whom this thesis would not exist. He has used the process of prepar‐

ing this thesis to humble me and teach me and I now offer it to Him as I do all

parts of my life.

Secondly, I thank my wife, Tania, for her support and encouragement. We

started our marriage during this process and without her love and understand‐

ing it quite likely would not have reached its conclusion.

Similarly my family – parents, brothers, sister, parents‐in‐law and acquaintance.

They’ve all had provided their fair share of encouragement over the last years

and I thank you all for it; especially my parents who, more than anyone, are re‐

sponsible for getting me to, and through, this candidature. Also thanks to the

many friends from the Shallow‐but‐Friendly home groupers to The Wiggles to

the other individuals, to many to single out, though I will save a special thanks

to Steve Pynor – I might get that real job now, if not the haircut.

To the PLASers (as I will always know them, including (but not limited to) Greg,

Jiro, Simon, Jens, Dominic, Doug, Joel, Asbjorn, & Aaron) thanks for making the

lab a great place to work in. I’m certain I would’ve given up had I not had all of

you keeping the vortex of procrastination spinning.

And finally to my supervisors, Wayne Kelly and Paul Roe. Thankyou for your

guidance. Wayne, I truly appreciated your ability to keep me on track, whilst

giving me room when needed and your tips which helped me persist.

1

1 Introduction

Peer‐to‐peer (P2P) computing has made a significant impact on Internet com‐

puting. The increase in P2P computing has been made possible due to increas‐

ing resources on personal computers. Modern PCs usually have good Internet

connections, powerful processors and significant memory assets. P2P applica‐

tions are designed to utilise these resources more effectively than standard

web based applications which are server oriented. Server based applications

make use of very few of the client’s resources(1).

The most common applications associated with the P2P movement are in the

file‐sharing arena: Napster, Gnutella, and their more recent offspring. These

applications use the increased connectivity of home machines to distribute the

cost of network bandwidth across a large number of users. The applications

rely on there being a significant number of people connected to the system for

their services to be useful. If only a few people are connected then a centralised

system can generally provide better download speeds, but centralised services

have difficulty scaling. P2P file sharing networks can scale to millions of users

with relatively little resources being provided by the user who initially offers

the files.

P2P systems rely on users sharing their resources. Some systems such as the

file‐sharing systems pay for the resources shared by providing additional files

which users can then download – kind of a big swap meet. Other P2P systems

rely on more charitable donations, usually for the advancement of science. The

resources for these cases tend to be computing cycles and include systems

such as SETI@Home and various medical research systems. Alternatively,

some networks are run on internal business networks and machines partici‐

pate due to company policies.

A succinct definition of P2P computing is difficult to find. Some view any sys‐

tem which takes advantage of resources located on a large number of desktop

computers as P2P, while others require that these machines have some form of

2

direct communication, or even that every machine in the network plays an

equal role and there are no special server style machines involved.

P2P systems typically have some of the following characteristics:

• A large number of standard desktop machines involved. Standard desk‐

top machines exclude servers or supercomputers.

• Direct communication between these “peer” machines

• highly volatile membership – peers are free to come and go as they

please (and typically do so quite often)

These characteristics are different to those of other computing patterns such

as client‐server and clustering. In client‐server systems a central authority, the

server, is responsible for coordinating and servicing the system. These server

systems generally require considerable resources and are expensive to build

and maintain. Cluster systems are more similar to P2P in that the machines in‐

volved are often standard desktop machines with direct communication links;

however, clusters generally consist of a dedicated collection of machines rather

than the highly dynamic sets associated with P2P.

1.1 Decentralised P2P

The purest form of P2P computing is when there are no central authorities co‐

ordinating the system. These systems, hereafter referred to as fully decentral‐

ised systems, provide significant benefits over more centralised approaches,

including:

• Improved stability – the system can not be disabled by any single ma‐

chine failing or being disconnected, and

• Easier deployment/maintenance – servers are generally more powerful

and require more maintenance than desktop machines. Additionally,

desktop machines are usually being maintained and used for other pur‐

poses.

3

Fully decentralised systems are, however, more difficult to develop. Whereas

client‐server systems have a controlling body which maintains global informa‐

tion, decentralised systems must perform all operations using only the local

information available at whichever node is performing the operation. Global

operations, such as searching the entire network, must be performed through a

series of local operations.

The large size of the networks involved means that such operations can not be

performed by simply contacting every node involved as that would quickly

overload the network’s resources and cause failure of the system. Instead, op‐

erations must be performed using sophisticated algorithms which require only

a limited set of nodes. Despite these restrictions the algorithms must still ob‐

tain optimal, or at least near‐optimal, solutions.

1.2 CycleStealing

Cycle‐stealing is a term used to describe P2P systems which share computing

cycles. Generally these systems are designed to make use of the spare cycles

available when the machine is not being actively used. For example, during the

idle times overnight or during lunch breaks. The concept of cycle‐stealing is

reasonably well known, primarily due to popular centralised systems such as

the “@Home” projects (SETI@Home, Folding@Home).

Cycle‐stealing systems can be split into two broad categories – application spe‐

cific systems and frameworks. Application specific systems, such as

SETI@Home, are designed to solve a specific problem while cycle‐stealing

frameworks provide a more general infrastructure which application pro‐

grammers can then make use of to solve a variety of different problems.

Cycle‐stealing frameworks allow application programmers to make use of

shared computing cycles without having to implement the actual cycle‐stealing

portion of the project. Additionally, these frameworks often allow people to

contribute cycles to a variety of projects using a single client.

4

The participants in traditional cycle‐stealing systems can be classified into 3

roles: volunteers, clients and brokers. Volunteers are machines that are offer‐

ing cycles to the system. These cycles may be offered for charitable purposes or

in return for some form of payment. Client machines are the consumers of

these cycles. Clients submit work to the system to be distributed amongst the

volunteers then collect the results of that work.

The final role, broker, is the interface between clients and volunteers. The ex‐

act details of what work is performed by brokers depends on the specific cycle‐

stealing system. In some cases, such as G2(2; 3), brokers are separate machines

which store work requests from clients and distribute this work to volunteers.

Brokers in other systems such as Condor(4) simply act as mediators for setting

up direct connections between clients and volunteers. Clients must then handle

the actual distribution of the work themselves. In some systems these roles are

not kept separate and particular machines may take on multiple roles. It is par‐

ticularly common to have client machines that also perform some, or all, of the

brokerage role.

A common feature of most existing cycle‐stealing systems is that brokerage is

performed by a centralised body. Centralised brokerage is the obvious solution

since brokerage requires knowledge of how many volunteers and clients are

using the system and how much work is currently available. However, for most

systems distributing work requires considerable resources, especially when

applications are creating many small work packages. Centralised brokers in

these systems often present a bottleneck which prevents systems from scaling

effectively.

The usual approach to solving the scaling problem is to separate the process of

distributing work from the process of connecting volunteers with clients. To do

this, volunteers contact a central body which redirects them to a client which

has work. The client (or a machine administered by the client) is responsible

for distributing the actual work to the volunteers. This approach places a heavy

burden on the client, as they must supply a machine capable of handling the

5

work administration. Additionally, this separation makes it difficult to keep a

fair balance of volunteers amongst the various clients.

Another potential solution to the scalability issue is to decentralise the broker‐

age operation. As stated previously, decentralised systems typically scale very

well, but are considerably more difficult to design, especially for operations

that rely on information about the entire network. Since brokerage relies on

knowledge of what volunteers are available and what work is required, decen‐

tralised brokerage presents a difficult problem, but offers the ability to create

highly scalable cycle‐stealing systems without burdening clients with broker‐

age responsibilities.

1.3 Decentralised CycleStealing

A fully decentralised cycle‐stealing framework has the potential to offer addi‐

tional benefits over traditional designs. As well as their scalability benefits, de‐

centralised networks by their very nature require direct communication links

between their nodes. Decentralised cycle‐stealing systems should therefore be

able to make use of these direct links to provide efficient communication chan‐

nels between work units. Communication on centralised cycle‐stealing systems

has previously been very limited and often burdened the central server more

by relying on it to provide message delivery and robustness.

However, decentralised cycle‐stealing presents a number of challenges. These

primarily occur because there is no centralised body to coordinate the system.

The biggest challenges include:

• how to distribute and balance work amongst the group of volunteers,

• how to deliver results back to the clients, and

• how to guarantee work completion despite constant node arrivals and

departures.

This thesis describes how these challenges can be met by a decentralised P2P

network. In addition to these basic problems, I will also address how a decen‐

tralised network can help extend the boundaries of cycle stealing by supplying

6

direct communication with adequate robustness guarantees. The work pre‐

sented here is to the best of my knowledge the first fully decentralised cycle‐

stealing model to address general purpose distributed computing.

1.4 Contributions

The major contributions of this thesis are:

• a design for a fully decentralised, general purpose, cycle stealing

framework,

• a programming model suitable for developing distributed object appli‐

cations on a decentralised P2P cycle stealing system including direct ob‐

ject‐to‐object communication,

• a fully decentralised fault tolerance system which handles the highly

dynamic nature of P2P networks. The system is tuneable to provide

greater efficiency on networks which are less volatile, and

• methods for improving object locality on distributed hash table (DHT)

overlay networks. These optimisations provide considerable perform‐

ance benefits for applications using inter‐object communication.

The work presented in this thesis has resulted in three publications(5; 6; 7).

Chapter 2 gives an overview of related work in the areas of peer‐to‐peer com‐

puting and cycle‐stealing.

Chapter 3 outlines the design of a decentralised cycle stealing framework. This

includes details on how cycle‐stealing brokerage can be solved on a decentral‐

ised system. Also addressed are the programming model used by application

developers and details on how inter‐object communication is achieved.

Chapter 4 describes a fault tolerance system which ensures the correct execu‐

tion of applications on the framework. Fault tolerance of this form has not been

required for previous pure P2P networks, but is essential for cycle‐stealing.

The fault tolerance system developed is tuneable to allow for different reliabil‐

7

ity levels depending on the type of application being used and the reliability of

the physical network and nodes that the framework is being hosted on.

Chapter 5 describes how object locality can be improved to increase both

communication and overall application performance. Some of this work is gen‐

eral in nature and can be adjusted for use by other DHT based applications.

This locality work has previously been unexplored in cycle‐stealing frame‐

works as they have not provided the communication mechanisms that make it

necessary.

In chapter 6 I evaluate the work presented in the previous chapters. This

evaluation consists of developing applications on a prototype implementation

of a decentralised cycle stealing system. Performance tests are run on these

applications testing the efficacy of the fault tolerance system and optimisations

presented in chapters 4 and 5.

I conclude in chapter 7 and present avenues for further development of this

work.

8

2 Related Work

This thesis extends two distinct areas – pure P2P computing and cycle‐stealing.

Previous work in pure P2P computing has concentrated on file sharing applica‐

tions and on generic pure P2P overlay networks suitable for use in a variety of

applications. File sharing applications have received the bulk of development

due to their popularity in large scale distribution of files across the Internet. In

particular, the fully decentralised nature of pure P2P networks is attractive in

distributing copyrighted files as it makes it more difficult for copyright holders

to identify and prosecute specific individuals who are providing illegal files or

rendezvous services to many subscribers.

However, there are a number of non‐file sharing applications which have been

developed using the pure P2P model. These applications benefit from the in‐

creased scalability and lower costs that the fully decentralised networks pro‐

vide. Despite the success of decentralised applications in overcoming these

problems there has been little investigation into cycle‐stealing on fully decen‐

tralised platforms.

In this chapter I will explore the existing work in both decentralised network‐

ing and cycle‐stealing. Within the decentralised networking area I will concen‐

trate on how the applications and network have been implemented and how

those choices affect the properties of the network such as scalability and ro‐

bustness. In the cycle‐stealing projects I will concentrate on what features each

project provides, particularly to the developers of applications on those

projects.

2.1 Decentralised Networking

The first popular pure P2P system was the Gnutella(8) network. The original

Gnutella network provides the facility to share users’ files across an unstruc‐

tured decentralised network. Each node connects to a set of neighbours in an

arbitrary manner. A search for files is initiated on a specific node. This node

9

sends a search request to all of its immediate neighbours. These neighbours

then pass this message onto their neighbours and execute the search on them‐

selves. Each message has a time‐to‐live (TTL) attached to it which is decre‐

mented as it passes through each node until it reaches zero and the search is

terminated. This style of messaging is commonly referred to as query flooding.

The primary goal of Gnutella was to provide a file‐sharing utility which could

not be terminated by switching off a single server machine. Earlier centralised

networks, such as Napster, could be disabled by simply removing a small set of

machines, whereas decentralised networks are not reliant on any single node.

While Gnutella achieved this goal, its inefficient routing protocol caused signif‐

icant problems(9). The most notable of these was that in larger Gnutella net‐

works, searches often didn’t find any results despite matching files being avail‐

able on the network. Actual file transfers were also significantly slower due to

the high overhead of the query protocol.

Hybrid systems were quickly developed to address Gnutella’s scalability issues.

The most prominent of these were the “ultrapeer” extensions to Gnutella, the

commercial FastTrack network and the proposed Gnutella2(10) network.

These systems build on the basic Gnutella approach by acknowledging that dif‐

ferent nodes have different bandwidth resources. High bandwidth nodes can

be promoted to supernode status and are responsible for handling search re‐

quests for a group of leaf nodes. This significantly decreases the amount of traf‐

fic generated by the network whilst simultaneously improving the quality of

search results(11), however, it still does not guarantee that an item will be dis‐

covered when searched for.

Other hybrid systems have separated the discovery protocols from the actual

transfer protocols. The extremely effective and popular BitTorrent network

does not include capabilities for discovering files. File references are ex‐

changed through standard web sites usually discovered using ordinary web

search engines. Once a reference is found it is submitted to a BitTorrent client

which proceeds to download the file. The use of ordinary web searching for

discovering files results in extremely low overhead during the transfer portion

10

since the peer is not burdened with search queries. To perform the actual file

transfer BitTorrent clients connect to one or more peers and downloads differ‐

ent blocks in parallel. By using multiple sources the file transfer speed is in‐

creased significantly. The BitTorrent protocol also includes algorithms which

automatically choose which portions of the file to transfer first. The goal of

these algorithms is to maximise the number of times the file is replicated. This

file replication makes it less likely that part of a file becomes unavailable when

any single peer leave the network.

There has been significant research work aimed at developing P2P networks

which could guarantee discovery of data whilst still maintaining scalability.

The most prominent approach used is the distributed hash table (DHT). Like

standard hash tables, distributed hash tables store data using an associated

key. However, in a DHT, the actual data is stored on one of the nodes within a

decentralised network. The specific node used for storing the data is chosen by

providing the key to some routing algorithm. The key can therefore be used to

retrieve the data efficiently, even on very large networks.

A number of P2P DHT projects were developed independently and released in

a relatively short period of time. These projects provide similar external inter‐

faces but differ in their internal representation. These internal differences re‐

sult in different memory requirements and routing performance.

2.1.1 Chord

The Chord project(12) from MIT provides a lookup service which resolves all

lookups in O(log N) messages where N represents the maximum number of

nodes the network can accommodate. Each node within a Chord network is as‐

signed an n‐bit identifier generated by passing some unique descriptor of the

node, such as an IP address, through a cryptographic hash function such as

SHA‐1. The value of n dictates the maximum size of the network ( 2 ).

Items to be stored in the network are given a key using the same cryptographic

hash function. Items are then stored on the node whose key is numerically

closest to the item’s key. The pseudo‐random properties of the hash function

11

provide a load balancing effect, ensuring that each node receives the same

number of keys on average.

The Chord routing mechanism requires nodes to maintain information about

another O(log N) neighbouring nodes. At each node messages are forwarded to

a node that is numerically closer to the destination address. Although this

could be achieved by maintaining simply the node’s immediate neighbours,

Chord defines a routing table called the finger table which can be used to accel‐

erate the process by making larger jumps around the circular Chord identifier

space.

2.1.2 ContentAddressable Network

The Content‐Addressable Network (CAN)(13) uses a d dimensional address

space. Each node in a CAN network is assigned a zone within this space which

it is responsible for. Applications submit key‐value pairs which will be stored

on the network for later retrieval. Each key‐value pair is assigned a point

within the address space and are hosted by the node whose zone covers that

point. As nodes join/leave the network the zones of responsibility of other

nodes are adjusted to ensure full coverage.

Routing within CAN is done using an O(d) sized routing table, which unlike

Chord, means that it does not increase with the size of the network. Routing is

performed by passing messages to the immediate neighbour whose zone is

closer to the target. Because of the layout of the CAN node space, the routing

method delivers messages in O(dN1/d) hops.

Both Chord and CAN allow the size of the routing table to be traded off against

the efficiency of the routing scheme. In practice the network variables such as

the size of Chord’s finger table or the number of dimensions in CAN can be set

appropriately for the expected size of the network.

Plaxton, Rajamaran and Richa(14) developed the basis of two decentralised

P2P projects which are similar to both Chord and CAN. The projects,

Pastry(15) and Tapestry(16), combine O(log N) routing schemes with knowl‐

edge of the physical relationships between nodes to further minimise the la‐

12

tency when sending messages. Typical knowledge used includes network hops

or ping time. Both projects extend the work of Plaxton, et al, by allowing the

network to be self‐organising, that is, when nodes wish to join or leave, the

network will automatically adjust to ensure correctness; Plaxton’s work re‐

quired the network to be static. Both projects, although created separately, are

quite similar. Since the Pastry network is used as the basis for the decentral‐

ised cycle‐stealing system presented in chapter 3 it will now be analysed in de‐

tail.

2.1.3 Pastry

Like Chord, Pastry uses an n‐bit NodeID to identify individual nodes. This ID is

analogous to an IP address in IP routing. Messages may be sent to any of the 2

possible NodeIDs. Unlike IP routing, if a message is sent to an address which is

not currently inhabited by a node the message delivery does not fail. Instead

the message is redirected to the node with the numerically closest address. Pa‐

stry’s routing mechanism guarantees a message will be delivered to the correct

node despite concurrent node failures unless a large number of nodes with ad‐

jacent NodeIDs all fail simultaneously. The specific number of nodes that must

fail is a configuration setting normally set to 8 or 16.

Routing State

For the purposes of routing, Pastry NodeIDs are split into series of b‐bit digits.

For example, a 128‐bit NodeID can be expressed as a series of 8, 16‐bit digits.

Each node maintains three sets of data used to perform routing – the leaf set,

the routing table and the neighbourhood set.

A node’s leaf set contains the nodes whose IDs are numerically closest. A node

must keep regular contact with its leaf set to detect if one of these nodes leaves

the network. Departing nodes must be replaced in the leaf set to ensure that it

is always fully populated, assuming that there are sufficient nodes on the net‐

work to do so. The size of the leaf set is configurable and is directly related to

the stability of the network. Message delivery is guaranteed in a Pastry net‐

work as long as no leaf set becomes invalid. This can only occur if a set of adja‐

13

cent nodes equal to half a leaf set fail at essentially the same time. Typically leaf

sets are set to contain 16 or 32 members.

The neighbourhood set contains the nodes which are physically closest. While

it is not used directly in routing, the neighbourhood set is essential in main‐

taining the locality properties of the network.

The routing table is the primary source of routing information. The routing ta‐

ble contains log rows with 2 1 entries each. The nth row of the table

contains a set of nodes whose NodeIDs share the first n digits with the present

node. Each column of the table represents one of the 2 possible digits. The

n+1th digit of each entry corresponds to that column’s digit. Figure 2‐1 shows a

sample routing table with the n+1th digit highlighted in each cell. Note that the

routing table may have empty entries where there is no suitable node to fill the

cell.

2132 0 1 2 3 0 0312 1012 30101 2022 2213 23012 2101 21203 2130

FIGURE 2‐1 – PASTRY ROUTING TABLE (8‐bit NodeID, b=2)

Routing Protocol

Pastry messages are sent in a series of hops between nodes. The routing proto‐

col ensures that each individual hop sends the message at least one node closer

to its target. Routing ceases and the message is delivered when there are no

closer nodes to send the message to.

Message hops are selected from two sources – the routing table and the leaf set.

Nodes use the following method in selecting how to forward a message:

1. First the node checks to see if the target ID is within the range of its leaf

set. If it is the node can select the appropriate node from its leaf set and

forward it to its final destination.

2.

3.

4.

Figure

withou

small

create

If the ID

find the n

in commo

row of th

digit in th

target in

ing the ro

In the rar

the node

Since it h

leaf set th

If a suitab

can be us

routing s

e 2‐2 demo

ut any sing

routing st

ed and used

FI

is outside

next target

on with the

he routing t

he message

l steps, wh

outing table

re case tha

uses the n

has already

here should

ble node ca

sed as a fall

ituation of

onstrates h

gle node re

tate and e

d without s

IGURE 2‐2 – P

of the leaf

t a node sim

e message’

table and f

e’s target. U

here l is the

es have suf

at an appro

th row to s

y been estab

d be a suita

annot be fo

lback. This

O(N).

ow a mess

equiring gl

efficient ro

uffering fro

PASTRY ROU

f set’s rang

mply calcul

’s target. It

finds the en

Using this m

e number o

fficient entr

opriate nod

select a nod

blished tha

able node in

ound in the

fallback po

sage is quic

lobal know

outing allow

om signific

TING FROM I

ge the rout

lates how m

t then looks

ntry corres

method the

of digits in

ries.

de cannot b

de which is

at the targe

n the routin

e routing ta

osition pro

ckly routed

wledge of th

w very lar

ant perform

ID:2000 ‐> ID

ing table is

many digit

s up the en

sponding t

e message r

n the NodeI

be found in

s closer to t

et is not in t

ng table.

able then th

ovides the w

d across the

he network

rge netwo

mance pen

D:0301

14

s used. To

ts, n, it has

ntry n+1th

o the next

reaches its

ID, assum‐

n the table

the target.

the node’s

he leaf set

worst case

e network

k. Pastry’s

rks to be

alties.

15

Joining Protocol

To join a Pastry network a node requires a NodeID and the address of one oth‐

er node already attached to the network. Nodes generate their own NodeIDs

independently. NodeIDs can be created by passing the node’s network address

through a cryptographic hash function or by generating a random ID. It is im‐

portant that the NodeIDs are generated with a uniform distribution across the

entire NodeID address space. This ensures that each node is responsible for

approximately the same range of addresses. The discovery of a node to connect

through is outside the scope of Pastry. It is usually done through a rendezvous

server or by using an expanding ring search.

Once an existing node is discovered, a special “join” message is routed, via this

node, towards the new NodeID, eventually arriving at the node whose ID is

closest. Each node that receives a join message replies with a sample of their

routing state. These replies are used to initialise the new node’s state. Three

different classes of nodes are encountered during the routing process and each

replies with a different type of routing information:

• All nodes which receive a join message provide any appropriate entries

from their routing table to the new node. As the message gets closer to

its destination more of the routing table will be relevant to the joining

node because they will have more NodeID digits in common.

• The last node, that is the node closest to the new NodeID, additionally

provides the new node’s leaf set. This is simply a copy of its own leaf set.

Once the node has joined the last node’s leaf set will be adjusted to in‐

clude the new node.

• The first node contacted provides the new node’s neighbourhood set. It

is assumed that when discovering a node to connect to a pastry network

a node will be selected which is physically close to the joining node. This

implies that the first node and the joining node’s neighbourhood set will

intersect considerably. As stated earlier, even if this neighbourhood set

is not particularly accurate it will not affect the validity of the routing

protocol, though it may affect its performance.

16

Once the node has received all of these replies and has initialised its routing

state it can fully participate in routing. At this stage it contacts all of the mem‐

bers of its leaf set who will update their information to include the new node.

If, during joining, it is found that the NodeID selected is already in use the

nearest free ID is selected and a special reply is returned indicating the change

to the joining node. The new node must then contact all of the nodes currently

aware of its presence and provide them with the updated value. If two nodes

attempt to join at the same time the Pastry routing mechanism will cause their

joining will be handled by the same node sequentially, preventing a potential

race condition.

Maintenance

Integrity of the routing state in a Pastry network is essential for the correct de‐

livery of messages. The leaf set is the most important aspect of the routing

state, in fact, provided the leaf set is correctly maintained, message delivery

will be correct albeit potentially slow. Leaf set nodes must therefore keep in

regular contact. All members of a leaf set will periodically exchange messages

to monitor their health. Nodes that have left the network for any reason are

discovered through this mechanism.

When a node is discovered missing, its leaf set will request information from

other members of the set so they can fill the missing node’s position. Provided

that there are at least two members of the missing node’s leaf set remaining

who know each other’s addresses, the network will be able to recover. This

means that unless m/2 adjacent nodes (where m represents the number of

nodes in a leaf set) disappear from the network at the same time there will be

no long term effect on the network.

It is obvious from this that the reliability of the network is directly proportion‐

al to the size of its leaf set. It is also for this reason that physically related Pa‐

stry nodes should be dispersed amongst the entire Pastry address space so

that it is less likely that a loss of an entire set of related machines (such as a

17

university lab) due to power loss or network failure will result in breakdown

of the Pastry network.

2.2 Cycle Stealing

Cycle‐stealing is fundamentally the attempt to harness the spare computing

cycles from standard desktop machines. There are two important aspects of

this definition which separates cycle‐stealing from other parallel distributed

computing disciplines. The first is that cycle‐stealing uses the “spare” cycles.

Volunteer machines are expected to primarily be used for another purpose but

offer some of their resources to the cycle‐stealing network. This contrasts with

cluster computing, where a set of machines are permanently dedicated to be‐

ing linked and participating in parallel computing endeavours.

The second defining aspect is that cycle‐stealing is targeted at “standard desk‐

top machines”. This aspect contrasts with Grid computing which is primarily

focused on managing connections between large computing resources; though

some of those resources may be collections of desktop machines. Under some

definitions cycle‐stealing can be classified as a subset of computational grids,

however these definitions still highlight the fact that cycle‐stealing is focused

on non‐dedicated machines.

The feasibility of cycle‐stealing was proven on a large scale by the highly suc‐

cessful SETI@home(17) and distributed.net(18) projects. These projects each

solve a particular problem using a client‐server based master‐worker style. Vo‐

lunteers contact a central server and retrieve a job which they then process in

their idle time. Once the job’s results are calculated they are returned to the

server and a new job is retrieved.

Several projects have since been created which offer generic platforms for

writing cycle‐stealing applications. These projects aim to simplify cycle‐

stealing application development by handling the details of communication,

job allocation and retrieval of results. By handling these common cycle‐stealing

18

features, the frameworks free the application developers so they can concen‐

trate on their particular problem.

The Butler system(19), developed at Carnegie‐Mellon University, was one of

the early attempts at utilising idle workstations for useful computation. The

goal of Butler was to allow users to execute jobs on otherwise idle worksta‐

tions without requiring modifications to the operating system or applications.

However, the restrictions required the system’s features to stay similarly sim‐

ple with no form of process migration or re‐execution. The system did not ad‐

dress running single applications in parallel across multiple workstations as

more recent cycle stealing systems do.

When a user returned to a workstation that was being used by Butler a 30 sec‐

ond warning was given to the remote user before the workstation was re‐

claimed by killing any remote processes. This reclamation process was re‐

ported to be one of the most annoying features of the Butler system especially

when users were executing interactive programs remotely.

Condor’s(4) basic premise is similar to Butler, however it effectively addresses

the concerns of users by introducing job checkpointing and re‐execution. Con‐

dor includes a checkpointing facility which is used to take snapshots of jobs as

they are running. These checkpoints are used to relocate a job when a its host

workstation is reclaimed or to store jobs indefinitely if there are no idle work‐

stations. Many extensions to Condor have also been developed including sup‐

port for master‐worker parallel applications communicating through PVM(20)

and using Condor on wide area networks(21).

The Piranha project(22) at Yale University is responsible for the applying the

term “adaptive parallelism” to networks of workstations. Piranha recognised

the need for systems to allow workstations to come and go from a computation

as their users needed them as opposed to cluster style systems such as Beo‐

wulf(23) and Berkeley‐NOW(24). Piranha applied adaptive parallelism to the

Linda coordination language to provide a cycle‐stealing platform for master‐

worker style applications.

19

One goal of the Piranha project was to allow an application to gracefully de‐

grade as the number of participating workstations decreases. In particular the

degenerate “non‐parallel” case should have almost no overhead to promote the

development of all intensive applications under the Piranha model. Piranha

provides no form of task migration, though the application programmer may

explicitly checkpoint tasks using the tuplespace. The original Piranha imple‐

mentation was using C, however a heterogeneous version was developed using

the cross‐platform features of Java.

The Charlotte(25) project is an Internet based cycle‐stealing system developed

using the Java platform. Java is a common platform for many Internet based

projects because of the importance of security and heterogeneity for Internet

volunteers. Charlotte applications consist of alternating sequential and parallel

steps. During a parallel step, application routines are distributed amongst the

set of volunteers. Like most systems, Charlotte doesn’t support communication

amongst these routines, but a later extension, Knitting Factory(26), supplied

this.

In Knitting Factory, Java RMI references are passed, via the server, to the vo‐

lunteers. This allows volunteers to communicate directly in a P2P manner, but

only while those specific volunteers remain available. This made the knitting

factory communication system somewhat fragile as RMI references could be‐

come invalid when volunteers left the system. Most other frameworks offer no

direct communication between volunteers, instead opting to route communica‐

tion through the server.

In 1997 the University of California, Santa Barbara (UCSB) proposed the devel‐

opment of an internet based grid‐like system, SuperWeb(27). This began a se‐

ries of cycle‐stealing projects with slightly different focuses. The SuperWeb

proposal outlined three participants which have persisted through the subse‐

quent projects: brokers, clients and hosts. Brokers collect and monitor the re‐

sources in the SuperWeb, clients utilise the resources by distributing tasks and

hosts volunteer their resources. The original proposal outlined a number of

resources to be supported by the system including computing cycles, data

20

storage and economic credits. It also discussed the need for a trust model to

guarantee the correctness of applications executed on the system.

The Javelin(28) project, the first of the SuperWeb projects, uses a centralised

broker to facilitate discovery of volunteer machines by clients. This centralised

broker was unable to scale sufficiently so a network of brokers was developed

for Javelin++ (29). Javelin++ supports a network of brokers which share the

burden of tracking volunteers. Additionally, if the load becomes too large, vo‐

lunteers may be promoted to act as additional brokers. To be eligible for pro‐

motion, volunteers must meet three conditions: having a “permanent” internet

connection (i.e. not a modem connection), being connected to the system for a

“long” duration and providing “ample warning” before withdrawing.

The Computation eXchange project combined the system developed in Javelin

with the communication ideas developed in Linda(30). Like Piranha, CX uses

tuples to store arguments for jobs. This provides a far simpler interface for

application programmers when compared to Javelin however it limits the class

of applications that can be executed on the system. Like Javelin++, CX contains

a set of brokers termed Task Servers. Each Task Server keeps its own volun‐

teers but maintains links to the other Task Servers to provide backup in case of

failure.

The Berkeley Open Infrastructure for Network Computing (BOINC) is a gener‐

alisation of the SETI@home project (31). BOINC provides a general platform

for cycle stealing using a client server approach. BOINC avoids the scalability

issues encountered by client server frameworks by concentrating on long run‐

ning jobs. BOINC jobs are expected to take many hours of processing. This lim‐

its the load on the server allowing it to handle many volunteers simultaneously.

BOINC is currently running multiple real world applications with almost a mil‐

lion active volunteers.

Cycle‐stealing on fully decentralised networks has not been explored as com‐

prehensively as client‐server based networks. Those few projects which have

addressed the area have concentrated on distinct aspects. In this section three

21

fully decentralised systems are examined – the DREAM project(32), the Java

based structured system developed by Butt, Fang, Hu and Midkiff(33) at Pur‐

due University and the unstructured network developed by Awan, Ferreira,

Jagannathan and Grama(34).

2.2.1 DREAM

The Distributed Resource Evolutionary Algorithm Machine (DREAM) project

aims to provide a large fully decentralised P2P network which can be used for

distributed computing. However, DREAM networks are not suitable for general

purpose cycle‐stealing. Development of the DREAM project has been guided by

evolutionary computing, and whilst it is not limited solely to evolutionary ap‐

plications, it still has a rather limited range of applications. Suitable applica‐

tions must have the following characteristics:(35)

• Be massively parallelizable

• Have little communication between subprocesses

• Have large resource requirements

• Be robust – the success of the application does not depend on the suc‐

cess of any given subprocess

The last characteristic places a severe restriction on what applications are

suitable but does allow the DREAM designers to simplify their design consid‐

erably.

DREAM consists of a number of layers which aim to ease the development of

evolutionary applications (see Figure 2‐3). The lowest evolutionary computing

layer is the JEO (Java Evolutionary Object) library, which is a low level Java li‐

brary providing interfaces for evolutionary algorithm components such as isl‐

ands, individuals, operators and evaluators along with standard implementa‐

tions of these components for ease of development. The JEO uses the distri‐

buted resource machine (DRM) to provide distributed parallel execution of

evolutionary applications, but also supports sequential execution without the

DRM. The DRM is the actual P2P network layer. The higher layers of DREAM,

the EASEA and GUIDE, will be explored before examining the DRM in detail.

22

Figure 2‐3 – DREAM Architecture

The EASEA (Easy Specification of Evolutionary Algorithms) layer allows evolu‐

tionary programs to be expressed in human readable language. This language

provides a method of simply expressing evolutionary programs without tying

them to a specific platform. The language can be compiled into Java classes

which use the JEO, or to other forms such as C++ source.

The GUIDE (Graphical User Interface for DREAM Experiments) layer provides

the simplest method of generating DREAM projects. It provides a graphical en‐

vironment where evolutionary problems can be expressed through point and

click methods by non‐expert programmers. GUIDE projects are compiled into

the EASEA language before final transition into JEO classes suitable for use on

the DRM.

The DRM layer represents the actual distributed processing network. The DRM

consists of a set of volunteer machines (termed nodes) hosting a number of ex‐

ecution agents (termed islands). The network is fully decentralised but, unlike

DHT networks such as Pastry, does not have any structure. Each node keeps a

list of other nodes in the network. Periodically nodes will exchange lists to

learn about other active nodes. To limit the amount of memory required on

each node, these lists may be truncated. To join a network a node simply con‐

tacts any other node, provides its address and receives a set of nodes from its

Evolutionary

Application

Libraries

GUIDE

JEO

EASEA

DRM

Advanced

Users

Non‐

programmers

Intermediate

Users

23

contact. The new node’s address is disseminated across the network through

the periodic list exchanges. The reliability and effectiveness of this approach is

discussed by Jelasity et al (35).

The actual problem is solved by the DRM islands. Applications are started by

creating an island on any node. Each island has a number of tasks which it ex‐

ecutes sequentially until each task is completed. When the island’s host ex‐

changes addresses with neighbouring nodes, the island can check if the neigh‐

bour is currently hosting a node. If not, the node can initiate execution on that

node by splitting its set of tasks and creating a new island on the neighbouring

node. Like new nodes joining the network, new applications are disseminated

across the network through this procedure which is termed an epidemic proto

col.

DREAM islands cannot communicate with each other after they are started. Isl‐

ands are also incapable of migrating when a node leaves the network, in fact,

any work allocated to such an island is lost. This severely limits the type of ap‐

plication which can be implemented using DREAM as they must be capable of

losing any individual work item. Because of this limitation DREAM cannot be

considered a general purpose cycle‐stealing system.

2.2.2 Butt et al

Butt et al(33) present a structured P2P network for sharing computing cycles.

The system uses a Pastry network to coordinate meetings between resource

consumers and providers. The project has a heavy emphasis on the economy of

the computing cycles, but nonetheless provides a simple fully decentralised

cycle‐stealing system.

Each node in the Pastry network is a resource consumer, resource provider or

both. To run applications, resource consumers query the network to find a

suitable provider. Suitable nodes are selected based on their credit information.

Once a node is selected all further communication is performed using direct

connections. The system does not supply any extra support for parallel appli‐

cations such as fault tolerance or communication. If a provider leaves the sys‐

24

tem while hosting an application, the consumer must renegotiate a new host

and restart the application.

The credit system is the most innovative aspect of the project. Applications are

compiled with additional beacon code added. These beacons report the

progress of the application as it is running to a separate reporting module. The

consumer can query the reporting module to get feedback on the provider’s

progress. If the provider is making progress then the consumer will transfer

credits to the provider. If the consumer does not supply credits then the pro‐

vider is free to stop executing the application. This simple approach is designed

to work like a real economy by minimising the effect of fraud rather than pre‐

venting it entirely.

Apart from the credit system, the most interesting aspect of the system is the

process for discovering resource providers. Butt et al have combined the in‐

formation dispersion style of projects like DREAM with a structured Pastry

network. Periodically each node passes its resource availability and characte‐

ristics to the nodes in its routing table. These messages are forwarded on in a

broadcast fashion until their specified TTL is reached. Nodes cache this infor‐

mation to allow for prompt response to requests for providers.

While this system provides a fully decentralised system, it provides very few

services for parallel distributed algorithms. Application programmers are re‐

sponsible for load balancing, providing fault tolerance and any direct commu‐

nication mechanisms that are needed.

2.2.3 Awan et al

Awan et al(34) present a contrasting system which uses an unstructured pure

P2P network. Like the system designed by Butt et al, Awan et al’s system is de‐

signed for embarrassingly parallel applications, however they have addressed

the issue of node failure through replication.

Job allocation is performed using a random walk algorithm. The job creator will

generate a set of n tasks to be performed. These tasks are grouped into batches

to reduce network communication before allocating to volunteers for computa‐

25

tion. Each group of tasks is then sent to a randomly selected host. This host is

selected by performing a random walk – each node has a set of other nodes it

knows of. It randomly selects one of these nodes to send the group to along

with a designated TTL value. This node decrements the TTL and forwards onto

another randomly selected node. This continues until the TTL value reaches 0

and the current node is selected as the random host. This random walk algo‐

rithm selects nodes with a reasonably uniform distribution assuming that the

network has uniform connectivity (i.e. each node is connected to the same

number of other nodes without).

Job groups are replicated to allow for unexpected host departures and to

detect fraud. When an application creates and submits jobs it specifies a repli‐

cation factor. The receiving volunteer decrements this replication factor and

forwards a replica of the group to another random node. When nodes receive

groups they acknowledge this receipt with the node that submitted the group

to them. This parent node must then periodically monitor the child nodes. If

the node fails then the job is resubmitted to another randomly selected node.

Since this process is repeated at every level up to the original job creator, the

job is guaranteed to complete provided at least that creator does not fail.

A simple communication mechanism is provided for sending results back to

the originating node. This communication is built upon a rendezvous service set

(RS‐set). Each node maintains its own independent RS‐set of log nodes

(where N is the size of the entire network). Messages are sent to the RS‐set and

include the ID of the target node for the message. There is a high probability

( ) that at least one node in the sender’s RS‐set has the target in its RS‐set. Any

such node forwards the message on to the target or stores it for when the tar‐

get node requests the data.

Awan et al’s system provides a fully decentralised cycle‐stealing system, how‐

ever it is quite limited in the type of applications for which it is capable of host‐

ing. Embarrassingly parallel applications have been shown to be attainable us‐

ing client‐server cycle‐stealing architectures far simpler than the decentralised

26

network demonstrated by Awan et al. There has also been little examination of

what benefits a P2P system could offer other than scalability and the P2P net‐

work described adds additional problems, particularly from malicious volun‐

teers, which centralised systems like BOINC avoid.

2.2.4 G2 Classic

G2 Classic (2) provides a cycle‐stealing framework on the Microsoft.NET plat‐

form which is simple to use for both application programmers and potential

volunteers. The project’s programming model is designed to allow program‐

mers not familiar with parallel programming to take advantage of cycle‐

stealing by using a well‐known programming pattern. Volunteering to the sys‐

tem requires almost no special configuration or installation on the volunteer

machine.

Programming for G2 Classic is very similar to programming ASP.NET web ser‐

vices. The G2 Classic tools create automatically generated G2 proxies which

allow the application programmer to create tasks by asynchronously calling on

a web‐service like interface. This is a direct analogy to the .NET Web Services

approach. The tasks are submitted by the proxy to a central server which then

distributes the jobs amongst the volunteers.

Writing the actual tasks is identical to writing ASP.NET web services. Custom

tools, or a Visual Studio.NET addin, are used to generate the G2 proxy, similar

to how standard web service proxies are created. Machines volunteer to do

work by contacting the server and requesting jobs. Since the volunteer process

can be hosted in a web browser, the entire volunteering process is performed

by simply browsing to a website.

2.2.5 Load Balancing

The issue of balancing the load across volunteers is of particular interest when

examining decentralised cycle‐stealing. In centralised systems such as G2 Clas‐

sic load can be simply balanced by only providing tasks one at a time to each

volunteer. As volunteers complete their work they request another tasks, easi‐

ly guaranteeing that no single node is overloaded, and simultaneously provid‐

27

ing more work to more capable nodes who will be completing, and requesting,

work more often.

This approach to load balancing relies on a central system to be controlling the

dispatch of jobs. The system described by Butt et al can take advantage of this,

even though their underlying system is fully decentralised, but for system such

as DREAM and Awan et al’s must provide alternative load balancing systems.

DREAM’s epidemic protocol provides load balancing for that system. Essential‐

ly each volunteer balances its jobs with its local neighbours. These local ex‐

changes manifest in general load balancing when spread around the entire

network, but is reliant on the particular nature of DREAM jobs and is not suita‐

ble for a general purpose cycle stealing system.

Assuming a sufficient number of tasks, Awan et al’s random walk protocol will

distribute these tasks uniformly across all volunteers. This approach provides

basic load balancing but does not take into account the varying capabilities of

different volunteers; more powerful volunteers are not allocated additional

tasks.

28

3 Decentralised CycleStealing

Cycle‐stealing frameworks are designed to simplify the development of cycle‐

stealing applications. The frameworks aim to handle the cycle‐stealing aspects

of the application, allowing application developers to concentrate on their spe‐

cific problem.

The most significant problem that must be addressed by a cycle‐stealing

framework is the brokerage functionality. Cycle‐stealing brokers are responsi‐

ble for locating computing cycles on volunteer machines and facilitating their

use by client applications. Typically brokers are implemented in a centralised

manner, either as a single server or a network of servers. These approaches are

problematic because they place a large load on the servers resulting in limited

scalability and heavy cost to the maintainers of the broker.

In this chapter I present a fully‐decentralised cycle‐stealing framework called

G2:P2P and examine how it overcomes the challenges inherent in performing

cycle‐stealing on a fully decentralised network. G2:P2P is the first general pur‐

pose, pure P2P cycle‐stealing system. Previous attempts at decentralised cycle‐

stealing have concentrated on specific problem areas such as evolutionary

computing and are unsuitable for general purpose applications. G2:P2P not

only supports standard cycle‐stealing applications, but actually expands the

range of applications which may be solved through cycle‐stealing by providing

direct communication channels between executing tasks. This direct communi‐

cation allows problems which were previously only addressable on cluster and

multi‐core machines to be approached with cycle‐stealing. To support these

applications I have developed a distributed object programming model which

provides a simple interface for application developers with support for flexible

communication patterns.

The contributions of this chapter are:

• a design for a fully decentralised, general purpose, cycle stealing

framework, and

29

• a programming model suitable for developing distributed object appli‐

cations on a decentralised P2P cycle stealing system, including direct

object‐to‐object communication.

G2:P2P is an entirely new framework which, whilst building upon general cy‐

cle‐stealing knowledge gained during the Gardens and G2:Classic projects, has

been designed and developed from scratch and is not an incremental develop‐

ment of any previous project.

This chapter consists of five sections. In section 3.1 I present a model for cycle

stealing on a fully decentralised network. This section presents the design of

the G2:P2P framework, particularly how it fulfils the brokerage function that

traditionally has been performed by a central machine in previous cycle steal‐

ing frameworks. Section 3.2 presents a new programming model which allows

application developers to take advantage of the new facilities enabled by de‐

centralised cycle stealing. In section 3.3 I discuss how volunteer machines ar‐

rive and depart from a G2:P2P network. This includes how to handle providing

jobs to these volunteers as they arrive and how to redeploy jobs when they de‐

part. Section 3.4 discusses the details of G2:P2P’s implementation and I con‐

clude in section 3.5.

3.1 G2:P2P Design

A G2:P2P system consists of a set of volunteers organised as a decentralised

P2P network. Clients connect to this network and submit jobs for execution.

The volunteers are collectively responsible for assigning the jobs to specific

volunteers so they can be executed, and returning the results of the execution

to the client. The volunteers also provide additional services to the executing

job, including the ability to create additional jobs, and to communicate be‐

tween running jobs.

Essentially the P2P network of volunteers acts as the broker of the cycle‐

stealing system. Whereas centralised cycle‐stealing systems would use a server

for connecting volunteers to clients (or their jobs), G2:P2P relies on the volun‐

30

teers performing that process collectively. Figure 3‐1 gives a high level view of

the system, illustrating how clients communicate with the cloud of volunteers

just as they would communicate with a server in a centralised system. The

principal benefit of this is of course scalability; as more volunteers join the

network, more clients can be serviced. Additionally, this approach offers the

opportunity for direct communication between running jobs. Previously such

communication channels have had to be regulated by a server.

FIGURE 3‐1 – G2:P2P OVERVIEW

A design goal for G2:P2P was to simplify the task of writing cycle‐stealing ap‐

plications. This goal has been used to guide decisions which were not con‐

cerned with the fundamental aspects of decentralised cycle‐stealing. By simpli‐

fying the cycle‐stealing aspect of application development, application devel‐

opers are able to concentrate on their problem domain rather than the intrica‐

cies of cycle‐stealing. Obviously they must still be aware that the application

will be executed within the G2:P2P environment, but the consequences of this

should be minimised.

3.1.1 Job Assignment

Many cycle‐stealing systems use a “pull” model for distributing jobs to volun‐

teers. Volunteer machines connect to a job server and request some work. The

job server maintains a list of jobs from which it can select an appropriate job to

assign to that volunteer. This approach places a lot of load on the job server

and limits the scalability of the entire system. Whilst systems of complicated

server hierarchies have been developed to allow systems to scale(29), these

complex hierarchies significantly increase the management effort and cost re‐

31

quired to set up and maintain the system. A pure P2P system offers the oppor‐

tunity to provide this scalability by distributing the job server’s role to the vol‐

unteers themselves.

Assigning jobs to volunteers is however a difficult process in a fully decentral‐

ised system. In centralised systems the process is relatively simple. The server

machine has a list of outstanding jobs and simply assigns one of these jobs to

each volunteer as it requests work. However, in a fully decentralised system

this approach is useless. Clients must be able to submit jobs through any volun‐

teer in the network. Maintaining a global list would therefore place too much

load on the list’s maintainer, essentially creating the same bottleneck as a cen‐

tralised system and similarly limiting scalability.

G2:P2P’s job assignment algorithm is built on the properties provided by P2P

distributed hash table (DHT) overlays. DHTs differ significantly from other P2P

networks by providing efficient lookup of resources on large networks. The

unstructured networks used in many P2P applications, such as Gnutella or Ka‐

zaa, do not allow for this precise addressing of items on the network. It is this

precise addressing and the manner in which the addresses are resolved which

allows G2:P2P to provide an efficient distributed job assignment mechanism.

G2:P2P uses the Pastry network for its underlying network infrastructure,

however the basic concepts developed could be suitably implemented on any

of the major DHT systems.

Within a DHT, each node is assigned a unique ID. Messages can be routed to the

node using its ID, however, unlike IP routing, messages sent to non‐existent

addresses are automatically routed to the node with the numerically closest ID.

This ensures that all addresses in the address space are valid and can be used

to store resources. As nodes join and leave the network the range of addresses

assigned to each node will change, however, there will always be one node re‐

sponsible for each address.

Volunteer hosts for G2:P2P jobs are assigned by allocating each job an ID and

using the DHT’s routing scheme to automatically match that ID to a volunteer.

By usi

whose

IDs an

volunt

conne

Since

proble

1.

2.

Load i

the en

ing the DHT

e ID is num

nd voluntee

teers come

ction is ma

there is no

ems that ar

Load Imb

other volu

the volunt

Job ID Con

with futur

imbalance

ntire addre

T routing it

merically cl

er IDs is m

e and go fr

aintained.

FIGURE 3

o central a

rise must be

balance – S

unteers rec

teer pool.

nflicts – Mu

re managem

can be alla

ss space. T

t is guaran

losest to th

maintained

om the net

3‐2 – ASSIGNI

authority to

e catered fo

Some volun

ceive very

ultiple jobs

ment of tho

ayed by en

This disper

nteed that a

he job’s ID

for the ent

twork. Sect

ING JOBS TO

o manage t

or:

nteers may

few jobs. T

s may be as

ose jobs.

nsuring tha

rsal can be

a job is hos

. This conn

tire lifetim

tion 3.3 wi

VOLUNTEER

the assignm

y be assign

This results

ssigned the

at job IDs a

performed

sted by the

nection bet

e of the job

ill describe

RS

ment of jo

ned many j

s in ineffici

e same ID, i

are dispers

d by gener

32

volunteer

tween job

b, even as

e how this

b IDs two

jobs while

ient use of

nterfering

sed across

rating ran‐

33

dom IDs with a uniform distribution. This will minimise load imbalance, espe‐

cially in large networks with many jobs. While this basic load balancing ap‐

proach performs quite well, there are more advanced techniques which can be

used to further improve load balancing. Chapter 5 outlines some extensions to

the basic job assignment protocol which provides even better distribution of

jobs to volunteers, particularly in small networks.

While job ID conflicts are expected to be rare because the address space for job

IDs is very large, any conflicts that do occur could cause significant problems.

Thankfully, ID conflicts are relatively simple to resolve. When a job is first cre‐

ated a “creation” message is routed to its ID with the details of the job. If a job

already exists with the same ID then it is guaranteed that the conflicting job

will be hosted on a volunteer which receives this creation message. Therefore,

a conflict will be detected as soon as the creation message is received, prior to

the job actually being instantiated. If a conflict is detected the volunteer simply

selects another ID within its realm of responsibility and assigns the job this

new ID. The new ID is also included in the reply to the creation message so the

client that created the job will have the correct ID for future reference. Job ID

conflicts are rare since the job address space is large so this scheme adds no

significant overhead to the creation process. In the extremely rare case that an

ID is unavailable in the volunteer’s address space it can simply generate a new

random ID and forward the creation message on.

3.2 Programming Model

An important aspect of a cycle‐stealing framework’s design is its programming

model. The programming model defines how users of the framework, i.e. cycle‐

stealing application programmers, access the framework’s features. In existing

frameworks, jobs are usually independent processes that run once then return

some result. Some frameworks allow for sub‐jobs to be created to further de‐

compose the problem(36), but these sub‐jobs are still fundamentally inde‐

pendent processes.

34

G2:P2P differs from previous frameworks by offering direct communication

between executing jobs via message passing. While this has been available in

other parallel computing fields for many years, the client‐server nature of most

cycle stealing projects has discouraged its adoption in cycle‐stealing. Current

cycle‐stealing programming models are hence not flexible enough to effectively

support inter‐job communication.

There are two candidates for providing message passing style communication

in the programming model of G2:P2P – an explicit message passing library or a

distributed object model. Message passing libraries provide API calls which

explicitly send or receive messages between tasks. To use the library the appli‐

cation programmer must keep track of addresses of jobs they wish to commu‐

nicate with. Additionally the programmer must supply explicit points where

they will retrieve incoming messages. This approach is simple to implement

but places a large burden on the application programmer.

In a distributed object model the message passing is abstracted away as

method calls. Each job consists of an object instance which exposes a number

of methods. Messages are passed by obtaining a reference to another object

and calling these methods with the appropriate parameters. This approach is

familiar to users of some distributed programming APIs such as Java RMI

and .NET Remoting.

A distributed object model has been adopted for G2:P2P since it is most famil‐

iar to non‐expert programmers, that is, programmers who don’t have experi‐

ence with parallel programming. Whilst message passing libraries are very

common in parallel computing systems, they are not common in general pur‐

pose programming. The explicit message passing model can also be easily emu‐

lated using distributed objects, however distributed objects are generally pro‐

vided as a core feature of the framework and are more difficult to emulate at

the application level.

35

3.2.1 Distributed Object Model

By selecting the distributed object model as the programming model for

G2:P2P, it opens the possibility to further simplify the application program‐

mer’s job by fully integrating G2:P2P into an existing distributed object API.

Both Java and .NET provide remote object APIs through their Remote Method

Invocation and Remoting libraries respectively. While both APIs have some ex‐

tension support, the .NET Remoting approach provides greater flexibility in its

extension mechanisms(37). For this reason G2:P2P has been implemented as

an extension to .NET Remoting.

By integrating with an existing API a very simple model for writing G2:P2P ap‐

plications can be provided. To create jobs the application programmer simply

needs to instantiate a class they have previously marked for remote execution.

// Mark type for remoting G2P2PChannel.Current.RegisterActivatedClientType(typeof(MyType)) MyType remoteObject = new MyType(arg1, arg2);

LISTING 3‐1 – CREATING G2:P2P JOBS

Once a type is registered, the Remoting infrastructure will intercept any con‐

struction calls and convert them into a message. This message is passed to a

G2:P2P filter which has been registered with Remoting. This filter generates an

ID for the new object and routes the message to its appropriate host. The Re‐

moting infrastructure on the host automatically checks for conflicts, instanti‐

ates the object, and stores a reference for future remoting calls.

On the client side a proxy object is created by Remoting using the ID generated

by G2:P2P. This proxy object presents the same interface to the application as

the actual object would. It is through this proxy that application programmers

can launch remote method invocations on the objects. Whenever a method is

called on the proxy, the Remoting infrastructure converts the method call into

message format. This message is provided to G2:P2P which then uses Pastry to

pass the message to the remote object’s host. At this point G2:P2P passes the

message back to the Remoting infrastructure which converts the message into

36

a standard stack‐based method call, executes it and returns the results to

G2:P2P so it can route it through the same Pastry procedure.

3.2.2 InterObject Communication

Objects in a distributed object model can also communicate with each other

through remote method calls. To deliver data from one object to another ob‐

jects simply invoke the appropriate method and provide the data as parame‐

ters. This allows a well defined communication interface to be declared easily

by simply marking the appropriate methods as “public”.

The proxy objects generated when jobs are created provide an ideal method of

initiating this communication, however the communicating object must some‐

how obtain one of these proxies. Since proxy objects simply store the ID of the

target object, they can easily be passed between objects as parameters like any

other object. The G2:P2P routing scheme will correctly route method calls to

the target object regardless of where they are initiated provided it has the tar‐

get object’s ID. Listing 3‐2 demonstrates how proxy objects can be passed and

used for inter‐object communication.

class Client { public void Main() { // Assume MyType1 & MyType2 are configured for Remoting MyType1 remoteObject1 = new MyType1(); MyType2 remoteObject2 = new MyType2(); // Start processing in remote object 1 and pass 2nd object // to allow inter‐object communication remoteObject1.Start(remoteObject2); } } class MyType2 { public void Start(MyType2 partner) { // Do some work partner.SendData(someData); } }

LISTING 3‐2 – INTER‐OBJECT COMMUNICATION

37

3.2.3 WellKnown Objects

Section 3.2.2 described how remote references could be passed between the

objects in an application to allow them to communicate through method calls.

However, it is common for applications to include some well‐known objects

which are required by all, or at least many, other objects in the application. For

example, an application may include a central object which monitors the

progress of the application. Each worker object would periodically contact this

object and update its status. A user‐interface could also contact this object to

retrieve the status and display it for the user’s benefit. Using the previous me‐

thods references to this monitoring object would need to be passed manually

each of the worker objects. The user‐interface would also need to obtain a ref‐

erence to the object somehow. Obtaining these references has been simplified

by including a special mechanism for obtaining proxies to this type of object,

which are termed “well‐known” objects.

Since proxy objects are essentially a vessel for storing an object ID and expos‐

ing a facade for a particular type, these proxies can be created on any node, as‐

suming the object ID and object type are available. Typically the object ID is

generated by the G2:P2P runtime when an object is created, however for well‐

known objects it would be more suitable if a more user friendly ID could be

used, such as an application defined string. Such a string can easily be embed‐

ded in the actual code for the objects, eliminating the need for object refer‐

ences to be passed around.

G2:P2P still requires a Pastry style object ID to find the object’s host and route

messages to it. Such an ID is generated by passing the application defined

string ID through a cryptographic hash function. The outputs from crypto‐

graphic hash functions typically have a uniform random distribution. This is

necessary when generating object IDs to maintain the same load balancing fea‐

tures as the random object ID generation.

Unlike

time t

the ap

point

proxy

nicatio

passed

object

proxy

own p

ure 3‐

well‐k

FIGU

e regular G

that their p

pplication

a creation

object is a

on with th

d around th

t. One of the

objects ar

proxies with

4 and Figu

known obje

URE 3‐3 – SE

2:P2P obje

proxy objec

encounters

message is

also created

he object. F

he network

e main goa

round, ther

hout actua

ure 3‐5 dem

ects are cre

NDING MESS

ects, well‐k

cts are cre

s a “new”

s generated

d and retur

For standa

k if other ob

ls of well‐k

refore, mult

ally initiatin

monstrate t

ated.

SAGES TO WE

nown obje

eated. Stand

operation

d and subm

rned to the

ard objects

bjects need

known obje

tiple objec

ng construc

the differen

ELL‐KNOWN

cts are not

dard objec

for a regi

mitted to th

applicatio

s these pro

d to commu

ects is to av

ts must be

ction of the

nce betwee

OBJECTS

t created at

ts are crea

istered typ

he G2:P2P

n for futur

oxy objects

unicate wit

void this ne

e able to cr

e object. Fi

en how stan

38

t the same

ated when

pe. At this

system. A

re commu‐

s must be

th the new

eed to pass

reate their

gures Fig‐

ndard and

Unlike sta

plication

either con

generate

FIGURE 3

FIGURE 3‐

andard obj

must provi

nfiguring a

the proxy.

3‐4 – STANDA

5 – WELL‐KN

ects, when

ide the URL

a URL for a

If a type is

ARD G2:P2P

NOWN G2:P2

n creating a

L where th

a specific t

s configure

OBJECT CREA

P OBJECT CR

a proxy to a

e object is

type or by

ed with a U

ATION SEQU

REATION SEQ

a well‐know

hosted. Th

using a lib

URL for a w

ENCE

QUENCE

wn object t

is can be d

brary funct

well‐known

39

the ap‐

done by

tion to

object

40

then any attempt to create an object of that type will actually create a proxy for

contacting the single well‐known instance. Alternatively, using the library

function allows multiple instances of the type to be generated at different URLs.

Section 3.4.4 will provide details on how objects can connect to and communi‐

cate with well‐known objects.

Since proxy generation does not actually create the object instance, a separate

mechanism must be supplied to do so – either an explicit creation procedure or

some implicit mechanism. Since an application may consist of many distributed

objects all connecting to a single well‐known object it may be difficult for an

application programmer to identify a single point to perform an explicit crea‐

tion. Therefore an implicit mechanism for creating well‐known objects is pro‐

vided. When a message is received for a well‐known for the first time, the host‐

ing volunteer creates an instance of the appropriate type and assigns it the ap‐

propriate ID (see Figure 3‐5). The volunteer must use the type’s parameterless

constructor to do this. If the object requires data for initialisation then the ap‐

plication programmer must call an explicit initialisation method before any

other communication with the object. It is the responsibility of the application

programmer to ensure this call is made before any other calls.

3.2.4 Object Lifetime

As well as creating objects, G2:P2P must provide mechanisms for removing ob‐

jects when they are no longer needed. The simplest method of providing this

would be to require application programmers to explicitly call an API method

to remove all objects when they have completed their work, however this plac‐

es an extra burden on application programmers and also introduces the possi‐

bility of orphaned objects. Orphaned objects may occur if an application crash‐

es before cleaning up its resources, or simply because an application pro‐

grammer has forgotten to include the object cleanup code.

Both .NET Remoting and Java RMI include an object lifetime service based on

leases(37). Each object is provided with a lease when it is created. A LeaseMa‐

nager on each Remoting server periodically inspects each objects lease to see if

41

any have expired. If a lease is expired the object is destroyed and removed

from the server. Leases are automatically renewed whenever a method call is

received by an object. Additionally, an object can be sponsored by another ob‐

ject. If a sponsored object’s lease expires then the sponsor is contacted to see if

they wish to renew the lease. The length of an object’s lease can be set by the

object itself which allows application programmers to adjust the lease length

appropriately for the frequency with which an object will be contacted. Ulti‐

mately these leases can be sponsored by the originating machine which en‐

sures that objects are kept alive for the length of the application and will be

collected once the application, and hence that object, is destroyed.

Since G2:P2P is built into the .NET Remoting it can automatically take advan‐

tage of Remoting lease‐based lifetime service. This provides for flexible object

lifetime management while avoiding the potential problems that an explicit

destruction method call would introduce.

3.3 Volunteer Arrival & Departure

A major benefit of using a fully decentralised P2P network for cycle‐stealing is

its ease of management. While centralised systems require central server com‐

ponents that must be set up, maintained and extended to adjust for load, de‐

centralised solutions are entirely managed by each node. Since these nodes

are already being maintained for other purposes the maintenance cost of the

P2P network is negligible.

As is typical for many decentralised networks, G2:P2P has departed from a

“pure” P2P implementation in one area – node discovery. Node discovery is the

process by which a new node initially contacts an existing network. The prob‐

lem is addressed in a number of P2P projects. The simplest, and most common,

approach is to use a central server which keeps a list of nodes which are active

on the network. When a new node wishes to join they simply make a request to

this server asking for a node to connect to. The server may either choose a

random node, or may attempt to provide a node which has good communica‐

tion channels with the requesting node.

42

Other common bootstrapping techniques include Address Probing either ran‐

domly or using mechanisms from the underlying network layer(38). This ap‐

proach involves selecting a machine and attempting to connect to the P2P net‐

work on that machine, usually by connecting to a well‐known port. If a connec‐

tion is established the node is able to become part of the network. If the con‐

nection fails the machine is assumed to not be part of the network and a new

candidate machine is selected. These candidates can either be selected ran‐

domly or by using multicast technology if the underlying network supports it.

The effectiveness of this approach is directly related to the size of the P2P net‐

work.

For this research the simplest approach has been taken. A simple rendezvous

service on a central server is used to advertise nodes’ addresses. This centra‐

lised approach was chosen to simplify the implementation of G2:P2P. It does

not affect the actual processing of the network and could easily be replaced

with a decentralised method if there was a benefit to the research.

When a volunteer joins a G2:P2P network it generates an ID for itself, either

randomly or by hashing some unique attribute such as its network address.

After completing the standard Pastry joining process [15] the volunteer will

start receiving any messages for IDs within its current address range. Since vo‐

lunteers can join at any time, there may already be objects live on the network.

If one of those objects has an ID within the new volunteer’s address range then

its host will no longer receive any incoming messages. To resolve this issue the

object must be migrated from its old host to the new volunteer.

The final step a volunteer performs when joining a network is to inform its

new leaf set, including its two immediate neighbours, of its arrival. It is from

these immediate neighbours that any objects that should be hosted on the new

volunteer will be located. When a volunteer detects that it has a new neighbour

it checks all of the objects it is currently hosting to see if they should be mi‐

grated to the new volunteer. If any objects require migration then they are

immediately packaged and sent to the new volunteer.

43

There is a period between when a new volunteer joins the network and it rece‐

ives any objects which it is now responsible for hosting. During this period the

volunteer may receive method calls for the incoming objects. Since it obviously

cannot begin to process those calls it instead must keep them in a storage

queue until the objects are received. Once an object is recreated on the new

host the messages are replayed in the order they were received.

This process that volunteers go through to join the network is further devel‐

oped in Chapters 4 & 5 to add support for fault‐tolerance and better communi‐

cation performance.

The basic departure process for volunteers is kept simple. When a volunteer

decides to leave a network it may be hosting a number of running objects.

These objects must obviously be relocated so they can continue executing. Vo‐

lunteer departure is performed in two portions to allow for this object migra‐

tion.

First the volunteer departs the Pastry network so that it is no longer involved

in routing messages. At this point any incoming messages for the objects will

automatically be redirected to the objects’ new hosts. These hosts queue these

messages in the same procedure that a joining volunteer uses. The departing

volunteer then creates migration messages for its hosted objects and sends

these to its previous neighbours. By separating the two processes the volun‐

teer can safely migrate the objects without being interrupted by new messages

for those objects.

This simple departure model is unrealistic because it expects all volunteers to

completely migrate objects before departing. Chapter 4 will address this short‐

coming by describing methods for supporting unexpected departures such as

from crashing volunteers or network failures.

3.4 Implementation

A prototype G2:P2P framework has been implemented for the Microsoft .NET

platform. The “Remoting” infrastructure of .NET provides an ideal extension

point

plicati

3.4.1

The pr

tom im

mente

clude

top lay

for int

jects. F

Pastry

The Pa

an int

which

node a

for tightly

ions to be w

Prototyp

rototype sy

mplementa

ed separate

some uniq

yer contain

tegration w

Figure 3‐6

y Layer

astry layer

terface for

allows in

and joining

integrating

written usin

e Architec

ystem cons

ation of the

ely from th

que feature

ns all of the

with .NET R

provides a

r provides

creating P

teraction w

g the netwo

FIGURE 3‐

g G2:P2P i

ng familiar

cture

sists of two

e Pastry P2

he cycle ste

es which su

e cycle stea

Remoting, a

n overview

two extern

Pastry node

with a Pas

ork.

6 ‐ G2:P2P P

nto the pla

r distribute

o distinct la

2P overlay

ealing aspe

upport the

aling aspect

and for hos

w of the pro

nal interfac

es and an

stry networ

ROTOTYPE A

atform. Thi

d object tec

ayers. The l

y network.

ects of the

e requirem

ts of the sy

sting and m

ototype syst

ces to the c

“external

rk without

ARCHITECTUR

is allows G

chniques.

lowest laye

This layer

system, bu

ments of G2

ystem inclu

managing re

tem’s archi

cycle steali

connection

t actually c

RE

44

G2:P2P ap‐

er is a cus‐

r is imple‐

ut does in‐

2:P2P. The

uding code

emote ob‐

itecture.

ng layer –

n” method

creating a

45

Pastry nodes are the typical manner of using the layer. The nodes provide a

simple send/receive interface similar to what is described in the Pastry litera‐

ture(15). In addition, a broadcast method has been added. This broadcast

sends a message to every node in the network by passing it incrementally

around the address space. This broadcast method is inefficient and unsuitable

for use in real world networks, but can be useful during testing stages. The

node interface also provides a number of events which allows changes to the

node’s routing state to be tracked. Events are fired when nodes are added or

removed from the node’s leaf set. This is used with G2:P2P’s fault tolerance

system which will be discussed in Chapter 4.

External connections are an enhancement to the standard Pastry design which

allow messages to be sent on a network without the overhead of running an

actual Pastry node. External connections are used within G2:P2P to allow client

applications to submit work without becoming volunteers themselves, saving

considerable overhead when submitting work and also simplifying the system.

Without external connections the system would need to support nodes which

were in the network but weren’t actually available for hosting jobs. They are

also used when a volunteer leaves a network to redeploy the objects that vo‐

lunteer was hosting, greatly simplifying the redeployment process.

An external connection can be created through any node on a network. This

connecting node is termed the ‘host’ node for the external connection. The ex‐

ternal client must generate a unique ID for itself which is submitted to the host

to allow for identification. This ID has the same form as NodeIDs but has an ex‐

tra marker which distinguishes it as an external ID. The ID allows the external

client to send and receive messages the same as actual nodes.

Since the external ID is randomly generated it is unlikely that it will have any

resemblance to the host node’s ID. This presents a problem for routing mes‐

sages to the external client since it is not part of the standard Pastry routing

layout. This problem is solved by setting up a redirection pointer from the

node whose ID is closest to the external ID to the node hosting the external

conne

is redi

This r

separa

an ext

is cho

will th

ternal

work,

nal cli

While

sages

uses t

proces

This r

cating

ction. Figu

irected thro

F

edirection

ation betw

ternal clien

sen that ho

hen receive

clients to

but it also

ients can c

they are d

they recei

the same

ssing.

redirection

g with exte

re 3‐7 dem

ough a redi

IGURE 3‐7 ‐ E

provides s

een the ex

nt to switch

ost simply

e any future

continue f

provides t

connect to

disconnecte

ve are sim

external I

mechanism

ernal hosts

monstrates

irection po

EXTERNAL C

some benef

ternal clien

h hosts and

updates th

e messages

functioning

the possibi

the netwo

ed the redir

mply stored

ID and re

m does int

s. It is exp

how a repl

inter.

LIENT MESSA

fits for exte

nt’s ID and

d still receiv

he redirect

s. Host swit

g even if th

lity of havi

ork, submit

rection poi

d. When th

trieves all

troduce som

pected that

ly message

AGE REDIREC

ernal conne

d the host’s

ve message

tion pointe

tching is ne

heir host h

ing disconn

t some wo

nter is set

he external

l of the s

me extra c

t external

to an exte

CTION

ections. By

s ID it is po

es. When a

er for the c

ecessary to

has to leav

nected clien

ork, then d

to null and

client rec

tored mes

cost when

hosts have

46

rnal client

keeping a

ossible for

a new host

client so it

o allow ex‐

ve the net‐

nts. Exter‐

disconnect.

d any mes‐

onnects it

ssages for

communi‐

e minimal

47

communication and so this cost should not be an issue. If heavier communica‐

tion is required then a node should be created which will exist within the Pa‐

stry network and participate fully in message routing.

The Pastry layer also contains the networking code for the actual communica‐

tion between Pastry nodes. This networking code is abstracted within a com‐

munication module which allows for alternate communication methods to be

substituted. In the current implementation a TCP adaptor is provided and used

by default, however a simulation adapter was also developed which allows a

Pastry network to be simulated on a single machine. This simulation is useful

for testing the Pastry layer and some aspects of the cycle stealing layer but is

not sophisticated enough to provide a full simulation of G2:P2P.

Cycle Stealing Layer

The prototype’s cycle stealing layer consists of two main modules – the object

manager and the .NET Remoting integration module.

The object manager is responsible for supervising the G2:P2P remote objects

which are currently hosted on the volunteer. Each G2:P2P volunteer has a sin‐

gle object manager responsible for matching incoming messages to their target

object, migrating objects when a volunteer departs the network, and monitor‐

ing any communication to and from the objects. The manager also includes the

facilities for fault tolerance and locality optimisation which are discussed in

Chapters 4 and 5.

The Remoting module contains all of the code necessary for integrating G2:P2P

into the Remoting infrastructure. This integration greatly simplifies writing

applications for G2:P2P and also allows for easy migration of existing Remoting

applications on to G2:P2P. The Remoting module provides the cycle‐stealing

application programmers’ interface to G2:P2P; it allows programmers to create

G2:P2P objects and initiate method calls between those objects. All other fea‐

tures of G2:P2P are implemented within the object manager.

48

The details of how G2:P2P is integrated into Remoting is discussed in the fol‐

lowing sections. Since Remoting was designed with a client‐server architecture,

integrating G2:P2P into Remoting requires significant effort and a number of

unique techniques.

3.4.2 .NET Remoting Background

.NET Remoting is core feature of the .NET runtime which allows method calls

to occur between objects in separate application domains. Application domains

in .NET are the unit of isolation for an application. They ensure that separate

applications can not access each other’s code or resources and that faults in

one application do not affect other applications. They are somewhat analogous

to operating system processes, except that there may be multiple application

domains within a single process. Remoting allows applications to communicate

across the application domain boundary, whether that boundary is within the

same operating system process, separate operating system processes or on

separate physical machines.

Communication in Remoting occurs along transport channels. A number of

standard channels are supplied with the framework including channels to

communicate on TCP and HTTP. Additionally, users may extend Remoting by

developing their own custom channels. Channels generally consist of two parts

– a server side and a client side, each hosted in separate application domains.

The client side is responsible for taking a chunk of data supplied by Remoting

and transferring it to the server side which then provides it to the Remoting

infrastructure in its domain. The Remoting infrastructure handles the details of

translating a method or construction call into a chunk of data, and recovering

and executing the call on the other side.

In addition to channels, Remoting allows extension through message sinks.

Message sinks are used to provide channel‐agnostic processing of Remoting

messages. A standard Remoting installation include messages sinks used for

converting the original message into different serialised forms such as binary

or soap. O

or redirec

Remoting

proxies m

plication

the .NET

tion calls

a proxy th

into a me

sinks unti

sends the

of server

object tak

on the tar

process.

To perfor

informati

which obj

object a U

parts:

• A

for

an

Other poten

ction of me

g calls are

masquerade

developer.

runtime. T

and substi

he call is a

essage. This

il it reache

e serialised

side messa

kes the mes

rget object.

rm all of thi

on to find

ject on tha

URL to un

scheme sp

r communi

nd “http:” fo

ntial uses fo

essages for

initiated b

e as norma

Since Rem

This deep in

itute these

ctually con

s message

s the client

d message t

age sinks. A

ssage, conv

. Figure 3‐8

FIGURE 3‐8

is the trans

d the corre

at server to

niquely ide

ecification

ication. Th

or the built

or message

load balan

by executi

al objects a

moting is a c

ntegration

proxies at

nverted fro

is passed t

t side of th

to the serv

At the end o

verts it bac

8 shows the

8 ‐ .NET REM

sport chann

ect server t

o execute t

ntify it for

which ide

hese are co

in scheme

e sinks incl

cing.

ng method

and are ent

core feature

allows it to

creation ti

m a standa

through a c

e transpor

er side wh

of this chai

ck to a stac

e standard

MOTING STRU

nels and th

to send th

he method

r this purp

entifies whi

ommon to

s.

ude encryp

ds on prox

tirely trans

e it is deep

o intercept

ime. When

ard stack b

chain of cli

t channel. T

ich sends i

in is a real p

ck based ca

.NET Remo

UCTURE

he real prox

he message

d on. Remo

pose. URLs

ich channe

all URLs a

ption of me

xy objects.

sparent to t

ly integrate

t object con

a call is m

based meth

ent side m

The chann

it down the

proxy objec

all and exec

oting meth

xy need suf

e to, and id

ting assign

consist of

el should b

and include

49

essages

These

the ap‐

ed into

nstruc‐

ade on

hod call

message

el then

e chain

ct. This

cutes it

hod call

fficient

dentify

ns each

f three

be used

e “tcp:”

50

• A channel specific section which allows the channel to correctly identify

which server to send messages to.

• An object identifier which identifies which object on the server the mes‐

sages should be delivered to.

There are two types of object identifiers, one for identifying standard G2:P2P

objects and one for identifying well‐known objects. Typically standard identifi‐

ers are GUIDs created by the server when the object is created.

Application developers must register any types that will be used for Remoting

before creating instances of those types. As part of this registration process the

developer must supply a base URL for the type. This base URL includes the

scheme and channel specifications. When an object of that type is created the

Remoting infrastructure iterates through all of the currently registered chan‐

nels and asks them if they are capable of servicing the base URL. The first

channel to indicate they can is selected and the object is associated with that

channel’s message sink chain. Typically channels will inspect the URL’s scheme

to decide if they should service the object.

3.4.3 Integrating G2:P2P into Remoting

G2:P2P’s structure has significant differences to the structure Remoting was

designed for. Remoting objects are typically hosted on a single server machine

which is specified when the client first connects to the object. The entire

framework is built around a client‐server paradigm. This presents two major

problems for integrating G2:P2P:

1. G2:P2P clients do not know which server an object is hosted on when

creating/connecting to a new object.

2. G2:P2P objects may need to move between machines during their life‐

time because of the dynamic nature of the volunteer network.

The first issue is relatively simple to solve. Since the server is specified in the

channel specific portion of the URL a channel must be supplied which can tar‐

get objects at their correct host. To allow multiple G2:P2P networks to be run

51

from a single rendezvous server a network name must be specified in the URL.

This network name is passed to the server when requesting the set of nodes

used for connecting to the network. A URL scheme is also needed so the

G2:P2P channel can correctly identify which objects it should work with. This

leaves us with a URL of the form:

g2p2p://network_name/{object identifier}

The object identifier in this URL still presents a problem. The standard process

for creating a new object generates the object identifier on the machine which

will host the object. For G2:P2P that host machine is selected using the Objec‐

tID. This means the ID must be generated on the client so it can be used to cor‐

rectly route the creation message to the host. Additionally, the object identifier

in the Remoting URL is supplied by the Remoting infrastructure and is unique

to the machine on which the object is hosted. This means that a table must be

kept which maps G2:P2P ObjectIDs to their Remoting identifiers. Using this

mapping, the server side of the G2:P2P channel can transparently rewrite ob‐

ject URLs at the machine boundary, substituting G2:P2P IDs in outgoing mes‐

sages and Remoting IDs in incoming ones.

Allowing objects to move between machines is significantly harder to solve.

Both Remoting and Java RMI are simply not designed to facilitate the migration

of objects between processes. The primary problem with migration is how to

serialise the object and deserialise it on the new host. Whilst .NET natively

supports object serialisation, it does not have any method for serialising active

threads. Therefore either a method of serialising .NET threads must be found

or G2:P2P must ensure an object has no active threads on it before migration

occurs.

Thread serialisation on .NET has been researched for use with mobile

agents(39) however this process requires special preparation of assemblies

before it will work. Since thread serialisation is a common problem on many

managed platforms I have instead investigated how migration can be provided

without requiring thread serialisation.

52

.NET’s native serialisation framework is sufficient for transferring a G2:P2P

object’s state from one volunteer to another assuming that the object is not

currently servicing method calls. The naïve way of achieving this is to simply

stop executing new method calls and wait for existing calls to finish, however

this may result in a deadlock and prevent the executing threads from ever

completing. Deadlocks may occur due to any of the following circumstances:

• A thread is blocked waiting for a signal from another method call

(which will never occur if new method calls are not being intercepted)

• The object contains an endless loop, e.g. a message processing loop

To ensure objects are able to be migrated when necessary (and, in future,

checkpoint objects – see Chapter 4) restrictions must be placed on how appli‐

cation programmers can implement G2:P2P classes. These restrictions are par‐

ticularly important for G2:P2P’s fault tolerance mechanisms so detailed discus‐

sion of them will be delayed until section 4.3.

3.4.4 Activating Objects

.NET Remoting provides two methods of activating remote objects – Client Ac‐

tivated Objects and Server Activated Objects. Client Activated Objects are acti‐

vated by the client side of a Remoting channel, allowing clients to pass initiali‐

sation variables and control exactly when an object is created. Server Activated

Objects are activated on the server in response to incoming messages. There

are two types of Server Activated Objects – singlecall and singleton. If a type is

registered as a singlecall object then a new object is created every time a mes‐

sage is received for the configured URL. Conversely, singleton objects are

created once when the first message is received and are kept alive to service

any future messages.

These two activation mechanisms have parallels in G2:P2P’s objects. Client Ac‐

tivated Objects are the standard activation mechanism in G2:P2P. Normally

Remoting requires any client activated types to be registered on the server be‐

fore they are created. However, unlike standard Remoting applications, G2:P2P

networks do not have prior knowledge of which types will be required by ap‐

53

plications. Therefore the Remoting activation sequence must be augmented to

allow types to be dynamically registered when their activation requests arrive.

Server Activated Objects are similar to the well‐known objects described in

section 3.2.3. In particular, the singleton style Remoting objects have the same

activation mechanism as G2:P2P server activated objects. This allows G2:P2P

to take advantage of the existing Remoting methods for connecting to well‐

known objects. There are two methods available – registering well‐known

types so that a “normal” construction call actually generates a proxy to a well‐

known object (Listing 3‐3) or using the RemotingServices.Connect method to

generate a proxy (Listing 3‐4).

RemotingConfiguration.RegisterWellKnownClientType( typeof(WellKnownType), "G2P2P://Rik/WKOServer"); WellKnownType server = new WellKnownType();

LISTING 3‐3 – CONNECTING TO WELL KNOWN OBJECTS USING TYPE REGISTRATION

WellKnownType server = RemotingServices.Connect(typeof(WellKnownType), “G2P2P://Rik/WKOServer”);

LISTING 3‐4 – CONNECTING TO WELL KNOWN OBJECTS USING ‘CONNECT’ API

Object activation is a particularly low level operation in Remoting and there

are no simple extension points for customising the activation process. Object

activation is performed by sending a special activation message to an activator

object. Each application domain hosts a single activator with the object iden‐

tifier “RemoteActivationService.rem”. Since this standard activator re‐

quires the type being activated to be registered before it receives an activation

message an alternate activator object must be substituted which will allow un‐

registered types to be activated.

The G2:P2P CustomActivatorSink is a server side Remoting sink allows a cus‐

tom activator object to be used instead of the standard .NET Remoting activa‐

tor. The sink monitors all incoming messages on the volunteers until an activa‐

tion message is received. Activation messages are identified by their target uri.

When a message which targets the “RemoteActivationService.rem” object

is rece

a cust

standa

The Cu

could

faciliti

Once t

by the

vator

the G2

ing inf

rectly

volvem

eived the C

tom activat

ard Remoti

ustomActiv

be used in

ies.

F

the Custom

e G2:P2P ac

does not ch

2:P2P activ

frastructur

handled b

ment by the

CustomActiv

tor object

ing activato

vatorSink i

other Rem

FIGURE 3‐9 –

mActivatorS

ctivator ob

heck that t

vator must

re. This allo

by the Rem

e G2:P2P ac

FIGURE 3

vatorSink r

is hosted.

or with a c

s designed

moting appl

ACTIVATION

Sink is insta

bject. Unlike

types are re

still regist

ows any fut

moting infr

ctivator.

3‐10 ‐ G2:P2P

redirects th

This redir

custom G2:

d as a gener

lications w

N VIA CUSTOM

alled all ac

e the stand

egistered b

ter any obje

ture metho

rastructure

P REMOTING

he message

ection esse

P2P activa

ral purpose

which requi

MACTIVATOR

ctivation m

dard activa

before crea

ects it crea

od calls on

without r

G STRUCTURE

e to a new

entially rep

ator (see Fi

e Remoting

re custom

RSINK

essages are

ator, the G2

ating them.

ates with th

the object

requiring f

E

54

uri where

places the

igure 3‐9).

g sink and

activation

e received

2:P2P acti‐

However,

he Remot‐

to be cor‐

further in‐

55

Figure 3‐10 shows the Remoting process with the custom G2:P2P items in‐

cluded. As can be seen when comparing to Figure 3‐8, G2:P2P takes advantage

of a considerable amount of the standard Remoting structure, simply inserting

its own custom channel which makes use of the G2:P2P Pastry network.

3.5 Conclusion

Pure P2P networks have proven to be effective at solving issues of scalability in

a variety of situations. In this chapter I have presented a fully decentralised

cycle‐stealing framework, G2:P2P, which performs its brokerage function using

the actual volunteer machines in the network. This decentralisation naturally

scales and also provides a solid foundation for extra features not available in

previous cycle stealing frameworks.

Applying a decentralised model to cycle‐stealing requires a programming

model which will take advantage of the direct links available between the vo‐

lunteer machines. G2:P2P has addressed this with a distributed object model

which allows for direct inter‐object communication using method calls. The

programming model has been designed so that it will integrate well into exist‐

ing distributed object models such as .NET Remoting and Java RMI.

A prototype implementation of the framework has been developed and inte‐

grated into the .NET Remoting infrastructure. This demonstrates that the pro‐

gramming model integrates well with existing distributed object models and

provides a test bed for evaluating the effectiveness of the framework. The pro‐

totype framework includes a custom implementation of the Pastry P2P overlay

which includes extensions for supporting communication between a Pastry

network and machines external to the network.

In summary, the following aspects of peer‐to‐peer cycle‐stealing have been ex‐

amined and addressed by G2:P2P:

• How to perform the “broker” role of typical cycle‐stealing systems in a

decentralised manner. This includes being able to distribute work to a

decentralised network of volunteers whilst ensuring reasonable load

56

balance between those volunteers even during regular arrival and de‐

parture of members of that network.

• Providing a communication model which takes advantage of the possi‐

bilities of a peer‐to‐peer network. Notably allowing direct communica‐

tion between running jobs. To facilitate this, a distributed object pro‐

gramming model is provided to allow non‐expert programmers to easi‐

ly use direct communication in a cycle‐stealing environment. This in‐

cludes providing a job addressing scheme which allows for direct ad‐

dressing of objects, even when their hosts may be frequently changing.

• Supplying a well‐known object facility to allow for objects which will be

addressed from all parts of an application without explicitly passing

references.

• Ensuring the system is fault tolerant, that is, it is robust in highly dy‐

namic peer‐to‐peer networks, imperfect network conditions and unreli‐

able volunteers.

• Providing a system of cleaning up objects which are no longer in use in

the system to prevent resource wastage.

All of these aspects have been addressed by the prototype system described in

Section 3.4.

57

4 Fault Tolerance

Fault tolerance mechanisms on P2P networks are generally restricted to the

routing layer. Considerable work has been done into ensuring messages are

delivered between nodes reliably despite node dropouts, however, at the ap‐

plication layer there is far less work. P2P applications generally do not require

stringent guarantees. If a node is removed from a network then it is assumed

that any data that node held is either replicated at another node, or the appli‐

cation can continue without that data.

Cycle‐stealing frameworks however, require a reliable foundation to ensure

client applications complete correctly. For example, when a peer leaves a file‐

sharing network the network may lose access to certain files that only that

specific peer holds. For most networks this is accepted as a normal restriction

of file sharing, however for a cycle‐stealing framework that missing peer may

have been hosting crucial state for a running application. The application will

now be unable to finish until that peer returns or the missing portion is some‐

how recovered.

Considerable literature is available on fault tolerance in the distributed com‐

puting community. Fault tolerance mechanisms in distributed computing gen‐

erally fall into three broad categories – replication, checkpointing, and message

logging. A common requirement of checkpointing and message logging is that

there is a reliable storage mechanism which will maintain the information re‐

quired to recover from faults should they occur. Typically a reliable central

machine is used or the computing nodes use a local storage mechanism. If local

storage is used it is assumed that the machines will recover from any faults

relatively quickly and return to the computation. These approaches can not be

used directly by fully decentralised P2P networks as there are no reliable cen‐

tral machines available and hosts are expected to be predominantly transient.

To reliably store data on a decentralised network it must be replicated across

multiple nodes. Although this does not provide an absolute guarantee it will be

58

sufficient provided enough nodes are used. The problem with this approach is

that it requires additional network communication for storing and recovering

the data. For traditional checkpointing or logging schemes this communication

pressure would cause significant performance degradation due to the fre‐

quency that logging data must be stored and the size of that data.

This chapter describes a fault tolerance scheme for G2:P2P designed to mini‐

mise performance impact, particularly from network communication. The

scheme is unique in providing considerable application level fault tolerance on

a highly dynamic P2P platform. Previous pure P2P applications have not pro‐

vided any significant form of fault tolerance, largely because they have not re‐

quired it. Comparable cycle stealing systems such as Awan et al(34) have not

addressed fault tolerance sufficiently. Existing P2P fault tolerance work has

been limited to protected the routing layer, not the applications. It is also capa‐

ble of being customised to provide varying levels of protection for various per‐

formance costs.

4.1 Background

Distributed fault tolerance can be broadly categorised into three categories:

replication, checkpointing and message logging.

Replication schemes work by creating replicas of any work items and submit‐

ting them to multiple processors. If an error occurs on one of these replicas it is

simply ignored as there are other replicas which are still completing the work.

If all replicas fail before completion then the work must be restarted from the

start. This method is commonly used in embarrassingly parallel cycle‐stealing

frameworks where it is sometimes referred to as eager scheduling. Replication

has the added benefit of assisting in fraud detection. By accepting results from

multiple distinct processors, the controlling process can compare results and

detect inconsistencies. This ability has been particularly important in large

scale public Internet cycle‐stealing projects such as SETI@home which have

been targeted with significant fraud attempts(40).

59

Replication schemes however are unsuitable for systems with inter‐process

communication. When communication is involved a task cannot simply be res‐

tarted as this will cause it to resend any outgoing communication that had pre‐

viously been created. The task will also miss incoming messages which had

been handled before the crash. While this can be overcome in a number of me‐

thods (which are outlined in the following section on checkpointing schemes)

replication has a further problem with communication. For communication to

work correctly any messages sent to a task would have to be directed to all

replicas of that task. This would require significant extra overhead in commu‐

nication and tracking of tasks. Since replication’s main benefit is simplicity

adding these extra complications means it offers few benefits over the check‐

pointing and message logging approaches.

I will now examine checkpointing and message logging in more detail. Check‐

pointing schemes periodically record the state of a system. When an error oc‐

curs the most recent recorded state is loaded and processing restarts from that

point. Message logging schemes track all communication between tasks in an

application. If an error occurs then just the affected task is restarted and any

communication messages are replayed to restore it to its pre‐error state. These

two recovery schemes have a number of sub‐classes which supply better per‐

formance under different circumstances.

4.1.1 Checkpoint Based Protocols

In a checkpoint based system the entire recovery process relies on a set of

checkpoints. There are two main variants of the checkpoint based class: unco

ordinated checkpointing and coordinated checkpointing(41). A third variant,

communicationinduced checkpointing, attempts to combine these two ap‐

proaches to simultaneously minimise communication and persistent storage

space. Table 1 at the end of this section provides a quick comparison of these

three variants.

60

Uncoordinated Checkpointing

In uncoordinated checkpointing each process independently chooses when to

take checkpoints. This can allow processes to decide the optimal point at

which to checkpoint, eg. when the amount of state information is minimal.

During rollback an uncoordinated system must determine which checkpoint on

each process is required to find a consistent system state. There are a number

of disadvantages to uncoordinated checkpointing:

1. There is the possibility of creating a domino effect when rolling back an

uncoordinated system. This is discussed further below.

2. Useless checkpoints may be taken that will never be part of a consistent

system state.

Multiple checkpoints must be maintained for each process to ensure that a

consistent system state can be obtained.

Domino Effect

The domino effect can occur with uncoordinated checkpointing during the re‐

covery stage. When an object is rolled back it must invalidate any messages

that the object had sent since its last checkpoint because those messages may

no longer be valid. This means that the receivers of these messages must also

be rolled back since they are relying on invalid data. Figure 4‐1 demonstrates

how a failure of one object, P2, can invalidate a message. In this case the other

process, P1, would be rolled back to its last checkpoint to reach a consistent

system state.

FIGURE 4‐1 – SIMPLE ROLLBACK EXAMPLE

Failure

m1 m2 Invalidated

Consistent system state

P2

P1

61

The domino effect starts to appear if the last checkpoint is not part of a consis‐

tent system state. This can occur when each rollback invalidates new messages

which in turn cause additional rollbacks. Figure 4‐2 shows an example of a sys‐

tem that would suffer from the domino effect. When P2 fails it invalidates the

message, m4. This in turn causes a rollback which invalidates m3. It can be seen

that all of the messages passed so far in the system are invalidated in turn until

the initial state is reached.

FIGURE 4‐2 – DOMINO ROLLBACK

There are two options for avoiding the domino effect. Coordinated checkpoint‐

ing allows processes to communicate to ensure that the recovery line1 is ad‐

vanced. Message logging schemes allow processes to log messages so that roll‐

back of one process does not necessarily require another to rollback, even if

they have exchanged messages.

Coordinated Checkpointing

In coordinated checkpointing processes must coordinate to ensure that every

checkpoint is part of a consistent system state. This allows previous check‐

points to be discarded as the latest checkpoint is always part of a consistent

system state. However coordinated checkpointing requires far more commu‐

nication during normal execution. Before a process may checkpoint it must

contact every other process to build the global checkpoint. This can introduce

1 Recovery Line: The set of checkpoints that represent a consistent system state.

P1

Failurem1 m4

P2 m3 m2

Consistent system state

62

a significant degradation in performance even when there have been no failed

processes.

The main benefit of coordinated checkpointing is its complete avoidance of the

domino effect and also the simplicity of rollback. When a process fails it

merely informs all processes to rollback to their last checkpoint and restart

execution. The cost of this is a appreciably more expensive checkpointing pro‐

cedure. This cost is significant since it will impact on the system even during

fault‐free execution.

CommunicationInduced Checkpointing

Communication‐induced checkpointing (CIC)(42) encapsulates a third ap‐

proach which allows processes to independently checkpoint, while avoiding

the domino effect. CIC systems define two different types of checkpoints, local

and forced. Local checkpoints correspond to uncoordinated checkpointing,

that is, they can be taken at any time independently of any other process.

Forced checkpoints are triggered when the process determines that a check‐

point is required to prevent the domino effect. CIC protocols use extra proto‐

col specific data piggybacked on the normal communication to evaluate the

need for a forced checkpoint.

Briatico, Ciuffoletti and Simoncini(BCS) presented the first attempt at a CIC

protocol(43). BCS requires each process to maintain a logical clock which is

used to timestamp that process’ checkpoints. The entire protocol can be ex‐

plained with three rules:

1. The clock starts at zero and is incremented by 1 whenever a local

checkpoint is taken.

2. The clock value is piggybacked on any outgoing message.

3. If the process receives a message with a higher clock value, a forced

checkpoint is taken and its own clock is updated to equal the received

clock.

63

This protocol ensures that a set of checkpoints with the same timestamp is

guaranteed to provide a consistent system state.

The BCS protocol is an example of an Index‐based CIC protocol. Model‐based

protocols also exist which rely on preventing certain patterns from forming

within the system however it has been proven by Hélary, Mostefaoui and Ray‐

nal that these two types are fundamentally equivalent(44).

Uncoordinated Coordinated CIC

Requires Only Last

Checkpoint

Avoids Domino Ef‐

fect

Simple Recovery

Avoids Extra Com‐

munication

TABLE 1 ‐ CHECKPOINTING OVERVIEW

4.1.2 LogBased Protocols

Log‐based recovery protocols extend checkpointing protocols by creating a log

of non‐deterministic events, such as message received from other processes.

When failure occurs this log is used to replay the events to a process dispelling

the need for related processes to be rolled back. All log‐based protocols rely

on a concept called piecewise determinism, which is explained in below.

Log‐based protocols can be classified into three groups:

1. Pessimistic Log‐Based Protocols

2. Optimistic Log‐Based Protocols

3. Causal Log‐Based Protocols

The major difference between pessimistic and optimistic protocols is their abil‐

ity to create orphaned processes. Orphaned processes can complicate rollback

and are explained in section below.

64

Generally log‐based protocols also use periodic checkpointing to limit the

number of events that need to be replayed during rollback. Table 2 at the end

of this section provides an overview of the three groups of log‐based protocols.

Piecewise Determinism

Piecewise determinism (PWD) is a necessary assumption for all log‐based re‐

covery protocols. It assumes that all nondeterministic events can be identified

and stored by the system. For a process that has no direct interaction with the

outside world2 piecewise determinism can be expressed as follows:

That the only non-determinism in a process arises from the nondeterministic order in which messages

are delivered

The actual data stored for an event is called a determinant and must contain

sufficient information to allow the system to replay the event in the case of a

failure. This information could include messages from other processes or in‐

ternal data such as random seeds.

Orphaned Processes

Rollback of a process when using log‐based recovery requires the determi‐

nants of all events received since the last checkpoint. Processes become or‐

phaned when one or more determinants that would be required to recover a

process are not available on persistent storage. This may occur if a process

starts processing a message before the message is persisted or if another proc‐

ess which was responsible for storing the determinant fails.

Failure of an orphaned process may require other processes to be rolled

backed so that the system can return to a consistent system state.

2 Interaction with the outside world may include displaying something to a user or stor-

ing/deleting data from a database or disk or retrieving the value of the system clock.

65

Pessimistic LogBased Protocols

Pessimistic protocols take the view that failures can occur after any nondeter‐

ministic event. The simplest form, synchronous logging, saves all events to

persistent storage before it is provided to the process. This protects the sys‐

tem completely against orphaned processes.

The primary benefit of pessimistic protocols is their simplicity. Rollback in a

pessimistic system only ever requires the latest checkpoint and will not affect

any other process. However there can be a significant performance penalty,

especially if saving the determinants takes anything but a trivial amount of

time.

Optimistic LogBased Protocols

Optimistic protocols utilise the observation that failures are relatively rare. In

optimistic protocols determinants are logged asynchronously to persistent

storage. This introduces the chance that failure will occur before the determi‐

nant is stored, and hence orphaned processes may be created.

Typically optimistic protocols will keep a cache of determinants which is peri‐

odically flushed to persistent storage. This greatly reduces the overhead of

logging during failure‐free execution, especially where writing to persistent

storage is a costly procedure. However, the possibility of orphaned processes

increases the complexity of rollback and may also require multiple processes

to be rolled back to obtain a consistent state.

Causal LogBased Protocols

Causal protocols combine some of the advantages of both pessimistic and op‐

timistic protocols. In particular, they allow asynchronous logging like optimis‐

tic protocols while still avoiding the creation of orphaned processes. However

they still require a complex recovery process which can rely on information

from the determinant cache of related processes.

66

Causal protocols prevent orphaned processes by piggybacking non‐stable de‐

terminants in their determinant cache on the messages they send to other

processes. These determinants are entered into the receiver’s determinant

cache before delivering the message to the process. There are a number of dif‐

ferent methods of implementing this style of fault recovery with varying trade‐

offs(45). Ultimately causal protocols trade increased message size and roll‐

back complexity for asynchronous logging and orphan prevention.

Pessimistic Optimistic Causal

Prevents Orphaned

Processes

Relies Only on Last

Checkpoint

Simple Recovery

Procedure

Table 2 ‐ Message Logging Overview

4.2 Fault Tolerance in G2:P2P

The fault tolerance scheme for G2:P2P is designed to minimise the impact on

the system during normal operation. Since G2:P2P is designed to scale to In‐

ternet style networks any fault tolerance scheme needs to minimise any net‐

work communication. At the same time, when running G2:P2P on networks

with more reliable nodes it would be beneficial if the fault tolerance system

can adapt to provide less overhead while taking advantage of the more stable

conditions.

Previous cycle‐stealing systems have had fairly simplistic fault tolerance sys‐

tems. Generally when a volunteer is leaving the network they have been con‐

tent to either:

• Save a checkpoint of any application work performed which will be re‐

covered when the volunteer next joins the network or

67

• Discard any application work and allow the system to reallocate it to

another volunteer

Neither of these approaches works well when inter‐task communication is in‐

volved. Allowing objects to disappear for a significant period of time (option 1)

could seriously impact the rest of the application since they will be unable to

contact the missing portion. Option 2 allows objects to be always available,

however, since objects maintain state which may be influenced by the commu‐

nication they have received, the system cannot simply restart an object; effort

must be made to return the object to a valid state.

I have chosen to provide fault tolerance in G2:P2P using a message logging sys‐

tem. The message logging approach allows each object to independently per‐

form the necessary steps to ensure recovery is possible. This independent

processing is an important feature because of G2:P2P’s decentralised nature. If

a less distributed approach was used such as coordinated checkpointing it

would require significant network communication each time a checkpoint was

taken. Additionally, since G2:P2P expects that volunteers will be arriving and

departing from the network regularly it would be difficult to actually coordi‐

nate all of an application’s objects for checkpointing.

G2:P2P’s fault tolerance system differs from many cycle‐stealing systems by

requiring more commitment in regards to leaving the network. Whereas other

cycle‐stealing systems such as G2 allow volunteers to simply leave the network

without informing any other machine of their departure, G2:P2P expects vo‐

lunteers to perform some clean up operations before they disconnect from the

other volunteers. Realising that this cleanup will not always occur because of

unexpected problems or malicious volunteers, three distinct scenarios are

identified for nodes leaving the system. G2:P2P must correctly handle these

scenarios to ensure the system executes correctly. They are presented here in

decreasing order of likelihood:

1. Standard departure – Volunteer leaves gracefully

68

2. Volunteer crash – Volunteer crashes or temporarily loses connection,

but rejoins in a reasonable amount of time

3. Unexpected exit – Volunteer disappears due to crash, network outage or

maliciousness and does not return in a reasonable amount of time

The expected scenario is a standard departure and was described in section 3.3.

In this case no additional work is required. All objects are migrated to their

new host before the volunteer leaves and they start executing again when they

arrive on the new host. In the second case objects will be temporarily unavail‐

able while the volunteer restarts. During this period any messages sent to

these objects must be stored remotely pending the return of the node. When

the node returns it must recreate any objects from a local checkpoint of their

state. Section 4.2.1 outlines how the objects are recovered from the local

checkpoint.

The third scenario requires similar recovery steps as the second, however, in

this case the state of any objects must be available even when the original host

node is unavailable. If this level of fault tolerance is required then object state

must be stored on other nodes within the network or on a 3rd party server.

Since every method call must be logged to ensure recovery is possible, storing

this data remotely requires potentially expensive network access during every

single method invocation. This remote storage on every call could cause signif‐

icant performance degradation during the normal execution of the system. The

message logging method has specifically been designed to minimise the cost of

logging method invocations.

4.2.1 Logging Procedure

For the purposes of logging, each volunteer in a G2:P2P system has a local and

a remote storage mechanism. Local storage represents data which is held by

the volunteer itself. It is only accessible from that specific volunteer but should

remain available if the volunteer crashes unexpectedly. To provide this persis‐

tent storage it needs to be implemented using the file system rather than simp‐

ly memory based storage.

69

Remote storage contains data which must be available if a volunteer is re‐

moved from the network unexpectedly. It will be accessed by the system to re‐

cover any objects previously hosted on the absent volunteer. In centralised

systems this type of storage would be provided by a reliable node such as a

server, however in a decentralised system this option is unavailable. Instead

other volunteers within the network must be used for remote storage.

The approach used in G2:P2P is to replicate remote storage data across a num‐

ber of other nodes in the network. Since the goal of the remote store is to pro‐

vide data when a node fails, the nodes selected for holding the store must be

discoverable after the failure. The Pastry leaf set, i.e. the set of neighbours clos‐

est to a node in the Pastry address space, provides a suitable set of nodes. It is

the leaf set’s responsibility for detecting when a node fails so they are the best

candidates for storing the remote storage data. If other, unrelated nodes were

selected for storing the data then some mechanism for notifying those nodes of

the failure would be required. The neighbourhood set is an attractive choice

since it would provide better performance, however, membership of the

neighbourhood set is not deterministic and there is a greater chance of all

neighbourhood set nodes failing at the same time due to failures in network

connections.

The largest problem with this approach is the potential performance issues

that regular network communication raises. For this reason if a G2:P2P net‐

work is being run in a fully controlled environment it may be appropriate to

break from the pure P2P approach and use a central machine as the remote

storage facility. This central machine would provide a reliable storage mechan‐

ism, removing the need for replication across multiple nodes, but would intro‐

duce a potential bottleneck to the system. G2:P2P’s logging mechanism is de‐

signed to allow a variety of storage mechanisms to be substituted into the sys‐

tem, though each volunteer must be configured to use the same mechanism.

For Internet based G2:P2P networks it may be more efficient to split the data

using erasure coding (46; 47). Erasure codes take a block of data and trans‐

form it into a set of n blocks. The total size of these n blocks is actually greater

70

than the original block of data, however the original data may be recovered us‐

ing only a subset of these blocks. The number of blocks required to recover the

data is called the rate. This rate is configurable, with smaller rates requiring

more CPU time to prepare but requiring less blocks to recover the data. By us‐

ing erasure codes a G2:P2P volunteer would not need to send a complete copy

of its remote data to every member of its leaf set, saving costly network com‐

munication. However, whereas full replication would only require one leaf set

member to survive for fault recovery to be possible, erasure codes would re‐

quire multiple leaf set members to survive.

Temporary Volunteer Crash

The second departure scenario, a temporary volunteer crash, can be handled

primarily through local storage. Each incoming message (method invocation or

method result) to the object is persisted to local storage. Additionally, the sys‐

tem takes periodic checkpoints of the object and also stores these in the local

storage space. When a failure occurs the system simply recreates the object

from its latest checkpoint and replays any messages received since that check‐

point was taken. All incoming messages can be discarded when a new check‐

point is taken. The benefit of this approach is that it has very low overhead. No

network communication is needed during normal running or at recovery time.

However, the approach is only appropriate when the machines are in a tightly

controlled environment where it is highly likely that any machines that crash

will return promptly. If a machine does not rejoin the network then it is likely

that any applications that had objects on that machine will not complete cor‐

rectly.

Applications could be designed to handle such failures, but in most cases this

places significant burden on the application programmer. There are certain

application domains, such as evolutionary computing, where providing appli‐

cation level fault tolerance is reasonably simple. These applications may wish

to forego the more comprehensive fault tolerance techniques described below

in favour of the less expensive scheme presented here. G2:P2P’s configurable

fault tolerance mechanisms allow system administrators to make this decision.

71

Unexpected Volunteer Departure

To handle the more general case where a volunteer unexpectedly leaves the

network the remote storage mechanism must be used. The goal remains to mi‐

nimise the network traffic involved since that is by far the most expensive as‐

pect of the system. To achieve this as much data is stored on local storage as

possible and sufficient information is stored in the remote storage to retrieve

this data when required.

When a volunteer crashes, some other volunteer must be nominated to detect

this and be responsible for recreating any objects the crashed volunteer was

hosting. In Pastry it is the members of a node’s leaf set which are first to dis‐

cover if a node has crashed. Since these nodes are also responsible for storing

object checkpoints it is simple for them to decide independently who is now

responsible for the missing objects. This decision can be made using the stan‐

dard procedure of hosting objects on the node whose ID is closest to the ob‐

ject’s ID. When a volunteer dies its closest neighbours iterate through each of

its object IDs and test whether they are now responsible for hosting that object.

This testing can be performed without any single node coordinating the recov‐

ery process.

To recover the object the volunteer requires two things – the object’s latest

checkpoint and any messages received by the object since that checkpoint was

taken. While all of this information could be stored in the volunteer’s remote

storage this would require significant network traffic, particularly for storing

the complete details of each remote method call. A method call’s details include

the identity of the method along with all of the parameters being passed to the

method.

The cost of sending method invocation details to remote storage can be

avoided by storing the method details with the originator of the method call.

This volunteer can safely store the details in its local storage and provide them

if they are required. Even if this originating volunteer crashes, the object will

be replayed and will regenerate the messages. When an object is being recov‐

ered it

jects. T

mote

them m

The fo

provid

•

•

•

•

To rec

list of

ble th

from t

t can obtai

This signif

storage. Ob

messages s

ollowing fo

des a visual

The meth

store.

The meth

identifier

If this is th

GUID and

The resul

ceiver’s lo

F

cover the o

which obje

hrough the

the checkpo

n the mess

ficantly dec

bjects simp

since their l

ur points o

l overview

od caller st

hod receive

for the me

he first me

method’s i

lt of the me

ocal store.

FIGURE 4‐3 –

object the n

ects have se

original o

oint. The v

sage details

creases the

ply need to

last checkp

outline the

of the syst

tores the d

er stores t

essage in its

ssage recei

identifier a

ethod call

OVERVIEW

node will n

ent messag

object’s rem

olunteer th

s by reques

e amount o

o store a li

point.

details of w

em.

etails of th

the caller o

s local store

ived by this

are stored i

is stored o

OF G2:P2P M

need the la

ges to the o

mote store

hen sends a

sting them

of data bei

st of all ob

what data

e method i

object’s ide

e.

s calling ob

n remote st

on both the

MESSAGE LOG

atest checkp

object, both

. The obje

a request to

from the c

ing stored

bjects that

is stored. F

nvocation

entity and

bject then t

torage.

e caller’s an

GGING

point along

h of which a

ct is first

o every obj

72

calling ob‐

in the re‐

have sent

Figure 4‐3

in its local

a unique

the caller’s

nd the re‐

g with the

are availa‐

recovered

ject which

73

had sent the object messages requesting them to be resent. When an object is

checkpointed it can include the identifier for the latest message it has handled

from each object. This can be used to ensure only outstanding messages are

resent. As these messages are received they are replayed on the new object.

As the messages are replayed they may cause messages to be sent from the ob‐

ject. These messages may have already been sent before the crash. To prevent

the method being re‐executed, volunteers must test any messages they receive

against their list of results. If a result is already available for a message it simp‐

ly returns that result without executing the message.

Storing all results for incoming method calls could result in a substantial bur‐

den for volunteers. To prevent these results stores from becoming too large

there needs to be a method of clearing results that are no longer needed. Re‐

sults are only needed if the calling object crashes between when it makes the

call and is next checkpointed. Therefore, when an object is checkpointed they

send a message to any objects they have communicated with indicating the lat‐

est message they have processed. Objects receiving this message can then re‐

move any previous results. These messages simply consist of a message ID and

should not place any significant burden on the network.

Unexpected Volunteer Returning

It is possible that a volunteer may return after its leaf set have decided it has

crashed. This suggests that there is the potential for the same object to exist on

multiple nodes, however, the existing object recovery and migration procedure

will correctly handle this event. When a volunteer returns it must notify its leaf

set as part of the standard Pastry arrival procedure. This notification will trig‐

ger the normal object migration procedure. As soon as this migration is trig‐

gered all incoming messages for the object will be correctly handled by

queuing on the returning volunteer until the migrating object is ready. In this

situation the returning node will not make use of its local store, however, there

is the possibility to optimise the recovery by using a combination of the return‐

ing node’s local store and the object’s new host’s logs.

74

A further complication occurs if a node has been temporarily disconnected, but

does not realise its leaf set has marked it as crashed. This situation is unlikely

since the timeout period for marking a node as crashed is monitored by both

the leafset and the potentially crashed node, so there is a very short period in

which the node could return. However, if this does occur then objects may

created on two nodes similar to the situation described in the previous para‐

graph. Once again, this does not cause any correctness issues. All volunteers

periodically send heartbeat messages to their neighbours. When the leafset

receives a heartbeat from a node it thought had crashed it will immediately go

through the same process for correcting the duplicated objects. In this case the

object simply needs to pass on any messages that were received to the return‐

ing volunteer.

This logging mechanism provides robust protection for G2:P2P systems but

does entail a relatively expensive recovery mechanism. For this reason it is ex‐

pected that the full recovery process only be used in extraordinary circums‐

tances. In standard execution volunteers should use the previous mechanisms

of graceful departure. Volunteers should also allow some time for crashed vo‐

lunteers to return and recover from their own local store as much as possible.

This will limit the expensive recovery process to only a few rare occasions such

as network outages, severe volunteer crashes and malicious volunteers. The

low overhead of this comprehensive logging during normal execution ensures

that it can be made available for those extraordinary circumstances without

significant detrimental effect on the system. Section 6.3 will examine the cost

of each logging mechanism during normal operation.

4.3 Checkpointing

While G2:P2P relies on message logging for the majority of its fault tolerance

needs, periodic checkpointing is also used to limit the number of messages that

must be replayed during the recovery stage. This checkpointing mechanism is

also used during object migration.

75

G2:P2P checkpointing is built upon standard .NET serialization. This reliance

on .NET serialization has one significant restriction – it is unable to take a

snapshot of executing threads. This means that any ongoing execution must be

halted before objects are checkpointed in G2:P2P. The simplest method of halt‐

ing execution is to simply set a flag when a checkpoint is required for the ob‐

ject. Once this flag is set the system will not start executing new methods on

the object.

When all methods are complete standard .NET serialization is used to take a

checkpoint of the object’s current state. .NET serialization is the standard me‐

thod for capturing an object’s state and only requires programmers to mark

their types with a “serializable” attribute, it does not require any special code

to be written; however, programmers may customise how their types are se‐

rialialized if they wish. .NET serialization also captures any objects which are

referenced by the object being serialized.

Blocking and waiting for methods to complete introduces a potentially serious

problem – deadlock. If a thread executing a method on the object is blocked

waiting for another incoming method call then blocking these incoming calls

will prevent the thread from ever completing. There are two options available

for avoiding this problem:

1. disallow blocking inside G2:P2P objects, or

2. create some mechanism for indicating which methods may block and

which may trigger the completion of those blocks. When a checkpoint is

required any non‐blocking methods will still be allowed to execute but

blocking methods will be queued. Once all blocking methods are com‐

pleted then all incoming methods are queued till the checkpoint is taken.

There are a variety of ways in which the second approach may be taken, how‐

ever they all considerably complicate the application programmer’s job and

are hence undesirable. Obviously application programmers must correctly

identify blocking methods and methods which trigger blocks to end, but it may

be desirable to mark some other methods as “blocking”. For example, if a me‐

76

thod takes a long time to complete then it should not be started while there is

an impending checkpoint. Therefore application developers should mark such

a method as “blocking” even though it does not actually block. The scheme is

also complicated by methods which are both blocking and trigger the comple‐

tion of blocks.

The first approach is considerably simpler but restricts the application pro‐

grammer since they are no longer allowed to use blocking calls. Instead, re‐

placement mechanisms must be provided to overcome the restriction on block‐

ing. In the following sections I will demonstrate a replacement mechanism for

the restriction on blocking which is based on familiar techniques and is not

overly burdensome for the application programmer.

Another problem introduced by simply blocking till threads complete relates to

applications that have long running methods which take considerable time to

complete. Obviously waiting a long time for methods to complete is undesira‐

ble, especially when a volunteer is being shut down and must migrate its ob‐

jects as quickly as possible. It should be possible for the application program‐

mer to write long running methods in a manner which allows for checkpoint‐

ing during their execution.

In the following sections I will discuss methods for allowing application pro‐

grammers to use blocking and long running methods with G2:P2P without af‐

fecting the checkpointing scheme detrimentally.

4.3.1 Support for Blocking Methods

G2:P2P’s support for blocking methods is inspired by the existing support for

asynchronous methods in ASP.NET web services. This keeps in line with the

goal of providing a programming interface which is familiar to programmers

who are not familiar with parallel programming, but are familiar with existing

commercial frameworks.

Blocking would be used where an object needs to collate information from

multiple method calls, and hence must pause its execution until all of these me‐

thods have been called and the required information is available. It is this

77

communication model which G2:P2P must support, without using explicit

blocking.

Blocking Method Overview

The basic problem which must be solved is to take a method which would

normally include a block, and convert it into a form which does not include an

explicit block. When considering this it is important to realise that for any me‐

thod with a block in it, there must be a corresponding method which triggers

the end of this block. This trigger must be supplied by a call to another method

on the same object. This extra method will be considered when developing the

alternate approach to blocking.

The blocking method can be split into two sections – the section preceding the

block and the one following. Since G2:P2P only allows migration and check‐

pointing at method boundaries, both of these sections must be converted into

complete methods. To enable this some alternative mechanism for simulating a

block must be provided. This process is analogous to the asynchronous imple‐

mentation of web services in ASP.NET which will be used as an inspiration for

G2:P2P’s blocking mechanism.

If we are using a custom mechanism for the actual block then a custom me‐

chanism must also be used to trigger this block. This means that the blocking

model must alter both the blocking method and the method triggering the

block.

The following section gives details on how the blocking method will be split

into two methods corresponding to the sections before and after the block, and

how the triggering method is altered to support this change.

Alternate Blocking Details

The basic approach for blocking methods is to split the method into two parts –

the section before blocking and the section afterwards. These two parts are

split into separate methods and use G2:P2P to perform the actual block and

trigger the second part. Consider the following object which uses blocking:

78

public class BlockingObject : ContextBoundObject { AutoResetEvent waitHandle = new AutoResetEvent(); object incomingData; public object BlockingCalculation(object input) { PreProcess(input); waitHandle.WaitOne(); return ProcessData(input, incomingData); } public void TriggerBlock(object extraData) { incomingData = extraData; waitHandle.Set(); } }

LISTING 4‐1 – NON‐G2:P2P STYLE BLOCKING

In this example the BlockingCalculation method includes a blocking call

(waitHandle.WaitOne). This causes the calculation to pause midway through its

processing until another method (TriggerBlock) is called. The TriggerBlock

method is used to provide some additional data which is used to calculate the

final result of BlockingCalculation. By using a blocking call the application de‐

veloper has allowed for some of the processing to start while another object is

still generating part of the input.

This is a basic example of how blocking may be used in a G2:P2P object, if it

were allowed. The blocking method (BlockingCalculation) has 2 parts; the cal‐

culation of its return value and the actual returning of that value separated by a

block which is triggered by another method (TriggerBlock). To actually use this

in G2:P2P these two parts must be placed in separate methods using a particu‐

lar naming convention and a blocking handle must be returned to G2:P2P so it

can detect when to call the second part. Listing 4‐2 demonstrates how this ex‐

ample could be implemented in a G2:P2P object.

79

public class BlockingObject : ContextBoundObject { G2AsyncResult waitHandle; object incomingData; #region BlockingCalculation Implementation [AsyncImpl] public object BlockingCalculation(object input) { // G2:P2P will not allow this content to be called. throw new NotImplementedException(); } public IAsyncResult BeginBlockingCalculation(object input) { PreProcess(input); waitHandle = new G2AsyncResult(retVal); return waitHandle; } public object EndBlockingCalculation(IAsyncResult ar) { return ProcessData(ar.AsyncState, incomingData); } #endregion public void TriggerBlock(object extraData) { incomingData = extraData; waitHandle.Complete(); } }

LISTING 4‐2 – G2:P2P STYLE BLOCKING

As you can see, the first step is separated into a method titled BeginBlocking‐

Calculation and the second into EndBlockingCalculation. The begin method

must return an IAsyncResult which encapsulates the blocking handle. It also

may contain state information which will be passed to the end method. Since

asynchronous methods are an implementation detail, the original BlockingCal‐

culation method is still defined as part of the object’s interface. Clients of the

object actually perform calls on this method. G2:P2P will intercept these calls

and redirect them through the actual asynchronous implementation. The

AsyncImpl attribute indicates to G2:P2P that it should perform this interception.

When G2:P2P sees an AsyncImpl attribute it assumes that there will be ‘begin’

and ‘end’ methods of the same name with a specific signature:

80

• The ‘begin’ method will have a return type of IAsyncResult

• The ‘end’ method will have the same return type as the actual method

• The ‘begin’ method will have the same parameter list as the actual

method, except:

o ‘ref’ parameters will be passed as standard ‘in’ parameters

o ‘out’ parameters will be removed from the list

• The ‘end’ method parameter list will contain:

o An IAsyncResult as its first member. This will be the object that

was passed back by the Begin method

o All ‘out’ and ‘ref’ parameters from the actual method

The IAsyncResult returned by the begin method is monitored by G2:P2P. When

this object is triggered by another method G2:P2P will queue up a call to the

end method. The results of the end method are then returned to the client ex‐

actly as if they had been calculated by the original prototype method.

G2:P2P saves the IAsyncResult as part of the object’s checkpoint. If an object is

migrated or a checkpoint recovered the new host will automatically start

monitoring any of the object’s IAsyncResults so it can correctly trigger the end

methods as normal.

G2:P2P also supports methods with multiple blocking points. The following

section outlines the QueueMethodCall API including how it can be used for pro‐

viding multiple blocking points in a method.

4.3.2 Support for Long Running Methods

Most long running methods can be described as either a series of sequential

steps, or as a loop. Using QueueMethodCall, application programmers can de‐

velop both of these styles while still allowing G2:P2P to checkpoint and mi‐

grate objects relatively promptly.

QueueMethodCall is used by an object to call one of its own methods. Unlike

simply calling the method itself, QueueMethodCall uses the standard logging

mechanism, just as if the method had been called by another remote object.

81

This means that if there is a pending checkpoint, the call is postponed just like

any other call. By breaking a single method call into a sequence of steps and

calling each step from the previous one using QueueMethodCall, the applica‐

tion programmer can provide the opportunity for G2:P2P to checkpoint or mi‐

grate the object even during a long running process. Listing 4‐3 demonstrates

using QueueMethodCall in a long running method.

public class LongSequentialTask : ContextBoundObject { public void LongTaskPart1() { // Performs some work G2P2PChannel.Current.QueueMethodCall (new NullDelegate(LongTaskPart2)); } public void LongTaskPart2() { // Performs some work G2P2PChannel.Current.QueueMethodCall (new NullDelegate(LongTaskPart2), intermediateValue); } public void LongTaskPart3(object intermediateValue) { // Completes work } }

LISTING 4‐3 – LONG RUNNING G2:P2P METHOD

Loop constructs can also be implemented using QueueMethodCall in a tail call

style. Instead of a conventional loop construct, the programmer simply inserts

a call to QueueMethodCall on the end of the loop body.

public class InterruptableLoop : ContextBoundObject { public void Loop(int count) { // Do the body of the loop if (count > 0) G2P2PChannel.Current.QueueMethodCall (new IntDelegate(Loop), count ‐ 1); } }

LISTING 4‐4 – INTERRUPTABLE G2:P2P LOOP

82

By using a combination of asynchronous methods and QueueMethodCall the

application programmer can create long running methods that are interrup‐

table, but still return values.

public class LongRunningWithReturn : ContextBoundObject { [AsyncImpl] public object LongTask() { throw new NotImplementedException(); } public IAsyncResult BeginLongTask() { G2AsyncResult ar = new G2AsyncResult(); G2P2PChannel.Current.QueueMethodCall (new G2Callback(LongTaskPart1), retVal); return retVal; } public void LongTaskPart1(G2AsyncResult ar) { // Some processing G2P2PChannel.Current.QueueMethodCall (new G2Callback(LongTaskPart2), retVal); } public void LongTaskPart1(G2AsyncResult ar) { // Some processing ar.AsyncState = calculatedValue; ar.Complete(); } public object EndLongTask(G2AsyncResult ar) { return ar.AsyncState; } }

LISTING 4‐5 – LONG RUNNING INTERRUPTABLE TASK WITH RETURN VALUE

Finally, since QueueMethodCall executes methods just as if they had been

called from another object, it can be used to execute asynchronous methods.

This allows us to simulate a method that has multiple blocking points by sim‐

ply splitting each blocking section into separate asynchronous methods.

Listing 4‐6 demonstrates a method with multiple blocking points. The main

entry method, MultiBlock, is implemented as an asynchronous method. The be‐

gin portion initiates a call to Block1 which is also implemented asynchronously.

83

The first blocking point occurs at the end of BeginBlock1. This block is ended

by the call to Trigger1 which causes EndBlock1 to be executed by the G2:P2P

framework. When EndBlock1 finishes it sets a flag allowing Trigger2 to be

called and enters the 2nd block. When Trigger2 is executed it completes the

outer trigger starting the call to EndMultiBlock. This pattern can be continued

to allow as many blocks as required.

84

public class MultipleBlockingPoints : ContextBoundObject { G2AsyncResult outerTrigger, block1; bool trigger2Ready; [AsyncImpl] public object MultiBlock() { throw new NotImplementedException(); } public IAsyncResult BeginMultiBlock() { G2AsyncResult block2 = new G2AsyncResult(); G2P2PChannel.Current.QueueMethodCall (new G2Callback(Block1), ar); return retVal; } public object EndMultiBlock(G2AsyncResult ar) { return ar.AsyncState; } [AsyncImpl] public object Block1(G2AsyncResult outerResult) { throw new NotImplementedException(); } public IAsyncResult BeginBlock1(G2AsyncResult outerResult) { G2AsyncResult block1 = new G2AsyncResult(); return block1; } public void EndBlock1(G2AsyncResult ar) { // Do work then return which blocks until Trigger2 is called trigger2Ready = true; } public void Trigger1() { block1.Complete(); } public void Trigger2() { if (trigger2Ready) outerBlock.Complete(); } }

LISTING 4‐6 – METHOD WITH MULTIPLE BLOCKING POINTS

85

QueueMethodCall also allows for objects to initiate multiple threads by simply

queuing up multiple calls. G2:P2P volunteers can be configured to allow multi‐

ple threads to run on each volunteer. If this is done then the queued method

calls will be started concurrently. Threads started this way will still work cor‐

rectly with the checkpointing mechanism assuming they use QueueMethodCall

and asynchronous methods appropriately. As with any other call, extra threads

are not allowed to block, although short thread synchronisation calls are ac‐

ceptable provided care is taken to prevent deadlock.

4.4 Conclusion

Fault tolerance is an essential feature of a cycle stealing system. The applica‐

tion of a decentralised network model to cycle stealing creates a new and diffi‐

cult situation for providing fault tolerance. The techniques applied in centra‐

lised cycle stealing and in traditional high‐performance computing are not

suitable in a decentralised network, primarily due to their reliance on a relia‐

ble storage mechanism. Additionally, there has been little investigation into

providing fault tolerance on decentralised systems because their previous ap‐

plications have not required it to any significant level.

In this chapter I have presented a fault tolerance system for cycle stealing on a

fully decentralised network. The system adapts existing message logging sys‐

tems for use on a fully decentralised network. It provides reliability through

data replication, but minimises the network traffic required for this replication

to ensure it does not cause unreasonable performance degradation.

The system provides two distinct tolerance levels with differing performance

costs. These levels allow the system to take advantage of more reliable under‐

lying networks when available.

Section 6.3 shows the results of performance tests which test the overhead

caused by the fault tolerance system.

86

5 Improving Locality

Unlike previous cycle stealing frameworks, G2:P2P supports direct communi‐

cation between executing jobs. While some collections of objects have largely

ad‐hoc communication patterns, a significant proportion have more structured

communication such as nearest neighbour or tree style patterns. These pat‐

terns provide an opportunity for optimising the performance of inter‐object

communication by improving the locality of communicating objects. This opti‐

misation is important as communication costs can be a major drain on the per‐

formance of a G2:P2P application.

In this chapter I present a series of locality optimisations designed to improve

the communication efficiency, and hence the overall performance, of applica‐

tions using G2:P2P. Locality refers to the physical relationship between two

objects. This physical relationship manifests itself in the latency and band‐

width of their communication channels. When an object calls a method on a

remote object the method details, including the parameters, must be trans‐

ferred to the target object. Ideally this object would be hosted on the same

node so this transferral would not require any network communication. Of

course, in the extreme, simply hosting all objects on the same volunteer would

ensure excellent communication channels for all inter‐object communication,

but would completely remove all parallelism from the system. Therefore a

balance must be struck. When there are more objects than volunteers and mul‐

tiple objects must be hosted on the same node, objects which are likely to

communicate should be chosen, rather than unrelated ones.

The importance of optimising for locality in parallel programs is well unders‐

tood and there has been extensive work in this area(48; 49). However, in the

context of cycle‐stealing systems and more generally DHT based P2P systems

this topic has been completely unexplored. The DHT concept allows a reasona‐

ble amount of flexibility in the structure of the network. Ultimately, just the

ability to perform key‐value lookups on the distributed network must be main‐

87

tained. Within this guideline there is room for improving locality by altering

how keys are assigned to nodes and objects.

Ultimately, the goal of this work is to extend the decentralised cycle‐stealing

principles developed in the previous chapters so that objects which are com‐

municating regularly with each other are more likely to have better communi‐

cation channels. This means that the objects must be hosted on nodes which

are physically closer to each other, such as both within the same organisation,

or hosting multiple objects on the same node. This improved locality is

achieved through alterations to the generation of ObjectIDs and through

changes to the underlying DHT layer. Whilst the work is motivated by parallel

programming, it is entirely likely that the locality ideas proposed here will

have wider applicability to other DHT applications.

The contributions of this chapter are:

• A method for improving object locality on DHT overlay networks, and

• A method for improving the physical relationship between nodes in a

DHT overlay network without reducing the distribution of those nodes

across the DHT’s address space.

Both of these contributions are applied to cycle‐stealing through integration

into G2:P2P. The optimisations result in measurable performance gains, par‐

ticularly for applications involving inter‐object communication.

5.1 Related Work

There are a number of systems which deal with optimizing distributed object

systems. JavaSymphony (50) provides a programming paradigm for distri‐

buted, parallel computing based on distributed objects. JavaSymphony objects

may be mapped to hosts using either an automatic mapping or by relating

them to other objects. This allows the programmer to indicate that objects

which have frequent communication should be hosted near each other. To

provide this JavaSymphony relies on a virtual architecture to be defined for the

system. This virtual architecture requires significant effort to set up and is not

88

suitable for highly dynamic networks like G2:P2P. A manual migration facility

is also provided based on this virtual architecture.

Mobile agent systems such as ObjectSpace Voyager (51) and Aglets (52) pro‐

vide distributed object systems with specific support for data locality optimisa‐

tion, but are not designed for high frequency, fault tolerant communication be‐

tween the agents. Migration in mobile agents generally requires specific inter‐

vention by the programmer indicating which host to migrate to unlike the au‐

tomated optimizations proposed by this chapter.

Other cycle‐stealing systems like Charlotte (25) and Javelin (28) are limited in

the type of applications they can support and hence do not provide specific lo‐

cality optimizations.

5.2 Optimisations

G2:P2P contains two distinct addressing schemes – the virtual DHT addressing

scheme provided by Pastry and the underlying physical addressing of the

transport layer (TCP/IP). To improve the communication channels between

G2:P2P objects the physical locality of the objects must be improved, however

because the physical layer is abstracted by the DHT layer the locality optimisa‐

tions must address both layers.

Four optimisations have been developed which each improve the object locali‐

ty in different ways. The optimisations consist of both changes to the G2:P2P

object addressing scheme as well as changes to the underlying DHT layer. The

DHT layer optimisations are general in nature and hence could be applied to

other DHT based applications which would benefit from object locality. For ex‐

ample, a DHT based data storage mechanism may benefit from improving the

locality of related files stored in the system. This could improve the speed of

retrieval request for related files similar to how data caches benefit from spa‐

tial locality.

The optimisations described here rely on a priori knowledge of how objects

will communicate during their lifetime. In many applications objects communi‐

89

cate in well known patterns such as ring or mesh layouts, or at least in some

pattern that is determined by the design of the application. These patterns can

be used when objects are created to optimise their layout on the G2:P2P virtual

address space.

Since the G2:P2P address space natively has a ring layout (i.e. the NodeID ad‐

dress space), applications that use other communication patterns must map

their objects onto a ring layout. It is expected that mappings for common

layouts could be provided by libraries, removing this burden from the applica‐

tion programmer.

5.2.1 Optimisation 1 – ObjectID Ordering

The first optimisation is designed to increase the likelihood that two objects

which communicate with each other are hosted on the same volunteer. This is

achieved by altering how ObjectIDs are generated for objects. Since an object’s

ID determines which node an object will be hosted on, the chance of hosting

two communicating objects on the same volunteer can be increased by assign‐

ing them numerically adjacent IDs. However, this simple procedure must be

balanced with the need to distribute the objects amongst all of the volunteers

in the network.

In the original object distribution mechanism described in Chapter 3, load ba‐

lancing was achieved by generating random IDs for objects. These random IDs

ensured a relatively even spread of objects over the entire address space, and

hence across all of the volunteers in the network. This random assignment was

also notable because it achieved this spread without using any central service

for generating the IDs. It is essential that any changes to the ObjectID genera‐

tion procedure still disperse the IDs across the entire address space to main‐

tain the system’s load balance while maintaining its decentralised nature.

It is common for there to be significantly more objects on the system than

there are volunteers. This means that each volunteer will be hosting multiple

objects. With the random ObjectID generation the objects assigned to a volun‐

teer are relatively unlikely to have any direct communication. Instead all com‐

munic

Figure

signed

bour r

layout

The op

cation

the nu

takes

aroun

an equ

ment m

cedure

ordere

Figure

could

cation will

e 5‐1 show

d to a G2:P

ring manne

t on the net

ptimised ID

n programm

umber of in

a collectio

d the entir

ual spacing

maintains

e, but it is

ed accordin

e 5‐2 demo

be laid ou

require ex

ws commun

P2P networ

er, but this

twork.

FIGURE 5‐1

D generatio

mer has of

ntra‐volunt

on of com

e ObjectID

g between

the load ba

the order o

ng to their l

nstrates ho

ut using th

xpensive m

nication link

rk. These o

communic

– UNOPTIMI

on procedu

their desig

eer commu

mmunicatin

address sp

each ID in

alancing pr

of the IDs t

likely comm

ow a set of

he optimise

messages b

ks between

objects com

cation sche

ISED RING CO

ure makes u

gn’s comm

unication li

ng objects

pace – that

n the addre

roperties o

that is imp

munication

objects com

ed ID gene

being sent

n a set of o

mmunicate

eme is not

OMMUNICAT

use of the k

munication

inks. The g

and assig

is each ID

ess space.

of the previ

portant. Th

n patterns.

mmunicati

eration. In

across the

objects ran

e in a near

apparent f

ION

knowledge

patterns to

eneration p

gns them

is chosen s

The unifor

ious gener

e group of

ng in a ring

this figure

90

e network.

ndomly as‐

est neigh‐

from their

the appli‐

o increase

procedure

uniformly

so there is

rm assign‐

ation pro‐

f objects is

g topology

e the ring

communi

underlyin

a volunte

some of t

jects in th

tion links

more com

the load o

over the e

It is impo

tire life o

work. Obj

objects ar

communi

This optim

assigned

cation is o

ng DHT add

er is hostin

heir comm

he group in

s will incr

mmunicatin

on each vol

entire addr

FI

ortant to no

f the appli

jects will b

re hosted o

cation link

misation re

IDs at the p

bvious bec

dress space

ng two obje

munication w

ncreases th

ease faster

ng objects w

unteer will

ress space.

IGURE 5‐2 –

ote that thi

cation, eve

be migrated

on the same

ks.

elies entire

point of cre

cause it is d

e. It can als

ects these o

without us

he number

r than inte

will be hos

l stay even

OPTIMISED R

is optimise

en as volun

d to other

e volunteer

ely on alte

eation. An o

directly ref

so be seen t

objects wil

ing the net

r of these i

er‐voluntee

sted on the

because th

RING COMMU

d ID gener

nteers arriv

volunteers

r they will b

ring the m

obvious alt

flected in th

that in the

l be able to

twork. As t

ntra‐volun

er commu

e same volu

he objects a

UNICATION

ration stays

ve and dep

s but any ti

benefit from

manner in w

ternative to

heir order

two cases

o perform a

he number

nteer comm

nication b

unteers. Ho

are still dis

s valid for t

part from th

ime that m

m intra‐vol

which obje

o this is to c

91

on the

where

at least

r of ob‐

munica‐

ecause

owever

persed

the en‐

he net‐

multiple

lunteer

cts are

change

92

an object’s ID during the execution of the application to take advantage of the

current communication patterns. However, updating objects’ IDs once they

have been assigned complicates G2:P2P’s communication mechanisms. Inter‐

object communication is addressed based on an object’s ID. This addressing

allows objects to be contacted even though their host volunteer may change as

volunteers come and go from the network. If an object’s ID is changed once

there are other references to that object on the network some method of redi‐

recting communication must be provided.

There are two options available for handling altered ObjectIDs:

1. Update all references to the object with the new ID, or

2. Leave a forwarding indicator at the object’s previous ID and forward

messages as they arrive.

Updating all the object references would potentially be a very expensive opera‐

tion. To do this the update message would either have to be passed to every

node in the network or a central list of all references would need to be kept.

Contacting every node in a large P2P network is prohibitively expensive and

hence not a reasonable option. Keeping reference lists would be possible but

would require close monitoring of all inter‐object messages to detect when

references are copied to other objects. Ultimately this would significantly in‐

crease the overhead of the G2:P2P communication system.

Message forwarding is a fairly simple approach and has been used in some

other distributed systems(51). However, message forwarding does require

each node to keep lists of forwarding addresses and increases the cost of com‐

munication by including more nodes in a message path. While message for‐

warding may be acceptable for small numbers of objects with limited reloca‐

tion the static method described here is significantly simpler and is suitable for

a large number of applications. Additionally, none of the optimisations de‐

scribed in this chapter preclude the future development of a dynamic reloca‐

tion system utilising message forwarding.

93

This optimisation significantly improves the locality of objects assigned to the

same volunteer but it does not address how objects that do not end up being

hosted on the same node can improve their communication performance.

However, the ObjectID ordering presented here forms the basis for further op‐

timisations presented later in this chapter which will improve inter‐volunteer

communication.

This optimised allocation scheme requires slightly more information than the

standard method. Previously the only information required to create a G2:P2P

object was the type of the object and any parameters to its constructor. To

generate a uniform distribution of ObjectIDs the ID generator needs objects to

be created as part of a group. These object groups are designated by the appli‐

cation programmer when they create the object. Initially just the size of the

group, m, is required so that the distance between each object in the address

space can be calculated by dividing the address space in m even segments.

Once the generator has this information each object in the group can be

created as usual. The generator generates a random ID for the first object then

allocates each subsequent object by adding the calculated distance to the pre‐

vious ID. This will ensure that each object in the group is evenly spaced be‐

tween its neighbours (that is, neighbours within the group not including other

non‐related objects). This generation process is handled entirely by the creat‐

ing node without the need for global coordination.

Since these object groups may coexist with other objects ObjectID clashes are

still possible for each individual object. These are handled in the usual manner

by allocating the closest available ID. The slight variation in distance caused by

these clashes will not significantly alter the locality properties.

This optimisation changes the basic manner in which application programmers

instantiate G2:P2P objects. Whereas previously they have simply used the

standard .NET “new” operator, they must now provide extra information when

objects groups which require this enhanced ObjectID assigned are created.

Whilst the current system requires application programmers to supply these

94

details manually, it is possible that some form of automated analysis, either

static or dynamic, could be used to generate these mappings. Such automated

analysis is beyond the scope of this thesis. Section 5.3 will outline how this op‐

timisation is made available to application developers.

5.2.2 Optimisation 2 – Object Collocation

The previous optimisation increases the chances that two objects that commu‐

nicate often will be located on the same volunteer, however it does not guaran‐

tee that they are always hosted together. The optimisation described in this

section is designed for situations in which a small set of objects communicate

so frequently with one another that they should always be collocated on the

same host.

Collocation is provided by adding extra bits to ObjectIDs, extending them

beyond the length of the NodeIDs. This can be thought of as turning ObjectID

into fixed point numbers rather than integers. Objects which share the same

integer part will always be mapped to the same volunteer.

Application programmers use this optimisation by indicating at instantiation

time that a group of objects should be collocated. This group of objects is then

assigned ObjectIDs with the same, randomly assigned, integer part, ensuring

they are placed on the same volunteer. The actual physical volunteer may

change over time as volunteers come and go, but the group of objects will al‐

ways be collocated. Note that this locality comes at the cost of load balance and

therefore parallelism. If a group of objects are assigned the same integer part,

they will always map to a single machine, even if there are a large number of

other nodes in the network which are completely unused. Usually, however,

these nodes will be populated by other objects used by the application or by

other applications.

An alternative method of achieving this optimisation is to encapsulate these

objects inside a single remote object container that forwards messages to them.

The advantage of the approach described above is that each of the objects in

the group remain individually addressable by remote clients. Whether or not a

95

set of objects should always be collocated is an performance optimisation

which ideally should be kept separate from the application logic and the ab‐

stractions used.

5.2.3 Optimisation 3 – Volunteer Balancing

The assignment of ObjectIDs and NodeIDs described so far will lead to approx‐

imately the same number of objects being allocated to each volunteer. In some

cases however, particularly with smaller networks, this balance may not be

reached. The third optimisation presented is used to improve load balancing,

particularly with networks with only a small number of volunteers. More im‐

portantly, this optimisation also provides a basis for the fourth optimisation

which directly addresses the physical relationship between volunteers.

The goal of this optimisation is to achieve a more uniform spread of volunteers

around the entire DHT address space. Currently volunteers are assigned ran‐

dom IDs when they join the network. While this random assignment provides a

reasonable distribution this optimisation aims to provide a much more even

spread of volunteers. Because the set of volunteers is continually changing this

optimisation is an ongoing process during the lifetime of the network.

As discussed earlier, it is extremely problematic to change an object’s ID after it

is initially assigned because references to that object may have spread

throughout the entire network. It is however possible for a volunteer’s NodeID

to change at a later time. A simple way to explain why this is possible is to view

the process as equivalent to a volunteer leaving the network and then imme‐

diately rejoining (with a new NodeID). Obviously performing a full depar‐

ture/arrival process for each NodeID change would be expensive, however

with a little ingenuity a process can be developed that is much more efficient

than the naïve implementation hinted at above.

Since G2:P2P is a decentralised system, the uniform volunteer distribution

must be obtained through a series of local operations performed by each vo‐

lunteer. From the point of view of an individual volunteer, uniform distribution

is demonstrated by being equidistant between its two immediate neighbours.

96

If each individual volunteer in the network adjusts itself so it is evenly spaced

between its neighbours then the entire network will move towards a stable

uniform distribution. The beauty of this approach is that even if new volun‐

teers join the network or old ones depart the process automatically adjusts to

incorporate these changes without any special cases. Obviously in a real net‐

work a perfect uniform distribution is unlikely to form if there are constant

changes in membership however a reasonable distribution should be quickly

obtained and still provide the improved load balancing that is being aimed for.

Since each volunteer in the network is making adjustments to its NodeID inde‐

pendently it may take a large number of small adjustments to get to a steady

state. This conversion may be sped up by making use of the extra information

each volunteer is holding in its leaf set. The leaf set allows the volunteer to

predict the changes that its neighbours will be making, most likely simulta‐

neously, to their NodeIDs. To allow for this, instead of a volunteer simply plac‐

ing itself halfway between its two immediate neighbours, it measures the dis‐

tances to all of its known neighbours and attempts to balance them. This ba‐

lancing is particularly effective if there are large gaps in the leaf set that are not

immediately adjacent to the volunteer.

The following formulas describe how volunteers calculate their incremental

NodeID changes.

2

1

,

" "

Where Wdir indicates the weight of the respective half of the leaf set and LSetdir

represents the set of volunteers in each half of the leaf set. Essentially Wclock

97

and Wanti calculate the force exerted on the volunteer by the nodes in the

clockwise and anti‐clockwise half leaf set. The term Sleafset ensures that volun‐

teers that are further away express less force than nearby volunteers. The new

position for the volunteer is calculated such that those two forces will be equa‐

lised.

A weighting factor (nweight) is also included on each volunteer. This weight al‐

lows the volunteer movement process to take into account volunteers that

have greater processing power than normal (e.g. a cluster computer rather

than a PC). Such volunteers will be responsible for a greater portion of the ad‐

dress space and hence will automatically be assigned more objects.

Since it is imperative that the relative order of the volunteers is maintained,

volunteer movement is further restricted so that any single move may not tra‐

vel more than half the distance to the next volunteer. Without this restriction

volunteers may “cross over” each other as they independently calculate their

moves. This restriction may slow down the progress towards the global optima,

however, once the network is well dispersed volunteer movements are gener‐

ally small anyway and this restriction is rarely encountered.

Now that it has been established that moving NodeIDs is a beneficial operation,

it is worth looking at a method of moving IDs which is more efficient than the

naive disconnect/reconnect approach described earlier in this section. It is es‐

sential that any enhanced moving scheme does not corrupt the routing infor‐

mation that each node maintains. In section 2.1.3 three categories of routing

information that is used by the Pastry network were described – the neigh‐

bourhood set, leaf set and routing table.

The neighbourhood set is not used directly in routing, rather it is used to pro‐

vide physical locality information and therefore cannot be corrupted by

changes in the virtual address of a node.

The leaf set holds the addresses of nodes whose NodeIDs are closest to the leaf

set’s owner. The leaf set is essential to the Pastry routing system. Provided the

leaf sets of all volunteers on the system stay valid then routing is guaranteed to

98

complete correctly. For this reason a volunteer keeps in regular contact with

its leaf set which means that any changes to a volunteer’s ID can be quickly

communicated to its leaf set. The one potential problem with moving a volun‐

teer’s ID is that it may change order of volunteers in the address space. If this

happens then the leaf set will become invalid and routing may be corrupted.

The volunteer balancing mechanism must therefore include safeguards to en‐

sure that volunteers do not move their IDs past either of their immediate

neighbours.

The routing table maintains links to various volunteers across the system and

is the primary routing mechanism in Pastry. Unlike the leaf set, a volunteer

does not know which routing tables it is part of and therefore can’t inform

them of any changes to its NodeID. While this would appear to be a significant

problem, when the role of the routing table is examined it is apparent that in

most cases changing NodeIDs will not cause ruinous problems. The routing ta‐

ble is used to quickly communicate a message to the general area of its target

node. Since it is expected that most changes to a NodeID should be relatively

small, at least once a network is up and established, even when a routing table

entry is selected for hopping a message the small differences in NodeID will

not alter the overall effect of the message hop. That is, the message will still be

significantly closer to its target than it previously was. Even in the worst case

when a NodeID has changed significantly, routing will not be broken by deli‐

vering to an incorrect entry in the routing table. The message may simply pass

through more intermediate nodes than it would do so ideally.

To maintain the routing state in the long term nodes should be informed if they

have an incorrect entry. Incorrect entries can be detected by simply including

the NodeID each individual hop was sent to as part of the hop. When a volun‐

teer receives a message and detects that it was sent to the wrong ID because of

a change in NodeID the volunteer reply to the previous node, providing it with

a new NodeID so it can update its state appropriately. The handling of the ac‐

tual message will continue normally.

99

So it can be seen that provided the volunteer movement mechanism maintains

the relative order of volunteers there is no need to disconnect and reconnect

the volunteers. When a volunteer changes its ID it simply needs to inform its

leaf set of the change. It also needs to monitor incoming messages to detect if

other volunteers have outdated information. When such inconsistencies are

discovered the volunteer simply sends updated details to the other node.

The volunteer’s routing state must also be updated to reflect the new NodeID.

The leaf set is not affected because the volunteers with adjacent NodeIDs are

guaranteed not to change. Similarly, the neighbourhood set is unchanged since

it is only related to physical locality, not the NodeID. The amount of the routing

table affected is directly related to the distance moved by the node. Generally

the top rows of the routing table will stay constant, while bottom rows will

need to be repopulated. Repopulation of these rows can generally be done by

obtaining leaf set members’ routing tables during standard heartbeat messages

without much extra overhead.

Since each NodeID movement requires a certain amount of overhead in com‐

municating the change to its leaf set, and may trigger other costly events such

as object migrations, there should be some attempt to limit the frequency of

these movements. At some point the benefit to load balancing gained from a

movement is outweighed by the cost of performing the movement. Each node

movement incurs a cost in communicating the new ID to its neighbour, but

more importantly, each movement may result in the migration of objects be‐

tween volunteers. These object migrations can be relatively expensive and

hence should be avoided if the gain is not significant. This cost is difficult to

quantify at the volunteer level since it requires global knowledge of the net‐

work layout and load, but it highlights the need to place a minimum threshold

for any single NodeID movement. If, once an ID movement is calculated, it is

found to be under this threshold the calculation is simply discarded and no

movement occurs.

The threshold used depends on the number of volunteers in the network and

the size of the address space. As the number of volunteers in the network in‐

100

creases there should be less gap between volunteers and so smaller move‐

ments become more important. Conversely, the larger the address space the

larger the gap between nodes and hence small movements become less impor‐

tant. The size of the address space is directly related to the length of the Node‐

IDs. The threshold, T, can be derived from these two relationships:

1

The constant c is a configuration value which represents the sensitivity of the

volunteer balancing mechanism. This sensitivity controls how quickly the vo‐

lunteers move to their optimum position. It is analogous to the Gain Factor in a

control system using proportional control. If the sensitivity is set too low the

volunteers will take a long time to reach their balanced position. While a high

sensitivity will cause the volunteers to quickly respond to changes in their bal‐

ance, but may cause them to never reach a steady state since even small

changes in their neighbours will cause movement. These two processes are

well understood in control theory and are referred to as overdamping and un

derdamping.

When calculating this threshold the limitation of no global knowledge is en‐

countered once again. A single node cannot know the current number of nodes

involved in a network. It can however approximate this value by examining

how much of the address space its leaf set is occupying and extrapolate to find

the total network size. This approximation actually becomes quite accurate as

the network is balanced via the volunteer balancing process.

Optimised Joining ID

When a node joins there is a significant opportunity to optimise its location be‐

fore it even advertises itself to other nodes. The standard Pastry join procedure

uses a randomly generated NodeID. This random NodeID helps to distribute

the nodes across the NodeID address space. However, with the addition of the

101

volunteer balancing scheme described above this random distribution is less

important. If a node could be placed equidistant between two neighbours in‐

stead of simply randomly assigned an ID then the network would already be

partially balanced and there would be less movement steps required to stabi‐

lise the network. Since each movement step requires extra communication

amongst the leaf set any reduction in movement results in less communication

overhead.

The standard Pastry join procedure starts with the new node routing a special

join message to its randomly generated NodeID. This join message is received

by the node whose ID is closest to the new position who then replies with con‐

firmation of the IDs acceptance and with some initial data to start populating

the node’s routing state. This join message can be used to detect NodeID con‐

flicts. If a node receives a join message for its own NodeID it simply replies

with a message indicating the NodeID is unavailable and the joining node se‐

lects a new ID to join with.

For the enhanced joining process the node needs to find two nodes which it

will position itself between then calculate their midpoint and select that as its

NodeID. Since there is no central body which can perform the calculation the

node must somehow select its own prospective neighbours. The existing join‐

ing process provides an effective method of doing this and allows the enhanced

process to be implemented with few changes.

The joining node generates a random NodeID as per usual and routes a join

message to that ID. When that join message is received it is processed slightly

differently. Instead of simply checking that the ID is valid and does not conflict

with existing IDs, the node now uses its leaf set to establish the new node’s

neighbours. The node then calculates the balanced position for the new node,

exactly as it would during a NodeID movement calculation. The node has suffi‐

cient information for this calculation because the new node’s leaf set will con‐

sist of the calculating node along with a subset of its own leaf set.

102

Once the balanced NodeID has been calculated the ID is returned in the join

message’s reply. When the joining node receives this reply it continues its join‐

ing process in the same manner it normally would, but with the new NodeID

returned from the join message.

In the enhanced scheme described so far the initial random NodeID is used to

select the neighbours of the joining node. However, this ID is simply used to

find the initial target of the join message. There is no reason why this join mes‐

sage can’t be redirected to a another position that offers a better balancing

prospect for the network if one can be found. The actual suitability of the ran‐

domly selected position is not known until the join message is received by its

target node. However, at that point the node can examine its leaf set and search

for better positions. The best position for a node to fill is the largest gap in the

address space between two nodes. While the receiving node does not know the

global best position, it can easily select the largest gap within its leaf set. Once

this is found it simply redirects the join message to the new position. The new

target node does not need to know that the join message had been redirected

and can perform the entire process again.

This redirection does however need to be limited somewhat. It is possible that

a join message may be redirected repeatedly, passing it gradually around the

node address space. In fact, because of the dynamic nature of the network –

there are regular node arrivals and departures and node’s IDs are being moved

– it is possible that the join message will never be actually processed. To pre‐

vent this a join message needs to keep a count of how many times it has been

redirected and redirection ceased after a certain threshold. It is not a signifi‐

cant issue if this threshold is reached as the goal of this enhancement is to

simply improve the initial joining position, not to select the absolute optimum.

5.2.4 Optimisation 4 – Node Ordering

The optimisation in section 5.2.1 describes how an alternate ObjectID assign‐

ment method can increase the chance that two objects that communicate often

will be located on the same machine. There are however usually situations

103

where objects that communicate often are located on different machines. If

these objects cannot be on the same machine (and indeed the entire goal of

G2:P2P is to distribute objects amongst multiple machines to benefit from ad‐

ditional computing power) then it is preferable that they are hosted on ma‐

chines that are physically close. This section describes a change to the layout of

Pastry nodes which allows physical locality to be reflected in the virtual Pastry

address space. By connecting the volunteer’s NodeIDs to their physical location

the uniform ObjectID distribution described in section 5.2.1 will benefit from

improved inter‐node communication along with the intra‐node communication

it already achieved.

Essentially what the optimisation proposes is to assign NodeIDs based on some

information which reflects the volunteer’s physical locality. For example, by

comparing two volunteer’s IP addresses some insight can be gained as to their

position within a network. Nodes with similar IP addresses, especially identical

subnets, presumably have good communication links, while totally unrelated

addresses are similarly unrelated in their communication links. Therefore, if

volunteers with similar IP addresses are assigned similar NodeIDs, the speed

of volunteers’ communication links will be reflected by the proximity of their

NodeIDs. Once a link is formed between NodeIDs and their physical proximity,

the previous ObjectID ordering optimisation can be used to place objects which

regularly communicate on physically close volunteers.

IP addresses can be converted to NodeIDs using any simple mapping function.

For example, if the NodeID is the same length (i.e. has the same number of bits)

as the IP address then a simple identity function will suffice. Otherwise some

form of truncation or zero‐padding can be used to adjust the IP address to the

correct size. In the case of truncation the standard G2:P2P joining procedure

will correctly handle NodeID clashes. The only important property of the map‐

ping function is that it must maintain the relative order of the IDs. That is, No‐

deIDs generated from IP addresses must be in the same order as their original

IP addresses were.

104

When volunteers are assigned IDs randomly, they are naturally distributed

across the entire address space. By tying the volunteers’ IDs to their IP address

this natural distribution will no longer occur and a severely unbalanced net‐

work could easily form. For this reason this optimisation must always be com‐

bined with the previous volunteer balancing optimisation. The volunteer ba‐

lancing process will overcome any imbalance caused by the systematic genera‐

tion of NodeIDs.

However, using the volunteer balancing technique complicates the joining

process. As volunteers adjust their IDs, they may find themselves moving sig‐

nificantly away from the ID that was generated based on their IP address.

While this is not a problem during the standard operation of the network, it

does present an issue when volunteers join. As usual, volunteers will send a

join message to their prospective NodeID (that is the ID generated from their

IP address). Without the volunteer balancing in effect that prospective NodeID

will place them amongst physically related volunteers. However, if those re‐

lated volunteers have moved since they joined, the generated ID and its

neighbours may have no physical relationship. This issue is entirely restricted

to join messages and hence does not require substantial changes to resolve.

The issue fundamentally comes down to the disconnect between a nodes cur‐

rent ID and its original ID. While normal communication messages use the cur‐

rent NodeID, the join messages need to be routed based on nodes’ original IDs.

Unfortunately the routing table maintained by each node is designed for stan‐

dard routing and hence can not be used for routing based on original IDs.

However, due to the properties of the volunteer balancing technique, the order

of current IDs will be identical to their original ID order. This means leaf sets

can be used to route messages, albeit with a worse case of N/2 network hops.

In practice, joining is generally performed by contacting a physically close node

and using it to initiate the join message. This means that regardless of how far

volunteers’ IDs move during execution, join messages will always be initiated

reasonably close to their final target.

105

There is one disadvantage of applying this optimisation. In a normal Pastry

network, nodes in a particular physical locality are likely to be widely spread

throughout the network’s address space. This means that if some fault in that

physical locality occurs (such as a local power loss or a local network issue)

then loss of nodes will be felt in a dispersed fashion across the network rather

than in a single large cluster. Pastry networks are designed to be able to recov‐

er from individual nodes disappearing unexpectedly, provided that other

nodes in the leaf set remain. So by changing this aspect of the pastry network

its ability to recover from local faults is decreased. This is obviously a trade‐off

that must be made between efficiency and reliability.

While this trade‐off may not be suitable for ad‐hoc networks hosted on the In‐

ternet, it would be of considerable use for networks spread across a number of

reliable sites with good interconnections. Each site’s nodes would be posi‐

tioned next to each other within the virtual address space and hence would

benefit from high speed communication links. These communication links

would be utilised by communicating objects, especially when combined with

the first optimisation presented in this chapter, but would also benefit the

standard upkeep of the network since most network overhead is performed

amongst leaf set nodes. Longer range routing would also benefit as a message

originating on one side of the network would very quickly arrive at the physi‐

cal destination site due to Pastry’s log(N) routing scheme and then efficiently

redirect to its ultimate target on the high speed internal network.

5.3 Programming Model Extensions

While most of the optimisations presented in this chapter require no special

effort by application developers, the first two optimisations – ObjectID order‐

ing and object collocation – do require some extra information to be provided.

In both cases the new features work with groups of objects. The improved ob‐

ject spacing takes a group of objects and assigns their IDs systematically to im‐

prove their locality. Similarly, the object collocation also takes a group of ob‐

jects, but instead ensures they are located on the same processing node. This

106

similarity allows us to provide a consistent programming interface for both op‐

timisations.

It is important that these optimisations do not limit how objects are created.

For instance, it would be possible to provide an API call which took a type pa‐

rameter and a parameter indicating the number of objects to create. The call

would simply create the indicated number of objects of the specified type as a

group, either uniformly dispersing the group as the first optimisation describes

or collocating them according to the second optimisation. However, this inter‐

face would only allow groups to contain a single object type and also does not

allow for parameters to be passed to the object constructors.

Arguably the simplest interface for creating object groups would be to provide

a method which takes an array of objects and groups them for spacing or collo‐

cation. Unfortunately, such an interface is impossible to implement. Any im‐

plementation of this method would require objects’ IDs to be altered to per‐

form the spacing or collocation. Changing ObjectIDs like this is impossible, as

was discussed earlier, making any post‐creation grouping of objects unattaina‐

ble.

It is the therefore necessary to ensure object groups are created at, or before,

creating the objects. To allow the maximum amount of flexibility it is important

that objects can be created using the standard “new” operator. Since the “new”

operator cannot easily be extended in .NET, the size of the group must be

communicated to the ObjectID generator prior to creating the objects. A simple

method call can supply the group size, n, to the generator before any objects in

the group are created. Once this call is made the generator is placed in a special

mode where it will generate the appropriate ObjectIDs (either spacing them or

collocating them) for the next n objects.

Listing 5‐1 demonstrates how the object spacing optimisation is used. This

sample generates a set of “Island” objects which are evenly spaced around the

network. In this case all of the objects are of the same type (Island) but each

object takes a different integer parameter which is used for identification.

107

Island[] islands = new Island[numIslands]; G2P2PChannel.Current.StartSpacingObjects(numIslands); for (int i = 0; i < numIslands; i++) islands[i] = new Island(i);

LISTING 5‐1 – USING THE OBJECT SPACING OPTIMISATION

Listing 5‐2 demonstrates how collocation is used. As expected the interface is

very similar to the object spacing interface. In this sample a group of 3 related

objects are collocated for improved communication. The group of objects that

are collocated consist of three different types, ClassA, ClassB and ClassC, and

take entirely different parameter lists. It is important to note that G2:P2P only

intercepts the “new” operator for G2:P2P objects, not standard .NET objects.

The only change from the original, unoptimised code is the addition of the

StartCollocatingObjects call. This is important as it allows the application de‐

veloper to take advantage of collocating objects without compromising the de‐

composition of their solution.

Once StartCollocatingObjects is called the next n objects on that thread are al‐

located using the collocation optimisation; this decision is made at runtime. An

alternative method, using a pair of StartCollocating/EndCollocating methods

could have been provided, but the single call was chosen instead to keep the

interface consistent with the object spacing optimisation. There is no technical

reason why a Start/End pair could not also be provided.

G2P2PChannel.Current.StartCollocatingObjects(3); ClassA a = new ClassA(new int[] {2, 3, 4}); ClassB b = new ClassB(a); ClassC c = new ClassC(b, c);

LISTING 5‐2 – USING THE OBJECT COLLOCATION OPTIMISATION

In summary, there are three distinct methods of allocating ObjectIDs in G2:P2P.

1. The original, random ID generation which provides load balancing by

distributing the objects around the entire address space. This is the de‐

fault allocation scheme in G2:P2P.

108

2. The object spacing optimisation which takes a group of objects and

evenly spaces them around the address space. This is activated through

a call to G2P2PChannel.Current.StartSpacingObjects as shown in Listing

5‐1.

3. The object collocation optimisation which takes a group of objects and

ensures they are hosted on the same volunteer machine. This is acti‐

vated through a call to G2P2PChannel.Current.StartCollocatingObjects as

shown in Listing 5‐2.

These additions to the programming model place some small burden on the

application programmer. These burdens are as small as possible – simply pro‐

viding the necessary information to the system for it to enact the optimisations.

The programmer is still free to employ their usual techniques for creating and

using objects. It is possible, though unlikely, that the number of objects being

created is not easily known when the calls to StartSpacingObjects or StartCol‐

locatingObjects must be made. If this was the case an alternative procedure

which marked the start and end of when the optimisations were to be applied

would be useful. This would be simple in the case of StartCollocatingObjects

but would be extremely difficult to implement in the StartSpacingObjects case.

For this reason the described method was chosen, keeping a common pattern

to both calls.

5.4 Conclusion

In this chapter I have presented four methods of optimising the performance of

applications running on a fully decentralised cycle stealing framework. These

optimisations are particularly useful to applications with inter‐object commu‐

nication, but also provide benefits for non‐communicating applications.

The first two optimisations use the concept of object locality, familiar from

other high‐performance computing endeavours, to improve inter‐object com‐

munication performance by systematically assigning their ObjectIDs. These op‐

timisations improve performance by increasing the probability that regularly

communicating objects are hosted on the same volunteer.

109

The final two optimisations alter the underlying decentralised network layer.

These optimisations adjust the layout of the volunteers in the virtual P2P ad‐

dress space to improve locality of objects hosted on different volunteers. These

optimisations build on the previous two to further improve communication

performance. Since these optimisations are implemented at the P2P layer they

are not coupled to cycle stealing and could be used by other fully decentralised

networks.

110

6 Evaluation

A prototype implementation of G2:P2P has been developed in C#. Although

this prototype has not been optimised for maximum performance, it provides a

reasonable test bed for evaluating the design of G2:P2P. Like G2:P2P, the pro‐

totype has been developed from scratch without any input from G2:Classic.

The entire prototype consists of 19 500 lines of code including tests. Sections

of the prototype are also of interest as separate modules. In particular the

TcpEx module – a bidirectional TCP channel for .NET ‐ has been released publi‐

cally under a BSD license.

Three test applications have been developed using this prototype. These ap‐

plications were then used to test the performance of the system on a typical

university computing lab. The third application was developed by Johan

Berntsson as part of research into distributed evolutionary computing(52) but

was not used during performance testing. These three applications demon‐

strate how G2:P2P handles a variety of different parallel application styles.

The prototype includes a complete implementation of the G2:P2P system as

described in Chapter 3 and the fault tolerance extensions described in Chapter

4. Additionally two of the four optimizations described in Chapter 5 – Objected

ordering and volunteer balancing – have been implemented and their effect

measured.

In this chapter I will evaluate the G2:P2P framework. Section 6.1 presents the

test applications. Section 6.2 evaluates the performance of the prototype

G2:P2P system and will include examining the effectiveness of the optimisa‐

tions presented in chapter 5. Section 6.3 evaluates the overhead of the fault to‐

lerance system presented in Chapter 4.

6.1 Test Applications

The first application is an embarrassingly parallel application for calculating

the Mandelbrot set. Embarrassingly parallel applications are common candi‐

111

dates for cycle stealing systems because they do not require inter‐task com‐

munication. As a test application it provides a basic example useful for measur‐

ing raw performance.

The second, more sophisticated application uses a lattice gas model to simulate

surface tension between two fluids. The application uses a cellular automaton

to run the simulation. Cellular automata use a considerable amount of struc‐

tured inter‐object communication. This test demonstrates how G2:P2P can be

used to address problems that were essentially impossible to approach with

traditional cycle stealing systems. The communication style is also similar to

other parallel applications such as finite element simulations which typically

are run on multi‐core or cluster machines. In addition, the cellular automata

provides a test bed for evaluating the efficacy of the locality optimisations in‐

troduced in Chapter 5.

G2:P2P was also used by Johan Berntsson to develop a cycle‐stealing genetic

algorithm framework called G2DGA (52). The library uses the parallel form of

genetic algorithm called the island model. While the island model can be cor‐

rectly implemented in an embarrassingly parallel form, the distributed object

form used by G2:P2P more closely reflects the structure of the island model.

G2DGA has not been used to gather performance data since it was developed

with an early version of G2:P2P, however its basic structure still conforms with

G2:P2P.

The following sections will examine the two test applications in detail.

6.1.1 Mandelbrot – Embarrassingly Parallel

The majority of existing cycle‐stealing applications are embarrassingly parallel.

Embarrassingly parallel problems are well suited to cycle‐stealing because the

only communication link required is between the task and the client (or the

broker if it is collecting results on the client’s behalf). Additionally, embarras‐

singly parallel designs do not require complex fault tolerance mechanisms. If

an error occurs with a single task the task can simply be restarted without af‐

fecting any other part of the application. Although embarrassingly parallel ap‐

proach

cludes

astrop

This t

the ca

plex n

region

ing wh

not. Th

the m

Figure

This te

but do

6.1.2

Cellula

own st

hes are on

s some

physics(53)

est applica

lculation o

number pla

n is assigne

hich portio

he results

anner typi

e 6‐1.

est applica

oes not mak

Lattice Ga

ar automat

tate(55). C

nly suitable

significan

).

ation uses

f the Mand

ane and di

d to a sepa

ns of that r

of this calc

ically used

tion uses t

ke use of in

FIGURE

as Simulat

ta are discr

ells change

e for some

t membe

an embarr

delbrot set.

issects it in

arate G2:P2

region are

culation ar

to visuali

the basic re

nter‐object

6‐1 – MAND

tion – Cellu

rete model

e states acc

e problems

ers inclu

rassingly p

The algori

nto a num

2P object w

part of the

re returned

se the Man

emote obje

communic

ELBROT VISU

ular Autom

ls consistin

cording to a

s, this set o

ding pro

arallel algo

ithm takes

mber of sub

which is resp

e Mandelbr

d to the clie

ndelbrot se

ct features

cation.

UALISATION

maton

ng of a set o

a set of rul

of problem

otein fold

orithm to p

a region o

b‐regions.

ponsible fo

rot set and

ent and dis

et (54) as

s supplied b

of cells eac

es which a

112

ms still in‐

ding and

parallelise

f the com‐

Each sub‐

or calculat‐

which are

splayed in

shown in

by G2:P2P

ch with its

re applied

in discret

areas incl

Lattice ga

of fluids a

lattice ga

allow mo

keep trac

teraction

applicatio

(see Figur

Cellular a

every cell

calculatio

immediat

in paralle

ue to its n

calculatio

kept cellu

te time ste

luding phys

as automat

at the part

as models a

re particles

ck of a num

potentials,

on is a sim

re 6‐2).

FIGURE 6

automata a

l in the au

on only req

te neighbou

el. Between

neighbours

on. Howeve

ular autom

eps. Cellula

sics, biolog

ta(56) are

ticle level.

allow strai

s to be sim

mber of con

, for each p

mulation of

6‐2 – LATTICE

are good c

utomaton is

quires the

urs. This m

n each time

s so they h

er, this req

mata out of

ar automat

gy and artifi

cellular au

Because ce

ightforwar

mulated than

tinuous va

particle. Th

f the intera

E GAS SIMUL

andidates

s inspected

cell’s curre

means that

step a cell

have suffici

uires frequ

f the reach

ta can be u

ficial life.

utomata use

ellular auto

d impleme

n finite ele

alues, such

he specific

action betw

LATION OF IM

for paralle

d and its n

ent state a

each cell c

l simply ne

ent inform

uent, direct

h of centra

used to m

ed to mode

omata are

entation on

ment mode

as position

problem so

ween two i

MMISCIBLE F

elisation. A

ew state is

and possibl

an potentia

eeds to com

mation to pe

t communi

alised cycle

odel a var

el the inter

entirely di

n compute

els which h

n, velocity a

olved by th

immiscible

LUIDS

At each tim

s calculated

ly the state

ally be pro

mmunicate

erform the

ication whi

e stealing

113

riety of

raction

iscrete,

rs and

have to

and in‐

his test

e fluids

me step

d. This

e of its

ocessed

its val‐

eir next

ich has

frame‐

114

works. Since G2:P2P provides direct inter‐volunteer communication a cycle

stealing implementation of cellular automata can now be realised.

Since the calculations required for each cell are typically very quick parallelis‐

ing at the cell level is too fine grained for cycle‐stealing. Instead, this test appli‐

cation splits the cells in to a finite set of groups. Each of these groups is as‐

signed to a separate G2:P2P object which will handle the calculation of all cells

in the group. This limits the amount of data that needs to be communicated be‐

tween objects to the states of the just the edge cells.

The test application has further reduced inter‐object communication by not

exchanging data on every time step. Each object calculates a number of steps

and then exchanges the data. This results in some replication of calculations

because there is a common area on each group’s border which must be calcu‐

lated by multiple objects. However, the decreases in the frequency that data

needs to be communicated more than compensates for this extra processing.

6.2 Speedup Tests

Two typical university computing laboratories were used to evaluate the per‐

formance of G2:P2P. The two labs consisted of a total of 56 desktop machines –

26 Core Duo 3GHz machines and 30 Pentium 4 3GHz machines connected by a

100Mbps Fast Ethernet network.

The two test applications described in Section 6.1 were run on this network.

The Mandelbrot test was run on an area of the complex plane between [‐2‐2j]

and [2+2j] at a resolution of 0.0001 units in both dimensions. The lattice gas

simulation was run on a 1600×1600 cellular automata for 50 steps. The objects

exchanged data on every fifth step. Both applications used 144 objects (a

12x12 grid).

Two optimisations have been implemented and tested within the prototype

G2:P2P system – the ObjectID ordering optimisation presented in Section 5.2.1

and the volunteer balancing optimisation presented in Section 5.2.3.

The first

patterns,

uses a lot

should ex

Figure 6‐3

automata

benefit w

FIGU

These res

machines

sons for t

The test n

chines we

lel. Howev

machines

nature of

optimisati

was tested

t of inter‐o

xpect consid

3 shows th

a. It shows

with networ

RE 6‐3 – SPE

sults show

s, however

hese result

network co

ere dual co

ver, the rem

s did have

f the applic

ion, orderin

d with the

object comm

derable ben

he effect of

that the o

rks of all siz

EEDUP OF OB

w that the s

this speedu

ts.

onsisted of

re machine

maining 30

hyper‐thr

cation prev

ng ObjectID

e cellular a

munication

nefits from

optimising

optimised f

zes.

JECT ORDER

system pro

up is less th

two sets o

es capable

0 machines

readed pro

vented hyp

Ds accordi

utomata a

n in a near

m this optim

g the order

form gains

ING OPTIMIS

ovides spee

han linear.

f disparate

of executin

had only a

ocessors, th

per‐threadi

ing to their

pplication.

rest neighb

misation.

of ObjectID

s a reasona

SED CELLULA

edup over

There are

e machines

ng multiple

a single cor

he comput

ing from p

r commun

This appl

bour patter

Ds on the c

able perfor

AR AUTOMAT

the entire

two prima

. The first 2

e objects in

re. While th

tationally i

roviding a

115

ication

ication

rn so it

cellular

rmance

TA

e set of

ary rea‐

26 ma‐

n paral‐

hese 30

intense

ny sig‐

116

nificant benefit. This disparity is apparent in the results between the 20 and 30

volunteers. By the 30 volunteer mark the slower single core machines had

started to be included in the network. These single core machines act as a bot‐

tleneck on the application reducing its overall performance.

The second issue affecting these results is the load balance of the system.

G2:P2P relies on the random generation of IDs for both volunteers and objects

to provide load balancing. However, on the small networks used in this test

there are insufficient volunteers to provide a good random distribution across

the entire address space. This load balancing is addressed by the volunteer ba‐

lancing optimisation which was also implemented in the prototype.

This volunteer balancing optimisation was tested with both the Mandelbrot

and cellular automata applications. Figure 6‐4 and Figure 6‐5 show the

speedup when using the volunteer balancing optimisation compared to an un‐

optimised test run. This optimisation provides considerable benefits for both

applications. This benefit is primarily due to the improved load balancing that

the optimisation provides. By spreading the nodes evenly around the address

space it has ensured that each node services approximately the same number

of objects. In the unoptimised tests the load balancing can be quite uneven re‐

sulting in the application’s results relying on the completion of one particular

volunteer.

FFIGURE 6‐4 – SPEEDUP OFF MANDELBRROT WITH VOOLUNTEER BA

ALANCING

117

F

With

speedu

moves

IGURE 6‐5 –

this optim

up line. Th

s below lin

Sd

SPEEDUP OF

misation ap

his reflects

near speed

0

5

10

15

20

25

30

35

0

Speedu

p

F CELLULAR A

pplied the

s the use o

dup again w

10

Num

AUTOMATA W

results sh

of the dual

when the

20

mber of Volun

WITH VOLUN

ow speedu

l core mac

30 volunte

NOptim

Optim

30

teers

NTEER BALAN

up above t

chines. The

eer mark i

ot mised

mised

40

118

NCING

the linear

e speedup

is reached

119

since these volunteers include single core processors and act as a bottleneck to

the system.

This bottleneck could be reduced by using the number of cores in the machine

as a parameter to the volunteer balancing formula presented in section 5.2.3.

This formula already supports a weight factor for each node. If this weight is

used to reflect the overall processing power of the node, including the number

of cores, it would automatically assign dual core machines more of the object

address space and hence assign them more objects to process.

Although the tests have only been performed on relatively small networks, sca‐

lability to 56 nodes is a good demonstration for G2:P2P. The structure of the

Pastry network that G2:P2P uses changes significantly as the network grows

larger than its leaf set. At 56 nodes the network will have achieved the same

structure it uses for any larger sizes.

6.2.1 MultiCore Speedup

G2:P2P also offers benefits when run on a single machine with multiple pro‐

cessors or multi‐core processors. Each G2:P2P object is run on its own thread.

On multi‐processor machines this means that multiple objects can be executed

in parallel. Application developers can take advantage of these processors by

using the G2:P2P programming model and running a volunteer on the same

machine as the actual application. Volunteers can be run in either a separate

process or hosted within the main application process.

Both test applications were run on a dual core machine with significant per‐

formance benefits. Figures Figure 6‐6 and Figure 6‐7 show the improvements

provided by running the Mandelbrot and cellular automata on a dual core ma‐

chine. In both cases G2:P2P almost halved the running time.

6.3 F

The m

its pot

FIGURE

FIGURE 6‐7

Fault To

main issue r

tential ove

Time (m

s)Time (m

s)

E 6‐6 – SPEED

7 – SPEEDUP

olerance

raised by th

erhead. Thi

0

100

200

300

400

500

600

0

200

400

600

800

1000

1200

1400

1600

DUP OF MAN

OF CELLULA

e Overh

he fault tol

is overhead

Sequential

Sequentia

NDELBROT ON

AR AUTOMAT

head

lerance sch

d has been

l

N DUAL‐CORE

TA ON DUAL‐

heme prese

n measured

Parallel

Parallel

E MACHINE

CORE MACHI

ented in Ch

d through t

120

INE

hapter 4 is

the proto‐

type impl

thod calls

the cellul

These me

ther proc

message t

Figure 6‐

on a G2:P

pected, th

overhead

FI

Like the r

fited from

mising th

its entire

lementatio

s the cellula

lar automa

essages act

cessing unt

transmissio

8 provides

P2P networ

he fault to

d the system

IGURE 6‐8 ‐ F

rest of the p

m any optim

he scheme.

data store

0

5

10

15

20

25

30

35

Speedu

p

on. Since th

ar automat

ata frequen

t as a synch

til they are

on could re

s the speed

rk with var

lerance sy

m still gains

FAULT TOLER

prototype s

misation w

In particu

e when it re

0

5

0

5

0

5

0

5

0

he system

ta applicati

ntly exchan

hronisation

e received.

esult in sign

dup values

rying fault

stem incur

s significan

RANCE OVER

system, the

work. There

lar the loca

eceives eac

10

Number

only introd

on was use

nge messa

n point in t

For this re

nificant per

when runn

tolerance

rs a slight

t speedup.

RHEAD FOR C

e fault toler

e is conside

al storage

ch message

20

of Volunteer

duces over

ed for testin

ages with t

the applica

eason, any

rformance

ning the ce

levels. It s

overhead,

CELLULAR AU

rance sche

erable opp

scheme cu

e. This was

No FaultToleranc

LocalRemote

30

s

rhead durin

ng. The obj

their neigh

ation, haltin

overhead

costs.

ellular auto

hows that,

but despit

UTOMATON

me has not

ortunity fo

urrently ser

s done to si

t ce

40

121

ng me‐

jects in

hbours.

ng fur‐

on the

omaton

, as ex‐

te that

t bene‐

or opti‐

rialises

implify

122

the implementation but could be improved to simply serialising the latest mes‐

sage which would significantly increase its performance.

123

7 Conclusions

This thesis demonstrates how a fully decentralised network model can offer a

number of benefits to cycle stealing. I have designed a scalable cycle stealing

framework using a fully decentralised network model. Previous cycle stealing

frameworks have used predominantly centralised network models. This cen‐

tralisation has placed significant limits on the frameworks, particularly in the

areas of scalability and inter‐volunteer communication. G2:P2P demonstrates

how a fully decentralised network can provide the basis for a cycle stealing

framework which naturally overcomes these limitations.

Chapter 3 described the design of G2:P2P. G2:P2P improves on existing cycle

stealing work by providing direct inter‐volunteer communication and by pro‐

viding scalability through its underlying network model. Since previous cycle

stealing work has not provided general purpose communication facilities, a

new programming model was required for G2:P2P. A distributed object based

model is presented which integrates with the .NET Remoting framework. This

allows non‐expert programmers to approach parallel, distributed computing

using a familiar programming model.

The direct communication provided by the decentralised model allows a wider

scope of applications to be developed when compared with centralised cycle‐

stealing frameworks. This has been demonstrated with the development of a

parallel genetic algorithm framework and a parallel cellular automata applica‐

tion. Whilst genetic algorithms have been performed with centralised methods,

the decentralised version allows for a more natural implementation of the isl‐

and model. Cellular automata have not been attempted on a cycle stealing sys‐

tem because of the large amount of communication required. The prototype

implementation of G2:P2P has proven effective at providing speedup for this

application

Chapter 4 addresses fault tolerance in G2:P2P. Stringent fault tolerance on de‐

centralised networks has not been required by previous decentralised applica‐

124

tions. For cycle stealing it is an essential aspect. I have developed a reliable

fault tolerance system which takes into account the restrictions of decentrali‐

sation. Since decentralised networks do not provide any reliable storage me‐

chanism, data must be stored by replicating it across multiple nodes. My fault

tolerance system is designed to minimise the amount of data that requires rep‐

lication while still ensuring recovery is possible.

Chapter 5 presents four optimisations for improving application performance

in G2:P2P. Two optimisations are specific to cycle stealing and concentrate on

altering object locality to improve communication performance. The other two

optimisations work at the underlying P2P layer to improve the layout of volun‐

teers in the virtual P2P address space. These optimisations have been imple‐

mented and provided significant performance benefits to applications running

on G2:P2P.

The aim of this research was to investigate how a fully decentralised network

model could be used to improve cycle stealing. A prototype system, G2:P2P,

was designed and developed which verifies that decentralised cycle stealing is

possible and yields benefits. This system extends the current cycle stealing

possibilities by providing direct inter‐object communication and by using an

underlying network model which naturally scales. I have addressed fault toler‐

ance on the network, which is essential for any cycle stealing framework, and

have introduced a number of optimisation techniques which significantly im‐

prove application performance.

7.1 Future Work

The performance testing of G2:P2P suggests that load balancing is currently

causing the largest performance degradation. While the volunteer balancing

optimisation presented in section 5.2.3 improves load balancing considerably,

there are other opportunities for overcoming this issue. There are a number of

potential methods of addressing this issue. The Javelin project uses a “work

stealing” process to improve load balancing. Although this concept does not

125

directly map to the distributed object programming model presented by

G2:P2P it could be adapted to some form of “object stealing” process.

While the distributed object programming model provided by G2:P2P is neces‐

sary for supporting its communication facilities, it does not support direct

porting of existing cycle stealing applications. A more functional style pro‐

gramming model could be developed for decentralised cycle stealing which fo‐

rego the communication facilities in exchange for easier porting of existing

cycle stealing applications. The decentralised network model would still pro‐

vide more natural scalability than the hybrid network models that systems like

Javelin have required.

The integration of G2:P2P into the .NET Remoting framework simplifies devel‐

opment of cycle stealing applications, however there are a significant number

applications which already use .NET Remoting which can not be directly

ported due to the restrictions of the G2:P2P programming model. Many of

these applications may benefit from being able to distribute their processing

across a cluster of computers for better scalability. Since most of the restric‐

tions introduced by G2:P2P are necessary to correctly support fault tolerance

with inter‐object communication, it may be possible to relax these restrictions

in exchange for limits on how the objects communicate. This would allow

G2:P2P to be used to easily distribute existing Remoting applications on large

clusters of machines by simply changing some configuration settings.

G2:P2P also offers benefits for multi‐core/multi‐processor machines by in‐

creasing performance while avoiding concurrency issues through its restricted

programming model. Since these machines are increasingly common there

could be significant benefits in further work on improving multi‐core/multi‐

processor performance using G2:P2P as a basis.

G2:P2P does not address how malicious volunteers or clients could affect the

system. There is considerable work which could be performed to address how

to protect applications from attacks at the framework layer. This work could

build from existing work in protecting P2P applications from malicious nodes.

126

Finally, G2:P2P also offers an alternative method of developing P2P applica‐

tions. Currently P2P applications require significant knowledge of networking

to perform the necessary communications. The distributed object program‐

ming model provided by G2:P2P could potentially be adapted to provide a

simple API for developing pure P2P applications.

127

Bibliography

1. Oram, A., ed. PeerToPeer Harnessing the Power of Disruptive Technologies.

First ed. 2001, O'Reilly & Associates.

2. Kelly, W., P. Roe, and J. Sumitomo. G2: A Grid Middleware for Cycle Donation

using .NET. Proceedings of the International Conference on Parallel and

Distributed Processing Techniques and Applications. 2002, pp. 699‐705.

3. Sumitomo, J., A Programming Model and Performance Model for Cycle Steal

ing, PhD Thesis, Queensland University of Technology, 2005.

4. Litzkow, M.J., M. Livny, and M.W. Mutka. Condor A Hunter of Idle Worksta

tions. in Proceedings of the 8th International Conference on Distributed

Computer Systems. San Jose, California, USA, 1988.

5. Mason, R. and W. Kelly. G2P2P: A Fully Decentralised FaultTolerant Cycle

Stealing System. in Proceedings of the 2005 Australasian workshop on Grid

computing and e‐research. Newcastle, New South Wales, Australia, 2005,

pp.33‐39

6. Mason, R. and W. Kelly. PeerToPeer Cycle Sharing via .NET Remoting. in

Proceedings of the Ninth Australian World Wide Web Conference. Gold

Coast, Queensland, Australia, 2003

http://ausweb.scu.edu.au/aw03/papers/mason/paper.html

7. Mason, R. and W. Kelly. Enhancing Data Locality in a Fully Decentralised P2P

CycleStealing Framework. in Proceedings of the Thirtieth Australasian

Computer Science Conference. Ballarat, Victoria, Australia, 2007

8. Kan, G., Gnutella, in PeertoPeer: Harnessing the Power of Disruptive Tech

nologies, A. Oram, Editor. 2001, O'Reilly & Associates, Inc.: Sebastopol. p.

94‐122.

9. Hong, T., Performance, in PeertoPeer: Harnessing the Power of Disruptive

Technologies, A. Oram, Editor. 2001, O'Reilly & Associates, Inc.: Sebastopol.

p. 203‐241

128

10. Gnutella2 Standard. Gnutella2 Developer Network [Online] January 16,

2006. [Cited: October 2, 2006],

http://www.gnutella2.com/index.php/Gnutella2_Standard.

11. Loo, Boon Thau, et al. Measurement and Analysis of Ultrapeerbased P2P

Search Networks. UC Berkeley Technical Report UCB/CSD‐03‐1277, 2003

12. Stoica, I., et al. Chord: A Scalable PeertoPeer Lookup Service for Internet

Applications. in Proceedings of the 2001 ACM Conference on Applications,

Technologies, Architectures, and Protocols for Computer Communication.

San Diego, California, 2001.

13. Ratnasamy, S., et al. A Scalable Content Addressable Network. in Proceedings

of the 2001 ACM Conference on Applications, Technologies, Architectures,

and Protocols for Computer Communication. San Diego, California, 2001.

14. Plaxton, C.G., R. Rajaraman, and A.W. Richa, Accessing Nearby Copies of Rep

licated Objects in a Distribute Environment, in ACM Symposium on Parallel

Algorithms and Architectures. 1997. p. 311‐320.

15. Rowstron, A. and P. Druschel. Pastry: Scalable, decentralized object location

and routing for largescale peertopeer systems. in 18th IFIP/ACM Interna‐

tional Conference on Distributed Systems Platforms (Middleware 2001).

Heidelberg, Germany, November 2001.

16. Zhao, B., et al. Tapestry: A Resilient Globalscal Overlay for Service Deploy

ment. IEEE Journal on Selected Areas in Communications, Vol. 22., 2004.

17. Anderson, D. P., et al., SETI@home: An Experiment in PublicResource Com

puting. Communications of the ACM, Vol. 45, pp. 56‐61.

18. distributed.net. distributed.net Homepage [Online] December 16, 2006

[Cited: March 20, 2007], http://www.distributed.net/.

19. Nichols, D. Using idle workstations in a shared computing environment. in

Proceedings of the Eleventh ACM Symposium on Operating Systems Princi‐

ples (Austin, Texas, United States). ACM Press, New York, NY, November 8‐

11, 1987.

20. Pruyne, J.and M. Livny, Interfacing Condor and PVM to harness the cycles of

workstation clusters. Future Generation Computer Systems, 1996. 12(1), pp.

67‐85.

129

21. Epema, D.H.J., et al., A Worldwide Flock of Condors: Load Sharing Among

Workstation Clusters. Journal on Future Generations of Computer Systems,

1995. 12(1), pp. 53‐65.

22. Carriero, N., et al., Adaptive Parallelism with Piranha. IEEE Computer, 1995.

28(1), pp. 40‐49.

23. Becker, D.J., et al. Beowulf: A Parallel Workstation for Scientific Computation.

in Proceedings, International Conference on Parallel Processing, 1995. pp.

11‐14.

24. Culler, D.E., et al. Parallel Computing on the Berkeley NOW. in Proceedings of

the 9th Joint Symposium on Parallel Processing (JSPP'97). Kobe, Japan,

1997.

25. Baratloo, A., et al., Charlotte: Metacomputing on the Web. in Proceedings of

the Ninth International Conference on Parallel and Distributed Computing

Systems, 1996.

26. Baratloo, A., et al. An Infrastructure for Network Computing with Java Ap

plets., Concurrency: Practice and Experience, Vol. 10, 1998, pp. 1029‐1041.

27. Alexandrov, A.D., et al., SuperWeb: Research Issues in JavaBased Global

Computing. Concurrency: Practice and Experience, 1997. 9(6): p. 535‐553.

28. Capello, P., et al., Javelin: InternetBased Parallel Computing Using Java. In

Proceedings of the Sixth ACM Symposium on Principles and Practice of Pa‐

rallel Programming, 1997.

29. Neary, M. O., et al., Javelin++: Scalability Issues in Global Computing. Concur‐

rency: Practice and Experience, Vol. 12, 2000, pp. 727‐753.

30. Cappello, P. and D. Mourloukos. CX: A Scalable, Robust Network for Parallel

Computing. in ACM Java Grande/ISCOPE Conference, 2001.

31. Anderson, D. P., BOINC: A System for PublicResource Computing and Storage.

5th IEEE/ACM International Workshop on Grid Computing. November 8,

2004, Pittsburgh, USA.

32. Collet, M., G. A., et al., A Framework for Distributed Evolutionary Algorithms.

in Proceedings of Parallel Problem Solving from Nature 2002, 2002.

130

33. Butt, A. R., et al., Java, PeertoPeer, and Accountability: Building Blocks for

Distributed Cycle Sharing. in Proceedings of the 3rd Virtual Machine Re‐

search and Technology Symposium, San Jose, California, 2004.

34. Awan, A., et al., Unstructured PeertoPeer Networks for Sharing Processor

Cycles. Parallel Computing, Vol. 32(2), 2006.

35. Jelasity, M., M. Preuss and B. Paechter, A Scalable and Robust Framework for

Distributed Applications. in Proceedings of the 2002 Congress on Evolution‐

ary Computing, 2002.

36. Sumitomo, J., W. Kelly, An Enhanced Programming Model for Internet Based

Cycle Stealing. in Proceedings of the 2003 International Conference on Par‐

allel and Distributed Processing Techniques and Applications. Las Vegas,

Nevada, 2003.

37. Rammer, I., Advanced .NET Remoting. Apress, Berkeley, CA. 2002. ISBN: 1‐

59059‐025‐2.

38. Cramer, C., Kutzner, K., and Fuhrmann T., Bootstrapping LocalityAware P2P

Networks. in Proceedings of the IEEE International Conference on Networks

(ICON), Singapore, 2004.

39. Cooney, D., P. Roe, Experiences with a Mobile Process Oriented Middleware.

in Proceedings of the Tenth Australian World Wide Web Conference. Gold

Coast, Queensland, Australia, 2004,

http://ausweb.scu.edu.au/aw04/papers/refereed/cooney/paper.html.

40. Anderson, D. P., et al., SETI@home: an experiment in publicresource com

puting. Communications of the ACM, 2002. 45(11), pp.56‐61.

41. Elnozahy, E., D. Johnson, and Y. Wang, A survey of rollbackrecovery proto

cols in messagepassing systems. 1996, Carnegie Mellon University.

42. Alvisi, L., et al., An Analysis of CommunicationInduced Checkpointing, in

Symposium on Fault‐Tolerant Computing. 1999. p. 242‐249.

43. Briatico, D., A. Ciuffoletti, and L. Simoncini. A Distributed Domino‐Effect

Free Recovery Algo‐rithm. in Proceedings of the IEEE International Sympo‐

sium on Reliability, Distributed Software, and Databases, Dec. 1984.

131

44. Hélary, J.M., A. Mostefaoui, and M. Raynal. Virtual Precedence in Asyn‐

chronous Systems: Concepts and Applications. in Proceedings of the 11th

Workshop on Distributed Algorithms, 1997.

45. Alvisi, L. and K. Marzullo. Trade‐Offs in Implementing Causal Message Log‐

ging Protocols. in Proceedings of the 1996 ACM SIGACT‐SIGOPS Symposium

on Principles of Distributed Computing Systems (PODC'96). Philadelphia,

PA, USA, 1996.

46. Plank, J. S., A Tutorial on ReedSolomon Coding for Faulttolerance in RAID

like Systems. Software Practice and Experience, Vol.27, 1997, pp. 995‐1012.

47. Luby, M. G., et al., Practical lossresilient codes. in Proceedings of the twenty‐

ninth annual ACM symposium on Theory of computing, 1997, pp. 150‐159.

48. Anderson, J. M. and Lam, M. S.Global optimizations for parallelism and local

ity on scalable parallel machines. In Proceedings of the ACM SIGPLAN 1993

Conference on Programming Language Design and Implementation (Albu‐

querque, New Mexico, United States, June 21 ‐ 25, 1993). ACM Press, New

York, NY, 1993, pp. 112‐125.

49. B. Maggs, F. Meyer auf der Heide, B. Voecking, M. Westermann. Exploiting

Locality for Data Management in Systems of Limited Bandwidth. In 38th An‐

nual Symposium on Foundations of Computer Science (FOCS '97), 1997. pp.

284

50. Thomas Fahringer, JavaSymphony: A System for Development of Locality

Oriented Distributed and Parallel Java Applications, p. 145, IEEE Interna‐

tional Conference on Cluster Computing (Cluster'00), 2000

51. G. Glass, ObjectSpace voyager— the agent ORB for Java, Lecture Notes in

Computer Science, 1998.

52. D. B. Lange and M. Oshima, Programming and Deploying Mobile Agents with

Java Aglets, Addison‐Wesley, Reading, MA, USA, Sept. 1998.

53. Philippsen, M. and M. Zenger, JavaParty – Transparent Remote Objects in

Java. Concurrency: Practice and Experience, Vol. 9, 1997, pp. 1225‐1242.

54. Berntsson, J., G2DGA: an adaptive framework for internetbased distributed

genetic algorithms, in Proceedings of the 2005 workshops on Genetic and

evolutionary computation, 2006, pp. 346‐349.

132

55. Choosing BOINC projects. University of California [Online] April 16, 2007.

[Cited: April 26, 2007], http://boinc.berkeley.edu/projects.php.

56. Douady, A., Julia Sets and the Mandelbrot Set in The Beauty of Fractals: Im

ages of Complex Dynamical Systems. H. O. Peitgen and D. H. Richter [ed],

Berlin: Springer‐Verlag, 1986.

57. Weisstein, Eric W. Cellular Automaton. From MathWorld‐‐A Wolfram Web

Resource. March 26, 2006, [Cited: April 26, 2007],

http://mathworld.wolfram.com/CellularAutomaton.html

58. Chopard, B., et al., Cellular automata and lattice boltzmann techniques: An

approach to model and simulate complex systems. Complex Systems, Vol. 5,

2002.

A Framework for Fully Decentralised Cycle Stealing · 2010. 6. 9. · A dissertation submitted in partial fulfilment Of ... stealing the limits of centralised models can be overcome.

Documents