Eternal: Fault Tolerance and Live Upgrades for Distributed ...

Eternal: Fault Tolerance and Live Upgradesfor Distributed Object Systems �

L. E. Moser, P. M. Melliar-Smith,P. Narasimhan, L. A. Tewksbury, V. Kalogeraki

Department of Electrical and Computer EngineeringUniversity of California, Santa Barbara, CA 93106

Abstract

The Eternal system supports distributed object applica-tions that must operate continuously, without interruptionof service, despite faults and despite upgrades to the hard-ware and the software. Based on the CORBA distributedobject computing standard, the Eternal system replicatesobjects, invisibly and consistently, so that if one replica ofan object fails, or is being upgraded, another replica isstill available to provide continuous service. Through theuse of interceptors, Eternal renders the object replicationtransparent to the application and also to the CORBA ORB.Consequently, Eternal is able to provide fault tolerance,and live hardware and software upgrades, for existing un-modified CORBA application programs, using unmodifiedcommercial-off-the-shelf ORBs.

1 IntroductionMany computer systems used in defense applications mustoperate continuously without interruption of service. Suchsystems must provide continuous service in the presenceof hardware and software faults, hardware that has beenrepaired and returned to service, hardware that has beenupgraded, and software that has been corrected or upgradedto provide improved service. A typical requirement forcontinuous service is that the computer system must gener-ate its response within a prescribed deadline, perhaps fiveseconds, and the deadline must be respected even in thepresence of a fault or an upgrade. Such a deadline re-quirement implies almost immediate recovery from a fault.The interval between events, in which the system fails toachieve such immediate recovery, is required to be manyyears. This demanding requirement is no longer unique to

�This research has been sponsored by the Defense Advanced ResearchProjects Agency in conjunction with the Air Force Research LaboratoryRome under Contract F3602-97-1-0248.

defense systems; many e-commerce and e-business appli-cations now have essentially the same requirement.

It is certain that every hardware unit will fail eventu-ally. Many software components will also fail; indeed,in complex systems, software faults are substantially morefrequent than hardware faults. Under conditions of war-fare, computer system failures occur even more frequentlybecause of damage, hostile intrusions, or use of hardwareor software rushed into service without adequate testing.It is also likely that, within the lifetime of a major de-fense system, every hardware component will be replacedby different hardware, possibly completely different, andevery software component will be replaced by enhancedsoftware, also potentially very different. Under conditionsof warfare, whatever hardware and software are availableare what must be used; in a future information war, it maybe necessary to modify the software very frequently tocounter new strategies or attacks of the enemy, or to exploitopportunities or vulnerabilities.

Unfortunately, the current state-of-the-practice is inade-quate in the reliability and recovery time that is achieved, inthe ability to use heterogeneous hardware and software, inthe ability to upgrade hardware or software without inter-ruption of service, in the cost and difficulty of developingnew software, and most particularly in the developmenttimescale for new software. Fault-tolerant defense systems(and many are not) may require several minutes to recover,long enough for an enemy to detect the fault and to launchan attack to exploit it. Very few defense systems canexploit heterogeneous hardware and software. Even fewerdefense systems attempt live upgrades, and existing liveupgrade technology has a high risk of failure. The costand timescales for the development of defense software arenotorious; major systems typically require years to reachdeployment, rather than the days or hours that will beessential in a future information war.

The Eternal system attempts to address these problemsby providing:

� Operation in heterogeneous hardware and softwareenvironments by use of the Common Object RequestBroker Architecture (CORBA) distributed object com-puting standard [22], and unmodified commercial-off-the-shelf (COTS) hardware platforms, operatingsystems, and CORBA ORBs.

� Robust fault tolerance by use of object replication, sothat if one replica is disabled by a fault then anotherreplica can still continue to provide the required ser-vice. Both active and passive replication are provided;however, only active replication can achieve the rapidrecovery from faults that is required in many defensesystems.

� Live software upgrades by use of object replication,so that one replica can be upgraded while anotherreplica continues to provide service. The live upgradetechnology of Eternal is robust and requires no moreskill of the application programmers than is requiredfor programming the application.

� Reduced software development costs and timescales.With Eternal, an application programmer writes astandard CORBA program and that program is auto-matically rendered fault tolerant by Eternal. In currentpractice, the highly-specialized mechanisms for faulttolerance are inextricably mixed into the algorithmsof the application program, greatly increasing thedifficulty and complexity of programming the appli-cation. Moreover, the testing of fault tolerance istime consuming, and skimping on such testing can becatastrophic.

Application programmers must be experts in the ap-plication domain; they cannot also be expected to beexperts in fault tolerance. It is more appropriate forexperts in fault tolerance to program the fault tolerancemechanisms once only (as in the Eternal system), andthen apply them to many defense applications. Theapplication programming is simpler, with lower costsand shorter development timescales.

2 The Eternal System

The Eternal system is based on the Common Object Re-quest Broker Architecture (CORBA) defined by the ObjectManagement Group (OMG) [22]. CORBA provides mod-ular distributed object programming, location transparencywithin a distributed system, portability of programs acrossplatforms, and interoperability between diverse platforms.

Eternal extends CORBA with capabilities for fault tol-erance and live upgrades of application objects. Eternal

MulticastMessages

IIOP Interface

EternalReplication

Manager

EternalResourceManager

Totem

CORBAORB

Eternal Interceptor Eternal Interceptor

Platform Platform

Stub

Totem

Method

IIOP Interface

CORBAORB

Skeleton

MethodEternalEvolutionManager

EternalReplication

Manager

EternalResourceManager

EternalEvolutionManager

TCP/IP TCP/IP

Figure 1: The Eternal system includes the Interceptor, Replication Man-ager, Evolution Manager and Resource Manager.

replicates objects, distributes the object replicas across thesystem and maintains consistency of the states of the repli-cas of an object. Both client and server objects can bereplicated, and objects can act as both clients and servers.

In the Eternal system, the replicas of an object form anobject group, a collection of objects typically located ondifferent computers. An object can invoke methods on anobject group (i.e., simultaneously on all of the members) ina transparent manner so that the invoker of the method is notaware of the type or degree of the replication, membershipof the object group, or location of the members. A clientobject invokes the methods on a server object group asthough the server were a single object.

The Eternal system includes the Interceptor, ReplicationManager, Evolution Manager and Resource Manager, asshown in Figure 1. The CORBA Object Request Broker(ORB) packages the method invocations and responses intomessages formatted according to the Internet Inter-ORBProtocol (IIOP), which are transmitted via TCP/IP. TheInterceptor [17] captures the IIOP messages, and divertsthem to the Replication Manager which, in turn, passes themessages to a multicast group communication system.

The group communication system multicasts the mes-sages to the group and delivers the messages reliablly andin total order to all members of the group. Over local-areanetworks, multicast group communication protocols, suchas Totem [13], are now as efficient as point-to-point com-munication using TCP/IP, so no performance disadvantageresults from replacing TCP/IP by such a multicast protocol.For operation over the Internet, multicast group communi-cation protocols, such as FTMP [15], are currently beingdeveloped.

The Replication Manager translates method invocationson an object group into method invocations on the in-

ReplicationManager

ObjectC

CORBA ORB

RecoveryMechanism

ObjectA

ObjectB

Processor Processor Processor

CORBA ORB

LoggingMechanism

LoggingMechanism

LoggingMechanism

Message HandlingMechanism



New replicaof Object C

being recovered

Object A invoking Object Bwith the request message

being logged

RecoveryMechanism

RecoveryMechanism

CORBA ORB

Figure 2: The Eternal Replication Manager, working in concert withthe Message Handling, Logging and Recovery Mechanisms, providesobject replication and maintains consistency of the states of the replicas.The mechanisms provide detection of duplicate invocations and duplicateresponses, transfer of state between the object replicas, and consistentscheduling of concurrent operations.

dividual object replicas. To achieve replica consistency,the Replication Manager utilizes the reliable totally or-dered message delivery of the group communication sys-tem. Using the Message Handling, Logging and RecoveryMechanisms, shown in Figure 2, the Replication Managerprovides detection of duplicate invocations and duplicateresponses, transfer of state between the object replicas, andconsistent scheduling of concurrent operations.

The Evolution Manager performs the automated upgradeand evolution of application objects while they continueto execute. The replicas of an object can be stoppedand replaced, one at a time, while other replicas continueto provide service. A sequence of replacements, eachimplementing a small modification to a single replica,can achieve substantial modifications without stopping thesystem. The Evolution Manager invokes the ReplicationManager to create replicas of the new versions of the objectsand to remove replicas of the old versions.

The Resource Manager determines appropriate alloca-tions of objects to processors and distributes the replicasacross the system. It monitors the behavior of the objectsand the use of the resources, and can move objects andchange the type or the number of replicas to meet perfor-mance objectives. To meet soft real-time deadlines, it usesa distributed least-laxity scheduling algorithm.

The Replication, Evolution and Resource Managers arethemselves implemented as collections of CORBA objects,and can thereby benefit from CORBA’s interoperabilityand Eternal’s fault tolerance and live upgrade capabilities.

3 InterceptionThe Eternal Interceptor [17] is a non-ORB-level, non-application-level module that ‘‘attaches’’ itself to everyexecuting application object, transparently to the applica-tion and the ORB, and is capable of modifying the object’sbehavior as desired.

Current operating systems provide hooks that can beexploited to develop modules such as interceptors. Withthe Unix operating system, there are at least two possibleimplementations of interceptors. The =proc-based imple-mentation provides for interception at the level of systemcalls. The library interpositioning implementation providesfor interception at the level of library routines. The tech-niques differ, but the intent and the use of interceptors isthe same in both cases.

The specific system calls to intercept in a =proc-basedimplementation or the specific library routines to redefine ina library-interpositioning implementation, depends on theextent of the information that the interceptor must extract(from the ORB or the application) to enhance the applica-tion with new features. The interceptor may capture all, ora particular subset, of the system calls or library routinesused by the application, depending on the feature beingadded. The Eternal Interceptor currently employs the li-brary interpositioning implementation, because of its loweroverheads and ease of deployment with various ORBs.

The Interceptor monitors the operating system callsmade by the objects to establish IIOP connections overTCP/IP, and to communicate IIOP messages over thoseconnections. The Interceptor catches the IIOP messagesbefore they reach TCP/IP, and diverts them instead to theReplication Manager. The Replication Manager multicaststhe messages to the object groups using the group com-munication system that delivers the messages to the objectgroups reliably and in total order. The interception ap-proach of Eternal requires no modifications to the ORB,the operating system, or the application.

4 Replication ManagementIn Eternal, the replicas of an object form an object group.Communication occurs between client and server objectgroups, rather than between individual client and serverobjects. Each object group has a unique object groupidentifier.

A reliable totally ordered multicast group communi-cation system is used to communicate invocations andresponses between client and server object groups. Thisensures that all of the replicas of an object receive the samesequence of messages in the same order, which facilitatesreplica consistency.

In Eternal, a Logging Mechanism on each processor isresponsible for recording invocations, responses and check-

points of the replicas hosted on that processor. Typically,each processor hosts many different object groups. Thus,the Logging Mechanism maintains a single physical log forthe processor, and the log is indexed by the object groupidentifier.

The log is a sequence of log records, each containingboth an IIOP message (or a checkpoint represented asan IIOP message) and a special Eternal-specific headerassociated with the IIOP message for duplicate detection,garbage collection of the log, etc. The records are storedin the log as they are delivered reliably and in total orderto the Logging Mechanism by the underlying multicastgroup communication system. The log contains only non-duplicate records.

For all outgoing messages from the replicas that it man-ages, the Replication Manager receives the IIOP messagefrom the Interceptor, and passes it to the Message Han-dling and Logging Mechanisms, which record the messageheader for duplicate detection and then encapsulates themessage in an Eternal-specific header and passes it tothe underlying multicast group communication system fortransmission.

For all incoming messages to the replicas that it man-ages, the Replication Manager receives the encapsulatedIIOP messages from the underlying multicast group com-munication system and routes the encapsulated messageswith the Eternal-specific header to the Message Handlingand Logging Mechanisms. The Logging Mechanism deter-mines whether the message is a duplicate and records itsheader to allow detection of future duplicates. For passivelyreplicated object groups, the Logging Mechanism logs theentire message for possible use during recovery. The Mes-sage Handling Mechanism uses the header to determine thetarget group and local replica for the message and deliversthe message to the application.

For every replicated CORBA object that it manages,the Replication Manager also receives group membershipchange notifications from the underlying multicast groupcommunication system. Thus, the Replication Manageris aware of the addition of replicas to, or the removal ofreplicas from, the object groups that it manages. When anew or recovering or upgraded replica is introduced, theReplication Manager initiates the transfer of state to thenew replica by invoking the Logging Mechanism.

5 Replica ConsistencyTo ensure that the states of the replicas of an object areupdated consistently, Eternal exploits the reliable totallyordered message delivery service of an underlying mul-ticast group communication system that provides virtualsynchrony guarantees. The replicas start in the same initialstate. The method invocations, and the corresponding re-sponses, are contained in multicast messages. The multicast

messages are delivered in the same total order at each of theobject replicas. Consequently, each of the object replicasperforms the operations in the same order, and the states ofthe object replicas remain consistent.

In addition to the reliable totally ordered multicast groupcommunication system, Eternal employs the followingmechanisms to maintain replica consistency:

� Transfer of state between replicas, ensuring that allof the replicas agree on which operations precede thestate transfer and which follow it,

� Detection of duplicate invocations and duplicate re-sponses that are generated by two or more replicas ofan object, and

� Consistent scheduling of multithreaded concurrent op-erations.

5.1 State TransferEvery replicated object can be regarded as having threekinds of state: application state,programmed into the objectby the application programmer, ORB state, maintained bythe ORB for the object, and infrastructure state, invisibleto the application programmer and maintained by Eternalfor the object. Application state is typically represented bythe values of the data structures of the object. ORB stateis vendor-dependent and consists of the values of the datastructures (last-seen request identifier, threading policy,etc). Logging state is independent of, and invisible to, theobject as well as to the ORB, and involves only informationthat is needed to maintain replica consistency.

The Logging and Recovery Mechanisms of Eternalensure that all of the replicas of an object are consistentin application, ORB and infrastructure state. State transferto a new or recovering or upgraded replica includes thetransfer of application state to the replica, ORB state tothe ORB hosting the new replica, and infrastructure stateto the Message Handling and Logging Mechanisms for thereplica.

In order that application state can be transferred fromone replica to another, every object that is replicated mustinherit the Checkpointable interface. This interface con-tains a get state() method for retrieving an object’s stateand a set state() method for assigning an object’s state.The frequency of checkpointing is determined as a prop-erty by the application deployer for each object groupindividually.

For a passively replicated object group (see Section 6.2),the Recovery Mechanism transfers the state of the primaryreplica to the backup replicas periodically. For an activelyreplicated object group, the Recovery Mechanism transfersstate only to activate a new or recovering or upgradedreplica. In both cases, the Recovery Mechanism fabricatesan IIOP message for get state() to dispatch to the object

group to obtain the application state from one of the existingreplicas. The Recovery Mechanism at the existing replica(s)piggybacks the ORB state and infrastructure state for theexisting replica(s) onto the application state returned inthe response message for the invocation of get state().The response message is then transmitted to the backupreplica(s) or the new or recovering or upgraded replica, andis logged at those replicas.

The Recovery Mechanism at a new or recovering orupgraded replica, whether actively or passively replicated,extracts that message from the log, strips off the infras-tructure state and the ORB state, and uses it to initializethe infrastructure state and the ORB state for that replica.The application state (the return value of get state()) ispassed by the Recovery Mechanism as an argument toset state(), which it invokes on the new or recovering orupgraded replica.

All of the incoming invocations and responses must beenqueued until the set state() invocation is delivered. Thisqueue also contains the get state() invocation dispatchedearlier to the object group; all of the invocations andresponses prior to this invocation are discarded from thequeue. The get state() and set state() invocations mustappear to occur at the same logical point in time becausethe return value of the get state() is the parameter of theset state().

Thus, when the set state() invocation is received bythe Recovery Mechanism at a new or recovering or up-graded replica, the invocation moves to the head of theincoming message queue (a position previously occupiedby the get state() message), and is delivered to the new orrecovering or upgraded replica. The remaining enqueuedmessages are applied after the state transfer. The loggedget state() invocation is never applied to the new or re-covering or upgraded replica; it simply serves to representthe synchronization point, in the totally ordered messagesequence, at which the state transfer must occur through itscorresponding set state() message.

5.2 Duplicate Detection and SuppressionThe Message Handling Mechanism matches up responseswith their corresponding invocations, and detects and sup-presses duplicate invocations, responses and state transfermessages. To be able to match incoming response mes-sages with their corresponding invocations, the MessageHandling Mechanism inserts an invocation (response) iden-tifier into the Eternal-specific header for each outgoing IIOPinvocation (response) message. A part of the invocationidentifier, the operation identifier, uniquely represents theoperation consisting of the invocation-response pair. Allreplicas in an object group choose the same operation iden-tifier, which is included in both invocation and responsemessages.

The Message Handling Mechanism records, for eachsource group on its processor, the invocation identifiers ofall outgoing invocations for which responses are expected.When a response arrives, the Message Handling Mecha-nism delivers the response only if the operation identifier inthe received response identifier corresponds to the opera-tion identifier in the invocation identifier of an outstandinginvocation.

Operation identifiers are also used to discard duplicateinvocations and duplicate responses, so that only non-duplicate messages are delivered to the destination group.Each destination group corresponds to one or more sourcegroups. To enable duplicate detection, for each sourcegroup that sends messages to a destination group, theMessage Handling Mechanism at a destination processorrecords the operation identifier associated with the last-received message from the source group.

The list of operation identifiers for the outstandinginvocations for which the existing replicas are awaitingresponses is part of the infrastructure state that the LoggingMechanism stores and manages. The infrastructure statealso contains information that the Logging Mechanism usesfor duplicate detection and garbage collection of the log,including the list of last-seen operation identifiers fromevery sender group.

For every outgoing IIOP message that it receives fromthe Replication Manager, the Message Handling Mech-anism inserts, but does not record, a unique operationidentifer into the Eternal-specific header of the encapsu-lated message. For every incoming encapsulated IIOPmessage, the Message Handling Mechanism uses the infor-mation in the Eternal-specific header to detect and suppressduplicate messages, and passes only non-duplicate mes-sages (along with sufficient information about the targetobject group) to the Replication Manager for delivery tothe application.

5.3 MultithreadingMany commercial ORBs are multithreaded, and multi-threading can yield substantial performance advantages.Unfortunately, the specification of multithreading in theCORBA standard does not place any guarantee on the or-der of operations dispatched by a multithreaded ORB. Inparticular, the specification of the Portable Object Adapter(POA), which is a key component of the CORBA standard,provides no guarantee about how the ORB or the POAdispatches requests across threads. The ORB may dispatchseveral requests for the same object within multiple threadsat the same time.

In addition to ORB-level threads, the CORBA applica-tion itself can be multithreaded, with the thread schedulingdetermined by the application programmer. The applicationprogrammer must ensure correct sequencing of operations

and must prevent thread hazards. Careful application pro-gramming can ensure thread-safe operations within a singlereplica of an object; however, it does not guarantee thatthreads and operations are dispatched in the same orderacross all of the replicas of an object. The application pro-grammer should not need to be responsible for concurrencycontrol and ordering of dispatched operations in replicatedobjects to provide strong replica consistency.

To maintain strong replica consistency for multithreadedobjects, the Eternal system enforces deterministic behav-ior across all of the replicas of a multithreaded object bycontrolling the dispatching of threads and operations identi-cally within every replica through a deterministic operationscheduler.

The operation scheduler dictates the creation, activation,deactivation and destruction of threads within every replicaof a multithreaded object, as required for the executionof the current operation ‘‘holding’’ the logical thread-of-control. Exploiting the thread library interpositioningmechanisms of Eternal’s Interceptor, the scheduler canoverride any thread or operation scheduling that either themultithreaded ORB, or the replica itself, performs.

Based on the incoming reliable totally ordered messagesequence, the scheduler at each replica decides on the im-mediate delivery, or the delayed delivery, of the messagesto that replica. At all of the replicas, the schedulers’ de-cisions are the same and, thus, operations and threads aredispatched in the same order at all of the replicas of themultithreaded object.

A more complete description of the mechanisms thatEternal uses for state transfer, duplicate detection and sup-pression, and multithreading can be found in [14, 18, 19].

6 Fault Tolerance6.1 Fault ModelEternal protects against only omission faults, or againstboth omission faults and commission faults. These types offaults are broadly defined as follows:

� An omission fault occurs when an object or processorsends no further messages, i.e., it crashes, or omits tosend an expected message but does send subsequentmessages.

� A commission fault occurs when an object or processorsends a message that is syntactically or semanticallyincorrect, such as a mutant message. Mutant messagesare two or more messages that purport to be the samemessage but that have different contents.

6.2 Types of ReplicationThe application and the types of faults that must be tol-erated dictate the types of replication that are employed.

Eternal ORB

Method invokedon object group Bis executed by onlythe primary replica

Response returnedto primary replica in object group A

Method invokedon object group Ais executed by onlythe primary replica

Client object group invoking a methodon object group A

Reliabletotally ordered multicastfor state transfer

Reliabletotally ordered multicastfor state transfer

Object Group A

Object Group B

Primaryreplica

Primaryreplica

Figure 3: Passive replication. The primary replica in the object groupexecutes the method and the Recovery Mechanism transfers the state ofthe primary to the nonprimary replicas at the end of the method invocation.

To protect against omission faults, Eternal uses the stan-dard techniques of passive replication and active replication(without majority voting). To protect against commissionfaults, Eternal uses active replication with majority votingtogether with a more robust underlying group communica-tion protocol, such as SecureRing [10], which substantiallyincreases the cost of replication and fault tolerance.

6.2.1 Passive Replication

In passive replication, when a client object invokes amethod on a server object group, Eternal multicasts theinvocation to the server object group and only one of theserver replicas, the primary replica, executes the method,as shown in Figure 3. The Replication Manager at eachof the other replicas retains the message containing themethod invocation so that those replicas can execute themethod, if the primary replica fails. At the end of themethod execution, Eternal multicasts the updated stateof the primary replica to the nonprimary replicas andmulticasts the results to the invoking object. The statetransferred to the nonprimary replicas serves as a checkpointto which the state can be rolled back, if the primaryreplica fails. During the execution of the method, thestates of the nonprimary replicas may differ from that ofthe primary replica; however, the state transfer achievesreplica consistency at the end of the method execution.The underlying reliable totally ordered multicast protocolensures that either all of the nonprimary replicas have theupdated state of the object or, alternatively, none of themhas the updated state.

Eternal ORB

Reliable totally ordered multicast

Reliabletotally orderedmulticastfor invocation

Reliabletotally orderedmulticastfor response

Client object group invoking a methodon object group A

Duplicate responses suppressed

Duplicate invocationssuppressed

Object Group A

Object Group B

Figure 4: Active replication. All of the replicas within the object groupexecute the method. Duplicate invocations and duplicate responses aredetected and suppressed by the Message Handling Mechanism.

In Figure 3, object groups A and B each contain threepassive replicas. A client object invokes a method of objectgroup A, and only the primary replica in object group Aexecutes the method. That method invocation results inthe invocation of a further method on object group B, andagain, only the primary replica in object group B executesthe method. When the primary replica in object groupB hascompleted the method execution, the Recovery Managertransfers the state of that replica to the nonprimary replicasin object group B and returns the results to the replicas inobject group A.

6.2.2 Active Replication

In active replication, when a client object invokes a methodon a server object group, Eternal multicasts the methodinvocation to the server object group and each of the serverreplicas then executes the method, as shown in Figure 4.The underlying reliable totally ordered multicast protocolensures that all of the replicas of an object receive the samemessages in the same order, and that they can thus executethe methods in the same order. This ordering of methodinvocations and responses ensures that the states of thereplicas are consistent at the end of each operation.

In Figure 4, object groups A and B each contain threeactive replicas. A client object invokes a method of anobject group A, which in turn invokes a method of anobject group B. Each of the replicas in object group Bexecutes the method invocation, and Eternal multicasts theresults at the end of the execution and, similarly, for eachof the replicas in object group A.

TThrhree ree reliableeliabletotally ortotally orderderededmulticastmulticastmessagesmessagesfor rfor responseesponse

Eternal ORB

Reliable totally ordered multicasts

Three reliable totally ordered multicast messagesfor invocation

Object Group A

Object Group B

v v v v

vvv

vv

Client object groupinvoking a methodon object group A

Figure 5: Active replication with majority voting. Using a VotingMechanism, the Replication Manager subjects the invocations (responses)to majority voting in order to produce a single invocation (response).

6.2.3 Active Replication with Majority Voting

To tolerate both omission and commission faults, activereplication with majority voting must be used, as shownin Figure 5. In the past, majority voting was used in syn-chronous systems for safety-critical applications such asaircraft flight control [28]. Multicast group communicationsystems that provide reliable totally ordered message deliv-ery in a model of virtual synchrony make majority votingpossible in asynchronous systems. Most group communi-cation systems tolerate only crash faults. A more robustgroup communication system, such as SecureRing [10],must be used to tolerate arbitrary faults.

Majority voting requires at least three-way active repli-cation. In an environment that is subject to arbitraryfaults, the invocation (response) first received might beerroneous and, thus, duplicate invocations (responses) can-not be suppressed. Rather, to tolerate arbitrary faults,the invocations (responses) from different replicas in thesame object group are collected at the invokee (invoker)and combined using majority voting to produce a singleinvocation (response).

In Figure 5, object groups A and B each contain threeactive replicas and majority voting is used. A client objectinvokes a method of object groupA, which in turn invokes amethod of object group B. Using a Voting Mechanism, theReplication Manager subjects the invocations to majorityvoting in order to produce a single invocation. Then, eachof the replicas in object group A executes the method, andEternal multicasts the results, and similarly for each of thereplicas in object group B. Using the Voting Mechanism,

the Replication Manager likewise subjects the responses tomajority voting in order to produce a single response.

More details about the mechanisms needed to tolerateboth omission faults and communications faults can befound in [16]

6.3 Fault Detection and RecoveryDetection of faults in Eternal (and Totem) is based onunreliable fault detectors because, in an asynchronous dis-tributed system that is subject to faults, it is impossible todistinguish between an object or processor that has failedand one that is merely slow. However, if commission faultsmust also be tolerated, then a more robust fault detector isrequired [9].

Recovery from faults requires more care for passivereplication than for active replication. For an object that ispassively replicated, the effect of a fault depends on whetherthe failed replica is a primary or nonprimary replica. If anonprimary replica fails during the execution of a method,the Replication Manager removes it from the group, whilethe primary replica continues to execute the method. Thus,the failure of a nonprimary replica is transparent to theclient object that invoked the method.

To consider the failure of a primary replica, envisagea passively replicated object with several replicas, eachon a different processor. A method invoked on an objectgroup containing those replicas is multicast to both theprimary and nonprimary replicas. The nonprimary replicasdo not execute the method; rather, the Logging Mechanismfor a nonprimary replica logs the method invocation untilit receives the response from the primary replica. If theprimary replica fails, the Replication Manager determinesa new primary replica. The new primary replica mustexecute all of the methods for which it has not receiveda response from the prior primary replica. Invocationson other objects, and even responses, can be generatedby both the original replica and the new primary replica.The Message Handling Mechanism suppresses duplicateinvocations (responses) that occur after the fault.

For active replication, if any one of the server replicasfails, or is removed for upgrading during the execution ofa method, service is not interrupted (because the methodis executed by the other server replicas) and the resultsare returned to the client object. The server replicas alsopreserve the state of the server object for subsequent transferto a new or recovered or upgraded server replica. To sustainoperation, at least one server replica must be operationaland, to tolerate a fault, two or more server replicas areneeded.

7 Evolution ManagementWithout the ability to upgrade the software, and also thehardware, no application can claim to be able to operate

continuously. Exploiting object replication, the EternalEvolution Manager readily supports evolution of, and up-grades to, the hardware. Hardware components can fail, berepaired or replaced, and be reintegrated into the systemwithout interruptionof service. If the replacement hardwareis of a different design (e.g., different byte order or datarepresentation), the interoperability features of CORBAenable the application to adapt to the change.

With existing technologies, in a conventional applica-tion, the system must be stopped in order to upgrade thesoftware (i.e., the application objects themselves). WithEternal, the system need not be stopped to upgrade theapplication objects. By exploiting object replication, theEvolution Manager accomplishes the overall change to theobjects incrementally and systematically, while the appli-cation continues to operate.

The Evolution Manager, which comprises the Preparerand the Upgrader shown in Figure 6, performs the upgradeof a large program in a sequence of steps. Each step ofthe sequence is completed and demonstrated to operatesatisfactorily before the next step is undertaken. A step inthe sequence consists of three phases:

� A preparation phase in which the programmer preparesa new upgraded program that is to replace the existingprogram

� A preliminary preprocessing phase that involves thePreparer with the assistance of the human

� A fully automatic upgrade phase that involves theUpgrader.

7.1 The PreparerTo upgrade an object class, the programmer submits, tothe Preparer, the code of both the existing object classand a new version of the class. The Preparer comparesthe two classes and determines the differences. Withassistance from the programmer, the Preparer generatesone or more intermediate classes to facilitate the upgrade,and compiles and deposits those classes into the CORBAImplementation Repository for use by the Upgrader. Nospecial skill is required of the application programmerbeyond that required to program the application programbeing upgraded.

7.2 The UpgraderThe Upgrader upgrades an object of a class using a sequenceof invisible upgrades, each of which moves closer to thedesired overall upgrade. The actual upgrade is performedin a single atomic action when all program code anddata structures are in place. Further invisible upgrades thenremove obsolete or transitional code. The Upgrader ensuresthat at least one replica of an object continues to provideservice while another replica is being upgraded.

Input fromApplication

Programmers

OriginalProgram

Source Code

IntermediateObject

Source Code

UpgradedProgram

Source Code

UpgradeScript

EternalPreparer

EternalUpgrader

Compiler

OperationalComputer

OperationalComputer

Implementation

Repository

UpgradedObject

UpgradedObject

Figure 6: The Eternal Evolution Manager, consisting of the Preparer andthe Upgrader, exploits object replication to achieve hardware and softwareupgrades without stopping the system.

Particular care is required for upgrades that modifythe attributes (local state variables) of an object. Codemust be generated by the Preparer with the assistanceof the programmer, and invoked by the Upgrader, to setappropriate values for new attributes. Even more care isrequired if the signatures (parameters and their types) ofthe methods of the objects are changed. Several classes ofobjects may need to be upgraded together in a coordinatedupgrade sequence.

If evolution is required but fault tolerance is not, thenreplication is necessary only while the system is beingupgraded. Such systems operate normally with unreplicatedobjects, but additional replicas are introduced when neededto support live upgrade and evolution. Replication thusprovides not only fault tolerance but also live upgrades thatallow distributed object applications to grow and evolvewithout interruption of service.

8 Resource ManagementTypical distributed object applications, particularly defenseapplications, are very complex and are incompletely un-derstood, and it is difficult for the application programmerto obtain accurate projections of resource requirements andbehavioral characteristics. Consequently, resource man-agement in Eternal is implemented as a Resource Managerfor the entire system and a Profiler and a Scheduler for eachprocessor, as shown in Figure 7. Logically, there is onlya single copy of the Resource Manager, although it maybe replicated for fault tolerance. Each Profiler and eachScheduler, however, is specific to an individual processorand is interfaced to both the ORB and the operating system.

8.1 The Resource ManagerThe Resource Manager allocates the object replicas to theprocessors based on the current loads on the processors,and moves objects from one processor to another. As newtasks are introduced into the system, the Resource Managerdecides whether the available resources can satisfy therequests and allocates the resources accordingly. Duringoperation, the Resource Manager might determine that aresource is overloaded or that a task is not meeting itsdeadlines, necessitating reallocation of the object replicas.If a processor is lost because of a fault, the ResourceManager might need to reallocate the object replicas tomaintain a sufficient degree of replication to satisfy faulttolerance requirements.

For each application task, the Resource Manager main-tains a list of the method invocations required for thattask, a deadline for completion of the task, and an impor-tance metric that is used to decide which tasks should beabandoned if a system overload occurs.

For each method of each object, the Resource Managermaintains estimates of the processing and communicationtimes for invocations of that method. These estimates areused to determine the initial laxity of the application taskwhen it starts to execute.

The Resource Manager works in concert with the Profil-ers and Schedulers on the processors. The Profilers monitorthe behavior of the application objects and measure the cur-rent load on the processors’ resources. The ResourceManager allocates objects to processors, the objects exe-cute and use resources, and the Profilers report resourceusage to the Resource Manager. The Resource Managercan determine the degree of replication of each applicationobject and can reallocate resources and reschedule objectson the processors, to maximize the system utility and tomaintain a uniform load on the resources. The Schedulersexploit information collected by the Resource Manager toschedule the tasks to meet soft real-time deadlines.

The Resource Manager employs a three-level feedbackloop. The tightest level (milliseconds) uses measurementsof elapsed time to refine the estimated residual laxity ofexecuting tasks, which are used for least-laxity scheduling.The second level (fractions of a second) uses measurementsof elapsed time and measurements of the resource load torefine the initial estimates of the laxity for the tasks as theystart. The third level (several seconds) uses the measuredresource load and residual laxities to revise the allocationof objects to processors.

8.2 The ProfilersThe Profilers monitor the current load on the processors and,therefore, can detect significant deviations in performance.They supply feedback to the Resource Manager, whichdetermines the allocation of the objects to the processors.

Processor

CORBA ORB

ObjectA Object

BObject

B

Feedback:Loads on

processors

Allocatereplicas toprocessors

ResourceManager

Processor

CORBA ORB CORBA ORB

Profiler Scheduler Profiler Scheduler Profiler Scheduler

Processor

Figure 7: The Eternal Resource Manager, working in concert with theProfilers and the Schedulers on each of the processors, employs objectmigration algorithms to balance the loads on the processors and also adistributed least-laxity scheduling algorithm to schedule the tasks.

Each Profiler is specific to an individual processor andreports, for each resource associated with its processorthe current utilization of the resource, which determineswhether a resource is overloaded and whether to reallocateobjects to other processors to balance the load or to decreasetheir degrees of replication.

Each Profiler also monitors the behavior of the applica-tion objects located on its processor and reports, for eachapplication object, the time required for the object to exe-cute on the processor and the proportion of each resourceallocated to the object.

8.3 The SchedulersThe Schedulers employ a distributed least-laxity schedulingalgorithm, which is effective, provided that the processorsare not overloaded. The processor utilizations are assumedto provide sufficient margins to accommodate statisticalfluctuations in the load with acceptable probability.

In least laxity scheduling, the laxity of task t representsa measure of urgency of the task. The laxity is defined by:

Laxityt = Deadlinet � Projected latencyt

where Deadlinet is the time by which task t must becompleted and Projected latencyt is the estimated timeto complete task t.

A more complete description of the resource manage-ment strategies employed by the Eternal system can befound in [6, 7, 12].

9 Prototype ImplementationOur prototype implementation of the Eternal system op-erates using unmodified commercial ORBs, including In-prise’s VisiBroker, Iona’s Orbix, Xerox PARC’s ILU,

Object-Oriented Concept’s ORBacus and Washington Uni-versity’s real-time TAO ORB. The implementation is de-signed for Solaris 2.6 but also operates on RedHat Linux.A port to WindowsNT is in progress.

The current implementation exploits library interpo-sitioning, which is less dependent on operating systemspecific mechanisms and has lower overheads than ourinitial implementation, which was based on interceptingthe =proc interface of the Solaris operating system. Eitherapproach (library interpositioning or using =proc) allowsEternal to be used with diverse commercial ORBs, with nomodification to either the ORB or the application. The ven-dor’s implementation of CORBA must, of course, supportIIOP, as the CORBA standard mandates.

The overheads of Eternal are in the range of 10-15%for remote invocations/responses with triplicated clientsand triplicated servers. These low overheads include thecost of interception and replication, as well as that ofmulticasting GIOP messages using the Totem multicastgroup communication system [13].

For example, using Sun UltraSPARC2 167 and 200MHz workstations and 100 Mbit/s Ethernet, a remoteinvocation and response with an unreplicated client andan unreplicated server running over the VisiBroker ORB,without Eternal, required 0.330 ms. In this case, the clientand server communicate using IIOP messages transmittedover TCP/IP.

Using Eternal, for the same platform and application,with three-way actively replicated client and server objectsrunning over VisiBroker, a remote invocation and responserequired 0.369 ms, which represents an overhead of 12%over the unreplicated case. These measurements involvedan actively replicated client object repeatedly invoking anactively replicated server object using deferred synchronouscommunication without message packing.

With three-way passively replicated clients and servers,a remote invocation and response required 0.374 ms, whichrepresents an overhead of 15% over the unreplicated case.These measurements involved a passively replicated clientobject, with the primary client replica repeatedly invokingthe passively replicated server object. The state transfersfor both client and server objects were hand-coded.

10 Related WorkSeveral systems that extend CORBA with object replicationand fault tolerance have been developed.

The Electra toolkit implemented on top of Horus pro-vides support for fault tolerance by replicating CORBAobjects, as does Orbix+Isis on top of Isis [1, 11]. Un-like Eternal, Electra and Orbix+Isis are non-hierarchicalobject systems that support only active replication. BothElectra and Orbix+Isis use an integration approach in that

the replication and group communication mechanisms areintegrated into the ORB and require modification to theORB. In contrast, Eternal uses the interception approach,which requires no modification to the ORB.

Another approach to fault tolerance, adopted by theOpenDREAMS toolkit [4], adds replication and groupcommunication as services (implemented as CORBA ob-jects) on top of the ORB, and requires no modificationto the ORB. The service approach exposes the replicationof objects to the application programmer and allows theprogrammer to modify the class library to construct cus-tomized replication and group communication services. Incontrast, Eternal is transparent to the application and theapplication programmer.

The Distributed Object-Oriented Reliable Service(DOORS) [24] adds support for fault tolerance to CORBAapplications by providing replica management and fault de-tection as service objects above the ORB. DOORS supportspassive but not active replication and is not based on mul-ticast group communication, as is the Eternal system. TheDoorMan management interface monitors DOORS and theunderlying system to fine-tune the functioning of DOORSand to take corrective action, if their hosts are suspected ofbeing faulty.

The Maestro toolkit [27] adds reliability and high avail-ability to CORBA applications in settings where it is notfeasible to make modifications at the client side. It includesan IIOP-conformant ORB with an open architecture thatsupports multiple execution styles and request processingpolicies. The replicated updates execution style can be usedto add reliability and high availability on the client side.

The AQuA framework [3] employs the Ensem-ble/Maestro [26, 27] toolkits, the Proteus dependabilityproperty manager, and the Quality of Service for CORBAObjects (QuO) runtime system. Proteus determines thetype of faults to tolerate, the replication policy, the degreeof replication, the type of voting to use and the locationof the replicas. Using a Quality of Service DescriptionLanguage (QDL) to specify an application’s expected us-age patterns and QoS requirements, QuO modifies theconfiguration to meet those requirements dynamically, andprovides mechanisms for measuring and enforcing Qualityof Service contracts and taking appropriate actions whenthose contracts are violated.

Another CORBA-based system that provides adaptationto dynamic and unpredictable changes in the computingenvironment has been developed by Nett, Gergeleit andMock [20]. Like Eternal, their system uses integrated mon-itoring, dynamic execution time prediction, and schedulingto provide time-awareness for standard CORBA objectinvocations.

Shokri, Hecht, Crane, Dussault and Kim [25] makethe case that effective fault handling in complex distributed

applications requires the ability to adapt resource allocationand fault tolerance policies dynamically in response tochanges in the environment, application requirements andavailable resources. The Eternal system supports thisviewpoint in its implementation of fault tolerance andresource management.

While the above systems provide support for fault toler-ance and resource management, they do not provide supportfor live upgrade and evolution of objects.

The Simplex Architecture [5], which is intended foronline upgrades of control systems, is based on two ab-stractions, the replaceable unit abstraction and the cellabstraction. The replaceable unit abstraction allows anexisting software module to be replaced online by anothermodule with similar or enhanced functionality, while thecell abstraction represents a protected module which can-not be affected by other modules. These abstractions havebeen implemented in a real-time POSIX testbed, based onpublish/subscribe communication, which is quite differentfrom the multicast group communication employed by theEternal system.

Another system that supports upgrades of system soft-ware, hardware and application software has been devel-oped by Kanevsky, Krupp and Wallace [8]. Their systemshares many characteristics with the Simplex Architecturebut differs in its application to the evolution of long life-cycle defense systems. In particular, they have appliedtheir technology to the multiple target tracking part of asurveillance radar system.

Neither of those two systems has addressed the generalproblem of live upgrades of object-oriented programs thatthe Eternal system addresses.

11 ConclusionDefense applications of the future will be complex dis-tributed object applications that must operate continuouslywithout ever stopping. Those applications will be difficultenough to develop without the additional programmingrequired to provide fault tolerance and live upgrades.

By replicating CORBA objects and maintaining strongreplica consistency, the Eternal system provides support forfault tolerance and live upgrades of defense applications.Eternal simplifies the programming of those applicationsby exploiting the location transparency, portability andinteroperability that CORBA provides and hiding the dif-ficult issues of replication, consistency and recovery fromthe application programmer.

The prototype implementation of the Eternal systemworks with unmodified commercial-off-the-shelf CORBAORBs, with overheads for triply replicated applicationsin the range of 10-15%, compared to their unreplicatedcounterparts. These low overheads include the cost of

interception and replication, as well as that of multicastingmessages.

During the past year, we have been working with theObject Management Group to establish a standard forfault tolerance for CORBA [21, 23]. Currently, we areadapting the technology of the Eternal system to meet therequirements of that standard and to meet the needs ofcommercial CORBA applications.

References[1] K. P. Birman and R. van Renesse, Reliable Distributed

Computing with the Isis Toolkit, IEEE ComputerSociety Press, Los Alamitos, CA (1994).

[2] D. H. Craft, ‘‘A study of pickling,’’ Journal of Object-Oriented Programming, vol. 5, no. 8, SIGS Publica-tions, New York (January 1993), pp. 54-66.

[3] M. Cukier, J. Ren, C. Sabnis, W. H. Sanders, D.E. Bakken, M. E. Berman, D. A. Karr and R. E.Schantz, ‘‘AQuA: An adaptive architecture that pro-vides dependable distributed objects,’’ Proceedingsof the IEEE 17th Symposium on Reliable DistributedSystems, West Lafayette, IN (October 1998), pp.245-253.

[4] P. Felber, B. Garbinato and R. Guerraoui, ‘‘De-signing a CORBA group communication service,’’Proceedings of the IEEE 15th Symposium on ReliableDistributed Systems, Niagara on the Lake, Canada(October 1996), pp. 150-159.

[5] M. Gagliardi, R. Rajkumar and L. Sha, ‘‘Designing forevolvability: Building blocks for evolvable real-timesystems,’’ Proceedings of the IEEE 1996 Real-TimeTechnology and Applications Symposium, Brookline,MA (June 1996), pp. 100-109.

[6] V. Kalogeraki, L. E. Moser and P. M. Melliar-Smith,‘‘Dynamic modeling of replicated objects for de-pendable soft real-time distributed object systems,’’Proceedings of the IEEE 4th Workshop on Object-Oriented Real-time Dependable Systems, Santa Bar-bara, CA (January 1999), pp. 48-55.

[7] V. Kalogeraki, P. M. Melliar-Smith and L. E. Moser‘‘Using multiple feedback loops for object profiling,scheduling and migration in soft real-time distributedobject systems,’’ Proceedings of the IEEE 2nd Inter-national Symposium on Object-Oriented Real-TimeDistributed Computing, Saint Malo, France (May1999), pp. 291-300.

[8] A. Kanevsky, P. C. Krupp and P. J. Wallace,‘‘Paradigm for building robust real-time distributedmission-critical systems,’’ Proceedings of the IEEE

1995 Real-Time Technology and Applications Sym-posium, Chicago, IL (May 1995), pp. 33-40.

[9] K. P. Kihlstrom, L. E. Moser and P. M. Melliar-Smith,‘‘Solving consensus in a Byzantine environment us-ing an unreliable fault detector,’’ Proceedings of theInternational Conference on Principles of DistributedSystems, Picardie, France (December 1997), pp.61-75.

[10] K. P. Kihlstrom, L. E. Moser and P. M. Melliar-Smith,‘‘The SecureRing protocol for securing group com-munication,’’ Proceedings of the IEEE 31st HawaiiInternational Conference on System Sciences, vol. 3,Kona, HI (January 1998), pp. 317-326.

[11] S. Landis and S. Maffeis, ‘‘Building reliable dis-tributed systems with CORBA,’’ Theory and Prac-tice of Object Systems, vol. 3, no. 1 (April 1997), pp.31-43.

[12] P. M. Melliar-Smith, L. E. Moser, V. Kalogeraki andP. Narasimhan ‘‘The Realize middleware for replica-tion and resource management,’’ Proceedings of theIFIP International Conference on Distributed SystemsPlatforms and Open Distributed Processing, Middle-ware ’98, The Lake District, England (September1998), pp. 123-138.

[13] L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R.K. Budhia and C. A. Lingley-Papadopoulos, ‘‘Totem:A fault-tolerant multicast group communication sys-tem,’’ Communications of the ACM, vol. 39, no. 4(April 1996), pp. 54-63.

[14] L. E. Moser, P. M. Melliar-Smith and P. Narasimhan,‘‘Consistent object replication in the Eternal system,’’Theory and Practice of Object Systems, vol. 4, no. 2(1998), pp. 81-92.

[15] L. E. Moser, P. M. Melliar-Smith, R. R. Koch andK. Berket, ‘‘A group communication protocol forCORBA,’’ Proceedings of the 1999 ICPP Interna-tional Workshop on Group Communication, Aizu,Japan (September 1999), pp. 30-36.

[16] P. Narasimhan, K. P. Kihlstrom, L. E. Moser and P.M. Melliar-Smith, ‘‘Providing support for survivableCORBA applications with the Immune system,’’ Pro-ceedings of the IEEE 19th International Conferenceon Distributed Computing Systems (May/June 1999),Austin, TX, pp. 507-516.

[17] P. Narasimhan, L. E. Moser and P. M. Melliar-Smith, ‘‘Using interceptors to enhance CORBA,’’IEEE Computer, vol. 32, no. 7 (July 1999), pp.62-68.

[18] P. Narasimhan, L. E. Moser and P. M. Melliar-Smith‘‘Replication and recovery mechanisms for strongreplica consistency in reliable distributed systems,’’Proceedings of the ISSAT 5th International Confer-ence on Reliability and Quality in Design, Las Vegas,NV (August 1999), pp. 26-31.

[19] P. Narasimhan, L. E. Moser and P. M. Melliar-Smith,‘‘Enforcing determinism for the consistent replicationof multithreaded CORBA applications,’’ Proceedingsof the IEEE 18th Symposium on Reliable DistributedSystems, Lausanne, Switzerland (October 1999), pp.263-273.

[20] E. Nett, M. Gergeleit and M. Mock, ‘‘An adaptive ap-proach to object-oriented real-time computing,’’ Pro-ceedings of the IEEE 1st International Symposium onObject-Oriented Real-Time Distributed Computing,Kyoto, Japan (April 1998), pp. 342-349.

[21] Object Management Group, ‘‘Fault Tolerance forCORBA,’’ OMG Technical Committee Documentorbos/98-10-08 (October 1998).

[22] Object Management Group, ‘‘The Common ObjectRequest Broker: Architecture and Specification,’’2.3 edition, OMG Technical Committee Documentformal/98-12-01 (June 1999).

[23] Object Management Group, ‘‘Fault TolerantCORBA,’’ OMG Technical Committee Documentorbos/99-10-05 (October 1999).

[24] J. Schonwalder, S. Garg, Y. Huang, A. P. A. vanMoorsel and S. Yajnik, ‘‘A management interface fordistributed fault tolerance CORBA services,’’ Pro-ceedings of the IEEE 3rd International Workshop onSystems Management, Newport, RI (April 1998), pp.98-107.

[25] E. Shokri, H. Hecht, P. Crane, J. Dussault and K.H. Kim, ‘‘An approach for adaptive fault tolerancein object-oriented open distributed systems,’’ Pro-ceedings of the IEEE 3rd International Workshopon Object-Oriented Real-Time Dependable Systems,Newport Beach, CA (February 1997), pp. 298-305.

[26] R. van Renesse, K. Birman, M. Hayden, A. Vays-burd and D. Karr, ‘‘Building adaptive systems usingEnsemble,’’ Software - Practice and Experience, vol.28, no. 9 (July 1998), pp. 963-979.

[27] A. Vaysburd and K. Birman, ‘‘The Maestro approachto building reliable interoperable distributed appli-cations with multiple execution styles,’’ Theory andPractice of Object Systems, vol. 4, no. 2 (1998), pp.73-80.

[28] J. Wensley, P. M. Melliar-Smith, et al, ‘‘SIFT: Designand analysis of a fault-tolerant computer for aircraftcontrol,’’ Proceedings of the IEEE, vol. 66, no. 10(October 1978), pp. 1240-1255.

Eternal: Fault Tolerance and Live Upgrades for Distributed ...

Documents