Autonomic Computing – a method for automated systems ...

Autonomic Computing – a method forautomated systems management

Systems management using distributed local control

Patrick [email protected]

TRITA-NA-Eyynn

Master’s Thesis in Computer Science (20 credits)at the School of Computer Science and Engineering,

Royal Institute of Technology, March 2004.Commissioned by Sverker Janson, SICS

Supervisor at Nada: Inge FrickExaminer: Stefan Arnborg

AbstractAutonomic Computing is a long-term IBM initiative aiming for a unification of re-search on areas related to computer system self-management. Methods of automatedsystems management have been studied at SICS (the Swedish Institute of ComputerScience), mostly from the decision method perspective. The idea is to unify thehandling of system management issues such that system complexity decreases andsystem autonomy increases.

In this thesis we study a control method for making automated distributed man-agement decisions. We also study the consequences of externalised managementon services and the system architecture. A simulation experiment is performed toevaluate the suggested control method. We assume a distributed, dynamic and un-reliable environment. The automated manager’s goal is to compensate for servicedisturbance caused by the environment. The automated manager needs 1) a sys-tem view classifying the current system state based on local knowledge, 2) a learnerproviding an optimal decision policy, and 3) an adaptive service architecture withabilities to compensate disturbances.

As result, we have found that one of the greatest challenges to automated systemsmanagement is the creation of an architecture allowing adaptation and configurationof services. The more issues managed externally, the more constraints on the internalstructure of services.

Autonomic Computing – en metod för automatisksystemhantering

Systemhantering genom distribuerad lokal kontroll

SammanfattningAutonomic Computing är ett långsiktigt initiativ från IBM som syftar till att förenaforskning inom området för självhanterande datasystem. Metoder för självhanteringhar studerats på SICS (the Swedish Institute of Computer Science), i huvudsak av-seende beslutsmetoder. Tanken är att förena hanteringen av systemhanteringsfrågorså att systemets komplexitet minskar och autonomin ökar.

I denna rapport studeras en kontrollmetod för automatiska distribuerade hante-ringsbeslut. Vi studerar också konsekvensen av extern hantering för tjänster ochsystemarkitektur. Ett simuleringsexperiment genomförs för att utvärdera den före-slagna kontrollmetoden. Vi antar att systemmiljön är distribuerad, föränderlig ochopålitlig. Den automatiska systemhanterarens mål är att kompensera för störningari tjänsten som orsakas av förändringar i systemmiljön. Systemhanteraren behöver1) en vy av systemet som klassificerar systemets tillstånd baserat på lokal kunskap,2) en inlärningstjänst som tillhandahåller en optimal beslutspolicy, och 3) en adaptivtjänstearkitektur med egenskaper för att kompensera störningarna.

Som resultat har vi funnit att en av de största utmaningarna för automatisk system-hantering är skapandet av en arkitektur som möjliggör anpassning och konfigureringav tjänster. Desto fler egenskaper som hanteras externt, desto fler begränsningarkrävs på tjänsternas interna struktur.

Contents

1 Introduction and Motivation 11.1 Thesis purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 A solution approach using control problem models . . . . . . . . . . 31.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 52.1 Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Signs of Autonomic Computing . . . . . . . . . . . . . . . . . . . . . 72.3 Recovery Oriented Computing . . . . . . . . . . . . . . . . . . . . . . 82.4 Externalised architecture adaptation . . . . . . . . . . . . . . . . . . 82.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Systems Management as Automatic Control 103.1 System environment model . . . . . . . . . . . . . . . . . . . . . . . 103.2 Challenges of system management . . . . . . . . . . . . . . . . . . . 123.3 The control problem point of view . . . . . . . . . . . . . . . . . . . 133.4 Modeling a control problem for systems management . . . . . . . . . 153.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Control Mechanisms for System Adaptation 194.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Installing services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Adaptation of dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Upgrading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 An Experiment 265.1 Scenario description . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 The control problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Experiment execution . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.5 Resulting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Discussion 356.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 What has been achieved? . . . . . . . . . . . . . . . . . . . . . . . . 36

References 39

Chapter 1

Introduction and Motivation

Developers of large-scale applications have experienced increased complexity in theirsoftware systems due to the restless integration of services. This affects not onlysoftware developers, but the end users who have to manage the software systems.The increasing complexity and its consequences were observed by IBM, launchingthe Autonomic Computing initiative as a countermeasure in 2001.

Autonomic Computing is a vision striving for system self-management with respectto four areas; system configuration, protection, healing and optimisation. All ofthese areas are under constant investigation by researchers and have been so for along time. Autonomic Computing is an attempt to unify related research on areasof computer self-management.

Self-management can, in general terms, be achieved in three steps. First, an informa-tion infrastructure is required to provide sufficient information for system awareness.In the second step, the awareness triggers decisions, deduced using system knowledge.Third, the decisions taken are executed by exploiting the adaptive capabilities of thesystem.

In this thesis we approach system self-management using the perspective of controlproblems, illustrated in Figure 1.1. A controller’s objective is to keep the systemwithin given constraints by reacting on changes and anticipating the coming needsof the system. System constraints follow from limitations in computing power, stor-age and bandwidth, as well as users’ and services’ expectations. The disturbancescausing the system to deviate may be normal usage, abnormal usage, crashes, bugs,modifications of hardware or software, etc.

In distributed systems, control cannot always be efficiently achieved using centralisedmethods, due to communication delays, costs, scalability, etc. For this reason, it isdesirable to find a method for efficient and optimal decentralised decisionmaking.

There are some questions related to this approach. Looking at automatic manage-ment of distributed systems, how are the control objectives formulated and how

1

System

Controller

output

systemabstraction

controloutput

disturbance

Figure 1.1. A control system. The purpose of the controller is to mitigate the effectof disturbances.

is the control problem identified? How is system knowledge represented to mapawareness information into adaptation control?

1.1 Thesis purpose

The project described in this thesis was initiated by the Swedish Institute of Com-puter Science (SICS) in the Intelligent Systems Laboratory (ISL). The project com-missioner was Sverker Janson. The project collaborators at ISL were Joakim Eriks-son and Niclas Finne.

The thesis purpose is to investigate a method of systems self-management for dis-tributed systems. This work has four phases. First, the Autonomic Computinginitiative and related work is surveyed. Second, systems management challenges arepresented, and a method for automated decisionmaking for systems managementis suggested. Third, a model of system adaptation relevant to autonomic systemsmanagement is suggested. Fourth, an experiment is performed to concretize andanalyse the challenges and problems of the described methods.

1.2 Problem description

We address the problem of ensuring dependability and performance of networkedsystems in the face of uncertain, and often largely unknown, environments.

An increasingly popular approach is to apply concepts and techniques from auto-matic control and related fields. A simple example is managing the CPU load andmemory use of a web server using traditional linear control. Controlling a net-worked system is much harder. The system components all have local observationand control points and incomplete knowledge of overall system state. The effects ofcontrols are uncertain, interact, and may appear long after their application. For

2

such problems, recent advances in approximate dynamic programming—a field atthe intersection of control, operations research, and computer science—offer us newpowerful tools.

Problem statement. Is is possible to achieve good systems management using dis-tributed control in unreliable environments?

1.3 A solution approach using control problem models

Control systems (Figure 1.1) are well studied in control theory and in the contextsof dynamic programming and Markov decision processes. The generality of themethods used in these disciplines allows them to be used for a large class of problems.Systems management problems are in like manner a large class of heterogeneousproblems. The idea to use the control problem point of view is to unify the handlingof system management issues such that system complexity decreases and systemautonomy increases.

We would like to develop system management knowledge without explicit program-ming. With a feedback mechanism of rewards (from a system critic), an optimalstrategy is found using machine learning. System knowledge will thus be character-ised by the critic, and will take the form of a mapping from an observed system stateto a control output. Thus, in this method, the system knowledge is represented asa function from system states to adaptation action.

Our hypothesis is that systems can manage themselves successfully using this general-purpose problem solving method. If this is found to be true, a substantial amount ofsystem self-management problems should be solvable without writing new algorithmsfor every new decision problem. Instead, a new decision problem is introduced byincluding new control input or new control output and then creating a new decisionpolicy. If a new aspect of the system is to be managed, the critic is modified andexisting policies reevaluated.

1.4 Thesis outline

Chapter 2 provides a background of Autonomic Computing and work related toself-management and adaptable architectures.

Chapter 3 discusses systems management as automatic control. It first proposes asystem environment model and follows up on some management challenges inherentin the environment and system model. The general control problem approach iscovered, and later how to design a control problem for systems management. Thesuggested control problem addresses the challenges discussed earlier in the chapter.

3

Chapter 4 discusses control mechanisms in the adaptive architecture for softwareservices. It first proposes a system model based on the workflow process model, andlater outlines mechanisms supporting the decision process.

Chapter 5 describes an experiment based on the topics covered in Chapters 3 and4. It is based on a scenario with properties from the system environment model.The scenario, experiment setup and execution is described. The resulting data ispresented, analysed and discussed.

Chapter 6 concludes the thesis. First is a discussion answering some of the questionsasked in the introduction. Last, a summary of our achievements is presented.

4

Chapter 2

Background and Related Work

This chapter surveys the Autonomic Computing concept, targeting the followingquestions: What is the Autonomic Computing vision? What are the technologicalsigns of a coming Autonomic Computing era? What are related work on systemself-management and Autonomic Computing?

2.1 Autonomic Computing

In October 2001, IBM released a manifesto [13] describing the vision of AutonomicComputing. The purpose is to countermeasure the complexity of software systemsby making systems self-managing. The paradox has been spotted, that systemsneed to become even more complex to achieve this. The complexity, it is argued,can be embedded in the system infrastructure, which in turn can be automated.The similarity of the described approach with the autonomic nervous system of thebody, which relieves basic control from our consciousness, gave birth to the termAutonomic Computing.

In Kephart’s and Chess’ Vision of Autonomic Computing [17], which is cited in therest of this section, the following four system abilities are discussed:

• Self-configuring

• Self-optimising

• Self-healing

• Self-protecting

Self-configuration involves automatic incorporation of new components and auto-matic component adjustment to new conditions. Self-optimisation on a system levelis about automatic parameter tuning on services. A suggested method to do this isexplore, learn and exploit. Self-healing from bugs and failures can be accomplished

5

KnowledgeMonitor

Analyse Plan

Execute

Managed element

Autonomic manager

Figure 2.1. An autonomic element (adopted from [17]).

using components for detection, diagnosis and repair. Self-protection of systems willprevent large-scale correlated attacks or cascading failures from permanently dam-aging valuable information and critical system functions. It may also act proactivelyto mitigate reported problems. It is claimed that as these aspects become propertiesof a general architecture they will merge into a single self-maintenance quality.

The architecture of an Autonomic Computing system will be a collection of com-ponents called autonomic elements (Figure 2.1), which encapsulates managed ele-ments. The managed element can be hardware, application software or an entiresystem. An autonomic element is an agent, managing its internal behaviour andrelationships with others in accordance to policies. It is driven by goals, by otherelements or by contracts established by negotiation with other elements. Systemself-management will arise both from interactions among agents and from agents’internal self-management.

An autonomic multi-agent system will run in agent-oriented architectures of yet un-known proportions. These architectures present numerous engineering challenges tobe solved, involving agent lifecycles and relationship management such as negoti-ation and trust, just to mention a few. There are also scientific challenges such asinduction of global behaviour, control theories, machine learning, etc.

At IBM, the Autonomic Computing initiative spans across all levels of computermanagement, from the lowest levels of hardware to the highest levels of softwaresystems. IBM System Journal has collected some of this work. On the hardwarelevel, systems are dynamically upgradable [15]. On the operating system level,active operating system code is replaced dynamically [2]. Some work on autonomicmiddleware can be found at other sources [16] [30] [6]. On the application level,databases self-validate optimisation models [21] and web servers are dynamicallyreconfigured by agents to adapt service performance [9].

6

These examples illustrate that, although some of the characteristic Autonomic Com-puting features seem to be far away, some of the ideas from Autonomic Computinghave already been put into practice. In the next section, we will examine othercurrent technological signs of a coming new computing era.

2.2 Signs of Autonomic Computing

A broad and general base of supporting technologies is the fundament of the grandand pervasive vision of Autonomic Computing. Many recent advances in technologyare in favour of Autonomic Computing and similar initiatives.

The Grid computing environment provides many services [14] which are useful forthe development of autonomic software. The services themselves present autonomicbehaviour [18] [1]. The Grid thus becomes a natural platform for Autonomic Com-puting systems. The Grid is also a platform for IBM applications, which are beingreconciled by the autonomic initiative. There is therefore undoubtly a close rela-tionship between the two initiatives.

With the release of Java 2 SE 1.5, an infrastructure for monitoring the virtualmachine and operating system properties will be made available [29]. Since self-awareness is the basis of Autonomic Computing, Java 2 SE 1.5 is a contributingplatform for the development of applications with autonomic behaviour. Also, thepossibility to efficiently run separate Java virtual machines will provide a basis forisolation of Java processes and promote self-healing by fast restarting of erroneousprocesses.

There are more signs of the autonomic future. The service-based system modelhas been developing for years with technologies such as Corba, Enterprise JavaBeans (EJB) and .NET. With today’s level of maturity, these technologies provide afoundation for autonomic systems as depicted by Kephart. The Forrester report on“The Fabric Operating System” [26] predicts that dedicated service-oriented systemswill be running enterprise applications before 2007, and provides a roadmap forcompanies on how to prepare their software systems for enablement on the newcoming computing platforms.

The Dynamic Systems Initiative (DSI) from Microsoft is mentioned in the Forresterreport. It has a purpose similar to the Autonomic Computing vision. The purpose ofDSI is to deliver a coordinated set of solutions that simplify and automate how busi-nesses design, deploy, and operate distributed systems [22]. Similar to AutonomicComputing is also the idea to tie business policies directly to IT systems. DSI isalso claimed to be the result of a commitment to reduce complexity, analogous tothe Autonomic Computing initiative.

This short reflection reveals that there exists some technological bases for AutonomicComputing systems and that important institutions are strongly convinced that

7

future computing will be organised in ways similar to the Autonomic Computingvision. Next, we look at work related to Autonomic Computing.

2.3 Recovery Oriented Computing

Recovery Oriented Computing [25] (ROC) is frequently cited in Autonomic Com-puting contexts. The ROC hypotheses: “Repair fast to improve dependability andto lower cost of ownership,” embraces the fact that (human, software and hardware)errors are facts that have to be coped with. ROC provides several techniques forthis purpose, some which are covered below.

System reboot is sometimes the only way to restore system operation. Rebootingreclaims resources and returns software to its initial state. Rebooting should be con-sidered a normal part of system operation, and services could proactively terminatethemselves (or restart) to remove latent errors. Systems tolerating bounded partialrestarts can recover very fast.

A reversible system allows an operator to undo recent actions, repair the system ata previous point of execution, and replay the latest actions. This ability provides amargin of safety against operator slips and mistakes.

Fault injection allows for software recovery experiments and validation of error re-covery mechanisms. The ROC team found four successful practices: resource preal-location, graceful degradation, selective retry and process pools.

2.4 Externalised architecture adaptation

Adaptation of software architectures using externalised managers has been exploredin numerous environments, mainly to provide self-healing of large systems. Threeprimitive parts seems to constitute an adaptive software architecture. The first part,the architecture itself, defines the adaptive operations and provides basic monitor-ing. The second part is an integrated information infrastructure which exchangesmonitored information and adaptation commands. The third part is an analysis anddecision mechanism maintaining the architecture.

IBM:s Tivoli monitoring is a systems manager using an expert system to isolateproblems and correct them on a local machine [20]. Here, the architecture is theoperating system itself. The monitoring software maintains an external model whichis accessed by scripts that isolate problems and repair faults.

Garlan and Schmerl provides a more general approach for externalised adaptationof distributed applications [8]. The component infrastructure is mapped to an archi-tecture model using styles [10] which allows for generalised checking of consistency,constraints and dependencies. Repair scripts are executed when system properties

8

fall outside acceptable bounds. The approach is general enough to manage anysoftware system architecture, but the obvious lack of information and adaptationsupport in some systems requires some amount of manual adjustment in the imple-mentation step.

In [23] the authors argue that components designed for blind communication, beingunaware of their location in the network and of the exact identity of their commu-nication counterparts, may be moved between hosts and have their connection pathsaltered by the runtime system. Using this approach, the runtime adaptation systemexhibits constraints on the design of components, in contrast to the other adaptationschemes mentioned thus far, where the adaptation system is accommodated.

Externalised adaptation favours a centralised system organisation. In centralisedmodel-based adaptation, an architecture manager maintains, analyses and correctsthe system model. In the Willow system for critical networked infrastructures [19],a central priority enforcer is used to avoid conflicts caused by simultaneous asyn-chronous adaptations.

The area of self-organising architectures provides a different perspective of the topic.In a study by Georgiadis, Magee and Kramer, architectural constraints are forcedon the components, and the components are responsible for arranging themselvesaccordingly [11]. Responsibility is thus shifted towards the component developer,making services less dependent on features of the surrounding system architecture.Broadcasting is used to maintain a correct system view, and architecture knowledgeis self-contained in all components. The self-organising architecture approached thisway scales poorly.

Aside from organising and managing the system architecture, the flow of data in adistributed environment could also be organised and managed. A workflow processis a series of steps that involves coordination of tasks, handoff or routing of informa-tion, and synchronisation of activities [7]. A workflow management system managesconnections between tasks located on distributed hosts. A workflow managementmodel defined by Shrivastava and Wheater supports transactional dynamic recon-figuration of workflow schemas and executing workflow instances [27]. Workflowsystems are composed from other applications and systems.

2.5 Summary

Autonomic Computing attempts to unify research on areas of self-management. Thevision of Autonomic Computing suggests a service-centric multiagent system as aplatform for self-managing systems. The Autonomic Computing initiative has tech-nological support at a conceptual level and some early results have been presentedby IBM. Autonomic Computing promises dependability and adaptability, and someinfluential research from these two areas have been breifly presented.

9

Chapter 3

Systems Management as AutomaticControl

This chapter describes an approach to systems management based on a method ofexternal control. The purpose of the controller is to regulate the system as disturb-ances from the environment causes the system to deviate from its optimal perform-ance. The controller can manage systems where system information is accessibleand external control is permitted. It therefore depends on system architectures withwell defined capabilities and interaction protocols.

A model of the system environment, and the challenges of managing a system op-erating in it, are presented first. The control problem point of view of systemmanagement is described in general terms. Last a model of a control problem facingthe challenges is presented.

3.1 System environment model

The system environment is categorised into three areas: the network environment,the resource environment and the service environment.

The network environment

The Internet and local area network environments enables hosts to establish links viathe network protocol. These links are typically robust and have no communicationcost. With the coming of ubiquitous computing, mobile units will create networkswhere these properties are not necessarily fulfilled. We believe a generalised networkenvironment has the following properties:

• Links have limited bandwidth.

10

• There is a cost when sending data on a link.

• There is a delay when sending data on a link.

• Network endpoints might be sporadically connected.

The network environment is the set of all network properties, physical links, lo-gical connections, bandwidths, disturbances, etc. If network links are sparse andnon-robust, network reconfigurations will frequently affect the connection proper-ties between components. But even in a dynamic network, for periods of time,some parts of the network are more static than others, presenting opportunities forcollaborative decisionmaking.

The resource environment

A host is assembled from resources, which can be physical or virtual. The resourceenvironment consists of the current resource configuration on all system hosts. Com-ponents must allocate resources to run, thus service deployment is constrained bythe availability of resources. At least two resources are included on a host: a limitedamount of storage capacity and a limited amount of computing power.

Hardware failures may cause resources or hosts to crash. As a consequence, inform-ation that has not been processed, sent and properly received by another host maybe lost. It is uncertain how often this is expected to happen, and how large theaffected parts of the system will be. A crashing host causes the system to lose thenetwork links, resources and components. Hosts may install, remove, upgrade ordowngrade resources at any time, possibly causing components to crash.

Resources are controlled by local applications, but can be made available for otherapplications via network connections. There are two way to do this. In the first case,the local application provides a service to the remote applications. In the other case,the resource is emulated on the remote host, which receives a latency penalty andincreased network load. Remote network storage is an example of a remote resourceor service.

The service environment

As components are created on hosts, their interfaces are published to other applic-ations and services. The service environment for a host is the set of services foundlocally or remotely. The properties of the service environment are therefore affectedby the network environment and the resource environment on the service’s host. Theservice environment describes properties such as connection bandwidth, connectionlatency, estimated time to task completion, etc, for every pair of components.

11

3.2 Challenges of system management

What are the challenges of system management in the described environment?

• When hosts fail or become isolated, missing components needs to be installedand existing components may need to be redistributed. Likewise, the appear-ance of a new host may present an opportunity for a service to improve itsperformance. A challenge when distributing components is to utilise the re-source and network environments optimally while at the same time creating aworkflow tolerating resource and network disturbances.

• Running distributed components will cost, and communications might be lim-ited and erratic. Adapting service workflow to variations in the service envir-onment will be an important managemental concern.

• When links and hosts fail regularly, there may be a need to use redundant orreplicated components to process tasks in parallel. Redundant components canbe used for load balancing purposes.

The relation between these challenges and the Autonomic Computing vision will bediscussed next. Recall the four main points of the vision: self-healing, self-protection,self-configuration and self-optimisation.

A self-healing system could be expected to heal program parts that malfunction.From the Recovery Oriented Computing (ROC) experiments, we find that the mostreliable way to restore system operation is to reboot. If a failing component canbe restarted, the interrupted task can also be restarted if the input data to thetask is managed by an external service and the component state was saved. Ina workflow process, the input and output data is managed and the interactionsbetween components is supervised, allowing an interrupted task to be recovered.

It is however difficult to handle dynamic workflows in a generalised (perhaps peer-to-peer) network. According to our network environment model, there may be costsand delays associated with communications, making a centralised or synchronisedworkflow view disadvantageous. It might even be impossible to maintain a completeworkflow view, if parts of the network lose connection. It is a challenge to find areliable decentralised workflow model using only partial knowledge.

A self-protecting ability is to prevent the loss of data, tasks or services. The useof redundant or replicated components allows tasks to tolerate some failure in com-ponents, connections, and hosts. A task for which the component state needs tobe synchronised, either the entire workflow is required to pass through every rep-licated component, or the state must be transferred and synchronised between thecomponents. We consequently face a great challenge dealing with component syn-chronisation, unless components are stateless.

12

Self-configuring abilities could be to automatically deploy services and to discoverexisting running services. A purpose of the workflow manager will be to createworkflows by interconnecting existing components. A challenge is how to resolvebroken workflows. If the missing component is deployed or redeployed, the stateof an existing (but broken or unavailable) component must not disappear. Imaginea populated database disappearing and a new empty database appearing as itsreplacement. Again, this is not a problem for stateless components.

Self-optimising abilities are to maximise performance and value metrics providedby a service. Common optimisation features related to the resource and networkenvironments are to balance workload and to avoid network congestion. We assumethe optimisation is performed by setting or tuning service parameters.

These are some of the challenges of automatic system management in our suggestedsystem environment model. Next we describe a model in which we will capture anddeal with these challenges.

3.3 The control problem point of view

Our approach to solving the managemental challenges is to use the control problempoint of view. The system to control consists of components running in an adapt-ive architecture, monitored by an information infrastructure. The system receivesdisturbances from the environment. The controller’s purpose is to compensate dis-turbances causing degraded service performance or complete failure. The controlsignal targets both services and the adaptive architecture surrounding them.

An ordinary control problem model is illustrated in Figure 3.1. An ordinary controlmethod measures system properties and derive a control error using a set of referencevalues. The objective of the controller is to minimise the control error. The controlsolution can be found analytically, as was done in the control method of a LotusNotes server using integral control [24].

In the case of systems management, problems have large dimensionality and anexplicit problem model can be difficult to create and reason about. Systems mighthave complex or unclear relations between cause and effect. For ordinary controlmethods to be stable, information must be recent, and decisions taken frequently.This is hard to guarantee in the general case. Therefore, we have chosen a slightlydifferent model. We wish to optimise system operation rather than a specific set ofperformance values. Consequently, the control error is replaced by a more generalsystem model. The objective to minimise the control error is replaced by the moregeneral objective to maximise system quality. The suggested model is illustrated inFigure 3.2.

The system quality is determined by a critic overseeing the system state. The rewardmight be delayed, thus a learner is needed to resolve the temporal credit assignment

13

Services

Controller

output

controlsignal

PerformanceMeter

Σ

InformationInfrastructure

AdaptiveArchitecture

disturbance

+

−

control error

ReferenceValues

Figure 3.1. An ordinary control problem.

Services

Controller

output

controlsignal

Abstraction

InformationInfrastructure

AdaptiveArchitecture

disturbance

system model

Figure 3.2. A generalised control problem for systems management.

14

problem. The learner observes the system state, explores actions and observes thereward. This knowledge is then used by the learner to deduce a decision policy forthe controller.

Next, we concretize the control problem by identifying the knowledge and informa-tion needed to handle the challenges of system management discussed earlier in thischapter.

3.4 Modeling a control problem for systems management

Three issues define our control problem. First, the utility of system state mustbe evaluated. Second, the system state must be abstracted to be manageable bythe decision algorithm. Third, the control actions define the system adaptationcapabilities.

Evaluating system state

The control problems requires a utility value or reward to be calculated for evaluationof system properties. There can be several critics, each having a bias towards certainsystem aspects. A critic’s bias is determined by bias parameters. The following three(high-level) properties could be considered interesting for a system manager:

• Reliability

• Efficiency

• Service cost

The evaluation measures depend on the view. The ordinaray control problem inFigure 3.1 uses continuous performance measures, while the critic in the systemscontrol problem in Figure 3.2 uses discrete rewards triggered by conditions. Thefollowing measures can be used to evaluate the properties above:

• A continuous reliability measure could be the ratio of completed to total as-signments. A discrete reward can be triggered by a completed assignment.

• An continuous efficiency measure could be the average assignment time tocompletion. A discrete reward can be inversely proportional to the assignmentcompletion time.

• A cost metric is generic and can be used for almost any system aspect for bothproblem views. Cost can be measured using time, money, battery, occupiedbandwidth capacity, etc.

The utility or reward may be a linear function of the performance properties, withthe bias parameters as coefficients. In contrast to the ordinary control problem,

15

the critic may reinforce decisions based on delayed information. It also enjoys thefreedom of rewarding an overall good service performance, where the contributionsmade by each of the participating components is difficult or impossible to conclude.

Abstracting system state

The dimensionality of the total system state is titanic and needs to be abstractedprior to be handled by a decision mechanism. The information is abstracted in threesteps. The first abstraction is implied by the scope of system monitoring. The secondabstraction is implied by the scope of distribution by the information infrastructure.The third abstraction (the abstraction box in Figure 3.2) is a selection by design ofinformation needed by the controller to make decisions.

Monitoring is limited for several reasons. Some existing systems have barricadedthemselves against external insight. Another reason is performance. Monitoringshould be simple, cheap, exact and provide up-to-date information. Finding a unifiedway to access and understand the information provided is a basis for extensiblemanagement.

The distribution of information by the information infrastructure needs to be limitedfor scalability purposes. The information infrastructure will be truly scalable onlywhen adopting the end-to-end principles as discussed by Beck, Moore and Plank inthe case of distributed computations [4].

The design abstraction is a part of the manager’s analysis before a decision is takenby the system. It focuses on the relevant information for the decision at hand. Aproblem representation that clearly separates between circumstances causing thecritic to deliver a reward from circumstances where it does not, will allow the man-ager to make meaningful decisions in each abstracted state.

As an example, on the service workflow level, the abstraction includes a classifica-tion of components in the neighbourhood. Components will have certain propertiessuch as bandwidth, reliability, latency, cost, computing power, etc. The workflowtopology is created using decisions based on these properties. The variances of theproperties can be used to find static situations in the configuration, which is usefulfor creating groups which can coordinate their decisions and manage themselves ona higher level, possibly creating dynamic decision hierarchies.

It is not necessary to create decision hierarchies to make global decisions. In somecases, globally optimal decisions can be made by refinement or successive aggregationof data. For example, a network node routing a message the shortest path througha network needs not complete network knowledge. It needs only to know aboutthe shortest path to the destination for all neighbouring nodes to make an optimaldecision. Such basic principles of information aggregation should also be supportedby the information infrastructure.

16

Control mechanisms

The last piece of the control problem design is to define the actions with which thedecisionmaker exercises system control. This assumes that components relinquishessome of their control to external managers, or that they are designed conforming toan adaptive system structure. Components and services fulfilling these criteria canbenefit from the surrounding protecting, healing, optimising and configuring mech-anisms to survive and function in the dynamic environment. The control mechan-isms below targets the challenges discussed earlier in this chapter, and they will becovered more extensively in the next chapter.

Automated installation of components on hosts enables new services to launch auto-matically. For a component to be installed, it typically requires some minimum hostresources to be present. Likewise a host may agree to install only a specific subsetof components. A component may be installed automatically when an existing com-ponent disappears from the network, when workload becomes to high for an existingcomponent or when redundancy is needed.

Failing hosts and links suggest that a protection mechanism is present, for instancesupport for service redundancy and replication. Redundancy control delegated to thesystem is undetectable by the components. It could be actuated by submitting a jobto two separate processes on distinct hosts and later merging the results. As men-tioned in the challenges section, stateless components can support this. If compon-ents require synchronisation, synchronisation mechanisms must also be controlledby the system. It is a system challenge to prevent redundancy from overconsumingsystem resources when decisionmaking is distributed and information is inaccurate.

Workflow processes manage interactions and information exchange between complexservices. One of the challenges found earlier in this chapter was to deliver data to itsdestinations. In an dynamic situation, the network and service environments mightcause workflows to fail, implying that the workflow managers should be tolerant,having mechanisms to deal with uncertainties arising after a system derangement.The mechanisms for workflow management therefore include both installation andredundancy.

As computing systems and services evolve, they require components to upgrade. Arunning service should upgrade without restarting and without manual configura-tion. Seamless upgrading is important because it promotes adaptivity.

3.5 Summary

In this chapter we approached systems management as automatic control. Firsta system environment model was developed, with a network, resource and serviceenvironment. Second we presented some challenges of systems management in the

17

described environment, and related these challenges to the Autonomic Computingvision. Third, the control problem point of view was examined. In the last section,this point of view was used to formulate a control problem. To solve the controlproblem, system state needs to be evaluated and abstracted, and adequate controlmechanisms need to be provided by the system architecture.

18

Chapter 4

Control Mechanisms for SystemAdaptation

This chapter describes the mechanisms surrounding the decision process. Comparedwith the autonomic element (Figure 2.1) this process has a slightly different slant:we use plan, analyse, decide and execute. Excluding the decision process describedin the previous chapter, there are now three parts to consider. First the planningphase determines the legitimate actions in the current situation. Then the analysisphase provides the information needed to decide upon the suggested actions. Afterthe decision has been made, based on the information from the two preceding phases,the action is executed.

First is described a simple system model that has been adopted to support the adapt-ation. Second, a mechanism for distribution of components is discussed. Third, theworkflow management is an essential part of system adaptation. Fourth, redundancyis accomplished by workflow manipulation among several distributed components.Fifth, component upgrading requires security mechanisms to guarantee service cor-rectness.

4.1 System model

Our selection of system model is motivated by the choice to maximally separatemanagement from services. We use a workflow process model to allow componentconnections to be managed externally. Having control of component connections willallow external managers to control several interesting aspects of services. The dataof services participating in a workflow will be protected by the workflow managers,and the responsibility of successfully delivering a service will shift from the individualcomponents to the service workflow manager.

19

Task

Activatedinput state

Arrivingdata messages

Activatedoutput state

Departingdata messages

Figure 4.1. Input and output states of a component. The graphical notation isadopted from [27]. The boxes represent states and the circles represent packets.

In our model, all components are considered to provide services. A compoundservice consists of several interacting components. Services can thus be constructedrecursively. A compound service has a workflow schema describing the interactionsbetween components. A component participating in a service is running a task.

A workflow instance is created for a particular service when it receives a request. Aworkflow instance creates a task instance on the components affected by the flow. Itmay also be possible to create multiple workflow instances on beforehand to allow aservice to have continuous operation. This creates a stream of workflow instances,which is good for reliability and load balancing reasons. The more fine-grained theservice flow, the lesser the chance of the service becoming unbalanced, but the largerthe management and communication overhead. Fine-grained computations are alsopotentially more reliable, since they are easier to clone.

A component may have several distinguishable input and output states (the boxesin Figure 4.1). When a host has received data packets (the circles) belonging tothe same workflow instance and matching the input state of a component, the com-ponent starts the task, produces output, then terminates. The system recognisesthe output state and is responsible for delivering data to their destinations, usingthe information provided by the workflow schema. If several destinations match thecriteria, the workflow manager decides which of these will be targeted. The decisionis based on properties of the system and properties of the connections.

In our approach, there are three assumptions simplifying adaptation:

• All tasks have finite input/output data which is acquired/produced in finitetime.

20

• All workflow instances have finite computing paths from sources to destina-tions.

• The result of a workflow instance is a function of the explicitly declared sourcedata only. It is impossible to change the result as a side effect of modifyingglobal variables or using component states from previous tasks.

With the first assumption fulfilled, a component participating in a task will eventu-ally finish. With the second assumption fulfilled, all workflow instances will eventu-ally finish, given the first assumption is fulfilled. With the third assumption fulfilled,concurrent tasks are processed in isolation from each other, and independently fromtheir location.

Using this system model, we will now look at the control mechanisms for serviceinstallation, adaptation of dataflow, redundancy and upgrading.

4.2 Installing services

The workflow schema describes interactions among components participating in aservice. The workflow schema must be distributed to a number of hosts before aworkflow can be created. It is a challenge to do this in a scalable manner. Someare approaching this using information hierarchies [28] [5]. Finding an optimal com-ponent distribution is a global optimisation problem. In an information hierarchy,local decisions can improve the scalability of the optimisation problem. An exampleof this is the hierarchal iterative gather-compute-scatter algorithm for path-baseddeployment [12]. For a hierarchy-based solution to be scalable, the higher levels ofthe hierarchy must rarely be invoked.

The approach presented here is non-scalable and assumes all hosts in an adminis-trative system region decides to deploy the service. Some components already existin the system when the service is launched, and if these are expensive to install,the launching service will probably build up around the existing components. Themechanism is illustrated in Figure 4.2 and elaborated below.

The purpose of the planning phase is to propose decision actions. A service S has aset components C = {c1, c2, . . . , cn}. The combination of components runnable onhost H is CH , a subset of the powerset of C. The purpose of the planning phase forthe installation problem is to find or estimate CH . CH depends on the resources onH, the constraints on each ci, the availability of installation programs for ci, andthe ci’s already present but not uninstallable by the automated manager. Later,the decision process picks an element from CH , and appropriate components will beinstalled or uninstalled by the execution phase.

21

Service S c1 c2 c3 (components)

(requirements)

Host H (resources)

Deduced valid local component configurations

c1 c2 c3 c1 c2 c1 c3

PLANc2

Localcomponentproperties

Remotecomponentproperties

c1

c1

c2

c3

ANALYSIS Decisionprocess

EXECUTION

costreliabilityprocessing rate

Figure 4.2. A control mechanism for component installation.

The purpose of the analysis phase is to provide information to the decision process.The analysis phase provides an abstraction of the system state suitable for installa-tion decisions. Information needed is the available components and their properties.The properties could be cost, reliability and processing rate. It is necessary to makea distinction between local and remote components, since it is costly to install andreconfigure components. The installation cost is also relevant for the decision.

When a decision has been made, the execution phase compares the current configur-ation with the suggested and triggers a set of installation and uninstallation actions.The methods are invoked using a protocol for communication with the adaptive ar-chitecture. The execution will alter the host’s properties. For example, tasks mightcompute slower, free storage space might shrink, communication links might be-come constricted, etc. These changes might affect the overall system. Adaptationsby other hosts are likely to follow.

4.3 Adaptation of dataflow

When a task has finished, the result needs to be delivered to other components. Theplanning phase provides a set of components which can receive the result. The de-cision problem is to find a subset of the components to notify, based on the system’spreference for reliability, efficiency and economy. The mechanism is illustrated inFigure 4.3.

22

Recognisedcomponents c2 c3 c3

Deduced valid data receivers

c2 c3 c3

PLANc2

c3

c3

c3

ANALYSIS Decisionprocess

EXECUTION

c1 c3

Data outputinstance

c3

costreliabilityprocessing rate

Receiverproperties

Figure 4.3. A control mechanism for dataflow between components.

The planning phase uses information from the workflow system to find all sets ofcomponents required and able to receive the result. The planner must ensure that itprovides the manager with alternatives which does not cause the service’s objectiveto fail.

The analysis provides performance estimates of the remote tasks and their con-nections. Such estimates can be cost, reliability (or probability of successful tasktermination), average CPU capacity (for load balancing purposes), etc.

The result of the decision is a set of destination components. The workflow manage-ment system executes the delivery of data to the selected components. This dynamicway of workflow processing is favourable in a dynamically changing environmentwhere the task connections might be disconnected or affected during execution.

Interesting possibilities arise if information value can be measured. When resourcesare meager, information value provides a possibility for the system to deliberatelydiscard or delay less important information, to uphold service quality in the mostdemanding situations. Another method is to classify the importance of specificmessages. Having separated this adaptation aspect from the service, the serviceneeds a way of communicating this value or priority information to the managingservice.

23

4.4 Redundancy

Redundancy is an ability which is naturally extended from the previously two de-scribed mechanisms. Redundancy is accomplished by simply installing several com-ponents and, during service execution, distributing data to multiple receivers. If theredundant components requires synchronisation, the state needs to be replicated.

The need to use redundancy in the first place is motivated by the fact that undercertain circumstances, information about data loss is not available due to partialobservability of the workflow state. As an example, a task finishes computation anddelivers its result to the next task on a remote host. Before the next task finishes anddelivers a confirmation, the connection between the two hosts fails. The producer isnow uncertain about the status of the workflow. Did the remote host crash or didthe task finish? If redundancy was used, the probability of two or more hosts failingis reduced considerably. Adequate decisions can be made if the probability of linkand host failure can be estimated by the information infrastructure. Certain tasksmay have a higher probability of failure than others, but environmental conditionsare also contributing.

If a task result has a unique or random factor, then cloning the task and runninga workflow in parallel will yield different results for each branch. There are twopossibilities to solve this problem. Either 1) such tasks should not be cloned, or2) the redundant component will have to synchronise the unique factor. The firstcase requires the service to depend on a centralised task. A central task is vulnerableand suffers from location dependence and inability to scale. The second case suffersfrom inconsistency if the network connections fail. In addition, the workflow managermust provide a synchronisation mechanism, which must be used by the service (whichis impossible to guarantee) in order to avoid inconsistencies. Synchronisation is themost difficult challenge to solve, and suggests that the success of the approachoutlined here depends on the possibility to use stateless components in the service.

4.5 Upgrading

Upgrading a software component means replacing a running task with another,whilst fulfilling the same obligations as the old component. Note that simply repla-cing a running component with any other is not upgrading. First, the componentinterfaces should be compatible to ensure that the total service still is in a validfunctional state. Second, the component’s purpose in the system must be similar,to ensure that the service actually does what it is supposed to do. To ensure thata component purpose has not been altered, component behaviour can be guardedby contracts [3], providing self-protection. If the upgraded component fails to fulfillthe service’s contract, the manager will have a possibility to undo the upgrade.

24

An intermediate task is easily upgraded by starting the upgraded component inparallel with the old (not necessarily on the same host) and routing computationsto the upgraded component. Since we assume there is a finite computing path fromthe sources to the destinations, all running computations will eventually reach theirdestinations, and the old components are then terminated. Such a workflow modelis described by Shrivastava and Wheater [27]. If state needs to be synchronisedbetween new and old components, the upgrade will need to run in a replicationphase until all running instances use the new component.

Location dependent components such as sources (for example a sensor) and destina-tions (for example a specific terminal) are trickier to upgrade since the componentsmight hold vital resources or live connections to other systems. Components re-quiring termination to release the vital resources might cause disturbances in theservice, freezing it while the upgrade is performed. A component holding a con-nection to another system might need to release those connections. In such cases,the new components must be able to reestablish the connections automatically andseamlessly.

These are some of the issues to resolve before the ability to upgrade components canbe managed by the autonomic system.

4.6 Summary

In this chapter we observed the consequences of the decision to separate workflowmanagement from compound services. To assure sound system behaviour, we as-sumed that the result of a task is dependent only on the input data, making itpossible for the workflow system to relocate, rerun, and parallelise task executions.The most difficult challenge involves synchronisation of component states.

25

Chapter 5

An Experiment

A simplified version of the system model outlined in the previous two chapters istested in an experiment. The purpose is to evaluate the decision process and thesignificance of adaptation actions. In the experiment, the workflow system is trainedto configure itself dynamically in short episodes of randomly generated scenarios.

First, the scenario and the control problem are described. Then, the experimentsetup and execution are described. The resulting data is presented, analysed anddiscussed.

5.1 Scenario description

The physical scenario consists of distributed hosts with unreliable links. The de-ployed service streams information from sensors to a command centre, with someoptional processing occurring in between. This service tolerates loss of information,but information is valuable, and the system should employ a strategy maximisingthat value.

There are three types of nodes: sensors, computational nodes, and a commandcentre (C2). The network system can be configured for unreliable (UDP) or reliable(TCP) communication. Sensors have three modes covering different amounts ofareas and producing different amounts of output. Sensor data must be processed ina computational node before sent to the command centre. The computational nodemay be configured to perform data fusion, meaning data packets are merged butdelayed, or to simply feed the results forward as quickly as possible. The networkhas limited capacity, and the fusion process is also limited. The goal of the systemis to decide on control parameters such that:

• the amount of data delivered to C2 is maximised,

• the age of data delivered to C2 is minimised,

26

SensorNode

ComputeNode

C2NodeC2

Node

single packets single packets

single packets

fused packet

fused packet

Figure 5.1. The experiment workflow schema. Optional data packets are dashed.The C2 node has unrestricted input capacity.

• sending data over links is minimised,

• sensors’ modes are configured to optimise performance according to systemtopology.

This is accomplished by a reward function which:

• rewards data holding valuable information,

• discounts the value according to data age,

• negatively rewards a sent packet,

• does not reward sensor overlaps.

The system workflow is organised as follows (see Figure 5.1). Sensor data is linkedthrough compute nodes (at least one). The data is possibly fused with other datapackets arriving in the same timeframe. The fused data packet holds a larger in-formation value, but its arrival at C2 is slightly delayed. Compute nodes send theinformation either to other compute nodes, or directly to C2. Sensors and computenodes are stateless—we thus avoid many of the problematic issues mentioned inChapter 4.

5.2 The control problem

We aim to find a decision policy for local control which is globally optimal. Theassumption is that all nodes of the same type locally evaluate and execute the samedecision policy.

27

Since the dimensionality of the problem is infinite (for example the number of nodesand their location is unconstrained), there exists an infinite number of trainingepisodes. We expect the optimal policy (π∗N for nodes of type N) to generalise overall of these episodes. However, since the number of state-action pairs is finite, it ispossible to use a finite number of training episodes to derive π∗N . The problem is wedo not know which episodes to use, nor how to derive π∗N from them.

We use an iterative policy search algorithm to find successively better policies. Itruns a series of episodes where the visited state-action pairs (s, a) are rewardedaccording to the global reward delivered by the critic after each episode. The valuefunction

Q(s, a) =

∑i∈Zs,a

r(i)

||Zs,a||

used in this experiment returns the mean reward of the episodes where state-actionpair (s, a) was visited. r(i) is the reward delivered in episode i. Zs,a is the set ofepisodes where (s, a) was visited (note that ||∅|| = 1). The best policy πb

N is then

πbN (s) = argmax

aQ(s, a).

The convergence of πbN using the policy search algorithm depends on the degree of

significance of each action in each state. For a specific state s, if an action a is chosenregularly, but the result of a is insignificant for the reward, Q(s, a) will converge tothe mean reward. On the other hand, if the result of a is significant, Q(s, a) willdiverge from the mean reward, proportionally to the significance of action a in states. If several actions have similar significance over all training episodes, their Q-valueswill be similar, making πb

N sensitive to variations of episodes.

The rate of exploration must be low for the algorithm to converge, and should beinversely proportional to the number of nodes and time units. If there are 100 nodesand an episode is 100 time units, a exploration rate of 10−4 causes on the averageone explored state-action pair per episode.

It is worth noting that the best policy πbN for node type N is correlated to the

policies of other node types acquired in the same experiment. This is true even ifthere is no explicit collaboration among nodes, as in this experiment. Informally, acorrelation between policies is present if any change of behaviour in one node effectsthe expected distribution of rewards over state-action pairs in another node. Thisbehaviour causes implicit collaboration to form among node types as the policy isbeing improved. An interesting aspect of the implicit collaboration is that it is notprogrammed—it follows naturally from the objective of finding the optimal policiesfor all node types in the service.

28

5.3 Experiment setup

The setup of the decision problem is described briefly: the configurational actions,the abstracted system states and the critic.

Actions

Sensors can configure their coverage to 1, 2 or 3 area units. Each monitored areaunit causes the sensor to produce a packet of sensor information. The sensor alsohas a scan mode which causes the sensor to switch off, to sweep clockwise or tochoose a random sector each time unit. The associated actions are:

• Set sensor coverage to 1 area unit

• Set sensor coverage to 2 area units

• Set sensor coverage to 3 area units

• Set sensor scan mode to off

• Set sensor scan mode to sweep

• Set sensor scan mode to random

Compute nodes can be configured to feedforward sensor information or to performinformation fusion. The associated actions are:

• Set compute mode to link

• Set compute mode to fusion

Network and workflow configuration is performed by both sensors and computenodes. It is possible to switch network protocols between TCP and UDP. TCPresends a package the next time unit if the receiver does not acknowledge. UDP sendsonly once. The workflow routing can be configured to characterise the preferred datareceiver, accordingly: either pick the node having the estimated shortest distanceto C2, the estimated shortest delay to C2 (in time units) or any random node.Redundancy can also be configured to on or off. When on, data packets are sent totwo nodes or twice to the same node (if only one receiver is present). The associatedactions are:

• Set network protocol to TCP

• Set network protocol to UDP

• Set workflow routing to shortest distance

• Set workflow routing to shortest delay

• Set workflow routing to random

29

• Set redundancy on

• Set redundancy off

A reconfiguration takes 1 time unit, meaning there is a cost associated with recon-figuration. Each node can make a decision each time unit.

States

The sensor nodes have the largest abstracted state space. It consists of the currentconfiguration (3×3 states for sensor mode and 2×2×3 states for network mode) andenvironmental attributes such as connection quality to C2 (3 states: bad, medium,good) and group information (15 states). Sensors can group if they are in closevicinity of each other. Groups are created with a maximum of five sensors, and eachsensor receives a rank in the group (which is based on a total ordering of nodes).With groups, sensors can coordinate decisions according to group roles for achievinga more efficient service. The grouping scheme used here is rather primitive butsimple to implement.

Using a tabular method, the size of sensor state space becomes

2× 2× 3× 3× 3× 3× 15 = 4860.

A compute node’s state space consists of the current configuration (2 states forcompute mode and 2× 2× 3 for network mode) and environmental attributes suchas connection quality to C2 (3 states) and the network stress (4 states).

The size of compute node state space is

2× 2× 2× 3× 3× 4 = 288.

Critic

The critic delivers a reward based on the area coverage of an episode and the numberof packets sent. We used the linear model

r = αc− (1− α)p,

where the coverage c is calculated as the union of the areas reaching C2, discounted(unproportionally) by the time delay for the data to reach C2. The value of p isthe total number of packets sent in the episode. α is the bias parameter, determ-ining if the manager is biased towards maximising sensor coverage or minimisingcommunication cost.

30

Figure 5.2. A screenshot of the simulation software developed at SICS.

Policy ` α

1 1.0 0.82 1.0 0.53 0.6 0.84 0.6 0.5

Table 5.1. Settings for link reliability (`) and bias (α) for four policies.

5.4 Experiment execution

An illustration of the simulation software is provided in Figure 5.2. For each episode,the experiment engine creates a static environment, where nodes has fixed positions.

Four different policies were trained. The first two were trained with link reliability` = 1.0 using the bias parameters α = 0.8 and α = 0.5, respectively. The thirdand fourth policies were trained with a lower link reliability, ` = 0.6, and the samerespective biases. The second and fourth policies are biased towards a system withlesser communication cost. The settings are listed in Table 5.1.

The gradual convergence of the policies during training is illustrated by the episode-reward plot of Policy 1 in Figure 5.3. The reward value in each evaluation point isaveraged over ten distinct scenarios and over five training rounds.

31

Policy 1

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

episode

rew

ard

Figure 5.3. Averaged reward per episode during training of Policy 1.

32

Policy Coverage Packets Reward1 2781 1418 19402 2757 1036 8603 2356 1782 15284 1944 1099 422

Table 5.2. Evaluation of policies averaged over 100 scenarios.

Policy ` = 1.0 ` = 0.61 1940 15453 1877 1528

Table 5.3. Rewards of policies 1 and 3 evaluated using alternative link reliabilities(bias α = 0.8).

5.5 Resulting data

The four policies were each applied to 100 distinct scenarios for evaluation of cover-age, communication and reward. The same 100 scenarios were used for evaluationof all policies. Each policy was created in five versions, and the best policy wasselected from this set. The result is listed in Table 5.2.

Tables 5.3 and 5.4 display the result of evaluating the policies using alternative linkreliabilities.

5.6 Analysis

The trained policies are characterised by both the bias parameters and the envir-onmental conditions present during the training. From Tables 5.3 and 5.4 it isclear that the policies received similar rewards, even though they were trained inenvironments with different network link reliability. This suggests that the systemabstraction used in the experiment successfully captures environement variations

Policy ` = 1.0 ` = 0.62 860 3734 848 422

Table 5.4. Rewards of policies 2 and 4 evaluated using alternative link reliabilities(bias α = 0.5).

33

and that it is possible to find general policies capable of managing a wide varietyof situations. It should be noted that some policies do not generalise well whenthe overall link reliability falls, particularly variations of Policy 2, which is biasedtowards sending few packets and was trained in a high reliability environment. Toensure good generalisation, the link reliability parameter ` should also be variedduring policy learning and evaluation.

From Table 5.2 we calculate the coverage-to-packet ratio. We see that Policy 2 hasa higher ratio (2.66) than Policy 1 (1.96). Thus, the system did not only send lesspackets—it also managed to keep the service running more efficiently. The sameobservation holds for Policy 4 (1.77) and Policy 3 (1.32).

5.7 Discussion

Using the policy search algorithm to create decision policies automatically increasessystem robustness. During prototyping, bugs in the system implementation causedthe policies to exhibit strange behaviours in the system. However the behaviour wasirrational for an observer, it was the most optimal behaviour of the system. Sincethe bugs had a negative impact on the system, the system learned to avoid thosesettings. As the strange system behaviour caught the attention of the developmentcrew, the bugs were eventually found and killed, making behaviour more “rational”and increasing the reward of subsequently trained policies.

Is a system of distributed managers stable? A decisive property is how sensitivemanagers are to changes occurring in their environment, for example decisions madeby neighbours. Using simulations, the stability of distributed decision systems can beverified statistically. A stabilising feature of the training presented here is that sincereconfiguration consumes valuable processing time, the reward is lowered implicitly,making unstable policies suboptimal.

In this experiment, a linear critic was used. If a non-linear critic is used, it is possibleto achieve maximum cost limits. This can be done, for example, by delivering a hugepenalty each time a performance measure becomes negative.

Another interesting feature to test would be an advanced grouping model, allowingdecision hierarchies to form spontaneously over locally static system regions. Thegroup feature implemented in this experiment did not allow group-specific decisionsto be taken by a central manager in the group. Instead, the decisions are executedby each individual group member based only on their local knowledge. The decisionsmade by all the members of the group is therefore only weakly coordinated. Anotherconsequence is that group behaviour may diverge if the policies are different.

34

Chapter 6

Discussion

This chapter concludes the thesis. All the chapters are summarised, and the achieve-ments made in this project by the research group is discussed.

6.1 Thesis summary

In Chapter 1 the study of automatic systems management was motivated. Thecomplexity of modern software systems gave birth to the vision of Autonomic Com-puting, an initiative aiming for greater system self-management.

In Chapter 2 several such systems were described, first the multiagent environment ofautonomic elements suggested by IBM. The approach is interesting, but the areas ofmultiagent systems, agent cooperation, contracts, policies, etc, have not yet reacheda desired level of maturity. Later in the chapter the principles of Recovery OrientedComputing provided ideas for self-healing systems. The most important lessonsfrom ROC is to treat failures as a normal part of system operation, to toleratethem and to adapt the system quickly. Adaptive architectures, discussed later, is aproblematic area because there is none or little consensus about what an adaptivearchitecture is. There is also none or little consensus about what basic operationsshould be provided by such an architecture. The area of workflow managementprocesses seems to be well-defined and seems to provide a reasonably good startingpoint for the creation of an adaptive architecture for component-based systems.

In Chapter 3 we approached systems management as automatic control. Using ageneral and simplified system environment model, some general challenges of dis-tributed system management were found. Using these challenges, a control problemwas modeled by identifying system state evaluation, system state abstraction andsystem control mechansisms. The controller’s objective is to maximise the systemvalue, which is also referred to as a reward function. The reward function is decisivefor the quality of the resulting system. The controller itself is regulated through

35

higher-level parameters, possibly managed by other controllers. Using the approachsuggested in this chapter, control complexity in any problem is naturally decreased,and could be successively decreased using multiple layers of controllers.

The decision process relies on a platform of supporting services for informationacquisition and system adaptation. In Chapter 4 the mechanisms surrounding thedecision were described. A planning phase provides decision alternatives, and ananalysis phase provides the precise information needed for the decision. These twophases use the system’s information architecture for information acquisition andmust also take into account any presently executing decisions. In the executionphase the selected action is interpreted as control commands directed to the adaptivearchitecture. The workflow process model was suggested as a possible basis forsystem adaptation. It allows component dataflows to be adapted and services tobe installed and upgraded. In some cases, services can improve their performanceand reliability using redundancy, managed by the workflow system. The chapterconcluded that all these basic adaptation decisions and operations had a numberof interesting challenges to solve. The key issue concerned components with statessince it is difficult to replicate states between components using external managers.If the suggested abilities are wanted in a system, it will impose hard architecturalconstraints on all components.

In Chapter 5 we attempted to concretize some of the presented ideas in an ex-periment. We described explicitly how the system knowledge is represented in themapping from system states into actions for the service. In the experiment we foundpolicies which successfully managed the system in various situations. In Section 5.2we discussed that the stability, and in the end the success, of the decision processdepends on the significance of the actions in each state. The behaviour of the policyis also affected by the covered space of episodes. We thus need to consider threeissues: 1) including all significant issues in the state, 2) using actions with largeimpact on the state, 3) covering a relevant amount of the total episode space. Wealso noted the formation of implicit collaboration between components.

6.2 What has been achieved?

We have attempted to approach a system management model through a separa-tion of service management decisions (in Chapter 3) and control mechanisms (inChapter 4). A simplified version of the decison model was evaluated experimentally.The suggested decision mechanism and adaptation actions are general enough tosupport system self-configuration, self-optimisation and self-healing.

We wanted to answer the question if it is possible to manage systems using dis-tributed control in unreliable environments. The answer is that it is very diffi-cult to achieve systems management addressing all the challenges described here inChapter 3. In Chapter 4 we discovered why. A general system is difficult to manage

36

if there is not a full observability of the system state. It is non-trivial to create aservice architecture managable by an externalised controller. In the experiment weused a simple model with stateless components, and a trivial workflow. We alsoused a continuous stream of data, and tolerated loss of data. Dispite this, this thesishas revealed and discussed some of the challenges involved in the management ofdistributed systems using automatic control.

It is unclear how the Autonomic Computing vision can be realised. The softwareindustry will push a legions of techniques to the test. In this thesis, we assumedservices operate in an architecture providing some basic abilities, such as informationextraction and adaptation of running services. The decision model is generic andsimple. We believe this approach can be used to solve problems without explicitlyprogramming the solutions—a way to both provide system autonomy and to decreasesystem complexity.

37

Acknowledgements

I wish to thank Sverker Janson, Joakim Eriksson and Niclas Finne at SICS, theSwedish Insitute of Computer Science, for all their support. For implementationof the simulation software contributed Joakim Eriksson and Niclas Finne. JoakimEriksson contributed to Section 5.3 (Experiment setup), with his draft on the Exper-iment Specification. Sverker Janson contributed to the thesis problem description.

I would also like to thank all the reviewers and all the persons who supported meduring the work.

Patrick Hinnelund

38

References

[1] M. Agarwal and M. Parashar. Enabling autonomic compositions in grid envir-onments. In Proceedings of the Fourth International Workshop on Grid Com-puting, page 34. IEEE Computer Society, 2003.

[2] J. Appavoo, K. Hui, C. A. N. Soules, R. W. Wisniewski, D. M. Da Silva,O. Krieger, M. A. Auslander, D. J. Edelsohn, B. Gamsa, G. R. Ganger, P. McK-enney, M. Ostrowski, B. Rosenburg, M. Stumm, and J. Xenidis. Enabling auto-nomic behavior in systems software with hot swapping. IBM Systems Journal,42(1):60–76, 2003.

[3] J. Armstrong. Making reliable distributed systems in the presence of softwareerrors. PhD thesis, The Royal Institute of Technology Stockholm Sweden, 2003.

[4] M. Beck, T. Moore, and J. S. Plank. An end-to-end approach to globally scalableprogrammable networking. In Proceedings of the ACM SIGCOMM workshopon Future directions in network architecture, pages 328–339. ACM Press, 2003.

[5] E. M. Belding-Royer. Multi-level hierarchies for scalable ad hoc routing. Wirel.Netw., 9(5):461–478, 2003.

[6] G. S. Blair, G. Coulson, L. Blair, H. Duran-Limon, P. Grace, R. Moreira, andN. Parlavantzas. Reflection, self-awareness and self-healing in openorb. InProceedings of the first workshop on Self-healing systems, pages 9–14. ACMPress, 2002.

[7] G. A. Bolcer and G. Kaiser. Swap: Leveraging the web to manage workflow.IEEE Internet Computing, 3(1):85–88, 1999.

[8] S-W. Cheng, D. Garlan, B. R. Schmerl, J. P. Sousa, B. Spitznagel, P. Steenkiste,and N. Hu. Software architecture-based adaptation for pervasive systems. InProceedings of the International Conference on Architecture of Computing Sys-tems, pages 67–82. Springer-Verlag, 2002.

[9] Y. Diao, J. L. Hellerstein, S. Parekh, and J. P. Bigus. Managing web serverperformance with autotune agents. IBM Systems Journal, 42(1):136–149, 2003.

39

[10] D. Garlan, R. Allen, and J. Ockerbloom. Exploiting style in architectural designenvironments. In Proceedings of the 2nd ACM SIGSOFT symposium on Found-ations of software engineering, pages 175–188. ACM Press, 1994.

[11] I. Georgiadis, J. Magee, and J. Kramer. Self-organising software architecturesfor distributed systems. In Proceedings of the first workshop on Self-healingsystems, pages 33–38. ACM Press, 2002.

[12] R. Haas, P. Droz, and B. Stiller. Autonomic service deployment in networks.IBM Systems Journal, 42(1):150–164, 2003.

[13] P. Horn. Autonomic Computing: IBM’s Perspective on the State of InformationTechnology, 2001. http://www.research.ibm.com/autonomic/manifesto/.

[14] B. Jacob, L. Ferreira, N. Bieberstein, C. Gilzean, J-Y. Girard, R. Strachowski,and S. Yu. Enabling Applications for Grid Computing with Globus. IBM Corp.,1 edition, 2003.

[15] J. Jann, L. M. Browning, and R. S. Burgula. Dynamic reconfiguration: Basicbuilding blocks for autonomic computing on ibm pseries servers. IBM SystemsJournal, 42(1):29–37, 2003.

[16] S. A. Jarvis, D. P. Spooner, H. N. Lim Choi Keung, J. R. D. Dyson, L. Zhao,and G. R. Nudd. Performance-based middleware services for grid computing.In Fifth Annual International Workshop on Active Middleware Services, pages151–159, 2003.

[17] J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer,36(1):41–50, 2003.

[18] B. Khargharia, S. Hariri, M. Parashar, L. Ntaimo, and B. uk Kim. vgrid: Aframework for building autonomic applications. In Proceedings of the Interna-tional Workshop on Challenges of Large Applications in Distributed Environ-ments, page 19. IEEE Computer Society, 2003.

[19] J. Knight, D. Heimbigner, A. L. Wolf, A. Carzaniga, J. Hill, P. Devanbu, andM. Gertz. The Willow Architecture: Comprehensive Survivability for Large-Scale Distributed Applications. Technical report, University of Colorado, De-partment of Computer Science, 2001.

[20] G. Lanfranchi, P. Della Peruta, A. Perrone, and D. Calvanese. Toward a newlandscape of systems management in an autonomic computing environment.IBM Systems Journal, 42(1):119–128, 2003.

[21] V. Markl, G. M. Lohman, and V. Raman. Leo: An autonomic query optimizerfor db2. IBM Systems Journal, 42(1):98–106, 2003.

[22] Microsoft Corporation. Microsoft Dynamic Systems Initiative, 2003.http://www.microsoft.com/windowsserversystem/dsi/.

40

[23] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Med-vidovic, A. Quilici, D. S. Rosenblum, and A. L. Wolf. An architecture-based ap-proach to self-adaptive software. IEEE Intelligent Systems, 14(3):54–62, 1999.

[24] S. Parekh, N. Gandhi, J. Hellerstein, D. Tilbury, T. Jayram, and J. Bigus. Usingcontrol theory to achieve service level objectives in performance management.Real-Time Syst., 23(1-2):127–141, 2002.

[25] D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. En-riquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tet-zlaff, J. Traupman, and N. Treuhaft. Recovery Oriented Computing (ROC):Motivation, Definition, Techniques,. Technical report, 2002.

[26] G. Schreck, T. Schadler, C. Rutstein, and A. Tseng. The Fabric OperatingSystem. Technical report, Forrester, 2003.

[27] S. K. Shrivastava and S. M. Wheater. Architectural support for dynamic recon-figuration of large scale distributed applications. In Proceedings of the 4th In-ternational Conference on Configurable Distributed Systems, pages 10–17. IEEEComputer Society Press, 1998.

[28] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek,and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for inter-net applications. IEEE/ACM Trans. Netw., 11(1):17–32, 2003.

[29] Sun Microsystems, Inc. Java™ 2 SDK, Standard Edition, Ver-sion 1.5.0 Summary of New Features and Enhancements, Feb. 2004.http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html.

[30] S. S. Yau, F. Karim, Y. Wang, B. Wang, and S. K. S. Gupta. Reconfigurablecontext-sensitive middleware for pervasive computing. IEEE Pervasive Com-puting, 1(3):33–40, 2002.

41

Autonomic Computing – a method for automated systems ...

Documents